Looking Back at My AI/ML Deep Dive (1) General Principles and Non-NN Models
Background
Since joining Kyndryl, I’ve had many more opportunities to interact with AI. I absolutely do not want to become an “LLM guy”—someone who just throws prompts around, pretending to know AI. Driven by this fear and a desire to get a solid, step-by-step grasp of the AI/ML domain, I cut off all holiday temptations and went on an intensive learning binge over the New Year.
Since I recently started this blog, I’ve decided to break down what I learned into a few posts. Rather than a textbook-like detailed explanation, I’ll focus on covering the big picture while keeping the summaries short and sweet. I might write deep dives into individual topics later if the mood strikes.
Here is the planned article series:
- Overview and Selection of Non-NN Models (This article)
- History of NN-based Models (Excluding GenAI)
- Overview of GenAI Models and Related Knowledge
- AI/ML System Design
- AI/ML Implementation and Operations (Libraries, Communities, Kaggle, MLOps, LLMOps)
This time, I’ll walk through the classic machine learning algorithms (Regression, Classification, Clustering, Dimensionality Reduction) that aren’t Deep Learning (NN).
My Knowledge Sources
Here is where I got all my info for this. Since I haven’t actually operated an ML system in a real-world production environment, there might be some misunderstandings regarding the surrounding details. If you notice anything, please let me know in the comments.
Student Days
- University and grad school lectures, labs, and research
- Experiments with Numpy, Pandas, Scikit-Learn
- Deep Learning from Scratch (O’Reilly Book)
- Exploring Everyday Things with R and Ruby (O’Reilly Book)
Past Few Years
- Andrew Ng’s Coursera courses
- Integrating simple AI features into products using LLMs
- Experimental projects streamlining website creation using Multimodal RAG
This Time (Holiday Intensive Study)
- Review of Numpy, Pandas, Scikit-Learn
- Learning PyTorch and Hugging Face
- Reading Kaggle source codes
- Andrej Karpathy’s YouTube channel
- ByteByteGo’s ML/GenAI System Design
- Conversations with Gemini
How to Choose a Model
Before looking at specific algorithms, let’s look at the big picture of how to pick a model for regression or classification from a few different angles.
-
“Go-To Moves” by Domain and Data Type
There are established “first moves” depending on the data type. The basic idea is to build a quick baseline with these, and if the accuracy isn’t good enough, you step up to the next option.- Text Classification (Spam detection, sentiment analysis, etc.)
Start with “Naive Bayes,” which is blazing fast to compute. If the accuracy is poor and the assumption of independence seems sketchy, turn the words into numbers (vectors) using “TF-IDF” or “Word2Vec,” and throw them into “Logistic Regression” or “SVM.” I’ll briefly explain “TF-IDF” in the Logistic Regression section, but “Word2Vec” is NN-based, so I’ll cover it next time. If that still fails, bring out the NNs. - Tabular Data (Structured data in CSV or DB)
Whether it’s classification or regression, the textbook move is to first run a “health check” and build a baseline using a “Linear Model.” However, in the real world, “Random Forest” is often heavily favored as the initial go-to for the slightly lazy (or rather, highly efficient) reason of: “I just want to dump the raw data in without annoying preprocessing like scaling, and quickly see which features actually matter.” If your business really needs “explainability” (why the AI made that choice) or you want to see the hidden rules in your data, use a simple single “Decision Tree” as your first step. When you have tons of data (tens of thousands of rows) and getting just a tiny bit more accurate means real money for the business, bring out the Kaggle champion “Gradient Boosting Decision Trees (GBDT)” and tune the heck out of it. - Time Series Data (Stock prices, sensor data)
Start with good old statistical models that look at short-term trends and seasonality. Once they hit their limit, move up to “GBDT” or “NN.” I won’t go deep into this here since I didn’t focus on it much this time around. - Image/Audio (Complex messy data)
In classic ML, pulling useful features straight from pixels or soundwaves is super hard. So, jumping straight to “NNs” is just the standard. For classification, if you don’t have the budget to run heavy NNs, a common trick is to use a pre-trained model (like ResNet or CLIP) to turn the images into vectors, and then sort them using “SVM (RBF kernel)” or “Logistic Regression.”
- Text Classification (Spam detection, sentiment analysis, etc.)
-
Data Volume (Sample size: n)
- ~10k samples
Since there’s not much data, complex models (like NNs) will easily overfit (memorize the noise). Simple models like “Linear Regression” or a single “Decision Tree” are your best starting points. On the other hand, if the data has tricky patterns, “SVM (RBF kernel)” becomes a massive weapon. As I’ll explain later, SVM has a superpower where it learns really well even with tiny amounts of data. But if you feed it too much data, the math gets way too heavy. So it really shines right here. - 10k ~ 1M samples
This is the sweet spot. It’s enough data to learn tricky patterns, but still small enough to fit in a normal computer’s RAM. In this zone, powerful combination models like “Random Forest” and “GBDT” are usually your best bet. - 1M+ samples
Now the data is huge. Classic models that try to cram everything into memory and check every single detail (like simple Decision Trees or SVMs) will just crash your computer or take forever. You have to switch to things like “Histogram-based GBDT” (which cuts corners smartly to save time) or “NNs” (which can learn bit by bit).
- ~10k samples
-
Balance between Data Volume and Number of Features (Dimensions) (n/p)
In machine learning, there’s a good rule of thumb called “The rule of ten” (try to have at least 10 examples for every feature).- Ideal state (n/p > 10)
You have plenty of samples, so even complex models (GBDT, NN, etc.) can learn smoothly without easily memorizing the noise. - High-dimensional, small sample (n/p < 1)
You have way more features than actual examples. The risk of overfitting is sky-high because the model just memorizes everything, noise included. It’s like a student who doesn’t understand the exam questions at all, but memorized that “the answer to past exams was always ‘C’, so I’ll just write ‘C’,” only to spectacularly bomb the actual, slightly tweaked test. Doing cleanup work first, like squishing down the number of features with PCA or trimming the fat with Lasso regression, becomes absolutely necessary.
- Ideal state (n/p > 10)
-
Data Complexity and Non-linearity
- Is the relationship in the data simple and straight (linear), or is it a tangled mess (non-linear)?
- First, use linear models as a “health check” for the data. Us engineers are creatures that, given the chance, want to jump straight into precision testing using complex NNs or ensemble models. However, holding back and starting with a simple stethoscope is the common sense of a working adult. If a simple linear model gives you decent accuracy, there’s no need to open the Pandora’s box of complex models (think about the poor soul who has to maintain it!). Conversely, if the linear model completely fails, it means there are strong hidden twists in your data (e.g., results spike only when conditions A and B align), so it’s time to step up to GBDT or NN.
-
Real-World Requirements (Inference Latency, Learning Cost)
When you actually want to put this in production, system limits often matter way more than just raw accuracy.- Inference Latency
For example, if the system absolutely has to reply within 10ms, heavy combination models (like Random Forest or GBDT) might just be too slow. In that case, lightning-fast Naive Bayes, simple linear models, or models stripped down to the bare minimum become your top choices. - Learning Cost (Compute Power)
Training NNs usually takes pricey GPUs and forever to finish, but algorithms like LightGBM can learn super fast even on cheap CPUs. For systems that need to retrain themselves every single night, keeping this training cost low is a big deal.
- Inference Latency
-
Trade-off between Explainability and Performance This is about whether humans can actually understand “why the AI said that.”
- Black Box (High Accuracy)
NNs, GBDTs, and SVM (RBF kernel) are powerful, but their internal logic is a massive web of math that humans just can’t untangle. - White Box (High Interpretability)
A single decision tree or linear model can clearly explain “this specific variable caused this specific outcome.” When confronted with “Why did you reject this customer’s loan application?”, you can confidently say, “Because they had X or more late payments in the past.” Depending on the situation, this “Excuse (Apology) Function” is absolutely crucial. In fields with strict rules (finance, medical diagnosis, etc.), white-box models are sometimes the only choice, even if you have to sacrifice some accuracy.
- Black Box (High Accuracy)
Regression & Classification
First, a quick refresher. Regression is about predicting numbers that can slide up and down, like predicting stock prices or rent. Classification is about putting things into buckets, like figuring out if an email is spam (yes/no) or what the weather will be (sunny/rainy/cloudy). Many regression/classification algorithms were first made for one main job, but usually, with a little tweaking, they can do the other task too. Here, I’ll break down how they work by family.
Linear Models
Linear Regression
Good old linear regression. It just tries to draw a line that gets as close as possible to all the right answers (by shrinking the Mean Squared Error). In the rent example, if the data naturally curves or jumps around—like when the rent heavily depends on a specific combo of the nearest station and the walking distance—a straight line just won’t cut it.
When predictions go poorly, it means the total error is high. Actually, this can be broken down like this:
Total Error = Bias^2 + Variance + Unfixable Error
This is the famous Bias-Variance Trade-off equation. It basically says that when your model sucks, it’s usually for one of two reasons. One is high bias, meaning your model is just too dumb. Linear regression is a good example; it’s what happens when you try to explain a wild, curvy reality using a straight ruler. The other is high variance, which is basically overfitting. I’ll explain this in the Decision Tree section.
Logistic Regression
Despite having “regression” in its name, it’s actually a yes/no classifier built on top of linear regression. It takes the raw output of linear regression (Logits), squishes it into a probability between 0 and 1 using an S-curve (sigmoid function), and learns by trying to make its guesses as close to the truth as possible.
To use logistic regression on messy data like text or images, you first have to turn that data into lists of numbers (vectors) that the math can understand. A super common way to turn text into vectors is TF-IDF.
TF-IDF (Term Frequency - Inverse Document Frequency) is just a score that multiplies how often a word shows up in one text (TF) by how rare it is across all the texts (IDF). It gives big scores to words that “show up a lot in this one document, but almost nowhere else.” For example, in spam detection, the word “Free” shows up a ton in spam emails (high TF) but rarely in normal emails (high IDF), giving it a huge TF-IDF score. On the other hand, words like “the” or “is” show up everywhere, so their IDF is basically zero, keeping their score low.
Here is how vectorization actually works in practice. First, you gather every single unique word from all your training texts and make a massive list. The number of words in that list is your “dimensions.” Then, you represent each text as a list of TF-IDF scores for every word in that master list.
Turning text into vectors can be hard to wrap your head around at first (trust me, I know), so let’s look at a simple example:
- I love cats (3 words)
- I really love cats and dogs (6 words)
- Dogs are scary (3 words)
First, make a list of all words appearing in all documents. This becomes the dimensions of the document vector.
[I, love, really, cats, and, dogs, are, scary](Total 8 words = 8 dimensions)
For each document, plugging the TF-IDF score into the corresponding word’s slot gives you that document’s vector.
- Document 1 Vector:
[TF-IDF(I), TF-IDF(love), 0, TF-IDF(cats), 0, 0, 0, 0] - Document 2 Vector:
[TF-IDF(I), TF-IDF(love), TF-IDF(really), TF-IDF(cats), TF-IDF(and), TF-IDF(dogs), 0, 0] - Document 3 Vector:
[0, 0, 0, 0, 0, TF-IDF(dogs), TF-IDF(are), TF-IDF(scary)]
Just like that, all sentences are turned into lists of numbers of the exact same length (8 dimensions), and those numbers get fed right into the logistic regression.
Decision Tree Family
Decision trees show up all the time when sorting data. Think of them as simple models that just ask a series of Yes/No questions to branch down to an answer. It’s basically the game Akinator.
Decision Tree
As mentioned above, a decision tree keeps splitting data. To get to the answer quickly (in the Akinator example, guessing the character with the fewest questions), picking the right question to split on is everything. The computer does this by brutally trying every possible split condition on all the data, and repeatedly picking the one that separates the classes the best (by calculating something called “Gini impurity”).
The catch is, if you let the tree ask too many questions (make it too deep), it starts memorizing weird coincidences. To make it easy to understand, let’s look at a weather forecast:
Expected Prediction: “Current pressure is 1015hPa, humidity 65%. Looking at past data, this means ‘Sunny’.”
Slightly Overfitting: “There’s a light breeze of 2.1m/s from the SSW. Ten years ago today, we had a sudden downpour on an afternoon with the exact same wind, so it might be ‘Rain’ this afternoon.”
Complete Overfitting (Obsessed with Noise): “In the past year, days where I rolled a die three times and got 1, 2, 3 were all sunny. Today meets this condition, so it must be sunny!”
This gets a bit into the weeds, but let me quickly touch on how to stop overfitting and tune the settings (hyperparameters).
You catch overfitting by keeping an eye on the error rates for both your training data and a separate set of test data. If your training error keeps dropping but your test error stops dropping (or even starts going up), boom, you’re overfitting. You hit the brakes right there. Also, don’t skip the boring best practices like cross-validation—they naturally help prevent overfitting even in simple experiments.
The absolute best way to fix overfitting is to get better data. Gather more rows, artificially twist your data to make more of it (Data Augmentation), or throw away useless features. The next best thing is purposely putting handcuffs on your model so it can’t memorize things. You can limit how deep the tree goes, shrink the math weights (Ridge regularization), or completely delete the weights of useless features (Lasso regularization).
Also, weirdly enough, it’s often best to build a massive model that is way too smart, and then heavily apply those handcuffs (regularization). People used to think that if the training error hit zero, the model was perfect. But researchers found a weird “double descent” phenomenon: the model actually performs the worst right at the exact moment it perfectly memorizes the training data. If you make the model even bigger past that point, it suddenly stops caring about the noise and starts learning the real underlying patterns. So the modern playbook is: huge model + strong regularization.
Regression Tree
A regression tree is just a decision tree but for predicting numbers. It splits the data so that similar numbers get grouped together (keeping the variance low). In the rent prediction example, you might branch by “nearest station,” then “distance from station,” then “floor area.” The final guess for a new apartment is just the “average” rent of all the past apartments that ended up in that same final bucket.
Like decision trees, it’s prone to overfitting.
Random Forest
An approach to suppress the weakness of decision trees—overfitting (model variance)—by connecting simple decision trees horizontally and taking a majority vote for the final result. Each decision tree is given a different subset of features or a subset of the dataset as training data. Choose this when the dataset is small and the overfitting risk is high.
To use an easy-to-understand system development example: Random Forest is like handing broken, purposely unfinished spec documents to 10 different developers, having them build the system independently, and finally merging everything with a majority vote. If a real Project Manager heard this, they’d foam at the mouth at such a guaranteed-to-burn-down brute-force tactic, but in the machine learning world, this “two heads are better than one” (or ten heads) approach works frighteningly well to prevent overfitting.
Gradient Boosting Decision Tree (GBDT)
Similar to Random Forest, it uses multiple simple trees, but instead of putting them side-by-side, it stacks them. The second tree tries to fix the mistakes of the first tree, the third tree fixes the second’s mistakes, and so on. By starting with a simple model and piling on more trees one by one, it slowly fixes the model’s blind spots and makes it smarter.
In a system development example, it’s like developer 1 building a working prototype, handing it to developer 2 who stays up all night debugging it, who then hands it to developer 3 to add features.
This is the go-to method on Kaggle. You use this when you have plenty of data and the time to tweak the settings (like the learning rate and tree depth). If you have enough data, it can catch super complex relationships (like when A and B happening together totally flips the result). Random Forest usually misses these because it’s just taking a big average and smoothing things out.
There are several famous implementations of GBDT:
- XGBoost (2014~): The key player that made GBDT famous. Adding parallel processing and regularization (overfitting prevention), it showed off blazing fast speed and accuracy compared to the then-standard Scikit-Learn’s
GradientBoostingClassifier. It held the Kaggle crown for a long time. - LightGBM (2017~): Developed by Microsoft. Hugely improved XGBoost’s weaknesses: “memory usage during training” and “computation speed.” Introduced a histogram-based algorithm (skips calculations by binning continuous values) and a leaf-wise growth mechanism that focuses on branching from data that heavily impacts accuracy (large gradients). Currently, it’s often treated as the strongest baseline for tabular data.
- HistGradientBoosting (Scikit-Learn 0.21+): Scikit-Learn incorporated LightGBM’s algorithm. Use this algorithm in environments where external libraries like LightGBM cannot or should not be installed.
Naive Bayes
A fast classifier based on Bayes’ theorem (a way of updating probabilities based on new evidence), mostly used for text sorting like spam detection.
Naive Bayes makes one really bold guess to keep the math easy: it assumes every single feature has absolutely nothing to do with the others. For example, in the spam email case, assuming the current email contains the words “spam” and “free”,
Probability it’s Spam: P(Spam|Free,Win) = P(Free|Spam) * P(Win|Spam) * P(Spam)
Probability it’s Normal: P(Normal|Free,Win) = P(Free|Normal) * P(Win|Normal) * P(Normal)
It does the math for both and picks the highest score. It’s much lighter than logistic regression, but it completely trips up if the features actually do interact (like how the words “AI” and “Investment” placed right next to each other make an email way more suspicious). It also panics if it sees a word it’s never seen before. Still, it’s the absolute top candidate to try first for text classification. If it fails, you move up to vectorizing the text (TF-IDF, Embeddings) + Logistic Regression.
Support Vector Machine (SVM)
SVM is mostly used for sorting things into two buckets (binary classification), but people also twist it to handle more buckets or even predict numbers.
To understand the core idea, let’s stick to two buckets. SVM tries to draw a line (or a flat plane) between the two groups of data, making sure that the gap between the line and the closest data points (the “support vectors”) is as wide as humanly possible. Any data points sitting far away from this boundary are completely ignored in the math.
Imagine shoving the thickest possible pipe between the two groups of data. How bendy and flexible this pipe can be depends on the “kernel” you choose—which is just a fancy way of saying how you measure the distance. You can pick linear, polynomial, or Gaussian (infinite dimensions), but in the real world, you almost always just pick Gaussian (RBF). Why? Because if you could just use a straight line (linear kernel), you’d probably just use a simpler model like Logistic Regression instead. The Gaussian kernel, however, can squeeze an unbelievable amount of learning out of a tiny amount of data (up to about 10k samples), filling a weird niche that NNs and GBDTs can’t touch.
As mentioned earlier, for messy data (text, images, audio), a great modern trick is to turn them into structured vectors using Embedding models, and then throw those vectors into an RBF kernel SVM. If you use a multimodal embedding model like CLIP, you literally get a multimodal classifier almost for free.
Applying this idea to predicting numbers (regression) is called SVR (Support Vector Regression). It just flips the goal: instead of keeping the pipe empty, you try to fit all the data points inside the pipe, making the pipe as thin as possible.
Clustering
This is about grouping similar pieces of data together. Unlike classification, there are no “correct answers” given upfront (this is called unsupervised learning), which makes it really annoying to figure out if your model actually did a good job.
K-Means
You tell the algorithm “I want K groups,” and K-Means goes hunting for K circles in your data. It plops down K random center points, draws a circle around each, grabs all the data inside, and then moves the center point to the actual middle of that grabbed data. It repeats this wiggle about 10-50 times until things settle down. Since it relies heavily on measuring distance, you must normalize or scale your data first. Also, because it starts with random points, the final result can vary. Usually, people use a smarter trick (like K-Means++) to spread out those starting points instead of going completely random.
To figure out if your K is actually good, people use the “Elbow Method” (you plot the messiness of the clusters on a graph, and look for where the line bends like an elbow). Also, because it looks for circles, it completely fails if your data is shaped like bananas or squiggles. If the math tells you the clusters aren’t tightly packed (low Silhouette score), you probably need to switch to a density-based algorithm like DBSCAN.
Time complexity is the product of number of data points, clusters, dimensions, and iterations (NKDI). I is generally around 10-50. Memory is only the data itself (ND) and centroids (KD). Dimensionality (D) heavily impacts performance (curse of dimensionality). Consider dimensionality reduction as preprocessing.
DBSCAN
This one grows groups in a chain reaction, kind of like the “a friend of a friend is my friend” rule. You just give it two settings: how close someone needs to be to be considered a “friend”, and the minimum size of a friend group. You don’t even have to guess the number of clusters beforehand. Any data points left isolated are just tossed out as noise or weird outliers.
In the background, it uses a spatial index (KDTree) which means it’s pretty fast and memory-efficient. However, just like K-Means, if you have too many features (dimensions), the math bogs down heavily. This is where dimensionality reduction comes in to save the day.
Dimensionality Reduction
This is the art of throwing away features (dimensions) while magically keeping as much of the original information as possible.
PCA (Principal Component Analysis)
It looks at all your features mashed together and finds the direction where the data spreads out the most. It’s like looking at a student’s math and physics grades and just combining them into a single “science score.” That becomes your first principal component. It then looks for the next best direction at a right angle to the first, maybe combining reading and writing into a “language score.” Under the hood, it’s just a bunch of heavy matrix math.
Suddenly, I realized I was basically doing mental PCA during my recent house hunting. Noticing the high correlation between “UA Value” and “C Value,” I compressed them into a single parameter called “Insulation Performance,” thereby preventing a brain explosion from variable overload (the classic curse of dimensionality). Though, identifying that variance was purely a gut feeling on my part.
Reconsidering ML and GenAI Application Domains
Given the recent explosion in generative AI, I spent some time thinking about where these classic ML models actually fit in today.
In “systems requiring lightning-fast responses” and “areas where AI hallucinating is strictly forbidden,” classic ML models still absolutely dominate; GenAI hasn’t taken over these areas yet.
But on the flip side, for quick startup prototypes where the ML task is kind of blurry, or when you just need to get a proof-of-concept working fast, a lot of people are just calling an LLM API and calling it a day.
Honestly, even in my own weekend projects, if I wanted a “similar item recommendation” feature in the past, I’d have to scrape data, build features by hand, and train a whole custom model (with all the annoying cross-validation and tuning). Now? Just calling an LLM gives me decent accuracy, and if I want to tweak the logic, I just type a different prompt. It’s almost too easy. Even for sorting text, the vibe nowadays is definitely “let’s just ask an LLM to classify it first before we bother writing Naive Bayes code.”
Conclusion
This time, we looked back at everything from overall machine learning principles to linear models, decision tree combinations, SVMs, and ways to cluster and shrink data.
Organizing this again made me really feel it: the scope of machine learning is simply massive. I started this with a light heart, thinking, “I’ll just whip up a quick cheat sheet over the weekend,” but before I knew it, it turned into a giant essay. My brain’s memory limit is nearly maxed out.
Still, looking back at all this reminded me of a few key things:
- You can’t pick a model just by glancing at a “cheat sheet”. The internet is flooded with simple flowcharts saying “If you have this data, use this model,” but in the real world, that’s not enough. You can’t find the best approach without deeply understanding how each algorithm actually works under the hood (why it does well with small data, why it loves to overfit, etc.), while also balancing brutal reality checks like “will this run fast enough?” and “can I explain to my boss why the AI did this?”.
- On the other hand, clear “standard plays” definitely exist. Even if cheat sheets aren’t enough, the question of “what should I try first?” is mostly solved based on what your data looks like. Starting there, checking if your data curves weirdly or if the model is memorizing noise, and then slowly stepping up to more complex models—while it sounds like the long way around—is actually the smartest, most professional path.
- Even in the GenAI era, classic ML isn’t going anywhere. While LLMs are definitely stealing the spotlight for quick prototypes and easy text sorting, when you’re building real production systems that demand “millisecond response speeds,” “zero hallucination risk,” and “crunching massive data on the cheap,” LightGBM and these other classic models are still absolute must-haves.
If you’re actually running ML systems in the real world right now, you might be reading this thinking, “Hey, that’s not quite how it works in the trenches anymore.” If so, please call me out in the comments!
Next time, I’ll dive into the “History of NN Model Development (Excluding GenAI).”
Comments (0)
No comments yet. Be the first!