Machine learning can feel like a black box — but the core algorithms are surprisingly intuitive once you see them in context. This article walks through the most important ML algorithms, explains how each one works, when to use it, and shows the key formulas and visual intuitions behind them. No PhD required.

1. Linear regression

Linear regression is the simplest and most widely used supervised learning algorithm. It finds a straight line (or a flat surface in higher dimensions) that best fits the data — minimising the distance between the predicted values and the actual values.

The formula
ŷ = β₀ + β₁x₁ + β₂x₂ + … + βₙxₙ
ŷ = predicted value β₀ = intercept (bias) β₁…βₙ = learned weights x₁…xₙ = input features
Linear regression: fitting a line to data
Feature (x)Target (y)ŷ = β₀ + β₁x
The algorithm finds the line that minimises the sum of squared residuals (the dashed vertical gaps between each point and the line).

When to use it: Predicting a continuous value — house prices, revenue forecasts, project durations, employee attrition cost. It's fast, interpretable, and often a solid baseline even when more complex models are available.

Cost function (Mean Squared Error)
MSE = (1/n) Σ (yᵢ − ŷᵢ)²

The model adjusts β values to minimise this cost — the average squared difference between actual and predicted values.

2. Logistic regression

Despite the name, logistic regression is used for classification, not regression. It predicts the probability that an input belongs to a particular class (e.g. "spam" vs "not spam"). The key trick is the sigmoid function, which squashes any number into a value between 0 and 1 — a probability.

The sigmoid function
P(y=1) = 1 / (1 + e−(β₀ + β₁x₁ + … + βₙxₙ))
Output is always between 0 and 1 Threshold at 0.5 for binary classification
The sigmoid (S-curve) — mapping inputs to probabilities
0.501Class 0 regionClass 1 regionz = β₀ + β₁x
The sigmoid function maps any real number to a probability between 0 and 1, creating the characteristic S-curve used for binary classification.

When to use it: Binary yes/no predictions — will a client churn? Is this transaction fraudulent? Should this email be flagged? Also extends to multi-class problems with softmax regression.

3. Decision trees

A decision tree splits data into branches based on feature values, creating a flowchart-like structure. At each node, the algorithm picks the feature and threshold that best separates the data. The result is highly interpretable — you can literally follow the tree to see why a prediction was made.

Decision tree: should we approve the loan?
Income > £50k?
Yes
Debt ratio < 40%?
Yes
Approve
No
Review
No
Credit score > 700?
Yes
Review
No
Decline
A simple decision tree for loan approval — each node asks a question, and the branches lead to a decision. Real trees have many more levels.
Splitting criterion (Gini impurity)
Gini = 1 − Σ pᵢ²

The tree picks the split that produces the lowest weighted Gini impurity in the child nodes. A Gini of 0 means perfectly pure (all one class).

When to use it: When interpretability matters — regulated industries, audit trails, client-facing explanations. Also the building block for more powerful ensemble methods.

4. Random forests

A random forest is an ensemble of many decision trees, each trained on a random subset of the data and features. For a prediction, every tree votes, and the majority wins (classification) or the average is taken (regression). This reduces overfitting dramatically compared to a single tree.

Random forest: ensemble voting
Tree 1: Approve
Tree 2: Approve
Tree 3: Decline
Tree 4: Approve
Tree 5: Approve
Majority vote: Approve (4/5)
Five trees each make an independent prediction. The ensemble combines their votes — reducing the risk of any single tree's errors dominating.

When to use it: General-purpose classification and regression when you want high accuracy with moderate interpretability. Excellent for tabular business data — customer behaviour, risk scoring, demand forecasting.

5. K-means clustering

K-means is an unsupervised algorithm — it groups unlabelled data into k clusters based on similarity. The algorithm iteratively assigns each data point to the nearest cluster centre (centroid) and then recalculates the centroids until convergence.

K-means clustering: three customer segments
Annual spendEngagement scoreLow valueMid valueHigh value
K-means automatically groups customers into segments based on spending and engagement. The + marks show cluster centroids.
Objective function
J = Σₖ Σᵢ ‖xᵢ − μₖ‖²

Minimise the total within-cluster variance — the sum of squared distances from each point to its assigned centroid μₖ.

When to use it: Customer segmentation, document grouping, image compression, and any problem where you want to discover natural groupings in unlabelled data.

6. Gradient boosting (XGBoost, LightGBM)

Gradient boosting builds an ensemble of weak decision trees sequentially — each new tree focuses on the errors the previous trees got wrong. The result is often the most accurate model for structured/tabular data, and it's the algorithm behind most Kaggle competition winners.

Boosting principle
F(x) = f₁(x) + f₂(x) + f₃(x) + … + fₘ(x)
Each fₘ is a small tree fitted to the residual errors of the previous ensemble

When to use it: When accuracy on tabular data is paramount — fraud detection, credit scoring, pricing optimisation, demand forecasting. Implementations like XGBoost and LightGBM are fast and battle-tested.

7. K-nearest neighbours (KNN)

KNN is one of the simplest algorithms: to classify a new data point, look at the k closest points in the training data and take a majority vote. No actual "training" happens — the model is just the data itself.

KNN (k=3): classifying a new point by its neighbours
?Class AClass Bk=3 → 2 B, 1 A → Class B
The query point (?) looks at its 3 nearest neighbours: 2 are Class B, 1 is Class A — so it's classified as Class B.

When to use it: Quick prototyping, recommendation systems ("users who are similar to you also liked…"), and small-to-medium datasets where simplicity is valued.

8. Neural networks (deep learning)

Neural networks are loosely inspired by the brain — layers of interconnected nodes ("neurons") that transform input data through weighted sums and non-linear activation functions. Deep learning simply means neural networks with many layers, enabling them to learn complex hierarchical representations.

Single neuron computation
output = activation(w₁x₁ + w₂x₂ + … + wₙxₙ + b)
w = weights (learned) b = bias activation = ReLU, sigmoid, etc.

When to use it: Image recognition, natural language processing, speech recognition, and any problem with very large datasets and complex, non-linear patterns. Neural networks are the foundation of generative AI.

Algorithm cheat sheet

Algorithm Type Best for Interpretability
Linear regression Supervised Continuous prediction High
Logistic regression Supervised Binary classification High
Decision tree Supervised Explainable decisions High
Random forest Supervised General tabular data Medium
Gradient boosting Supervised Max accuracy, tabular Low
K-nearest neighbours Supervised Small data, prototyping High
K-means Unsupervised Clustering / segmentation Medium
Neural networks Supervised Complex patterns, large data Low

Where to go from here

These eight algorithms cover the vast majority of real-world machine learning problems. For most business applications, you'll use linear/logistic regression for baselines, gradient boosting or random forests for production accuracy, k-means for segmentation, and neural networks for unstructured data (text, images, audio).

The key is not to memorise formulas — it's to understand which algorithm fits which problem, and to build intuition by working with real data. Start with the simplest model that could work, measure it rigorously, and only add complexity when it measurably improves results.

If you'd like help identifying the right ML approach for your use case, we'd be happy to talk. Get in touch to start the conversation.