Algorithm reference: ten from-scratch implementations

Decision Tree

Algorithm: CART (Classification and Regression Trees) using Gini impurity. Recursively finds the feature-threshold pair that produces the lowest weighted impurity across both child nodes.Dataset: Iris (150 samples, 4 features, 3 classes) via sklearn.datasets.load_iris.Key hyperparameters: max_depth (controls overfitting), min_samples_split (minimum node size before attempting a split).

Overview
Code

The tree is composed of Node objects. Internal nodes store a feature index and threshold; leaf nodes store a value (the majority class). _gini computes impurity for a label array. _best_split exhaustively searches every unique threshold for every feature. _grow recurses until max_depth or a pure node is reached.

class Node:
    def __init__(self, feature=None, threshold=None, left=None,
                 right=None, value=None):
        self.feature   = feature
        self.threshold = threshold
        self.left      = left
        self.right     = right
        self.value     = value   # set for leaf nodes

class DecisionTree:
    def __init__(self, max_depth=10, min_samples_split=2):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.root = None

    def _gini(self, y):
        classes, counts = np.unique(y, return_counts=True)
        p = counts / len(y)
        return 1 - np.sum(p**2)

    def _best_split(self, X, y):
        best_gini, best_feat, best_thr = float('inf'), None, None
        n, d = X.shape
        for feat in range(d):
            thresholds = np.unique(X[:, feat])
            for thr in thresholds:
                left  = y[X[:, feat] <= thr]
                right = y[X[:, feat] >  thr]
                if len(left)==0 or len(right)==0: continue
                g = (len(left)*self._gini(left) + len(right)*self._gini(right)) / n
                if g < best_gini:
                    best_gini, best_feat, best_thr = g, feat, thr
        return best_feat, best_thr

    def fit(self, X, y): ...   # calls _grow recursively
    def predict(self, X): ...  # traverses tree per sample

K-Means Clustering

Algorithm: Lloyd’s algorithm. Randomly initialises k centroids, then alternates between assigning each point to its nearest centroid and recomputing centroids as cluster means. Stops when centroid movement falls below tol.Dataset: Synthetic blobs via sklearn.datasets.make_blobs (400 samples, 4 true clusters).Key hyperparameters: k (number of clusters), max_iters (iteration cap), tol (convergence threshold).

Overview
Code

_assign computes Euclidean distances from every point to every centroid and returns the argmin. _inertia sums squared distances within each cluster (used for the elbow method). Centroids that lose all members retain their previous position to avoid NaN.

class KMeans:
    def __init__(self, k=3, max_iters=300, tol=1e-4):
        self.k = k; self.max_iters = max_iters; self.tol = tol

    def fit(self, X):
        idx = np.random.choice(len(X), self.k, replace=False)
        self.centroids = X[idx].copy()
        self.inertia_history = []

        for _ in range(self.max_iters):
            labels = self._assign(X)
            new_c  = np.array([X[labels==k].mean(axis=0) if (labels==k).any()
                                else self.centroids[k] for k in range(self.k)])
            self.inertia_history.append(self._inertia(X, labels))
            if np.linalg.norm(new_c - self.centroids) < self.tol:
                break
            self.centroids = new_c
        self.labels_ = labels

    def _assign(self, X):
        dists = np.array([[np.linalg.norm(x - c) for c in self.centroids] for x in X])
        return np.argmin(dists, axis=1)

    def _inertia(self, X, labels):
        return sum(np.sum((X[labels==k] - self.centroids[k])**2) for k in range(self.k))

    def predict(self, X):
        return self._assign(X)

K-Nearest Neighbors

Algorithm: Lazy learner — stores the entire training set and defers computation to prediction time. For each test point, computes Euclidean distance to every training point, selects the k smallest, and returns the majority class via np.bincount.Dataset: Small custom 2D dataset demonstrating binary classification.Key hyperparameters: k (number of neighbors; smaller values = higher variance, larger values = higher bias).

Overview
Code

fit only stores X_train and y_train — no training occurs. _predict_one handles a single query point; predict vectorises over the test set. Distance is Euclidean (sqrt(sum((x1-x2)^2))).

class KNNClassifier:
    def __init__(self, k=3):
        self.k = k

    def _euclidean_dist(self, x1, x2):
        return np.sqrt(np.sum((x1 - x2) ** 2))

    def fit(self, X, y):  # Lazy Learner
        self.X_train = X
        self.y_train = y

    def _predict_one(self, x):
        distances = [self._euclidean_dist(x, x_train) for x_train in self.X_train]
        knn_indices = np.argsort(distances)[:self.k]
        knn_classes = [self.y_train[i] for i in knn_indices]
        majority_class = np.argmax(np.bincount(knn_classes))
        return majority_class

    def predict(self, X):
        y_pred = [self._predict_one(x) for x in X]
        return np.array(y_pred)

Linear Regression

Algorithm: Two variants are implemented — gradient descent (iterative) and Ordinary Least Squares via the normal equation (closed-form). The gradient descent version initialises weights to zero and updates them each iteration using the mean squared error gradient.Dataset: Toy 1D dataset (X = [1,2,3,4,5], y = [2,4,6,8,10]).Key hyperparameters (GD): learning_rate, n_iter. OLS has no hyperparameters.

Overview
Code

LinearRegression uses gradient descent: dw = (1/m) X^T (ŷ - y) and db = (1/m) sum(ŷ - y). LinearRegressionOLS solves θ = (X^T X)^{-1} X^T y directly using np.linalg.inv.

class LinearRegression:
    def __init__(self, learning_rate=0.01, n_iter=1000):
        self.bias = None
        self.weights = None
        self.lr = learning_rate
        self.n_iter = n_iter

    def fit(self, X, y):
        m, n = X.shape
        self.bias = 0
        self.weights = np.zeros(n)

        for i in range(self.n_iter):
            y_pred = self.bias + np.dot(X, self.weights)
            db = (1/m) * np.sum(y_pred - y)
            dw = (1/m) * np.dot(X.T, (y_pred - y))
            self.bias -= self.lr * db
            self.weights -= self.lr * dw

    def predict(self, X):
        return self.bias + np.dot(X, self.weights)


class LinearRegressionOLS:
    def __init__(self):
        self.bias = None
        self.weights = None

    def fit(self, X, y):
        m, n = X.shape
        X_b = np.c_[np.ones((m, 1)), X]
        theta = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
        self.bias = theta[0]
        self.weights = theta[1:]

    def predict(self, X):
        return self.bias + np.dot(X, self.weights)

Logistic Regression

Algorithm: Binary classifier trained with gradient descent on binary cross-entropy loss. A sigmoid activation squashes the linear output to a probability; predictions are thresholded at 0.5 by default.Dataset: Toy 2D dataset (X = [[1,2],[2,3],[3,4],[4,5]], y = [0,0,1,1]).Key hyperparameters: learning_rate, n_iter, threshold (prediction cut-off, adjustable at inference time).

Overview
Code

_sigmoid(z) = 1 / (1 + exp(-z)). Gradients follow the same form as linear regression because the cross-entropy gradient simplifies to (ŷ - y). get_probabilities exposes raw sigmoid outputs; predict applies the threshold.

class LogisticRegression:
    def __init__(self, learning_rate=0.1, n_iter=1000):
        self.bias = None
        self.weights = None
        self.lr = learning_rate
        self.n_iter = n_iter

    def _sigmoid(self, z):
        return 1 / (1 + np.exp(-z))

    def fit(self, X, y):
        m, n = X.shape
        self.bias = 0
        self.weights = np.zeros(n)

        for i in range(self.n_iter):
            z = self.bias + np.dot(X, self.weights)
            y_pred = self._sigmoid(z)
            db = (1/m) * np.sum(y_pred - y)
            dw = (1/m) * np.dot(X.T, (y_pred - y))
            self.bias -= self.lr * db
            self.weights -= self.lr * dw

    def get_probabilities(self, X):
        z = self.bias + np.dot(X, self.weights)
        return self._sigmoid(z)

    def predict(self, X, threshold=0.5):
        probabilities = self.get_probabilities(X)
        return (probabilities >= threshold).astype(int)

Naive Bayes

Algorithm: Gaussian Naive Bayes. During fit, the prior probability, per-feature mean, and per-feature variance are computed for each class. At prediction time, the log posterior is computed as log P(y) + sum log P(x_i | y) using the Gaussian PDF, and the class with the highest score wins.Dataset: Iris via sklearn.datasets.load_iris (80/20 train-test split).Key hyperparameters: None — the model is fully determined by the training data. A small constant (1e-9) is added to variances to prevent division by zero.

Overview
Code

Working in log space (_log_likelihood) avoids numeric underflow from multiplying many small probabilities. The Gaussian log-likelihood for feature i under class c is -0.5 * (log(2πv) + (x-m)²/v). Summing over features exploits the conditional independence assumption.

class GaussianNB:
    def fit(self, X, y):
        self.classes = np.unique(y)
        self.priors, self.means, self.vars = {}, {}, {}
        for c in self.classes:
            Xc = X[y == c]
            self.priors[c] = len(Xc) / len(y)
            self.means[c]  = Xc.mean(axis=0)
            self.vars[c]   = Xc.var(axis=0) + 1e-9   # avoid /0

    def _log_likelihood(self, x, c):
        m, v = self.means[c], self.vars[c]
        return -0.5 * np.sum(np.log(2*np.pi*v) + (x-m)**2/v)

    def predict(self, X):
        preds = []
        for x in X:
            scores = {c: np.log(self.priors[c]) + self._log_likelihood(x, c)
                      for c in self.classes}
            preds.append(max(scores, key=scores.get))
        return np.array(preds)

Neural Network

Algorithm: Two-layer feedforward network: Input → Hidden (ReLU) → Output (Softmax). Trained with mini-batch gradient descent and cross-entropy loss. Weights are initialised with Xavier initialisation.Dataset: Digits dataset via sklearn.datasets.load_digits (1797 samples, 64 features, 10 classes).Key hyperparameters: hidden_dim, lr (learning rate), epochs, batch_size.

Overview
Code

forward computes Z1→A1 (ReLU) → Z2→A2 (Softmax). backward derives gradients analytically: dZ2 = A2 - y_true (softmax + cross-entropy simplification), then chain rule back through the hidden layer via relu_grad. fit shuffles indices each epoch for stochastic mini-batching.

def relu(z):      return np.maximum(0, z)
def relu_grad(z): return (z > 0).astype(float)
def softmax(z):
    e = np.exp(z - z.max(axis=1, keepdims=True))
    return e / e.sum(axis=1, keepdims=True)
def cross_entropy(y_pred, y_true):
    return -np.mean(np.sum(y_true * np.log(y_pred + 1e-15), axis=1))

class NeuralNetwork:
    def __init__(self, input_dim, hidden_dim, output_dim, lr=0.01):
        # Xavier initialisation
        self.W1 = np.random.randn(input_dim, hidden_dim) * np.sqrt(2/input_dim)
        self.b1 = np.zeros((1, hidden_dim))
        self.W2 = np.random.randn(hidden_dim, output_dim) * np.sqrt(2/hidden_dim)
        self.b2 = np.zeros((1, output_dim))
        self.lr = lr

    def forward(self, X):
        self.Z1 = X @ self.W1 + self.b1
        self.A1 = relu(self.Z1)
        self.Z2 = self.A1 @ self.W2 + self.b2
        self.A2 = softmax(self.Z2)
        return self.A2

    def backward(self, X, y_true):
        n = len(X)
        dZ2 = self.A2 - y_true
        dW2 = self.A1.T @ dZ2 / n
        db2 = dZ2.mean(axis=0, keepdims=True)
        dA1 = dZ2 @ self.W2.T
        dZ1 = dA1 * relu_grad(self.Z1)
        dW1 = X.T @ dZ1 / n
        db1 = dZ1.mean(axis=0, keepdims=True)
        self.W2 -= self.lr * dW2; self.b2 -= self.lr * db2
        self.W1 -= self.lr * dW1; self.b1 -= self.lr * db1

    def fit(self, X, y, X_val=None, y_val=None, epochs=200, batch_size=64):
        n = len(X)
        for epoch in range(epochs):
            idx = np.random.permutation(n)
            for i in range(0, n, batch_size):
                batch = idx[i:i+batch_size]
                self.forward(X[batch])
                self.backward(X[batch], y[batch])

    def predict(self, X):
        return np.argmax(self.forward(X), axis=1)

PCA

Algorithm: Principal Component Analysis via eigen-decomposition of the sample covariance matrix. Data is mean-centered, the covariance matrix is formed, and np.linalg.eigh decomposes it. The top n_components eigenvectors form the projection matrix.Dataset: Iris (2D projection) and Digits (reconstruction at various component counts), both via scikit-learn.Key hyperparameters: n_components (number of principal components to retain).

Overview
Code

fit stores the mean and the top-k eigenvectors as self.components (shape [n_components, n_features]). transform centers new data and projects it: (X - mean) @ components.T. explained_variance_ratio_ exposes the fraction of variance captured by each component.

class PCA:
    def __init__(self, n_components):
        self.n_components = n_components
        self.components = None
        self.mean = None
        self.explained_variance_ratio_ = None

    def fit(self, X):
        self.mean = X.mean(axis=0)
        X_c = X - self.mean
        cov = (X_c.T @ X_c) / (len(X) - 1)
        eigenvalues, eigenvectors = np.linalg.eigh(cov)
        # Sort descending
        idx = np.argsort(eigenvalues)[::-1]
        eigenvalues  = eigenvalues[idx]
        eigenvectors = eigenvectors[:, idx]
        self.components = eigenvectors[:, :self.n_components].T
        total = eigenvalues.sum()
        self.explained_variance_ratio_ = eigenvalues / total

    def transform(self, X):
        return (X - self.mean) @ self.components.T

    def fit_transform(self, X):
        self.fit(X)
        return self.transform(X)

Random Forest

Algorithm: Ensemble of n_trees decision trees. Each tree is trained on a bootstrap sample (sampling with replacement) and restricted to a random subset of max_features features at each split. Final predictions are determined by majority vote across all trees.Dataset: Breast Cancer Wisconsin via sklearn.datasets.load_breast_cancer (569 samples, 30 features, binary labels).Key hyperparameters: n_trees, max_depth, max_features ('sqrt' uses sqrt(d) features per split; 'log2' uses log2(d)).

Overview
Code

The notebook re-implements DecisionTree with an added max_features parameter for feature subsampling inside _best_split. RandomForest.fit loops over trees, draws a bootstrap index with np.random.choice(..., replace=True), trains a tree, and appends it to self.trees. predict stacks predictions from all trees and takes the mode column-wise.

class RandomForest:
    def __init__(self, n_trees=50, max_depth=10, max_features='sqrt'):
        self.n_trees = n_trees
        self.max_depth = max_depth
        self.max_features = max_features

    def fit(self, X, y):
        X, y = np.array(X), np.array(y)
        n, d = X.shape
        mf = int(np.sqrt(d)) if self.max_features == 'sqrt' else int(np.log2(d))
        self.trees = []
        for _ in range(self.n_trees):
            idx = np.random.choice(n, n, replace=True)  # bootstrap
            tree = DecisionTree(max_depth=self.max_depth, max_features=mf)
            tree.fit(X[idx], y[idx])
            self.trees.append(tree)

    def predict(self, X):
        preds = np.array([t.predict(X) for t in self.trees])  # (n_trees, n_samples)
        return np.array([_mode(preds[:, i]) for i in range(preds.shape[1])])

SVM

Algorithm: Hard-margin linear SVM trained with sub-gradient descent on the hinge loss: L = λ‖w‖² + (1/n) Σ max(0, 1 − y_i(w^T x_i + b)). Labels are re-coded to ±1 internally.Dataset: Synthetic binary classification via sklearn.datasets.make_classification (500 samples, 2 features), standardised with StandardScaler.Key hyperparameters: lr (learning rate), lambda_param (regularisation strength), n_iters.

Overview
Code

At each iteration, margins = y * (Xw + b) identifies support vectors (margins < 1). Sub-gradients are computed only for those violating samples: dw = 2λw - mean(X[mask] * y[mask]), db = -mean(y[mask]). The loss history is tracked in self.losses for convergence plots.

class SVM:
    def __init__(self, lr=0.001, lambda_param=0.01, n_iters=1000):
        self.lr = lr
        self.lam = lambda_param
        self.n_iters = n_iters
        self.w = None
        self.b = None
        self.losses = []

    def fit(self, X, y):
        # Labels must be -1 / +1
        y_ = np.where(y <= 0, -1, 1)
        n, d = X.shape
        self.w = np.zeros(d); self.b = 0.0
        for _ in range(self.n_iters):
            margins = y_ * (X @ self.w + self.b)
            mask = margins < 1
            dw = (2*self.lam*self.w - (X[mask] * y_[mask, None]).mean(axis=0)
                  if mask.any() else 2*self.lam*self.w)
            db = -y_[mask].mean() if mask.any() else 0
            self.w -= self.lr * dw
            self.b -= self.lr * db
            loss = self.lam*np.dot(self.w, self.w) + np.maximum(0, 1-margins).mean()
            self.losses.append(loss)

    def predict(self, X):
        return np.sign(X @ self.w + self.b).astype(int)

Get Started

ML From Scratch

Algorithm reference: ten from-scratch implementations

Build docs developers (and LLMs) love

Get Started

ML From Scratch

Documentation Index

Build docs developers (and LLMs) love