Back to Mathematics

Part 12: Projects & Applications

April 26, 2026 Wasil Zafar 25 min read

The final capstone of the series. We build four complete ML algorithms from scratch using only NumPy, tracing every line of code back to the mathematical foundation built across Parts 1–11: linear regression, Naive Bayes, PCA, and a two-layer neural network with backpropagation.

Table of Contents

  1. Math-to-Algorithm Map
  2. Project 1: Linear Regression
  3. Project 2: Naive Bayes
  4. Project 3: PCA
  5. Project 4: Neural Network
  6. Roadmap for Further Study
  7. Series Conclusion

Math-to-Algorithm Map

Before we build, here is the explicit connection between the prior 11 parts and the algorithms we will implement:

AlgorithmCore Math PartsKey Concepts Used
Linear Regression (OLS + GD)Parts 7, 8, 9Normal equations $X^\top X \mathbf{w} = X^\top \mathbf{y}$, MSE loss gradient $\nabla_\mathbf{w}\mathcal{L}$, gradient descent update
Naive Bayes ClassifierParts 2, 4, 6Bayes' theorem $P(C|\mathbf{x})$, conditional independence assumption, log-likelihood for numerical stability
PCAParts 5, 7, 10Covariance matrix, eigendecomposition, variance explained by eigenvectors, projection
2-Layer Neural NetworkParts 7, 8, 9Matrix multiply forward pass, chain rule backprop, sigmoid activation, binary cross-entropy, gradient descent

Project 1: Linear Regression from Scratch

We implement both the closed-form normal equations (Part 7) and gradient descent (Part 8), and verify they agree.

Math Recap: OLS closed form minimizes $\mathcal{L}(\mathbf{w}) = \|\mathbf{y} - X\mathbf{w}\|_2^2$. Setting $\nabla_\mathbf{w}\mathcal{L} = 0$ gives $\mathbf{w}^* = (X^\top X)^{-1} X^\top \mathbf{y}$. Gradient descent uses $\mathbf{w} \leftarrow \mathbf{w} - \eta \nabla_\mathbf{w}\mathcal{L}$ where $\nabla_\mathbf{w}\mathcal{L} = \frac{2}{n}X^\top(X\mathbf{w} - \mathbf{y})$.
import numpy as np

# ---- Generate synthetic data ----
np.random.seed(42)
n, d = 100, 3
X_raw = np.random.randn(n, d)
true_w = np.array([2.0, -1.5, 0.8])
y = X_raw @ true_w + 0.5 * np.random.randn(n)

# Add bias column (intercept)
X = np.column_stack([np.ones(n), X_raw])   # (100, 4)

# ---- Method 1: Normal Equations (closed form) ----
w_ols = np.linalg.solve(X.T @ X, X.T @ y)
y_pred_ols = X @ w_ols
mse_ols = np.mean((y - y_pred_ols)**2)
print("Normal Equations:")
print(f"  Coefficients: {np.round(w_ols, 4)}")
print(f"  MSE: {mse_ols:.6f}")

# ---- Method 2: Gradient Descent ----
def linear_regression_gd(X, y, lr=0.01, n_epochs=1000):
    n, p = X.shape
    w = np.zeros(p)
    losses = []
    for _ in range(n_epochs):
        y_hat = X @ w
        loss = np.mean((y - y_hat)**2)
        losses.append(loss)
        grad = (2 / n) * (X.T @ (y_hat - y))
        w -= lr * grad
    return w, losses

w_gd, losses = linear_regression_gd(X, y, lr=0.05, n_epochs=500)
y_pred_gd = X @ w_gd
mse_gd = np.mean((y - y_pred_gd)**2)
print("\nGradient Descent:")
print(f"  Coefficients: {np.round(w_gd, 4)}")
print(f"  MSE: {mse_gd:.6f}")
print(f"\nTrue coefficients (bias=0): [0, {true_w[0]}, {true_w[1]}, {true_w[2]}]")
print(f"OLS vs GD agreement: {np.allclose(w_ols, w_gd, atol=1e-3)}")
print(f"Loss after first epoch: {losses[0]:.4f}, after final epoch: {losses[-1]:.4f}")

Project 2: Naive Bayes Classifier from Scratch

Naive Bayes applies Bayes' theorem (Part 4) with the conditional independence assumption (Part 2): given the class, features are independent. We use log-probabilities (Part 6) for numerical stability.

$$\hat{y} = \arg\max_c \left[\log P(C=c) + \sum_{j=1}^d \log P(x_j | C=c)\right]$$
import numpy as np

class GaussianNaiveBayes:
    """Gaussian Naive Bayes: P(x_j|C=c) ~ N(mu_cj, sigma_cj^2)"""
    def fit(self, X, y):
        self.classes_ = np.unique(y)
        self.log_priors_ = {}
        self.params_ = {}   # {class: [(mu_j, sigma_j) for each feature j]}
        n = len(y)
        for c in self.classes_:
            X_c = X[y == c]
            self.log_priors_[c] = np.log(len(X_c) / n)
            self.params_[c] = (X_c.mean(axis=0), X_c.std(axis=0) + 1e-9)

    def _log_likelihood(self, x, c):
        mu, sigma = self.params_[c]
        # log of Gaussian PDF (ignoring constant term -0.5*log(2π))
        return np.sum(-np.log(sigma) - 0.5 * ((x - mu) / sigma)**2)

    def predict(self, X):
        preds = []
        for x in X:
            log_posteriors = {c: self.log_priors_[c] + self._log_likelihood(x, c)
                              for c in self.classes_}
            preds.append(max(log_posteriors, key=log_posteriors.get))
        return np.array(preds)

# --- Test on synthetic 2-class data ---
np.random.seed(0)
# Class 0: centered at (1,1), Class 1: centered at (4,4)
X0 = np.random.randn(50, 2) + np.array([1.0, 1.0])
X1 = np.random.randn(50, 2) + np.array([4.0, 4.0])
X_train = np.vstack([X0[:40], X1[:40]])
y_train = np.array([0]*40 + [1]*40)
X_test  = np.vstack([X0[40:], X1[40:]])
y_test  = np.array([0]*10 + [1]*10)

gnb = GaussianNaiveBayes()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)

accuracy = (y_pred == y_test).mean()
print("Gaussian Naive Bayes — Results:")
print(f"  Classes: {gnb.classes_}")
print(f"  Log priors: { {c: round(lp,3) for c, lp in gnb.log_priors_.items()} }")
for c in gnb.classes_:
    mu, sigma = gnb.params_[c]
    print(f"  Class {c}: mu={np.round(mu,2)}, sigma={np.round(sigma,2)}")
print(f"\n  Test accuracy: {accuracy:.2%}")
print(f"  Predictions: {y_pred}")
print(f"  True labels: {y_test}")

Project 3: PCA from Scratch

PCA (Part 7) finds the directions of maximum variance by computing the covariance matrix's eigendecomposition. Projecting onto the top $k$ eigenvectors gives the best rank-$k$ approximation in the least-squares sense.

$$\Sigma = \frac{1}{n-1}(X - \bar{X})^\top(X - \bar{X}), \quad \Sigma \mathbf{v}_i = \lambda_i \mathbf{v}_i$$
import numpy as np

class PCA:
    """PCA via covariance matrix eigendecomposition (Part 7)."""
    def __init__(self, n_components):
        self.n_components = n_components

    def fit(self, X):
        self.mean_ = X.mean(axis=0)
        X_centered = X - self.mean_
        cov = (X_centered.T @ X_centered) / (len(X) - 1)
        # eigh: symmetric matrix — eigenvalues in ascending order
        eigenvalues, eigenvectors = np.linalg.eigh(cov)
        # Sort descending (largest variance first)
        idx = np.argsort(eigenvalues)[::-1]
        eigenvalues = eigenvalues[idx]
        eigenvectors = eigenvectors[:, idx]
        self.components_ = eigenvectors[:, :self.n_components].T   # (k, d)
        self.explained_variance_ = eigenvalues[:self.n_components]
        self.explained_variance_ratio_ = eigenvalues[:self.n_components] / eigenvalues.sum()
        return self

    def transform(self, X):
        return (X - self.mean_) @ self.components_.T

    def inverse_transform(self, Z):
        return Z @ self.components_ + self.mean_

# --- Test on synthetic 4D data with 2 dominant directions ---
np.random.seed(7)
n = 200
# Generate data along 2 dominant directions in 4D
true_components = np.array([[1,1,0,0],[0,0,1,-1]], dtype=float)
true_components /= np.linalg.norm(true_components, axis=1, keepdims=True)
Z_true = np.random.randn(n, 2) * np.array([3.0, 1.5])   # 2 latent factors
X = Z_true @ true_components + 0.3 * np.random.randn(n, 4)  # 4D observations + noise

pca = PCA(n_components=2).fit(X)
Z = pca.transform(X)
X_reconstructed = pca.inverse_transform(Z)

print("PCA from scratch — Results:")
print(f"  Explained variance (components 1,2): {pca.explained_variance_ratio_ * 100}")
print(f"  Total variance explained by 2 PCs: {pca.explained_variance_ratio_.sum():.1%}")
print(f"\n  PC1 direction: {np.round(pca.components_[0], 3)}")
print(f"  PC2 direction: {np.round(pca.components_[1], 3)}")

reconstruction_error = np.mean((X - X_reconstructed)**2)
print(f"\n  Reconstruction MSE (2 PCs): {reconstruction_error:.4f}")

# Compare to numpy built-in via SVD
U, S, Vt = np.linalg.svd(X - X.mean(axis=0), full_matrices=False)
print(f"\n  Verify against SVD: PC1 cosine similarity = {abs(pca.components_[0] @ Vt[0]):.4f}")
print("  (cosine ~ 1.0 means same direction)")

Project 4: Mini Neural Network from Scratch

A 2-layer neural network forward pass and backpropagation from scratch. Every operation traces back to Parts 7 (matrix multiply), 8 (chain rule / gradient descent), and 9 (binary cross-entropy loss, sigmoid activation).

$$\text{Forward: } \mathbf{z}^{(1)} = X W_1 + b_1, \quad \mathbf{a}^{(1)} = \sigma(\mathbf{z}^{(1)}), \quad \hat{y} = \sigma(\mathbf{a}^{(1)} W_2 + b_2)$$
$$\mathcal{L} = -\frac{1}{n}\sum_i [y_i \log\hat{y}_i + (1-y_i)\log(1-\hat{y}_i)]$$
import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

def binary_cross_entropy(y_true, y_pred, eps=1e-8):
    y_pred = np.clip(y_pred, eps, 1 - eps)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

def train_nn(X, y, hidden_size=4, lr=0.1, n_epochs=2000, seed=42):
    """Two-layer neural network with backpropagation."""
    n, d = X.shape
    rng = np.random.default_rng(seed)

    # Initialize weights (Xavier: divide by sqrt of fan_in)
    W1 = rng.standard_normal((d, hidden_size)) / np.sqrt(d)
    b1 = np.zeros(hidden_size)
    W2 = rng.standard_normal((hidden_size, 1)) / np.sqrt(hidden_size)
    b2 = np.zeros(1)

    losses = []
    for epoch in range(n_epochs):
        # ---- Forward pass ----
        z1 = X @ W1 + b1          # (n, hidden_size)
        a1 = sigmoid(z1)           # (n, hidden_size)
        z2 = a1 @ W2 + b2          # (n, 1)
        y_hat = sigmoid(z2).ravel()  # (n,)
        loss = binary_cross_entropy(y, y_hat)
        losses.append(loss)

        # ---- Backward pass (chain rule) ----
        # dL/dy_hat → dL/dz2 (sigmoid + BCE cancel nicely: δ2 = y_hat - y)
        delta2 = (y_hat - y).reshape(-1, 1)   # (n, 1)
        dW2 = (a1.T @ delta2) / n              # (hidden, 1)
        db2 = delta2.mean(axis=0)

        # dL/da1 → dL/dz1 (sigmoid gradient: σ'(z) = σ(z)(1-σ(z)))
        delta1 = (delta2 @ W2.T) * (a1 * (1 - a1))   # (n, hidden)
        dW1 = (X.T @ delta1) / n                       # (d, hidden)
        db1 = delta1.mean(axis=0)

        # ---- Update ----
        W1 -= lr * dW1;  b1 -= lr * db1
        W2 -= lr * dW2;  b2 -= lr * db2

    return W1, b1, W2, b2, losses

# --- XOR problem: non-linearly separable ---
X_xor = np.array([[0,0],[0,1],[1,0],[1,1]], dtype=float)
y_xor = np.array([0, 1, 1, 0], dtype=float)  # XOR

W1, b1, W2, b2, losses = train_nn(X_xor, y_xor, hidden_size=4, lr=0.5, n_epochs=5000)

# Final predictions
z1 = X_xor @ W1 + b1; a1 = sigmoid(z1)
z2 = a1 @ W2 + b2;    y_hat = sigmoid(z2).ravel()
preds = (y_hat > 0.5).astype(int)

print("XOR Neural Network from Scratch:")
print(f"  Architecture: 2 → 4 → 1 (sigmoid activations)")
print(f"  Initial loss: {losses[0]:.4f}")
print(f"  Final loss:   {losses[-1]:.4f}")
print(f"\n  Input   True   Predicted  Probability")
for i in range(4):
    print(f"  {X_xor[i]}  {int(y_xor[i])}      {preds[i]}          {y_hat[i]:.3f}")
print(f"\n  Accuracy: {(preds == y_xor.astype(int)).mean():.0%}")
What just happened: A linear model cannot solve XOR (it is not linearly separable). The hidden layer with sigmoid activations creates a non-linear feature space where XOR becomes linearly separable. This is the universal approximation theorem in action — and every piece of math it uses came from this series.

Roadmap for Further Study

Next Steps
Where to Go from Here
  • Deep Learning Math: Convolutional operations as cross-correlations, attention as scaled dot-product $\text{softmax}(QK^\top/\sqrt{d_k})V$, backpropagation through time (BPTT)
  • Probabilistic ML: Gaussian processes (Part 11 Bayesian inference at function level), variational autoencoders (ELBO = reconstruction − KL divergence), normalizing flows
  • Optimization Theory: Convergence rates, adaptive learning rates (Adam from Part 8), second-order methods (Newton, L-BFGS)
  • Statistical Learning Theory: VC dimension, Rademacher complexity, uniform convergence (extending Part 11 PAC learning)
  • Numerical Linear Algebra: LU, QR, Cholesky decompositions, iterative solvers for large systems (Conjugate Gradient)

Recommended resources: Mathematics for Machine Learning (Deisenroth et al., free PDF); Bishop's Pattern Recognition & Machine Learning; 3Blue1Brown's Essence of Linear Algebra and Calculus series; Deep Learning textbook (Goodfellow et al., free online).

Series Conclusion

Congratulations — you've completed the series! Across 12 parts, you built a complete mathematical foundation for machine learning: from set theory and combinatorics, through probability and information theory, into linear algebra and calculus, and finally to advanced statistical theory and hands-on implementations. Every major ML algorithm now has a traceable mathematical foundation you have worked through.

Here is a final map of how each part connects to real-world ML:

PartTopicPowers in ML
1Mathematical ThinkingReading papers, writing proofs, thinking in abstractions
2Set TheoryFeature engineering, Jaccard similarity, dataset operations
3CombinatoricsBinomial distribution, hyperparameter search counting
4ProbabilityNaive Bayes, Bayesian networks, probabilistic predictions
5StatisticsA/B testing, confidence intervals, model evaluation
6Information TheoryCross-entropy loss, KL divergence in VAEs, decision trees
7Linear AlgebraNeural network layers, PCA, SVD-based recommendations
8CalculusGradient descent, backpropagation, Adam optimizer
9ML-Specific MathLoss functions, regularization, bias-variance tradeoff
10Computational MathVectorized implementations, numerical stability, simulation
11Advanced TopicsGMMs (EM), MCMC sampling, PAC generalization bounds
12ProjectsEverything — all 4 algorithms built and verified from scratch