Part 12: Projects & Applications

Complete Math + Probability + Statistics for ML/AI/DS Bootcamp

Your 12-step learning path • Currently on Step 12 (Final)

Projects & Applications

Build from scratch: regression, Bayes, PCA

You Are Here

Math-to-Algorithm Map

Before we build, here is the explicit connection between the prior 11 parts and the algorithms we will implement:

Algorithm	Core Math Parts	Key Concepts Used
Linear Regression (OLS + GD)	Parts 7, 8, 9	Normal equations $X^\top X \mathbf{w} = X^\top \mathbf{y}$, MSE loss gradient $\nabla_\mathbf{w}\mathcal{L}$, gradient descent update
Naive Bayes Classifier	Parts 2, 4, 6	Bayes' theorem $P(C\|\mathbf{x})$, conditional independence assumption, log-likelihood for numerical stability
PCA	Parts 5, 7, 10	Covariance matrix, eigendecomposition, variance explained by eigenvectors, projection
2-Layer Neural Network	Parts 7, 8, 9	Matrix multiply forward pass, chain rule backprop, sigmoid activation, binary cross-entropy, gradient descent

Project 1: Linear Regression from Scratch

We implement both the closed-form normal equations (Part 7) and gradient descent (Part 8), and verify they agree.

                            Math Recap: OLS closed form minimizes $\mathcal{L}(\mathbf{w}) = \|\mathbf{y} - X\mathbf{w}\|_2^2$. Setting $\nabla_\mathbf{w}\mathcal{L} = 0$ gives $\mathbf{w}^* = (X^\top X)^{-1} X^\top \mathbf{y}$. Gradient descent uses $\mathbf{w} \leftarrow \mathbf{w} - \eta \nabla_\mathbf{w}\mathcal{L}$ where $\nabla_\mathbf{w}\mathcal{L} = \frac{2}{n}X^\top(X\mathbf{w} - \mathbf{y})$.
                        

import numpy as np

# ---- Generate synthetic data ----
np.random.seed(42)
n, d = 100, 3
X_raw = np.random.randn(n, d)
true_w = np.array([2.0, -1.5, 0.8])
y = X_raw @ true_w + 0.5 * np.random.randn(n)

# Add bias column (intercept)
X = np.column_stack([np.ones(n), X_raw])   # (100, 4)

# ---- Method 1: Normal Equations (closed form) ----
w_ols = np.linalg.solve(X.T @ X, X.T @ y)
y_pred_ols = X @ w_ols
mse_ols = np.mean((y - y_pred_ols)**2)
print("Normal Equations:")
print(f"  Coefficients: {np.round(w_ols, 4)}")
print(f"  MSE: {mse_ols:.6f}")

# ---- Method 2: Gradient Descent ----
def linear_regression_gd(X, y, lr=0.01, n_epochs=1000):
    n, p = X.shape
    w = np.zeros(p)
    losses = []
    for _ in range(n_epochs):
        y_hat = X @ w
        loss = np.mean((y - y_hat)**2)
        losses.append(loss)
        grad = (2 / n) * (X.T @ (y_hat - y))
        w -= lr * grad
    return w, losses

w_gd, losses = linear_regression_gd(X, y, lr=0.05, n_epochs=500)
y_pred_gd = X @ w_gd
mse_gd = np.mean((y - y_pred_gd)**2)
print("\nGradient Descent:")
print(f"  Coefficients: {np.round(w_gd, 4)}")
print(f"  MSE: {mse_gd:.6f}")
print(f"\nTrue coefficients (bias=0): [0, {true_w[0]}, {true_w[1]}, {true_w[2]}]")
print(f"OLS vs GD agreement: {np.allclose(w_ols, w_gd, atol=1e-3)}")
print(f"Loss after first epoch: {losses[0]:.4f}, after final epoch: {losses[-1]:.4f}")

Project 2: Naive Bayes Classifier from Scratch

Naive Bayes applies Bayes' theorem (Part 4) with the conditional independence assumption (Part 2): given the class, features are independent. We use log-probabilities (Part 6) for numerical stability.

\hat{y} = \arg\max_c \left[\log P(C=c) + \sum_{j=1}^d \log P(x_j | C=c)\right]

import numpy as np

class GaussianNaiveBayes:
    """Gaussian Naive Bayes: P(x_j|C=c) ~ N(mu_cj, sigma_cj^2)"""
    def fit(self, X, y):
        self.classes_ = np.unique(y)
        self.log_priors_ = {}
        self.params_ = {}   # {class: [(mu_j, sigma_j) for each feature j]}
        n = len(y)
        for c in self.classes_:
            X_c = X[y == c]
            self.log_priors_[c] = np.log(len(X_c) / n)
            self.params_[c] = (X_c.mean(axis=0), X_c.std(axis=0) + 1e-9)

    def _log_likelihood(self, x, c):
        mu, sigma = self.params_[c]
        # log of Gaussian PDF (ignoring constant term -0.5*log(2π))
        return np.sum(-np.log(sigma) - 0.5 * ((x - mu) / sigma)**2)

    def predict(self, X):
        preds = []
        for x in X:
            log_posteriors = {c: self.log_priors_[c] + self._log_likelihood(x, c)
                              for c in self.classes_}
            preds.append(max(log_posteriors, key=log_posteriors.get))
        return np.array(preds)

# --- Test on synthetic 2-class data ---
np.random.seed(0)
# Class 0: centered at (1,1), Class 1: centered at (4,4)
X0 = np.random.randn(50, 2) + np.array([1.0, 1.0])
X1 = np.random.randn(50, 2) + np.array([4.0, 4.0])
X_train = np.vstack([X0[:40], X1[:40]])
y_train = np.array([0]*40 + [1]*40)
X_test  = np.vstack([X0[40:], X1[40:]])
y_test  = np.array([0]*10 + [1]*10)

gnb = GaussianNaiveBayes()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)

accuracy = (y_pred == y_test).mean()
print("Gaussian Naive Bayes — Results:")
print(f"  Classes: {gnb.classes_}")
print(f"  Log priors: { {c: round(lp,3) for c, lp in gnb.log_priors_.items()} }")
for c in gnb.classes_:
    mu, sigma = gnb.params_[c]
    print(f"  Class {c}: mu={np.round(mu,2)}, sigma={np.round(sigma,2)}")
print(f"\n  Test accuracy: {accuracy:.2%}")
print(f"  Predictions: {y_pred}")
print(f"  True labels: {y_test}")

Project 3: PCA from Scratch

PCA (Part 7) finds the directions of maximum variance by computing the covariance matrix's eigendecomposition. Projecting onto the top $k$ eigenvectors gives the best rank-$k$ approximation in the least-squares sense.

\Sigma = \frac{1}{n-1}(X - \bar{X})^\top(X - \bar{X}), \quad \Sigma \mathbf{v}_i = \lambda_i \mathbf{v}_i

import numpy as np

class PCA:
    """PCA via covariance matrix eigendecomposition (Part 7)."""
    def __init__(self, n_components):
        self.n_components = n_components

    def fit(self, X):
        self.mean_ = X.mean(axis=0)
        X_centered = X - self.mean_
        cov = (X_centered.T @ X_centered) / (len(X) - 1)
        # eigh: symmetric matrix — eigenvalues in ascending order
        eigenvalues, eigenvectors = np.linalg.eigh(cov)
        # Sort descending (largest variance first)
        idx = np.argsort(eigenvalues)[::-1]
        eigenvalues = eigenvalues[idx]
        eigenvectors = eigenvectors[:, idx]
        self.components_ = eigenvectors[:, :self.n_components].T   # (k, d)
        self.explained_variance_ = eigenvalues[:self.n_components]
        self.explained_variance_ratio_ = eigenvalues[:self.n_components] / eigenvalues.sum()
        return self

    def transform(self, X):
        return (X - self.mean_) @ self.components_.T

    def inverse_transform(self, Z):
        return Z @ self.components_ + self.mean_

# --- Test on synthetic 4D data with 2 dominant directions ---
np.random.seed(7)
n = 200
# Generate data along 2 dominant directions in 4D
true_components = np.array([[1,1,0,0],[0,0,1,-1]], dtype=float)
true_components /= np.linalg.norm(true_components, axis=1, keepdims=True)
Z_true = np.random.randn(n, 2) * np.array([3.0, 1.5])   # 2 latent factors
X = Z_true @ true_components + 0.3 * np.random.randn(n, 4)  # 4D observations + noise

pca = PCA(n_components=2).fit(X)
Z = pca.transform(X)
X_reconstructed = pca.inverse_transform(Z)

print("PCA from scratch — Results:")
print(f"  Explained variance (components 1,2): {pca.explained_variance_ratio_ * 100}")
print(f"  Total variance explained by 2 PCs: {pca.explained_variance_ratio_.sum():.1%}")
print(f"\n  PC1 direction: {np.round(pca.components_[0], 3)}")
print(f"  PC2 direction: {np.round(pca.components_[1], 3)}")

reconstruction_error = np.mean((X - X_reconstructed)**2)
print(f"\n  Reconstruction MSE (2 PCs): {reconstruction_error:.4f}")

# Compare to numpy built-in via SVD
U, S, Vt = np.linalg.svd(X - X.mean(axis=0), full_matrices=False)
print(f"\n  Verify against SVD: PC1 cosine similarity = {abs(pca.components_[0] @ Vt[0]):.4f}")
print("  (cosine ~ 1.0 means same direction)")

Project 4: Mini Neural Network from Scratch

A 2-layer neural network forward pass and backpropagation from scratch. Every operation traces back to Parts 7 (matrix multiply), 8 (chain rule / gradient descent), and 9 (binary cross-entropy loss, sigmoid activation).

\text{Forward: } \mathbf{z}^{(1)} = X W_1 + b_1, \quad \mathbf{a}^{(1)} = \sigma(\mathbf{z}^{(1)}), \quad \hat{y} = \sigma(\mathbf{a}^{(1)} W_2 + b_2)

\mathcal{L} = -\frac{1}{n}\sum_i [y_i \log\hat{y}_i + (1-y_i)\log(1-\hat{y}_i)]

import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

def binary_cross_entropy(y_true, y_pred, eps=1e-8):
    y_pred = np.clip(y_pred, eps, 1 - eps)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

def train_nn(X, y, hidden_size=4, lr=0.1, n_epochs=2000, seed=42):
    """Two-layer neural network with backpropagation."""
    n, d = X.shape
    rng = np.random.default_rng(seed)

    # Initialize weights (Xavier: divide by sqrt of fan_in)
    W1 = rng.standard_normal((d, hidden_size)) / np.sqrt(d)
    b1 = np.zeros(hidden_size)
    W2 = rng.standard_normal((hidden_size, 1)) / np.sqrt(hidden_size)
    b2 = np.zeros(1)

    losses = []
    for epoch in range(n_epochs):
        # ---- Forward pass ----
        z1 = X @ W1 + b1          # (n, hidden_size)
        a1 = sigmoid(z1)           # (n, hidden_size)
        z2 = a1 @ W2 + b2          # (n, 1)
        y_hat = sigmoid(z2).ravel()  # (n,)
        loss = binary_cross_entropy(y, y_hat)
        losses.append(loss)

        # ---- Backward pass (chain rule) ----
        # dL/dy_hat → dL/dz2 (sigmoid + BCE cancel nicely: δ2 = y_hat - y)
        delta2 = (y_hat - y).reshape(-1, 1)   # (n, 1)
        dW2 = (a1.T @ delta2) / n              # (hidden, 1)
        db2 = delta2.mean(axis=0)

        # dL/da1 → dL/dz1 (sigmoid gradient: σ'(z) = σ(z)(1-σ(z)))
        delta1 = (delta2 @ W2.T) * (a1 * (1 - a1))   # (n, hidden)
        dW1 = (X.T @ delta1) / n                       # (d, hidden)
        db1 = delta1.mean(axis=0)

        # ---- Update ----
        W1 -= lr * dW1;  b1 -= lr * db1
        W2 -= lr * dW2;  b2 -= lr * db2

    return W1, b1, W2, b2, losses

# --- XOR problem: non-linearly separable ---
X_xor = np.array([[0,0],[0,1],[1,0],[1,1]], dtype=float)
y_xor = np.array([0, 1, 1, 0], dtype=float)  # XOR

W1, b1, W2, b2, losses = train_nn(X_xor, y_xor, hidden_size=4, lr=0.5, n_epochs=5000)

# Final predictions
z1 = X_xor @ W1 + b1; a1 = sigmoid(z1)
z2 = a1 @ W2 + b2;    y_hat = sigmoid(z2).ravel()
preds = (y_hat > 0.5).astype(int)

print("XOR Neural Network from Scratch:")
print(f"  Architecture: 2 → 4 → 1 (sigmoid activations)")
print(f"  Initial loss: {losses[0]:.4f}")
print(f"  Final loss:   {losses[-1]:.4f}")
print(f"\n  Input   True   Predicted  Probability")
for i in range(4):
    print(f"  {X_xor[i]}  {int(y_xor[i])}      {preds[i]}          {y_hat[i]:.3f}")
print(f"\n  Accuracy: {(preds == y_xor.astype(int)).mean():.0%}")

                            What just happened: A linear model cannot solve XOR (it is not linearly separable). The hidden layer with sigmoid activations creates a non-linear feature space where XOR becomes linearly separable. This is the universal approximation theorem in action — and every piece of math it uses came from this series.
                        

Roadmap for Further Study

Next Steps

Where to Go from Here

Deep Learning Math: Convolutional operations as cross-correlations, attention as scaled dot-product $\text{softmax}(QK^\top/\sqrt{d_k})V$, backpropagation through time (BPTT)
Probabilistic ML: Gaussian processes (Part 11 Bayesian inference at function level), variational autoencoders (ELBO = reconstruction − KL divergence), normalizing flows
Optimization Theory: Convergence rates, adaptive learning rates (Adam from Part 8), second-order methods (Newton, L-BFGS)
Statistical Learning Theory: VC dimension, Rademacher complexity, uniform convergence (extending Part 11 PAC learning)
Numerical Linear Algebra: LU, QR, Cholesky decompositions, iterative solvers for large systems (Conjugate Gradient)

Recommended resources: Mathematics for Machine Learning (Deisenroth et al., free PDF); Bishop's Pattern Recognition & Machine Learning; 3Blue1Brown's Essence of Linear Algebra and Calculus series; Deep Learning textbook (Goodfellow et al., free online).

Series Conclusion

                            Congratulations — you've completed the series! Across 12 parts, you built a complete mathematical foundation for machine learning: from set theory and combinatorics, through probability and information theory, into linear algebra and calculus, and finally to advanced statistical theory and hands-on implementations. Every major ML algorithm now has a traceable mathematical foundation you have worked through.
                        

Here is a final map of how each part connects to real-world ML:

Part	Topic	Powers in ML
1	Mathematical Thinking	Reading papers, writing proofs, thinking in abstractions
2	Set Theory	Feature engineering, Jaccard similarity, dataset operations
3	Combinatorics	Binomial distribution, hyperparameter search counting
4	Probability	Naive Bayes, Bayesian networks, probabilistic predictions
5	Statistics	A/B testing, confidence intervals, model evaluation
6	Information Theory	Cross-entropy loss, KL divergence in VAEs, decision trees
7	Linear Algebra	Neural network layers, PCA, SVD-based recommendations
8	Calculus	Gradient descent, backpropagation, Adam optimizer
9	ML-Specific Math	Loss functions, regularization, bias-variance tradeoff
10	Computational Math	Vectorized implementations, numerical stability, simulation
11	Advanced Topics	GMMs (EM), MCMC sampling, PAC generalization bounds
12	Projects	Everything — all 4 algorithms built and verified from scratch

Cookie Consent

Table of Contents

Complete Math + Probability + Statistics for ML/AI/DS Bootcamp

Mathematical Thinking

Set Theory & Foundations

Combinatorics

Probability Fundamentals

Statistics

Information Theory

Linear Algebra

Calculus & Optimization

ML-Specific Math

Computational Math

Advanced Topics

Projects & Applications

Math-to-Algorithm Map

Project 1: Linear Regression from Scratch

Project 2: Naive Bayes Classifier from Scratch

Project 3: PCA from Scratch

Project 4: Mini Neural Network from Scratch

Roadmap for Further Study

Where to Go from Here

Series Conclusion

Cookie Consent

Part 12: Projects & Applications

Table of Contents

Complete Math + Probability + Statistics for ML/AI/DS Bootcamp

Mathematical Thinking

Set Theory & Foundations

Combinatorics

Probability Fundamentals

Statistics

Information Theory

Linear Algebra

Calculus & Optimization

ML-Specific Math

Computational Math

Advanced Topics

Projects & Applications

Math-to-Algorithm Map

Project 1: Linear Regression from Scratch

Project 2: Naive Bayes Classifier from Scratch

Project 3: PCA from Scratch

Project 4: Mini Neural Network from Scratch

Roadmap for Further Study

Where to Go from Here

Series Conclusion

Review Key Parts of the Series

Part 7: Linear Algebra

Part 8: Calculus & Optimization

Part 1: Mathematical Thinking