Complete Math + Probability + Statistics for ML/AI/DS Bootcamp
Mathematical Thinking
Mindset, notation & functionsSet Theory & Foundations
Sets, operations & ML connectionsCombinatorics
Counting, permutations & combinationsProbability Fundamentals
Rules, Bayes & distributionsStatistics
Descriptive to inferentialInformation Theory
Entropy, cross-entropy & KL divergenceLinear Algebra
Vectors, matrices & transformationsCalculus & Optimization
Derivatives, gradients & descentML-Specific Math
Loss functions & regularizationComputational Math
NumPy, SciPy & simulationAdvanced Topics
Multivariate stats & Bayesian inferenceProjects & Applications
Build from scratch: regression, Bayes, PCAMath-to-Algorithm Map
Before we build, here is the explicit connection between the prior 11 parts and the algorithms we will implement:
| Algorithm | Core Math Parts | Key Concepts Used |
|---|---|---|
| Linear Regression (OLS + GD) | Parts 7, 8, 9 | Normal equations $X^\top X \mathbf{w} = X^\top \mathbf{y}$, MSE loss gradient $\nabla_\mathbf{w}\mathcal{L}$, gradient descent update |
| Naive Bayes Classifier | Parts 2, 4, 6 | Bayes' theorem $P(C|\mathbf{x})$, conditional independence assumption, log-likelihood for numerical stability |
| PCA | Parts 5, 7, 10 | Covariance matrix, eigendecomposition, variance explained by eigenvectors, projection |
| 2-Layer Neural Network | Parts 7, 8, 9 | Matrix multiply forward pass, chain rule backprop, sigmoid activation, binary cross-entropy, gradient descent |
Project 1: Linear Regression from Scratch
We implement both the closed-form normal equations (Part 7) and gradient descent (Part 8), and verify they agree.
import numpy as np
# ---- Generate synthetic data ----
np.random.seed(42)
n, d = 100, 3
X_raw = np.random.randn(n, d)
true_w = np.array([2.0, -1.5, 0.8])
y = X_raw @ true_w + 0.5 * np.random.randn(n)
# Add bias column (intercept)
X = np.column_stack([np.ones(n), X_raw]) # (100, 4)
# ---- Method 1: Normal Equations (closed form) ----
w_ols = np.linalg.solve(X.T @ X, X.T @ y)
y_pred_ols = X @ w_ols
mse_ols = np.mean((y - y_pred_ols)**2)
print("Normal Equations:")
print(f" Coefficients: {np.round(w_ols, 4)}")
print(f" MSE: {mse_ols:.6f}")
# ---- Method 2: Gradient Descent ----
def linear_regression_gd(X, y, lr=0.01, n_epochs=1000):
n, p = X.shape
w = np.zeros(p)
losses = []
for _ in range(n_epochs):
y_hat = X @ w
loss = np.mean((y - y_hat)**2)
losses.append(loss)
grad = (2 / n) * (X.T @ (y_hat - y))
w -= lr * grad
return w, losses
w_gd, losses = linear_regression_gd(X, y, lr=0.05, n_epochs=500)
y_pred_gd = X @ w_gd
mse_gd = np.mean((y - y_pred_gd)**2)
print("\nGradient Descent:")
print(f" Coefficients: {np.round(w_gd, 4)}")
print(f" MSE: {mse_gd:.6f}")
print(f"\nTrue coefficients (bias=0): [0, {true_w[0]}, {true_w[1]}, {true_w[2]}]")
print(f"OLS vs GD agreement: {np.allclose(w_ols, w_gd, atol=1e-3)}")
print(f"Loss after first epoch: {losses[0]:.4f}, after final epoch: {losses[-1]:.4f}")
Project 2: Naive Bayes Classifier from Scratch
Naive Bayes applies Bayes' theorem (Part 4) with the conditional independence assumption (Part 2): given the class, features are independent. We use log-probabilities (Part 6) for numerical stability.
import numpy as np
class GaussianNaiveBayes:
"""Gaussian Naive Bayes: P(x_j|C=c) ~ N(mu_cj, sigma_cj^2)"""
def fit(self, X, y):
self.classes_ = np.unique(y)
self.log_priors_ = {}
self.params_ = {} # {class: [(mu_j, sigma_j) for each feature j]}
n = len(y)
for c in self.classes_:
X_c = X[y == c]
self.log_priors_[c] = np.log(len(X_c) / n)
self.params_[c] = (X_c.mean(axis=0), X_c.std(axis=0) + 1e-9)
def _log_likelihood(self, x, c):
mu, sigma = self.params_[c]
# log of Gaussian PDF (ignoring constant term -0.5*log(2π))
return np.sum(-np.log(sigma) - 0.5 * ((x - mu) / sigma)**2)
def predict(self, X):
preds = []
for x in X:
log_posteriors = {c: self.log_priors_[c] + self._log_likelihood(x, c)
for c in self.classes_}
preds.append(max(log_posteriors, key=log_posteriors.get))
return np.array(preds)
# --- Test on synthetic 2-class data ---
np.random.seed(0)
# Class 0: centered at (1,1), Class 1: centered at (4,4)
X0 = np.random.randn(50, 2) + np.array([1.0, 1.0])
X1 = np.random.randn(50, 2) + np.array([4.0, 4.0])
X_train = np.vstack([X0[:40], X1[:40]])
y_train = np.array([0]*40 + [1]*40)
X_test = np.vstack([X0[40:], X1[40:]])
y_test = np.array([0]*10 + [1]*10)
gnb = GaussianNaiveBayes()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
accuracy = (y_pred == y_test).mean()
print("Gaussian Naive Bayes — Results:")
print(f" Classes: {gnb.classes_}")
print(f" Log priors: { {c: round(lp,3) for c, lp in gnb.log_priors_.items()} }")
for c in gnb.classes_:
mu, sigma = gnb.params_[c]
print(f" Class {c}: mu={np.round(mu,2)}, sigma={np.round(sigma,2)}")
print(f"\n Test accuracy: {accuracy:.2%}")
print(f" Predictions: {y_pred}")
print(f" True labels: {y_test}")
Project 3: PCA from Scratch
PCA (Part 7) finds the directions of maximum variance by computing the covariance matrix's eigendecomposition. Projecting onto the top $k$ eigenvectors gives the best rank-$k$ approximation in the least-squares sense.
import numpy as np
class PCA:
"""PCA via covariance matrix eigendecomposition (Part 7)."""
def __init__(self, n_components):
self.n_components = n_components
def fit(self, X):
self.mean_ = X.mean(axis=0)
X_centered = X - self.mean_
cov = (X_centered.T @ X_centered) / (len(X) - 1)
# eigh: symmetric matrix — eigenvalues in ascending order
eigenvalues, eigenvectors = np.linalg.eigh(cov)
# Sort descending (largest variance first)
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
self.components_ = eigenvectors[:, :self.n_components].T # (k, d)
self.explained_variance_ = eigenvalues[:self.n_components]
self.explained_variance_ratio_ = eigenvalues[:self.n_components] / eigenvalues.sum()
return self
def transform(self, X):
return (X - self.mean_) @ self.components_.T
def inverse_transform(self, Z):
return Z @ self.components_ + self.mean_
# --- Test on synthetic 4D data with 2 dominant directions ---
np.random.seed(7)
n = 200
# Generate data along 2 dominant directions in 4D
true_components = np.array([[1,1,0,0],[0,0,1,-1]], dtype=float)
true_components /= np.linalg.norm(true_components, axis=1, keepdims=True)
Z_true = np.random.randn(n, 2) * np.array([3.0, 1.5]) # 2 latent factors
X = Z_true @ true_components + 0.3 * np.random.randn(n, 4) # 4D observations + noise
pca = PCA(n_components=2).fit(X)
Z = pca.transform(X)
X_reconstructed = pca.inverse_transform(Z)
print("PCA from scratch — Results:")
print(f" Explained variance (components 1,2): {pca.explained_variance_ratio_ * 100}")
print(f" Total variance explained by 2 PCs: {pca.explained_variance_ratio_.sum():.1%}")
print(f"\n PC1 direction: {np.round(pca.components_[0], 3)}")
print(f" PC2 direction: {np.round(pca.components_[1], 3)}")
reconstruction_error = np.mean((X - X_reconstructed)**2)
print(f"\n Reconstruction MSE (2 PCs): {reconstruction_error:.4f}")
# Compare to numpy built-in via SVD
U, S, Vt = np.linalg.svd(X - X.mean(axis=0), full_matrices=False)
print(f"\n Verify against SVD: PC1 cosine similarity = {abs(pca.components_[0] @ Vt[0]):.4f}")
print(" (cosine ~ 1.0 means same direction)")
Project 4: Mini Neural Network from Scratch
A 2-layer neural network forward pass and backpropagation from scratch. Every operation traces back to Parts 7 (matrix multiply), 8 (chain rule / gradient descent), and 9 (binary cross-entropy loss, sigmoid activation).
import numpy as np
def sigmoid(z):
return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
def binary_cross_entropy(y_true, y_pred, eps=1e-8):
y_pred = np.clip(y_pred, eps, 1 - eps)
return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
def train_nn(X, y, hidden_size=4, lr=0.1, n_epochs=2000, seed=42):
"""Two-layer neural network with backpropagation."""
n, d = X.shape
rng = np.random.default_rng(seed)
# Initialize weights (Xavier: divide by sqrt of fan_in)
W1 = rng.standard_normal((d, hidden_size)) / np.sqrt(d)
b1 = np.zeros(hidden_size)
W2 = rng.standard_normal((hidden_size, 1)) / np.sqrt(hidden_size)
b2 = np.zeros(1)
losses = []
for epoch in range(n_epochs):
# ---- Forward pass ----
z1 = X @ W1 + b1 # (n, hidden_size)
a1 = sigmoid(z1) # (n, hidden_size)
z2 = a1 @ W2 + b2 # (n, 1)
y_hat = sigmoid(z2).ravel() # (n,)
loss = binary_cross_entropy(y, y_hat)
losses.append(loss)
# ---- Backward pass (chain rule) ----
# dL/dy_hat → dL/dz2 (sigmoid + BCE cancel nicely: δ2 = y_hat - y)
delta2 = (y_hat - y).reshape(-1, 1) # (n, 1)
dW2 = (a1.T @ delta2) / n # (hidden, 1)
db2 = delta2.mean(axis=0)
# dL/da1 → dL/dz1 (sigmoid gradient: σ'(z) = σ(z)(1-σ(z)))
delta1 = (delta2 @ W2.T) * (a1 * (1 - a1)) # (n, hidden)
dW1 = (X.T @ delta1) / n # (d, hidden)
db1 = delta1.mean(axis=0)
# ---- Update ----
W1 -= lr * dW1; b1 -= lr * db1
W2 -= lr * dW2; b2 -= lr * db2
return W1, b1, W2, b2, losses
# --- XOR problem: non-linearly separable ---
X_xor = np.array([[0,0],[0,1],[1,0],[1,1]], dtype=float)
y_xor = np.array([0, 1, 1, 0], dtype=float) # XOR
W1, b1, W2, b2, losses = train_nn(X_xor, y_xor, hidden_size=4, lr=0.5, n_epochs=5000)
# Final predictions
z1 = X_xor @ W1 + b1; a1 = sigmoid(z1)
z2 = a1 @ W2 + b2; y_hat = sigmoid(z2).ravel()
preds = (y_hat > 0.5).astype(int)
print("XOR Neural Network from Scratch:")
print(f" Architecture: 2 → 4 → 1 (sigmoid activations)")
print(f" Initial loss: {losses[0]:.4f}")
print(f" Final loss: {losses[-1]:.4f}")
print(f"\n Input True Predicted Probability")
for i in range(4):
print(f" {X_xor[i]} {int(y_xor[i])} {preds[i]} {y_hat[i]:.3f}")
print(f"\n Accuracy: {(preds == y_xor.astype(int)).mean():.0%}")
Roadmap for Further Study
Where to Go from Here
- Deep Learning Math: Convolutional operations as cross-correlations, attention as scaled dot-product $\text{softmax}(QK^\top/\sqrt{d_k})V$, backpropagation through time (BPTT)
- Probabilistic ML: Gaussian processes (Part 11 Bayesian inference at function level), variational autoencoders (ELBO = reconstruction − KL divergence), normalizing flows
- Optimization Theory: Convergence rates, adaptive learning rates (Adam from Part 8), second-order methods (Newton, L-BFGS)
- Statistical Learning Theory: VC dimension, Rademacher complexity, uniform convergence (extending Part 11 PAC learning)
- Numerical Linear Algebra: LU, QR, Cholesky decompositions, iterative solvers for large systems (Conjugate Gradient)
Recommended resources: Mathematics for Machine Learning (Deisenroth et al., free PDF); Bishop's Pattern Recognition & Machine Learning; 3Blue1Brown's Essence of Linear Algebra and Calculus series; Deep Learning textbook (Goodfellow et al., free online).
Series Conclusion
Here is a final map of how each part connects to real-world ML:
| Part | Topic | Powers in ML |
|---|---|---|
| 1 | Mathematical Thinking | Reading papers, writing proofs, thinking in abstractions |
| 2 | Set Theory | Feature engineering, Jaccard similarity, dataset operations |
| 3 | Combinatorics | Binomial distribution, hyperparameter search counting |
| 4 | Probability | Naive Bayes, Bayesian networks, probabilistic predictions |
| 5 | Statistics | A/B testing, confidence intervals, model evaluation |
| 6 | Information Theory | Cross-entropy loss, KL divergence in VAEs, decision trees |
| 7 | Linear Algebra | Neural network layers, PCA, SVD-based recommendations |
| 8 | Calculus | Gradient descent, backpropagation, Adam optimizer |
| 9 | ML-Specific Math | Loss functions, regularization, bias-variance tradeoff |
| 10 | Computational Math | Vectorized implementations, numerical stability, simulation |
| 11 | Advanced Topics | GMMs (EM), MCMC sampling, PAC generalization bounds |
| 12 | Projects | Everything — all 4 algorithms built and verified from scratch |