Complete Math + Probability + Statistics for ML/AI/DS Bootcamp
Mathematical Thinking
Mindset, notation & functionsSet Theory & Foundations
Sets, operations & ML connectionsCombinatorics
Counting, permutations & combinationsProbability Fundamentals
Rules, Bayes & distributionsStatistics
Descriptive to inferentialInformation Theory
Entropy, cross-entropy & KL divergenceLinear Algebra
Vectors, matrices & transformationsCalculus & Optimization
Derivatives, gradients & descentML-Specific Math
Loss functions & regularizationComputational Math
NumPy, SciPy & simulationAdvanced Topics
Multivariate stats & Bayesian inferenceProjects & Applications
Build from scratch: regression, Bayes, PCALoss Functions
A loss function $\mathcal{L}(y, \hat{y})$ quantifies the discrepancy between true label $y$ and prediction $\hat{y}$. The choice of loss function encodes your assumptions about the noise structure and what kinds of errors matter most.
Regression Loss Functions
- MSE (L2 loss): $\mathcal{L} = \frac{1}{n}\sum_i (y_i - \hat{y}_i)^2$ — penalizes large errors heavily; optimal for Gaussian noise; sensitive to outliers
- MAE (L1 loss): $\mathcal{L} = \frac{1}{n}\sum_i |y_i - \hat{y}_i|$ — robust to outliers; non-differentiable at 0; optimal for Laplace noise
- Huber loss: quadratic for $|e| \leq \delta$, linear for $|e| \gt \delta$ — combines MSE's smoothness with MAE's outlier robustness
- Log-cosh: $\log(\cosh(y - \hat{y}))$ — smooth approximation of Huber; twice differentiable everywhere
Classification Loss Functions
- Binary cross-entropy: $\mathcal{L} = -y\log\hat{p} - (1-y)\log(1-\hat{p})$ — the negative log-likelihood for Bernoulli output; equivalent to KL divergence from $\hat{p}$ to $y$
- Categorical cross-entropy: $\mathcal{L} = -\sum_k y_k \log \hat{p}_k$ — for multi-class with one-hot $y$; reduces to $-\log \hat{p}_{y_{\text{true}}}$
- Hinge loss (SVM): $\mathcal{L} = \max(0, 1 - y \cdot \hat{y})$ — creates maximum-margin classifier; not differentiable at $y\hat{y}=1$
- Focal loss: $(1-\hat{p})^\gamma \cdot \text{CE}$ — down-weights easy examples; designed for class imbalance in object detection
import numpy as np
def mse_loss(y_true, y_pred):
return np.mean((y_true - y_pred)**2)
def mae_loss(y_true, y_pred):
return np.mean(np.abs(y_true - y_pred))
def huber_loss(y_true, y_pred, delta=1.0):
residual = np.abs(y_true - y_pred)
return np.mean(np.where(residual <= delta,
0.5 * residual**2,
delta * residual - 0.5 * delta**2))
def binary_cross_entropy(y_true, y_pred_prob):
eps = 1e-12
y_pred_prob = np.clip(y_pred_prob, eps, 1 - eps)
return -np.mean(y_true * np.log(y_pred_prob) + (1 - y_true) * np.log(1 - y_pred_prob))
# Example: compare regression losses with an outlier
y_true = np.array([1.0, 2.0, 3.0, 4.0, 100.0]) # last point is outlier
y_pred = np.array([1.1, 1.9, 3.2, 3.8, 4.5])
print("Regression losses (with outlier at index 4):")
print(f" MSE: {mse_loss(y_true, y_pred):.4f} ← dominated by outlier")
print(f" MAE: {mae_loss(y_true, y_pred):.4f} ← robust to outlier")
print(f" Huber: {huber_loss(y_true, y_pred):.4f} ← balanced")
# Classification cross-entropy
y_cls = np.array([1, 0, 1, 1, 0])
y_proba = np.array([0.9, 0.1, 0.8, 0.7, 0.3])
y_proba2 = np.array([0.6, 0.4, 0.6, 0.6, 0.4]) # less confident
print("\nBinary cross-entropy:")
print(f" Confident correct predictions: {binary_cross_entropy(y_cls, y_proba):.4f}")
print(f" Less confident predictions: {binary_cross_entropy(y_cls, y_proba2):.4f}")
Regularization
Regularization adds a penalty term to the loss to discourage complex models and reduce overfitting:
- L2 regularization (Ridge): $R(\theta) = \|\theta\|^2_2 = \sum_j \theta_j^2$ — penalizes large weights; encourages small, diffuse weights; differentiable everywhere; equivalent to Gaussian prior
- L1 regularization (Lasso): $R(\theta) = \|\theta\|_1 = \sum_j |\theta_j|$ — induces exact sparsity; many weights become exactly 0; equivalent to Laplace prior; non-differentiable at 0 (use subgradients)
- Elastic net: $R(\theta) = \alpha \|\theta\|_1 + (1-\alpha)\|\theta\|^2_2$ — combines sparsity of L1 and numerical stability of L2; useful when features are correlated
Bias-Variance Tradeoff
For a model $\hat{f}$ trained on dataset $\mathcal{D}$, the expected MSE at a new point $x$ decomposes as:
- Bias: error from wrong assumptions in the model (underfitting) — $\text{Bias} = \mathbb{E}[\hat{f}(x)] - f(x)$
- Variance: error from sensitivity to fluctuations in training data (overfitting) — $\text{Var}(\hat{f}(x)) = \mathbb{E}[\hat{f}(x)^2] - (\mathbb{E}[\hat{f}(x)])^2$
- Noise: irreducible error from inherent randomness in $y$
Increasing model complexity (more parameters) decreases bias but increases variance. Regularization reduces variance at the cost of slightly higher bias. The sweet spot is the model complexity that minimizes total generalization error.
import numpy as np
# Demonstrate bias-variance tradeoff via polynomial regression
np.random.seed(42)
def true_function(x):
return np.sin(2 * np.pi * x)
def train_and_predict(degree, X_train, y_train, X_test):
"""Fit polynomial of given degree and predict on X_test."""
X_train_poly = np.vander(X_train, degree + 1, increasing=True)
X_test_poly = np.vander(X_test, degree + 1, increasing=True)
# Least squares solution
coeffs, _, _, _ = np.linalg.lstsq(X_train_poly, y_train, rcond=None)
return X_test_poly @ coeffs
n_datasets = 50
n_train = 10
X_test = np.linspace(0, 1, 100)
y_test_true = true_function(X_test)
degrees = [1, 3, 9]
results = {}
for d in degrees:
preds = []
for _ in range(n_datasets):
X_train = np.random.uniform(0, 1, n_train)
y_train = true_function(X_train) + np.random.normal(0, 0.2, n_train)
preds.append(train_and_predict(d, X_train, y_train, X_test))
preds = np.array(preds) # (50, 100)
avg_pred = preds.mean(axis=0) # (100,)
bias_sq = ((avg_pred - y_test_true)**2).mean()
variance = preds.var(axis=0).mean()
results[d] = {'bias_sq': bias_sq, 'variance': variance, 'total': bias_sq + variance}
print("Bias-Variance Tradeoff (over 50 datasets, noise σ=0.2):")
print(f"{'Degree':>8} {'Bias²':>10} {'Variance':>10} {'Total':>10}")
for d in degrees:
r = results[d]
print(f"{d:>8} {r['bias_sq']:>10.4f} {r['variance']:>10.4f} {r['total']:>10.4f}")
print("\nDegree 1: high bias (underfits), low variance")
print("Degree 3: balanced (good fit) ")
print("Degree 9: low bias, very high variance (overfits)")
MLE & MAP
Maximum Likelihood Estimation (MLE) chooses parameters to maximize the probability of observed data:
MSE loss is MLE under Gaussian noise. Cross-entropy loss is MLE under Bernoulli/categorical distributions.
Maximum A Posteriori (MAP) adds a prior $p(\theta)$:
- Gaussian prior $p(\theta) \propto e^{-\lambda\|\theta\|^2}$ → MAP adds $-\lambda\|\theta\|^2$ to log-likelihood → equivalent to L2 regularization
- Laplace prior $p(\theta) \propto e^{-\lambda\|\theta\|_1}$ → MAP adds $-\lambda\|\theta\|_1$ → equivalent to L1 regularization
Softmax & Its Gradient
Softmax converts a vector of real-valued scores $\mathbf{z} \in \mathbb{R}^K$ into a probability distribution:
The Jacobian of softmax is:
When combined with categorical cross-entropy loss, the gradient simplifies beautifully: $\frac{\partial \mathcal{L}}{\partial z_k} = \hat{p}_k - y_k$ (prediction minus one-hot label). This is why softmax + cross-entropy is the universal choice for multi-class classification.
Activation Functions
Nonlinear activation functions enable neural networks to represent non-linear mappings. The gradient of each activation is essential for backpropagation:
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_grad(x):
s = sigmoid(x)
return s * (1 - s)
def tanh_grad(x):
return 1 - np.tanh(x)**2
def relu(x):
return np.maximum(0, x)
def relu_grad(x):
return (x > 0).astype(float)
def gelu(x):
"""Gaussian Error Linear Unit — used in BERT, GPT."""
return x * 0.5 * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))
def swish(x):
"""Swish = x * sigmoid(x), used in EfficientNet."""
return x * sigmoid(x)
x = np.array([-3, -1, -0.5, 0, 0.5, 1, 2, 3], dtype=float)
print("Activation function values:")
print(f"{'x':>8} {'sigmoid':>10} {'tanh':>10} {'ReLU':>8} {'GELU':>10} {'Swish':>10}")
for xi, sg, th, rl, gl, sw in zip(x, sigmoid(x), np.tanh(x), relu(x), gelu(x), swish(x)):
print(f"{xi:>8.1f} {sg:>10.4f} {th:>10.4f} {rl:>8.4f} {gl:>10.4f} {sw:>10.4f}")
print("\nGradients at key points:")
print(f"sigmoid'(0) = {sigmoid_grad(0):.4f} (max gradient at 0)")
print(f"sigmoid'(5) = {sigmoid_grad(5):.6f} (near-zero — vanishing gradient risk)")
print(f"tanh'(0) = {tanh_grad(0):.4f} (max gradient at 0, stronger than sigmoid)")
print(f"ReLU'(-1) = {relu_grad(-1):.4f} (dead neuron — zero gradient for x<0)")
print(f"ReLU'(1) = {relu_grad(1):.4f} (constant gradient for x>0)")
Batch Normalization
Batch normalization normalizes layer inputs to have zero mean and unit variance per mini-batch, then applies learnable scale $\gamma$ and shift $\beta$:
Where $\mu_B = \frac{1}{m}\sum_i x_i$ and $\sigma^2_B = \frac{1}{m}\sum_i (x_i - \mu_B)^2$ are batch statistics. During inference, running mean/variance (exponential moving averages from training) replace batch statistics.
Benefits: reduces internal covariate shift, smooths loss landscape (enabling larger $\alpha$), provides mild regularization effect, reduces sensitivity to initialization.
Practice Exercises
Derive MLE for a Gaussian Distribution
Given i.i.d. samples $x_1, \ldots, x_n \sim \mathcal{N}(\mu, \sigma^2)$, derive the MLE estimators $\hat{\mu}$ and $\hat{\sigma}^2$.
Show Answer
Log-likelihood: $\ell(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_i (x_i - \mu)^2$
$\frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum_i(x_i - \mu) = 0 \Rightarrow \hat{\mu} = \frac{1}{n}\sum_i x_i$ (sample mean)
$\frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4}\sum_i(x_i-\mu)^2 = 0 \Rightarrow \hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_i(x_i-\hat{\mu})^2$
Note: the MLE variance divides by $n$, not $n-1$, making it a biased estimator (Bessel's correction from Part 5 uses $n-1$ to correct this bias).
L1 vs L2 Regularization: Why L1 is Sparse
For a 2-parameter model minimizing MSE + L2 (iso-contours: circles) vs MSE + L1 (iso-contours: diamonds), explain geometrically why the L1 optimal solution tends to have one parameter exactly equal to zero.
Show Answer
The constrained optimization perspective: minimize MSE subject to $\|\theta\|_1 \leq t$ (L1) or $\|\theta\|_2^2 \leq t$ (L2).
The feasible region for L2 is a smooth ball; the feasible region for L1 is a diamond with corners on the axes. The MSE loss has elliptical iso-contours. The optimal constrained solution occurs where the MSE contour first touches the feasible region boundary.
For L2 (ball), this contact point is typically on a smooth portion of the boundary, giving a non-sparse solution. For L1 (diamond), the corners protrude far toward the axes, making it likely that the MSE contour first touches a corner — where one parameter $\theta_j = 0$. The sharper the corners (higher dimension), the more likely sparse solutions arise.
Conclusion & Next Steps
This part synthesized the math of previous parts into the specific constructs driving model training:
- Loss functions encode assumptions about noise and error importance
- L1/L2 regularization are MAP with Laplace/Gaussian priors — not arbitrary penalties
- Bias-variance decomposition explains the underfitting vs overfitting tension
- MLE = cross-entropy loss, MLE + Gaussian prior = L2 regularization
- Softmax + cross-entropy has the clean gradient $\hat{p} - y$
- Activation gradients determine vanishing/exploding gradient risk
- Batch norm stabilizes training by normalizing pre-activations
Next in the Series
In Part 10: Computational Math with Python, we shift from theory to implementation: NumPy, SciPy, SymPy, Monte Carlo simulation, numerical integration, and performance-focused vectorization.