Part 9: ML-Specific Math

Complete Math + Probability + Statistics for ML/AI/DS Bootcamp

Your 12-step learning path • Currently on Step 9

ML-Specific Math

Loss functions & regularization

You Are Here

Computational Math

NumPy, SciPy & simulation

Advanced Topics

Multivariate stats & Bayesian inference

Projects & Applications

Build from scratch: regression, Bayes, PCA

Loss Functions

A loss function $\mathcal{L}(y, \hat{y})$ quantifies the discrepancy between true label $y$ and prediction $\hat{y}$. The choice of loss function encodes your assumptions about the noise structure and what kinds of errors matter most.

Regression Losses

Regression Loss Functions

MSE (L2 loss): $\mathcal{L} = \frac{1}{n}\sum_i (y_i - \hat{y}_i)^2$ — penalizes large errors heavily; optimal for Gaussian noise; sensitive to outliers
MAE (L1 loss): $\mathcal{L} = \frac{1}{n}\sum_i |y_i - \hat{y}_i|$ — robust to outliers; non-differentiable at 0; optimal for Laplace noise
Huber loss: quadratic for $|e| \leq \delta$, linear for $|e| \gt \delta$ — combines MSE's smoothness with MAE's outlier robustness
Log-cosh: $\log(\cosh(y - \hat{y}))$ — smooth approximation of Huber; twice differentiable everywhere

Classification Losses

Classification Loss Functions

Binary cross-entropy: $\mathcal{L} = -y\log\hat{p} - (1-y)\log(1-\hat{p})$ — the negative log-likelihood for Bernoulli output; equivalent to KL divergence from $\hat{p}$ to $y$
Categorical cross-entropy: $\mathcal{L} = -\sum_k y_k \log \hat{p}_k$ — for multi-class with one-hot $y$; reduces to $-\log \hat{p}_{y_{\text{true}}}$
Hinge loss (SVM): $\mathcal{L} = \max(0, 1 - y \cdot \hat{y})$ — creates maximum-margin classifier; not differentiable at $y\hat{y}=1$
Focal loss: $(1-\hat{p})^\gamma \cdot \text{CE}$ — down-weights easy examples; designed for class imbalance in object detection

import numpy as np

def mse_loss(y_true, y_pred):
    return np.mean((y_true - y_pred)**2)

def mae_loss(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))

def huber_loss(y_true, y_pred, delta=1.0):
    residual = np.abs(y_true - y_pred)
    return np.mean(np.where(residual <= delta,
                            0.5 * residual**2,
                            delta * residual - 0.5 * delta**2))

def binary_cross_entropy(y_true, y_pred_prob):
    eps = 1e-12
    y_pred_prob = np.clip(y_pred_prob, eps, 1 - eps)
    return -np.mean(y_true * np.log(y_pred_prob) + (1 - y_true) * np.log(1 - y_pred_prob))

# Example: compare regression losses with an outlier
y_true = np.array([1.0, 2.0, 3.0, 4.0, 100.0])  # last point is outlier
y_pred = np.array([1.1, 1.9, 3.2, 3.8, 4.5])

print("Regression losses (with outlier at index 4):")
print(f"  MSE:   {mse_loss(y_true, y_pred):.4f}   ← dominated by outlier")
print(f"  MAE:   {mae_loss(y_true, y_pred):.4f}   ← robust to outlier")
print(f"  Huber: {huber_loss(y_true, y_pred):.4f}   ← balanced")

# Classification cross-entropy
y_cls    = np.array([1, 0, 1, 1, 0])
y_proba  = np.array([0.9, 0.1, 0.8, 0.7, 0.3])
y_proba2 = np.array([0.6, 0.4, 0.6, 0.6, 0.4])  # less confident

print("\nBinary cross-entropy:")
print(f"  Confident correct predictions: {binary_cross_entropy(y_cls, y_proba):.4f}")
print(f"  Less confident predictions:    {binary_cross_entropy(y_cls, y_proba2):.4f}")

Regularization

Regularization adds a penalty term to the loss to discourage complex models and reduce overfitting:

\mathcal{L}_{\text{reg}} = \mathcal{L}_{\text{data}} + \lambda \cdot R(\theta)

L2 regularization (Ridge): $R(\theta) = \|\theta\|^2_2 = \sum_j \theta_j^2$ — penalizes large weights; encourages small, diffuse weights; differentiable everywhere; equivalent to Gaussian prior
L1 regularization (Lasso): $R(\theta) = \|\theta\|_1 = \sum_j |\theta_j|$ — induces exact sparsity; many weights become exactly 0; equivalent to Laplace prior; non-differentiable at 0 (use subgradients)
Elastic net: $R(\theta) = \alpha \|\theta\|_1 + (1-\alpha)\|\theta\|^2_2$ — combines sparsity of L1 and numerical stability of L2; useful when features are correlated

                            Geometric intuition: In 2D parameter space, L2 constrains parameters to a ball (smooth boundary) while L1 constrains to a diamond (corners on axes). The optimal solution under L1 tends to land at a corner of the diamond where one coordinate is exactly zero — this is why L1 produces sparse solutions.
                        

Bias-Variance Tradeoff

For a model $\hat{f}$ trained on dataset $\mathcal{D}$, the expected MSE at a new point $x$ decomposes as:

\mathbb{E}[(y - \hat{f}(x))^2] = \underbrace{[\text{Bias}(\hat{f}(x))]^2}_{\text{systematic error}} + \underbrace{\text{Var}(\hat{f}(x))}_{\text{estimation variance}} + \underbrace{\sigma^2}_{\text{irreducible noise}}

Bias: error from wrong assumptions in the model (underfitting) — $\text{Bias} = \mathbb{E}[\hat{f}(x)] - f(x)$
Variance: error from sensitivity to fluctuations in training data (overfitting) — $\text{Var}(\hat{f}(x)) = \mathbb{E}[\hat{f}(x)^2] - (\mathbb{E}[\hat{f}(x)])^2$
Noise: irreducible error from inherent randomness in $y$

Increasing model complexity (more parameters) decreases bias but increases variance. Regularization reduces variance at the cost of slightly higher bias. The sweet spot is the model complexity that minimizes total generalization error.

import numpy as np

# Demonstrate bias-variance tradeoff via polynomial regression
np.random.seed(42)

def true_function(x):
    return np.sin(2 * np.pi * x)

def train_and_predict(degree, X_train, y_train, X_test):
    """Fit polynomial of given degree and predict on X_test."""
    X_train_poly = np.vander(X_train, degree + 1, increasing=True)
    X_test_poly  = np.vander(X_test, degree + 1, increasing=True)
    # Least squares solution
    coeffs, _, _, _ = np.linalg.lstsq(X_train_poly, y_train, rcond=None)
    return X_test_poly @ coeffs

n_datasets  = 50
n_train     = 10
X_test      = np.linspace(0, 1, 100)
y_test_true = true_function(X_test)

degrees = [1, 3, 9]
results = {}

for d in degrees:
    preds = []
    for _ in range(n_datasets):
        X_train = np.random.uniform(0, 1, n_train)
        y_train = true_function(X_train) + np.random.normal(0, 0.2, n_train)
        preds.append(train_and_predict(d, X_train, y_train, X_test))
    preds = np.array(preds)          # (50, 100)
    avg_pred = preds.mean(axis=0)    # (100,)
    bias_sq  = ((avg_pred - y_test_true)**2).mean()
    variance = preds.var(axis=0).mean()
    results[d] = {'bias_sq': bias_sq, 'variance': variance, 'total': bias_sq + variance}

print("Bias-Variance Tradeoff (over 50 datasets, noise σ=0.2):")
print(f"{'Degree':>8}  {'Bias²':>10}  {'Variance':>10}  {'Total':>10}")
for d in degrees:
    r = results[d]
    print(f"{d:>8}  {r['bias_sq']:>10.4f}  {r['variance']:>10.4f}  {r['total']:>10.4f}")
print("\nDegree 1: high bias (underfits), low variance")
print("Degree 3: balanced (good fit)  ")
print("Degree 9: low bias, very high variance (overfits)")

MLE & MAP

Maximum Likelihood Estimation (MLE) chooses parameters to maximize the probability of observed data:

\hat{\theta}_{\text{MLE}} = \arg\max_\theta \prod_{i=1}^n p(x_i \mid \theta) = \arg\max_\theta \sum_{i=1}^n \log p(x_i \mid \theta)

MSE loss is MLE under Gaussian noise. Cross-entropy loss is MLE under Bernoulli/categorical distributions.

Maximum A Posteriori (MAP) adds a prior $p(\theta)$:

\hat{\theta}_{\text{MAP}} = \arg\max_\theta \log p(\mathcal{D}\mid\theta) + \log p(\theta)

                            MAP = Regularized MLE:
                            Gaussian prior $p(\theta) \propto e^{-\lambda\|\theta\|^2}$ → MAP adds $-\lambda\|\theta\|^2$ to log-likelihood → equivalent to L2 regularization
Laplace prior $p(\theta) \propto e^{-\lambda\|\theta\|_1}$ → MAP adds $-\lambda\|\theta\|_1$ → equivalent to L1 regularization

                            Regularization is not just an engineering trick — it has a principled Bayesian interpretation as encoding prior beliefs about parameter magnitudes.
                        

Softmax & Its Gradient

Softmax converts a vector of real-valued scores $\mathbf{z} \in \mathbb{R}^K$ into a probability distribution:

\text{softmax}(\mathbf{z})_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}

The Jacobian of softmax is:

\frac{\partial \text{softmax}(\mathbf{z})_i}{\partial z_j} = \text{softmax}(\mathbf{z})_i \cdot (\mathbf{1}[i=j] - \text{softmax}(\mathbf{z})_j)

When combined with categorical cross-entropy loss, the gradient simplifies beautifully: $\frac{\partial \mathcal{L}}{\partial z_k} = \hat{p}_k - y_k$ (prediction minus one-hot label). This is why softmax + cross-entropy is the universal choice for multi-class classification.

Activation Functions

Nonlinear activation functions enable neural networks to represent non-linear mappings. The gradient of each activation is essential for backpropagation:

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_grad(x):
    s = sigmoid(x)
    return s * (1 - s)

def tanh_grad(x):
    return 1 - np.tanh(x)**2

def relu(x):
    return np.maximum(0, x)

def relu_grad(x):
    return (x > 0).astype(float)

def gelu(x):
    """Gaussian Error Linear Unit — used in BERT, GPT."""
    return x * 0.5 * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))

def swish(x):
    """Swish = x * sigmoid(x), used in EfficientNet."""
    return x * sigmoid(x)

x = np.array([-3, -1, -0.5, 0, 0.5, 1, 2, 3], dtype=float)

print("Activation function values:")
print(f"{'x':>8}  {'sigmoid':>10}  {'tanh':>10}  {'ReLU':>8}  {'GELU':>10}  {'Swish':>10}")
for xi, sg, th, rl, gl, sw in zip(x, sigmoid(x), np.tanh(x), relu(x), gelu(x), swish(x)):
    print(f"{xi:>8.1f}  {sg:>10.4f}  {th:>10.4f}  {rl:>8.4f}  {gl:>10.4f}  {sw:>10.4f}")

print("\nGradients at key points:")
print(f"sigmoid'(0)  = {sigmoid_grad(0):.4f}  (max gradient at 0)")
print(f"sigmoid'(5)  = {sigmoid_grad(5):.6f} (near-zero — vanishing gradient risk)")
print(f"tanh'(0)     = {tanh_grad(0):.4f}  (max gradient at 0, stronger than sigmoid)")
print(f"ReLU'(-1)    = {relu_grad(-1):.4f}  (dead neuron — zero gradient for x<0)")
print(f"ReLU'(1)     = {relu_grad(1):.4f}  (constant gradient for x>0)")

                            Vanishing & Exploding Gradients: Sigmoid and tanh saturate at extremes — their gradients approach 0 for large $|x|$. In deep networks, multiplying many near-zero gradients through the chain rule causes gradients to vanish, making early layers learn very slowly. Solutions: ReLU (constant gradient for $x \gt 0$), batch normalization (normalizes pre-activations to prevent saturation), residual connections (skip-connections provide gradient highway), gradient clipping (caps gradient norm for exploding gradients).
                        

Batch Normalization

Batch normalization normalizes layer inputs to have zero mean and unit variance per mini-batch, then applies learnable scale $\gamma$ and shift $\beta$:

\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma^2_B + \epsilon}}, \quad y_i = \gamma \hat{x}_i + \beta

Where $\mu_B = \frac{1}{m}\sum_i x_i$ and $\sigma^2_B = \frac{1}{m}\sum_i (x_i - \mu_B)^2$ are batch statistics. During inference, running mean/variance (exponential moving averages from training) replace batch statistics.

Benefits: reduces internal covariate shift, smooths loss landscape (enabling larger $\alpha$), provides mild regularization effect, reduces sensitivity to initialization.

Practice Exercises

Exercise 1MLE Derivation

Derive MLE for a Gaussian Distribution

Given i.i.d. samples $x_1, \ldots, x_n \sim \mathcal{N}(\mu, \sigma^2)$, derive the MLE estimators $\hat{\mu}$ and $\hat{\sigma}^2$.

Show Answer

Log-likelihood: $\ell(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_i (x_i - \mu)^2$

$\frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum_i(x_i - \mu) = 0 \Rightarrow \hat{\mu} = \frac{1}{n}\sum_i x_i$ (sample mean)

$\frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4}\sum_i(x_i-\mu)^2 = 0 \Rightarrow \hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_i(x_i-\hat{\mu})^2$

Note: the MLE variance divides by $n$, not $n-1$, making it a biased estimator (Bessel's correction from Part 5 uses $n-1$ to correct this bias).

Exercise 2Regularization

L1 vs L2 Regularization: Why L1 is Sparse

For a 2-parameter model minimizing MSE + L2 (iso-contours: circles) vs MSE + L1 (iso-contours: diamonds), explain geometrically why the L1 optimal solution tends to have one parameter exactly equal to zero.

Show Answer

The constrained optimization perspective: minimize MSE subject to $\|\theta\|_1 \leq t$ (L1) or $\|\theta\|_2^2 \leq t$ (L2).

The feasible region for L2 is a smooth ball; the feasible region for L1 is a diamond with corners on the axes. The MSE loss has elliptical iso-contours. The optimal constrained solution occurs where the MSE contour first touches the feasible region boundary.

For L2 (ball), this contact point is typically on a smooth portion of the boundary, giving a non-sparse solution. For L1 (diamond), the corners protrude far toward the axes, making it likely that the MSE contour first touches a corner — where one parameter $\theta_j = 0$. The sharper the corners (higher dimension), the more likely sparse solutions arise.

Conclusion & Next Steps

This part synthesized the math of previous parts into the specific constructs driving model training:

Loss functions encode assumptions about noise and error importance
L1/L2 regularization are MAP with Laplace/Gaussian priors — not arbitrary penalties
Bias-variance decomposition explains the underfitting vs overfitting tension
MLE = cross-entropy loss, MLE + Gaussian prior = L2 regularization
Softmax + cross-entropy has the clean gradient $\hat{p} - y$
Activation gradients determine vanishing/exploding gradient risk
Batch norm stabilizes training by normalizing pre-activations

Next in the Series

In Part 10: Computational Math with Python, we shift from theory to implementation: NumPy, SciPy, SymPy, Monte Carlo simulation, numerical integration, and performance-focused vectorization.

Cookie Consent

Table of Contents

Complete Math + Probability + Statistics for ML/AI/DS Bootcamp

Mathematical Thinking

Set Theory & Foundations

Combinatorics

Probability Fundamentals

Statistics

Information Theory

Linear Algebra

Calculus & Optimization

ML-Specific Math

Computational Math

Advanced Topics

Projects & Applications

Loss Functions

Regression Loss Functions

Classification Loss Functions

Regularization

Bias-Variance Tradeoff

MLE & MAP

Softmax & Its Gradient

Activation Functions

Batch Normalization

Practice Exercises

Derive MLE for a Gaussian Distribution

L1 vs L2 Regularization: Why L1 is Sparse

Conclusion & Next Steps

Next in the Series

Cookie Consent

Part 9: ML-Specific Math

Table of Contents

Complete Math + Probability + Statistics for ML/AI/DS Bootcamp

Mathematical Thinking

Set Theory & Foundations

Combinatorics

Probability Fundamentals

Statistics

Information Theory

Linear Algebra

Calculus & Optimization

ML-Specific Math

Computational Math

Advanced Topics

Projects & Applications

Loss Functions

Regression Loss Functions

Classification Loss Functions

Regularization

Bias-Variance Tradeoff

MLE & MAP

Softmax & Its Gradient

Activation Functions

Batch Normalization

Practice Exercises

Derive MLE for a Gaussian Distribution

L1 vs L2 Regularization: Why L1 is Sparse

Conclusion & Next Steps

Next in the Series

Continue the Series

Part 8: Calculus & Optimization

Part 10: Computational Math with Python

Part 4: Probability Fundamentals