Back to Mathematics

Part 9: ML-Specific Math

April 26, 2026 Wasil Zafar 20 min read

This part is where all the math we've built — calculus, probability, linear algebra, information theory — converges on the specific mathematical constructs you encounter every time you train a model: loss functions, regularization penalties, the bias-variance tradeoff, and the statistical foundations of learning itself.

Table of Contents

  1. Loss Functions
  2. Regularization
  3. Bias-Variance Tradeoff
  4. MLE & MAP
  5. Softmax & Its Gradient
  6. Activation Functions
  7. Batch Normalization
  8. Practice Exercises
  9. Conclusion & Next Steps

Loss Functions

A loss function $\mathcal{L}(y, \hat{y})$ quantifies the discrepancy between true label $y$ and prediction $\hat{y}$. The choice of loss function encodes your assumptions about the noise structure and what kinds of errors matter most.

Regression Losses
Regression Loss Functions
  • MSE (L2 loss): $\mathcal{L} = \frac{1}{n}\sum_i (y_i - \hat{y}_i)^2$ — penalizes large errors heavily; optimal for Gaussian noise; sensitive to outliers
  • MAE (L1 loss): $\mathcal{L} = \frac{1}{n}\sum_i |y_i - \hat{y}_i|$ — robust to outliers; non-differentiable at 0; optimal for Laplace noise
  • Huber loss: quadratic for $|e| \leq \delta$, linear for $|e| \gt \delta$ — combines MSE's smoothness with MAE's outlier robustness
  • Log-cosh: $\log(\cosh(y - \hat{y}))$ — smooth approximation of Huber; twice differentiable everywhere
Classification Losses
Classification Loss Functions
  • Binary cross-entropy: $\mathcal{L} = -y\log\hat{p} - (1-y)\log(1-\hat{p})$ — the negative log-likelihood for Bernoulli output; equivalent to KL divergence from $\hat{p}$ to $y$
  • Categorical cross-entropy: $\mathcal{L} = -\sum_k y_k \log \hat{p}_k$ — for multi-class with one-hot $y$; reduces to $-\log \hat{p}_{y_{\text{true}}}$
  • Hinge loss (SVM): $\mathcal{L} = \max(0, 1 - y \cdot \hat{y})$ — creates maximum-margin classifier; not differentiable at $y\hat{y}=1$
  • Focal loss: $(1-\hat{p})^\gamma \cdot \text{CE}$ — down-weights easy examples; designed for class imbalance in object detection
import numpy as np

def mse_loss(y_true, y_pred):
    return np.mean((y_true - y_pred)**2)

def mae_loss(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))

def huber_loss(y_true, y_pred, delta=1.0):
    residual = np.abs(y_true - y_pred)
    return np.mean(np.where(residual <= delta,
                            0.5 * residual**2,
                            delta * residual - 0.5 * delta**2))

def binary_cross_entropy(y_true, y_pred_prob):
    eps = 1e-12
    y_pred_prob = np.clip(y_pred_prob, eps, 1 - eps)
    return -np.mean(y_true * np.log(y_pred_prob) + (1 - y_true) * np.log(1 - y_pred_prob))

# Example: compare regression losses with an outlier
y_true = np.array([1.0, 2.0, 3.0, 4.0, 100.0])  # last point is outlier
y_pred = np.array([1.1, 1.9, 3.2, 3.8, 4.5])

print("Regression losses (with outlier at index 4):")
print(f"  MSE:   {mse_loss(y_true, y_pred):.4f}   ← dominated by outlier")
print(f"  MAE:   {mae_loss(y_true, y_pred):.4f}   ← robust to outlier")
print(f"  Huber: {huber_loss(y_true, y_pred):.4f}   ← balanced")

# Classification cross-entropy
y_cls    = np.array([1, 0, 1, 1, 0])
y_proba  = np.array([0.9, 0.1, 0.8, 0.7, 0.3])
y_proba2 = np.array([0.6, 0.4, 0.6, 0.6, 0.4])  # less confident

print("\nBinary cross-entropy:")
print(f"  Confident correct predictions: {binary_cross_entropy(y_cls, y_proba):.4f}")
print(f"  Less confident predictions:    {binary_cross_entropy(y_cls, y_proba2):.4f}")

Regularization

Regularization adds a penalty term to the loss to discourage complex models and reduce overfitting:

$$\mathcal{L}_{\text{reg}} = \mathcal{L}_{\text{data}} + \lambda \cdot R(\theta)$$
  • L2 regularization (Ridge): $R(\theta) = \|\theta\|^2_2 = \sum_j \theta_j^2$ — penalizes large weights; encourages small, diffuse weights; differentiable everywhere; equivalent to Gaussian prior
  • L1 regularization (Lasso): $R(\theta) = \|\theta\|_1 = \sum_j |\theta_j|$ — induces exact sparsity; many weights become exactly 0; equivalent to Laplace prior; non-differentiable at 0 (use subgradients)
  • Elastic net: $R(\theta) = \alpha \|\theta\|_1 + (1-\alpha)\|\theta\|^2_2$ — combines sparsity of L1 and numerical stability of L2; useful when features are correlated
Geometric intuition: In 2D parameter space, L2 constrains parameters to a ball (smooth boundary) while L1 constrains to a diamond (corners on axes). The optimal solution under L1 tends to land at a corner of the diamond where one coordinate is exactly zero — this is why L1 produces sparse solutions.

Bias-Variance Tradeoff

For a model $\hat{f}$ trained on dataset $\mathcal{D}$, the expected MSE at a new point $x$ decomposes as:

$$\mathbb{E}[(y - \hat{f}(x))^2] = \underbrace{[\text{Bias}(\hat{f}(x))]^2}_{\text{systematic error}} + \underbrace{\text{Var}(\hat{f}(x))}_{\text{estimation variance}} + \underbrace{\sigma^2}_{\text{irreducible noise}}$$
  • Bias: error from wrong assumptions in the model (underfitting) — $\text{Bias} = \mathbb{E}[\hat{f}(x)] - f(x)$
  • Variance: error from sensitivity to fluctuations in training data (overfitting) — $\text{Var}(\hat{f}(x)) = \mathbb{E}[\hat{f}(x)^2] - (\mathbb{E}[\hat{f}(x)])^2$
  • Noise: irreducible error from inherent randomness in $y$

Increasing model complexity (more parameters) decreases bias but increases variance. Regularization reduces variance at the cost of slightly higher bias. The sweet spot is the model complexity that minimizes total generalization error.

import numpy as np

# Demonstrate bias-variance tradeoff via polynomial regression
np.random.seed(42)

def true_function(x):
    return np.sin(2 * np.pi * x)

def train_and_predict(degree, X_train, y_train, X_test):
    """Fit polynomial of given degree and predict on X_test."""
    X_train_poly = np.vander(X_train, degree + 1, increasing=True)
    X_test_poly  = np.vander(X_test, degree + 1, increasing=True)
    # Least squares solution
    coeffs, _, _, _ = np.linalg.lstsq(X_train_poly, y_train, rcond=None)
    return X_test_poly @ coeffs

n_datasets  = 50
n_train     = 10
X_test      = np.linspace(0, 1, 100)
y_test_true = true_function(X_test)

degrees = [1, 3, 9]
results = {}

for d in degrees:
    preds = []
    for _ in range(n_datasets):
        X_train = np.random.uniform(0, 1, n_train)
        y_train = true_function(X_train) + np.random.normal(0, 0.2, n_train)
        preds.append(train_and_predict(d, X_train, y_train, X_test))
    preds = np.array(preds)          # (50, 100)
    avg_pred = preds.mean(axis=0)    # (100,)
    bias_sq  = ((avg_pred - y_test_true)**2).mean()
    variance = preds.var(axis=0).mean()
    results[d] = {'bias_sq': bias_sq, 'variance': variance, 'total': bias_sq + variance}

print("Bias-Variance Tradeoff (over 50 datasets, noise σ=0.2):")
print(f"{'Degree':>8}  {'Bias²':>10}  {'Variance':>10}  {'Total':>10}")
for d in degrees:
    r = results[d]
    print(f"{d:>8}  {r['bias_sq']:>10.4f}  {r['variance']:>10.4f}  {r['total']:>10.4f}")
print("\nDegree 1: high bias (underfits), low variance")
print("Degree 3: balanced (good fit)  ")
print("Degree 9: low bias, very high variance (overfits)")

MLE & MAP

Maximum Likelihood Estimation (MLE) chooses parameters to maximize the probability of observed data:

$$\hat{\theta}_{\text{MLE}} = \arg\max_\theta \prod_{i=1}^n p(x_i \mid \theta) = \arg\max_\theta \sum_{i=1}^n \log p(x_i \mid \theta)$$

MSE loss is MLE under Gaussian noise. Cross-entropy loss is MLE under Bernoulli/categorical distributions.

Maximum A Posteriori (MAP) adds a prior $p(\theta)$:

$$\hat{\theta}_{\text{MAP}} = \arg\max_\theta \log p(\mathcal{D}\mid\theta) + \log p(\theta)$$
MAP = Regularized MLE:
  • Gaussian prior $p(\theta) \propto e^{-\lambda\|\theta\|^2}$ → MAP adds $-\lambda\|\theta\|^2$ to log-likelihood → equivalent to L2 regularization
  • Laplace prior $p(\theta) \propto e^{-\lambda\|\theta\|_1}$ → MAP adds $-\lambda\|\theta\|_1$ → equivalent to L1 regularization
Regularization is not just an engineering trick — it has a principled Bayesian interpretation as encoding prior beliefs about parameter magnitudes.

Softmax & Its Gradient

Softmax converts a vector of real-valued scores $\mathbf{z} \in \mathbb{R}^K$ into a probability distribution:

$$\text{softmax}(\mathbf{z})_k = \frac{e^{z_k}}{\sum_{j=1}^K e^{z_j}}$$

The Jacobian of softmax is:

$$\frac{\partial \text{softmax}(\mathbf{z})_i}{\partial z_j} = \text{softmax}(\mathbf{z})_i \cdot (\mathbf{1}[i=j] - \text{softmax}(\mathbf{z})_j)$$

When combined with categorical cross-entropy loss, the gradient simplifies beautifully: $\frac{\partial \mathcal{L}}{\partial z_k} = \hat{p}_k - y_k$ (prediction minus one-hot label). This is why softmax + cross-entropy is the universal choice for multi-class classification.

Activation Functions

Nonlinear activation functions enable neural networks to represent non-linear mappings. The gradient of each activation is essential for backpropagation:

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_grad(x):
    s = sigmoid(x)
    return s * (1 - s)

def tanh_grad(x):
    return 1 - np.tanh(x)**2

def relu(x):
    return np.maximum(0, x)

def relu_grad(x):
    return (x > 0).astype(float)

def gelu(x):
    """Gaussian Error Linear Unit — used in BERT, GPT."""
    return x * 0.5 * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))

def swish(x):
    """Swish = x * sigmoid(x), used in EfficientNet."""
    return x * sigmoid(x)

x = np.array([-3, -1, -0.5, 0, 0.5, 1, 2, 3], dtype=float)

print("Activation function values:")
print(f"{'x':>8}  {'sigmoid':>10}  {'tanh':>10}  {'ReLU':>8}  {'GELU':>10}  {'Swish':>10}")
for xi, sg, th, rl, gl, sw in zip(x, sigmoid(x), np.tanh(x), relu(x), gelu(x), swish(x)):
    print(f"{xi:>8.1f}  {sg:>10.4f}  {th:>10.4f}  {rl:>8.4f}  {gl:>10.4f}  {sw:>10.4f}")

print("\nGradients at key points:")
print(f"sigmoid'(0)  = {sigmoid_grad(0):.4f}  (max gradient at 0)")
print(f"sigmoid'(5)  = {sigmoid_grad(5):.6f} (near-zero — vanishing gradient risk)")
print(f"tanh'(0)     = {tanh_grad(0):.4f}  (max gradient at 0, stronger than sigmoid)")
print(f"ReLU'(-1)    = {relu_grad(-1):.4f}  (dead neuron — zero gradient for x<0)")
print(f"ReLU'(1)     = {relu_grad(1):.4f}  (constant gradient for x>0)")
Vanishing & Exploding Gradients: Sigmoid and tanh saturate at extremes — their gradients approach 0 for large $|x|$. In deep networks, multiplying many near-zero gradients through the chain rule causes gradients to vanish, making early layers learn very slowly. Solutions: ReLU (constant gradient for $x \gt 0$), batch normalization (normalizes pre-activations to prevent saturation), residual connections (skip-connections provide gradient highway), gradient clipping (caps gradient norm for exploding gradients).

Batch Normalization

Batch normalization normalizes layer inputs to have zero mean and unit variance per mini-batch, then applies learnable scale $\gamma$ and shift $\beta$:

$$\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma^2_B + \epsilon}}, \quad y_i = \gamma \hat{x}_i + \beta$$

Where $\mu_B = \frac{1}{m}\sum_i x_i$ and $\sigma^2_B = \frac{1}{m}\sum_i (x_i - \mu_B)^2$ are batch statistics. During inference, running mean/variance (exponential moving averages from training) replace batch statistics.

Benefits: reduces internal covariate shift, smooths loss landscape (enabling larger $\alpha$), provides mild regularization effect, reduces sensitivity to initialization.

Practice Exercises

Exercise 1MLE Derivation
Derive MLE for a Gaussian Distribution

Given i.i.d. samples $x_1, \ldots, x_n \sim \mathcal{N}(\mu, \sigma^2)$, derive the MLE estimators $\hat{\mu}$ and $\hat{\sigma}^2$.

Show Answer

Log-likelihood: $\ell(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_i (x_i - \mu)^2$

$\frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum_i(x_i - \mu) = 0 \Rightarrow \hat{\mu} = \frac{1}{n}\sum_i x_i$ (sample mean)

$\frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4}\sum_i(x_i-\mu)^2 = 0 \Rightarrow \hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_i(x_i-\hat{\mu})^2$

Note: the MLE variance divides by $n$, not $n-1$, making it a biased estimator (Bessel's correction from Part 5 uses $n-1$ to correct this bias).

Exercise 2Regularization
L1 vs L2 Regularization: Why L1 is Sparse

For a 2-parameter model minimizing MSE + L2 (iso-contours: circles) vs MSE + L1 (iso-contours: diamonds), explain geometrically why the L1 optimal solution tends to have one parameter exactly equal to zero.

Show Answer

The constrained optimization perspective: minimize MSE subject to $\|\theta\|_1 \leq t$ (L1) or $\|\theta\|_2^2 \leq t$ (L2).

The feasible region for L2 is a smooth ball; the feasible region for L1 is a diamond with corners on the axes. The MSE loss has elliptical iso-contours. The optimal constrained solution occurs where the MSE contour first touches the feasible region boundary.

For L2 (ball), this contact point is typically on a smooth portion of the boundary, giving a non-sparse solution. For L1 (diamond), the corners protrude far toward the axes, making it likely that the MSE contour first touches a corner — where one parameter $\theta_j = 0$. The sharper the corners (higher dimension), the more likely sparse solutions arise.

Conclusion & Next Steps

This part synthesized the math of previous parts into the specific constructs driving model training:

  • Loss functions encode assumptions about noise and error importance
  • L1/L2 regularization are MAP with Laplace/Gaussian priors — not arbitrary penalties
  • Bias-variance decomposition explains the underfitting vs overfitting tension
  • MLE = cross-entropy loss, MLE + Gaussian prior = L2 regularization
  • Softmax + cross-entropy has the clean gradient $\hat{p} - y$
  • Activation gradients determine vanishing/exploding gradient risk
  • Batch norm stabilizes training by normalizing pre-activations

Next in the Series

In Part 10: Computational Math with Python, we shift from theory to implementation: NumPy, SciPy, SymPy, Monte Carlo simulation, numerical integration, and performance-focused vectorization.