Back to Math for AI Hub

Training, Alignment & Evaluation Math

April 30, 2026Wasil Zafar24 min read

Modern AI quality depends on optimization, preference learning, calibration, and evaluation statistics. This page turns training and alignment terms into equations you can reason about.

Table of Contents

  1. AdamW
  2. Schedules & Clipping
  3. Perplexity & Calibration
  4. Preference Optimization
  5. Evaluation Uncertainty
Modern training math: optimization controls whether a model learns; alignment objectives control what it prefers; evaluation statistics control whether improvements are real.

AdamW

Adam keeps exponential moving averages of gradients and squared gradients:

$$m_t=\beta_1m_{t-1}+(1-\beta_1)g_t,\quad v_t=\beta_2v_{t-1}+(1-\beta_2)g_t^2$$

AdamW decouples weight decay from the adaptive update, which is often more stable for large neural networks.

import numpy as np

grad = np.array([0.4, -0.2, 0.1])
w = np.array([1.0, -1.0, 0.5])
m = np.zeros_like(w)
v = np.zeros_like(w)
lr, beta1, beta2, eps, wd = 1e-3, 0.9, 0.999, 1e-8, 0.01
m = beta1 * m + (1 - beta1) * grad
v = beta2 * v + (1 - beta2) * grad**2
m_hat = m / (1 - beta1)
v_hat = v / (1 - beta2)
w = w - lr * (m_hat / (np.sqrt(v_hat) + eps) + wd * w)
print(np.round(w, 6))

Schedules & Clipping

Warmup prevents early unstable updates. Cosine decay gradually reduces learning rate. Gradient clipping rescales gradients when $\|g\|_2$ exceeds a threshold, preventing rare huge updates from destabilizing training.

Perplexity & Calibration

For language models, perplexity is exponentiated average negative log likelihood:

$$\text{PPL}=\exp\left(-\frac{1}{N}\sum_{i=1}^{N}\log p(x_i|x_{<i})\right)$$

Derivation from cross-entropy: The cross-entropy of the model distribution $q$ against the true distribution $p$ over sequences is $H(p,q) = -\frac{1}{N}\sum_{i}\log q(x_i|x_{<i})$. Perplexity is simply $2^{H(p,q)}$ (or equivalently $e^{H(p,q)}$ when using natural log). It represents the effective vocabulary size the model is "confused" among at each step.

Connection to bits-per-character (BPC): $\text{BPC} = H(p,q) / \log 2$. Lower BPC means more efficient compression. A model with PPL = 20 is equivalent to ~4.3 BPC.

Calibration: A model is calibrated if its predicted probabilities match empirical frequencies. Expected Calibration Error (ECE) measures this:

$$\text{ECE} = \sum_{b=1}^B \frac{|B_b|}{N}\left|\text{acc}(B_b) - \text{conf}(B_b)\right|$$

where predictions are binned by confidence. Temperature scaling post-hoc recalibrates by dividing logits by a learned $T > 0$ before softmax: $p_i = \text{softmax}(z_i / T)$. $T > 1$ softens overconfident predictions.

import numpy as np

# Perplexity computation for a language model
log_probs = np.array([-2.3, -1.8, -3.1, -2.0, -1.5, -2.8, -1.9, -2.5])
N = len(log_probs)

# Cross-entropy (negative average log-prob)
cross_entropy = -np.mean(log_probs)
perplexity = np.exp(cross_entropy)
bpc = cross_entropy / np.log(2)

print(f"Average NLL: {cross_entropy:.4f}")
print(f"Perplexity:  {perplexity:.2f}")
print(f"Bits/char:   {bpc:.4f}")

# ECE computation
confidences = np.array([0.95, 0.85, 0.75, 0.65, 0.92, 0.55, 0.88, 0.72])
correct = np.array([1, 1, 0, 1, 1, 0, 1, 1])
n_bins = 4
bin_edges = np.linspace(0.5, 1.0, n_bins + 1)
ece = 0.0
for i in range(n_bins):
    mask = (confidences >= bin_edges[i]) & (confidences < bin_edges[i+1])
    if mask.sum() > 0:
        bin_acc = correct[mask].mean()
        bin_conf = confidences[mask].mean()
        ece += mask.sum() / N * abs(bin_acc - bin_conf)
print(f"\nECE: {ece:.4f}")

# Temperature scaling effect
logits = np.array([2.5, 1.0, 0.3])
for T in [0.5, 1.0, 2.0]:
    scaled = logits / T
    probs = np.exp(scaled) / np.exp(scaled).sum()
    print(f"T={T}: probs = {np.round(probs, 3)}")

Preference Optimization

After supervised training, language models need alignment — learning to produce outputs humans prefer. The math of alignment centers on reward modeling and constrained policy optimization.

RLHF: Reward Modeling & KL-Constrained Optimization

Step 1 — Reward Model: Given preference pairs $(y_w \succ y_l | x)$ from human annotators, train a reward model $r_\psi(x, y)$ using the Bradley-Terry model:

$$P(y_w \succ y_l | x) = \sigma(r_\psi(x, y_w) - r_\psi(x, y_l))$$

The loss maximizes log-likelihood of observed preferences:

$$\mathcal{L}_{RM} = -\mathbb{E}_{(x, y_w, y_l)}\left[\log \sigma(r_\psi(x, y_w) - r_\psi(x, y_l))\right]$$

Step 2 — KL-Constrained Policy Optimization: Maximize expected reward while staying close to a reference policy $\pi_{\text{ref}}$ (the SFT model):

$$\max_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot|x)}[r_\psi(x, y)] - \beta \cdot D_{KL}(\pi_\theta \| \pi_{\text{ref}})$$

The KL penalty prevents reward hacking — exploiting quirks of the reward model by drifting too far from sensible language. This objective is optimized using PPO (see Extension 5) with the reward model providing the signal.

Optimal solution: The closed-form optimal policy for the KL-constrained objective is:

$$\pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\frac{r(x,y)}{\beta}\right)$$

where $Z(x) = \sum_y \pi_{\text{ref}}(y|x)\exp(r(x,y)/\beta)$ is the partition function. This is intractable to compute but provides the foundation for DPO.

DPO: Direct Preference Optimization

Key insight: We can rearrange the optimal policy to express the reward in terms of policies:

$$r(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)$$

Substituting into the Bradley-Terry preference model and noting that $Z(x)$ cancels between $y_w$ and $y_l$:

$$P(y_w \succ y_l | x) = \sigma\left(\beta\left[\log\frac{\pi^*(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log\frac{\pi^*(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right]\right)$$

Now replace the unknown $\pi^*$ with our learnable policy $\pi_\theta$ and maximize the preference likelihood directly:

$$\boxed{\mathcal{L}_{\text{DPO}}(\theta) = -\mathbb{E}_{(x, y_w, y_l)}\left[\log\sigma\left(\beta\left[\log\frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log\frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right]\right)\right]}$$

Why DPO works: It implicitly defines a reward $r(x,y) = \beta\log\frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}$ and optimizes it to match human preferences — without ever training a separate reward model or running RL. The $\beta$ parameter controls how far the policy can deviate from the reference (same role as the KL penalty in RLHF).

import numpy as np

def dpo_loss(log_pi_w, log_pi_l, log_ref_w, log_ref_l, beta=0.1):
    """
    Compute DPO loss for a batch of preference pairs.
    
    Args:
        log_pi_w: log pi_theta(y_w|x) for preferred completions
        log_pi_l: log pi_theta(y_l|x) for dispreferred completions
        log_ref_w: log pi_ref(y_w|x) for preferred completions
        log_ref_l: log pi_ref(y_l|x) for dispreferred completions
        beta: temperature parameter (controls deviation from reference)
    
    Returns:
        Scalar DPO loss (minimize this)
    """
    # Log-ratio differences
    log_ratio_w = log_pi_w - log_ref_w  # log(pi/ref) for winner
    log_ratio_l = log_pi_l - log_ref_l  # log(pi/ref) for loser
    
    # DPO objective: -log sigmoid(beta * (log_ratio_w - log_ratio_l))
    logits = beta * (log_ratio_w - log_ratio_l)
    loss = -np.mean(np.log(1 / (1 + np.exp(-logits))))
    return loss

# Example: 4 preference pairs
np.random.seed(42)
batch_size = 4

# Simulated log-probs (policy assigns higher prob to preferred outputs)
log_pi_w = np.array([-1.2, -0.8, -1.5, -0.9])   # pi_theta on winners
log_pi_l = np.array([-2.1, -1.9, -2.3, -2.0])   # pi_theta on losers
log_ref_w = np.array([-1.5, -1.2, -1.8, -1.3])  # pi_ref on winners
log_ref_l = np.array([-1.8, -1.5, -2.0, -1.6])  # pi_ref on losers

for beta in [0.05, 0.1, 0.5]:
    loss = dpo_loss(log_pi_w, log_pi_l, log_ref_w, log_ref_l, beta)
    print(f"beta={beta}: DPO loss = {loss:.4f}")

# Implicit reward under current policy
implicit_reward_w = 0.1 * (log_pi_w - log_ref_w)
implicit_reward_l = 0.1 * (log_pi_l - log_ref_l)
print(f"\nImplicit rewards (beta=0.1):")
print(f"  Winners: {np.round(implicit_reward_w, 4)}")
print(f"  Losers:  {np.round(implicit_reward_l, 4)}")
print(f"  Reward margin: {np.round(implicit_reward_w - implicit_reward_l, 4)}")

Evaluation Uncertainty

Benchmarks are samples. A 1% improvement on 200 examples may be noise; a 1% improvement on 20,000 examples is more convincing. Always pair score changes with uncertainty estimates and error analysis.

ExerciseEvaluation
Confidence Interval for Accuracy

A model gets 870 out of 1000 examples correct. Estimate a 95% confidence interval using $\hat{p}\pm1.96\sqrt{\hat{p}(1-\hat{p})/n}$.