Back to Math for AI Hub

Training, Alignment & Evaluation Math

April 30, 2026Wasil Zafar24 min read

Modern AI quality depends on optimization, preference learning, calibration, and evaluation statistics. This page turns training and alignment terms into equations you can reason about.

Table of Contents

  1. AdamW
  2. Schedules & Clipping
  3. Perplexity & Calibration
  4. Preference Optimization
  5. Evaluation Uncertainty
Modern training math: optimization controls whether a model learns; alignment objectives control what it prefers; evaluation statistics control whether improvements are real.

AdamW

Adam keeps exponential moving averages of gradients and squared gradients:

$$m_t=\beta_1m_{t-1}+(1-\beta_1)g_t,\quad v_t=\beta_2v_{t-1}+(1-\beta_2)g_t^2$$

AdamW decouples weight decay from the adaptive update, which is often more stable for large neural networks.

import numpy as np

grad = np.array([0.4, -0.2, 0.1])
w = np.array([1.0, -1.0, 0.5])
m = np.zeros_like(w)
v = np.zeros_like(w)
lr, beta1, beta2, eps, wd = 1e-3, 0.9, 0.999, 1e-8, 0.01
m = beta1 * m + (1 - beta1) * grad
v = beta2 * v + (1 - beta2) * grad**2
m_hat = m / (1 - beta1)
v_hat = v / (1 - beta2)
w = w - lr * (m_hat / (np.sqrt(v_hat) + eps) + wd * w)
print(np.round(w, 6))

Schedules & Clipping

Warmup prevents early unstable updates. Cosine decay gradually reduces learning rate. Gradient clipping rescales gradients when $\|g\|_2$ exceeds a threshold, preventing rare huge updates from destabilizing training.

Perplexity & Calibration

For language models, perplexity is exponentiated average negative log likelihood:

$$\text{PPL}=\exp\left(-\frac{1}{N}\sum_{i=1}^{N}\log p(x_i|x_{<i})\right)$$

Calibration asks whether probabilities mean what they say. If a model predicts 80% confidence on 100 examples, about 80 should be correct.

Preference Optimization

RLHF trains a reward model from comparisons, then optimizes a policy with a KL penalty to avoid drifting too far from the reference model. DPO writes the preference objective directly in terms of policy log probabilities:

$$\mathcal{L}_{DPO}=-\log\sigma\left(\beta\left[\log\frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)}-\log\frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right]\right)$$

Evaluation Uncertainty

Benchmarks are samples. A 1% improvement on 200 examples may be noise; a 1% improvement on 20,000 examples is more convincing. Always pair score changes with uncertainty estimates and error analysis.

ExerciseEvaluation
Confidence Interval for Accuracy

A model gets 870 out of 1000 examples correct. Estimate a 95% confidence interval using $\hat{p}\pm1.96\sqrt{\hat{p}(1-\hat{p})/n}$.