AdamW
Adam keeps exponential moving averages of gradients and squared gradients:
$$m_t=\beta_1m_{t-1}+(1-\beta_1)g_t,\quad v_t=\beta_2v_{t-1}+(1-\beta_2)g_t^2$$
AdamW decouples weight decay from the adaptive update, which is often more stable for large neural networks.
import numpy as np
grad = np.array([0.4, -0.2, 0.1])
w = np.array([1.0, -1.0, 0.5])
m = np.zeros_like(w)
v = np.zeros_like(w)
lr, beta1, beta2, eps, wd = 1e-3, 0.9, 0.999, 1e-8, 0.01
m = beta1 * m + (1 - beta1) * grad
v = beta2 * v + (1 - beta2) * grad**2
m_hat = m / (1 - beta1)
v_hat = v / (1 - beta2)
w = w - lr * (m_hat / (np.sqrt(v_hat) + eps) + wd * w)
print(np.round(w, 6))
Schedules & Clipping
Warmup prevents early unstable updates. Cosine decay gradually reduces learning rate. Gradient clipping rescales gradients when $\|g\|_2$ exceeds a threshold, preventing rare huge updates from destabilizing training.
Perplexity & Calibration
For language models, perplexity is exponentiated average negative log likelihood:
$$\text{PPL}=\exp\left(-\frac{1}{N}\sum_{i=1}^{N}\log p(x_i|x_{<i})\right)$$
Calibration asks whether probabilities mean what they say. If a model predicts 80% confidence on 100 examples, about 80 should be correct.
Preference Optimization
RLHF trains a reward model from comparisons, then optimizes a policy with a KL penalty to avoid drifting too far from the reference model. DPO writes the preference objective directly in terms of policy log probabilities:
$$\mathcal{L}_{DPO}=-\log\sigma\left(\beta\left[\log\frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)}-\log\frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right]\right)$$
Evaluation Uncertainty
Benchmarks are samples. A 1% improvement on 200 examples may be noise; a 1% improvement on 20,000 examples is more convincing. Always pair score changes with uncertainty estimates and error analysis.
Confidence Interval for Accuracy
A model gets 870 out of 1000 examples correct. Estimate a 95% confidence interval using $\hat{p}\pm1.96\sqrt{\hat{p}(1-\hat{p})/n}$.