Model Family Map
| Family | Core Objective | Math Tool | Likelihood? |
|---|---|---|---|
| VAE | Maximize ELBO (lower bound on log-likelihood) | Variational inference, KL divergence | Lower bound |
| GAN | Minimax game → minimize JS divergence | Game theory, divergence minimization | Implicit |
| Normalizing Flow | Exact log-likelihood via change of variables | Determinant of Jacobian | Exact |
| Diffusion | Denoising score matching | Markov chains, score functions | Lower bound (tight) |
VAEs & the ELBO
The fundamental problem: we want to maximize $\log p_\theta(x)$ (log-likelihood of observed data), but computing $p_\theta(x) = \int p_\theta(x|z)p(z)\,dz$ is intractable when the latent space is high-dimensional.
The solution: introduce an approximate posterior $q_\phi(z|x)$ and derive a tractable lower bound.
ELBO Derivation via Jensen's Inequality
Step 1: Start with the log-likelihood and introduce $q_\phi(z|x)$:
$$\log p_\theta(x) = \log \int p_\theta(x,z)\,dz = \log \int \frac{p_\theta(x,z)}{q_\phi(z|x)} q_\phi(z|x)\,dz$$Step 2: Apply Jensen's inequality ($\log$ is concave, so $\log \mathbb{E}[X] \geq \mathbb{E}[\log X]$):
$$\log p_\theta(x) \geq \int q_\phi(z|x) \log \frac{p_\theta(x,z)}{q_\phi(z|x)}\,dz = \mathbb{E}_{q_\phi(z|x)}\left[\log \frac{p_\theta(x,z)}{q_\phi(z|x)}\right]$$Step 3: Expand using $p_\theta(x,z) = p_\theta(x|z)p(z)$:
$$\text{ELBO} = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] + \mathbb{E}_{q_\phi(z|x)}\left[\log \frac{p(z)}{q_\phi(z|x)}\right]$$ $$\boxed{\text{ELBO} = \underbrace{\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]}_{\text{Reconstruction}} - \underbrace{D_{KL}(q_\phi(z|x) \| p(z))}_{\text{Regularization}}}$$The gap: How tight is this bound? The gap equals $D_{KL}(q_\phi(z|x) \| p_\theta(z|x))$:
$$\log p_\theta(x) = \text{ELBO} + D_{KL}(q_\phi(z|x) \| p_\theta(z|x)) \geq \text{ELBO}$$When $q_\phi$ perfectly matches the true posterior, the bound is tight.
The Reparameterization Trick
We need gradients through the sampling $z \sim q_\phi(z|x)$. Sampling is not differentiable, so we reparameterize:
$$z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$Now the randomness is in $\epsilon$ (independent of $\phi$), and the gradient flows through $\mu_\phi$ and $\sigma_\phi$ normally.
import numpy as np
# VAE ELBO computation with reparameterization trick
np.random.seed(42)
# Encoder outputs (for a single data point x)
mu = np.array([0.3, -0.5, 1.2]) # mean of q(z|x)
log_var = np.array([-0.2, 0.1, -0.5]) # log variance of q(z|x)
sigma = np.exp(0.5 * log_var)
# Reparameterization: z = mu + sigma * epsilon
epsilon = np.random.randn(*mu.shape)
z = mu + sigma * epsilon
print("Sampled z:", np.round(z, 4))
# KL divergence: D_KL(N(mu, sigma^2) || N(0, I))
kl = -0.5 * np.sum(1 + log_var - mu**2 - np.exp(log_var))
print(f"KL divergence: {kl:.4f}")
# Reconstruction loss (assume Gaussian decoder, MSE)
x_original = np.array([1.0, 0.5, -0.3])
x_reconstructed = z * 0.8 + 0.1 # simplified decoder
recon_loss = 0.5 * np.sum((x_original - x_reconstructed)**2)
print(f"Reconstruction loss: {recon_loss:.4f}")
# ELBO = -recon_loss - KL (we maximize ELBO, so minimize -ELBO)
elbo = -recon_loss - kl
print(f"ELBO: {elbo:.4f}")
GAN Objectives
A GAN is a two-player game: a generator $G$ produces samples, and a discriminator $D$ tries to distinguish real from generated. The value function is:
$$V(D, G) = \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$$The game is: $\min_G \max_D V(D, G)$.
Optimal Discriminator
Theorem: For fixed $G$, the optimal discriminator is:
$$D^*_G(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_g(x)}$$Proof: For any $x$, the integrand in $V$ that depends on $D(x)$ is:
$$f(D(x)) = p_{\text{data}}(x) \log D(x) + p_g(x) \log(1 - D(x))$$This is a function of the form $a\log y + b\log(1-y)$ with $a = p_{\text{data}}(x)$ and $b = p_g(x)$. Taking the derivative and setting to zero:
$$\frac{a}{y} - \frac{b}{1-y} = 0 \implies y^* = \frac{a}{a+b} = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_g(x)} \quad \square$$Equilibrium & JS Divergence
Theorem: When $D = D^*_G$, the minimax game reduces to minimizing the Jensen-Shannon divergence:
$$C(G) = V(D^*_G, G) = -\log 4 + 2 \cdot D_{JS}(p_{\text{data}} \| p_g)$$Proof: Substitute $D^*_G$ into $V$:
$$C(G) = \mathbb{E}_{x \sim p_{\text{data}}}\left[\log \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_g(x)}\right] + \mathbb{E}_{x \sim p_g}\left[\log \frac{p_g(x)}{p_{\text{data}}(x) + p_g(x)}\right]$$Let $m(x) = \frac{1}{2}(p_{\text{data}}(x) + p_g(x))$. Then:
$$C(G) = \mathbb{E}_{p_{\text{data}}}\left[\log \frac{p_{\text{data}}}{2m}\right] + \mathbb{E}_{p_g}\left[\log \frac{p_g}{2m}\right]$$ $$= D_{KL}(p_{\text{data}} \| m) + D_{KL}(p_g \| m) - 2\log 2$$ $$= 2 \cdot D_{JS}(p_{\text{data}} \| p_g) - \log 4$$Since $D_{JS} \geq 0$ with equality iff $p_{\text{data}} = p_g$, the global minimum is achieved when the generator perfectly matches the data distribution. At equilibrium, $D^*(x) = \frac{1}{2}$ everywhere. $\square$
import numpy as np
def js_divergence(p, q):
"""Compute Jensen-Shannon divergence between discrete distributions."""
m = 0.5 * (p + q)
# Avoid log(0) by adding small epsilon
eps = 1e-10
kl_pm = np.sum(p * np.log((p + eps) / (m + eps)))
kl_qm = np.sum(q * np.log((q + eps) / (m + eps)))
return 0.5 * (kl_pm + kl_qm)
# Example: measure how close generator distribution is to data
p_data = np.array([0.1, 0.2, 0.4, 0.2, 0.1]) # true data distribution
p_gen_bad = np.array([0.5, 0.1, 0.1, 0.1, 0.2]) # poor generator
p_gen_good = np.array([0.12, 0.18, 0.38, 0.22, 0.10]) # good generator
print(f"JS(data || bad_gen): {js_divergence(p_data, p_gen_bad):.4f}")
print(f"JS(data || good_gen): {js_divergence(p_data, p_gen_good):.4f}")
print(f"JS(data || data): {js_divergence(p_data, p_data):.6f}")
# GAN loss at optimal discriminator: C(G) = -log(4) + 2*JS
C_bad = -np.log(4) + 2 * js_divergence(p_data, p_gen_bad)
C_good = -np.log(4) + 2 * js_divergence(p_data, p_gen_good)
print(f"\nC(G_bad): {C_bad:.4f}")
print(f"C(G_good): {C_good:.4f}")
print(f"C(G*) = -log(4) = {-np.log(4):.4f} (global minimum)")
Normalizing Flows
Normalizing flows provide exact log-likelihoods by using invertible transformations. The key idea: if we can map a simple distribution to a complex one through invertible functions, we can compute the exact density of the complex distribution.
Change of Variables Derivation
Setup: Let $z \sim p_Z(z)$ (simple, e.g., standard Gaussian) and $x = f(z)$ where $f$ is invertible and differentiable. What is $p_X(x)$?
Derivation: Conservation of probability mass requires:
$$\int_A p_X(x)\,dx = \int_{f^{-1}(A)} p_Z(z)\,dz$$By the change of variables formula for integrals:
$$\int_A p_X(x)\,dx = \int_A p_Z(f^{-1}(x)) \left|\det \frac{\partial f^{-1}}{\partial x}\right|\,dx$$Since this holds for all measurable sets $A$:
$$\boxed{p_X(x) = p_Z(f^{-1}(x)) \left|\det \frac{\partial f^{-1}}{\partial x}\right|}$$Taking logarithms:
$$\log p_X(x) = \log p_Z(f^{-1}(x)) + \log \left|\det J_{f^{-1}}(x)\right|$$Why invertibility matters: If $f$ is not invertible, we cannot compute $f^{-1}(x)$ to evaluate $p_Z$ at the pre-image. Invertibility guarantees a unique pre-image for every output.
Composition of flows: Chaining $K$ invertible transforms $f = f_K \circ f_{K-1} \circ \ldots \circ f_1$:
$$\log p_X(x) = \log p_Z(z_0) + \sum_{k=1}^{K} \log \left|\det J_{f_k^{-1}}(z_k)\right|$$Each layer must have an efficiently computable log-determinant. Common architectures (RealNVP, GLOW, Neural Spline Flows) achieve this via triangular Jacobians.
import numpy as np
def log_prob_flow(x, f_inverse, log_det_jacobian, base_log_prob):
"""
Compute log p_X(x) using change of variables.
Args:
x: data point
f_inverse: inverse transformation function
log_det_jacobian: log|det(df^{-1}/dx)| function
base_log_prob: log p_Z(z) function (e.g., standard Gaussian)
"""
z = f_inverse(x)
log_pz = base_log_prob(z)
log_det = log_det_jacobian(x)
return log_pz + log_det
# Simple example: affine flow x = scale * z + shift
scale = 2.0
shift = 1.0
def f_inverse(x):
return (x - shift) / scale
def log_det_jacobian(x):
# For 1D affine: det(J) = 1/scale, log|det| = -log|scale|
return -np.log(np.abs(scale))
def standard_gaussian_log_prob(z):
return -0.5 * (z**2 + np.log(2 * np.pi))
# Evaluate density at x = 3.0
x = 3.0
log_px = log_prob_flow(x, f_inverse, log_det_jacobian, standard_gaussian_log_prob)
print(f"x = {x}")
print(f"z = f^{{-1}}(x) = {f_inverse(x):.4f}")
print(f"log p_Z(z) = {standard_gaussian_log_prob(f_inverse(x)):.4f}")
print(f"log |det J| = {log_det_jacobian(x):.4f}")
print(f"log p_X(x) = {log_px:.4f}")
print(f"p_X(x) = {np.exp(log_px):.4f}")
Diffusion Models
Diffusion models define a forward process that gradually destroys data with noise, then learn a reverse process that reconstructs data from noise. The math draws on Markov chains and score functions.
Forward Process
The forward process is a Markov chain that adds Gaussian noise at each step $t = 1, \ldots, T$:
$$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}\,x_{t-1}, \beta_t I)$$where $\beta_t \in (0,1)$ is a noise schedule (e.g., linear from $\beta_1=10^{-4}$ to $\beta_T=0.02$).
Key property: We can sample $x_t$ directly from $x_0$ without iterating through all steps. Define $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^t \alpha_s$:
$$\boxed{q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t}\,x_0, (1-\bar{\alpha}_t)I)}$$Derivation: By induction. At $t=1$: $x_1 = \sqrt{\alpha_1}x_0 + \sqrt{1-\alpha_1}\epsilon_1$. At $t=2$: $x_2 = \sqrt{\alpha_2}x_1 + \sqrt{\beta_2}\epsilon_2 = \sqrt{\alpha_2\alpha_1}x_0 + \text{(combined noise)}$. The combined noise has variance $1 - \alpha_1\alpha_2 = 1 - \bar{\alpha}_2$ because the sum of independent Gaussians is Gaussian.
As $T \to \infty$, $\bar{\alpha}_T \to 0$ and $q(x_T|x_0) \approx \mathcal{N}(0, I)$ — pure noise.
Reverse Process Derivation
The reverse process also forms a Markov chain. When $\beta_t$ is small, the reverse conditional $q(x_{t-1}|x_t)$ is approximately Gaussian. We parameterize:
$$p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)$$The training objective is derived from the variational lower bound on $\log p_\theta(x_0)$:
$$\log p_\theta(x_0) \geq \mathbb{E}_q\left[\log \frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)}\right]$$After expanding and simplifying (see DDPM paper), this reduces to matching the reverse conditionals:
$$L = \sum_{t=2}^T D_{KL}(q(x_{t-1}|x_t, x_0) \| p_\theta(x_{t-1}|x_t)) + \text{const}$$The posterior $q(x_{t-1}|x_t, x_0)$ is tractable (Gaussian) because we condition on both $x_0$ and $x_t$:
$$q(x_{t-1}|x_t, x_0) = \mathcal{N}(x_{t-1}; \tilde{\mu}_t(x_t, x_0), \tilde{\beta}_t I)$$where:
$$\tilde{\mu}_t = \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}x_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}x_t, \quad \tilde{\beta}_t = \frac{(1-\bar{\alpha}_{t-1})\beta_t}{1-\bar{\alpha}_t}$$The noise-prediction simplification: Since $x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon$, we can express $x_0 = \frac{x_t - \sqrt{1-\bar{\alpha}_t}\epsilon}{\sqrt{\bar{\alpha}_t}}$. Substituting into $\tilde{\mu}_t$ and choosing to predict $\epsilon_\theta(x_t, t) \approx \epsilon$:
$$\boxed{L_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]}$$This is the denoising objective: predict the noise that was added, and the model implicitly learns the mean of the reverse process.
Score Matching
The score function is the gradient of the log-density:
$$s(x) = \nabla_x \log p(x)$$Score matching trains a model $s_\theta(x) \approx \nabla_x \log p(x)$ without knowing $p(x)$ directly. The connection to diffusion: the noise-prediction network $\epsilon_\theta$ is related to the score by:
$$s_\theta(x_t, t) = -\frac{\epsilon_\theta(x_t, t)}{\sqrt{1-\bar{\alpha}_t}}$$Why this works: For $q(x_t|x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t}x_0, (1-\bar{\alpha}_t)I)$:
$$\nabla_{x_t} \log q(x_t|x_0) = -\frac{x_t - \sqrt{\bar{\alpha}_t}x_0}{1-\bar{\alpha}_t} = -\frac{\epsilon}{\sqrt{1-\bar{\alpha}_t}}$$So predicting $\epsilon$ is equivalent to estimating the score at noise level $t$.
import numpy as np
def forward_diffusion(x0, t, alpha_bar_schedule):
"""Sample x_t from q(x_t | x_0) = N(sqrt(alpha_bar_t)*x0, (1-alpha_bar_t)*I)."""
alpha_bar_t = alpha_bar_schedule[t]
epsilon = np.random.randn(*x0.shape)
x_t = np.sqrt(alpha_bar_t) * x0 + np.sqrt(1 - alpha_bar_t) * epsilon
return x_t, epsilon
# Create a linear noise schedule
T = 1000
beta = np.linspace(1e-4, 0.02, T)
alpha = 1.0 - beta
alpha_bar = np.cumprod(alpha)
# Demonstrate forward process at different timesteps
np.random.seed(0)
x0 = np.array([1.5, -0.8]) # clean data point
print(f"Original x0: {x0}")
print(f"{'t':>5} {'alpha_bar_t':>12} {'x_t':>20} {'signal/noise':>14}")
print("-" * 55)
for t in [0, 100, 250, 500, 750, 999]:
x_t, eps = forward_diffusion(x0, t, alpha_bar)
snr = alpha_bar[t] / (1 - alpha_bar[t])
print(f"{t:5d} {alpha_bar[t]:12.4f} [{x_t[0]:8.4f}, {x_t[1]:8.4f}] {snr:14.4f}")
# Score at time t: nabla log q(x_t|x0) = -epsilon / sqrt(1 - alpha_bar_t)
t_demo = 500
x_t_demo, eps_demo = forward_diffusion(x0, t_demo, alpha_bar)
true_score = -eps_demo / np.sqrt(1 - alpha_bar[t_demo])
print(f"\nAt t={t_demo}:")
print(f" True epsilon: {eps_demo}")
print(f" True score: {true_score}")
print(f" Relation: score = -epsilon / sqrt(1 - alpha_bar_t)")
Guidance Techniques
Classifier guidance modifies the score during sampling by adding a classifier gradient:
$$\hat{s}(x_t, t, y) = s_\theta(x_t, t) + w \cdot \nabla_{x_t} \log p_\phi(y|x_t)$$This steers generation toward class $y$ with strength $w$.
Classifier-free guidance avoids needing a separate classifier. During training, the model is trained with and without conditioning (by randomly dropping the condition). At inference:
$$\hat{\epsilon}(x_t, t, c) = \epsilon_\theta(x_t, t, \varnothing) + s \cdot (\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \varnothing))$$Mathematically, this is equivalent to applying Bayes' rule in score space:
$$\hat{s}(x_t, t, c) = s(x_t, t) + s \cdot (s(x_t, t | c) - s(x_t, t))$$The guidance scale $s$ controls the trade-off between diversity ($s=1$, no guidance) and adherence to the condition ($s \gg 1$, high guidance, reduced diversity).
Practice Exercises
- ELBO tightness: Show that $\log p_\theta(x) - \text{ELBO} = D_{KL}(q_\phi(z|x) \| p_\theta(z|x))$ by expanding the definition of KL divergence.
- Optimal discriminator: Verify that $D^*_G(x) = \frac{p_{\text{data}}}{p_{\text{data}} + p_g}$ is a maximum (not minimum) by checking the second derivative of $a\log y + b\log(1-y)$.
- JS divergence bounds: Prove that $0 \leq D_{JS}(p\|q) \leq \log 2$. When is each bound achieved?
- Flow composition: For two affine flows $f_1(z) = A_1 z + b_1$ and $f_2(z) = A_2 z + b_2$, derive the log-determinant of the composed transformation.
- Diffusion schedule: Verify by induction that $q(x_t|x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t}x_0, (1-\bar{\alpha}_t)I)$ given the step-wise transition $q(x_t|x_{t-1}) = \mathcal{N}(\sqrt{\alpha_t}x_{t-1}, \beta_t I)$.
- Score-epsilon connection: Starting from $x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon$, derive that $\nabla_{x_t}\log q(x_t|x_0) = -\epsilon/\sqrt{1-\bar{\alpha}_t}$.