Model Family Map
| Family | Core Objective | Math Tool |
|---|---|---|
| VAE | Maximize ELBO | Variational inference + KL |
| GAN | Adversarial min-max game | JS divergence intuition |
| Flow | Exact likelihood | Change of variables |
| Diffusion | Denoising / score matching | Markov chains + gradients of log density |
VAEs & ELBO
A VAE introduces latent variables $z$ and optimizes a lower bound on the log likelihood:
$$\log p_\theta(x) \ge \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x)\|p(z))$$
The first term rewards reconstruction. The KL term regularizes the latent space toward a simple prior, usually $\mathcal{N}(0,I)$.
import numpy as np
# KL between diagonal Gaussian q(z|x)=N(mu, sigma^2) and p(z)=N(0, I)
mu = np.array([0.2, -0.5, 1.0])
log_var = np.array([-0.1, 0.3, 0.0])
kl = -0.5 * np.sum(1 + log_var - mu**2 - np.exp(log_var))
print("KL(q || p):", round(float(kl), 4))
GAN Objectives
The original GAN objective is a game:
$$\min_G \max_D \mathbb{E}_{x\sim p_{data}}[\log D(x)] + \mathbb{E}_{z\sim p_z}[\log(1-D(G(z)))]$$
The discriminator estimates whether a sample is real; the generator learns to produce samples that fool it.
Normalizing Flows
Flows use invertible transformations $x=f(z)$ and the change-of-variables formula:
$$\log p_X(x)=\log p_Z(f^{-1}(x)) + \log\left|\det \frac{\partial f^{-1}}{\partial x}\right|$$
They trade architectural flexibility for exact likelihood computation.
Diffusion & Score Matching
Diffusion gradually adds noise, then trains a network to reverse the process. A common simplified objective predicts the noise $\epsilon$ added to a clean sample $x_0$:
$$\mathcal{L}=\mathbb{E}_{t,x_0,\epsilon}\|\epsilon - \epsilon_\theta(x_t,t)\|_2^2$$
import numpy as np
np.random.seed(0)
x0 = np.array([1.0, -0.5])
alpha_bar = 0.7
eps = np.random.randn(*x0.shape)
xt = np.sqrt(alpha_bar) * x0 + np.sqrt(1 - alpha_bar) * eps
print("noisy sample:", np.round(xt, 3))
Guidance
Classifier-free guidance combines unconditional and conditional denoising predictions:
$$\epsilon_{guided}=\epsilon_{uncond}+s(\epsilon_{cond}-\epsilon_{uncond})$$
The scale $s$ increases prompt adherence but can reduce diversity or create artifacts when pushed too high.