Back to Math for AI Hub

Generative Model Mathematics

May 30, 2026Wasil Zafar40 min read

Generative models estimate, transform, or sample from distributions. This extension provides full mathematical derivations — not just formulas — for VAEs, GANs, normalizing flows, and diffusion models.

Table of Contents

  1. Model Family Map
  2. VAEs & the ELBO
  3. GAN Objectives
  4. Normalizing Flows
  5. Diffusion Models
  6. Guidance Techniques
  7. Practice Exercises
Math Foundations: This extension builds on Part 4: Probability (expectations, Bayes' rule), Part 6: Information Theory (KL divergence, entropy), and Part 8: Calculus (chain rule, optimization). It provides the canonical derivations for generative content across PyTorch GAN Deep Dive, TensorFlow Stable Diffusion, and AI in the Wild Part 11.

Model Family Map

FamilyCore ObjectiveMath ToolLikelihood?
VAEMaximize ELBO (lower bound on log-likelihood)Variational inference, KL divergenceLower bound
GANMinimax game → minimize JS divergenceGame theory, divergence minimizationImplicit
Normalizing FlowExact log-likelihood via change of variablesDeterminant of JacobianExact
DiffusionDenoising score matchingMarkov chains, score functionsLower bound (tight)

VAEs & the ELBO

The fundamental problem: we want to maximize $\log p_\theta(x)$ (log-likelihood of observed data), but computing $p_\theta(x) = \int p_\theta(x|z)p(z)\,dz$ is intractable when the latent space is high-dimensional.

The solution: introduce an approximate posterior $q_\phi(z|x)$ and derive a tractable lower bound.

ELBO Derivation via Jensen's Inequality

Step 1: Start with the log-likelihood and introduce $q_\phi(z|x)$:

$$\log p_\theta(x) = \log \int p_\theta(x,z)\,dz = \log \int \frac{p_\theta(x,z)}{q_\phi(z|x)} q_\phi(z|x)\,dz$$

Step 2: Apply Jensen's inequality ($\log$ is concave, so $\log \mathbb{E}[X] \geq \mathbb{E}[\log X]$):

$$\log p_\theta(x) \geq \int q_\phi(z|x) \log \frac{p_\theta(x,z)}{q_\phi(z|x)}\,dz = \mathbb{E}_{q_\phi(z|x)}\left[\log \frac{p_\theta(x,z)}{q_\phi(z|x)}\right]$$

Step 3: Expand using $p_\theta(x,z) = p_\theta(x|z)p(z)$:

$$\text{ELBO} = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] + \mathbb{E}_{q_\phi(z|x)}\left[\log \frac{p(z)}{q_\phi(z|x)}\right]$$ $$\boxed{\text{ELBO} = \underbrace{\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]}_{\text{Reconstruction}} - \underbrace{D_{KL}(q_\phi(z|x) \| p(z))}_{\text{Regularization}}}$$

The gap: How tight is this bound? The gap equals $D_{KL}(q_\phi(z|x) \| p_\theta(z|x))$:

$$\log p_\theta(x) = \text{ELBO} + D_{KL}(q_\phi(z|x) \| p_\theta(z|x)) \geq \text{ELBO}$$

When $q_\phi$ perfectly matches the true posterior, the bound is tight.

Closed-form KL: When both $q_\phi(z|x) = \mathcal{N}(\mu, \sigma^2 I)$ and $p(z) = \mathcal{N}(0, I)$ are Gaussian, the KL has a closed form: $D_{KL} = \frac{1}{2}\sum_{j=1}^d (\mu_j^2 + \sigma_j^2 - \log\sigma_j^2 - 1)$.

The Reparameterization Trick

We need gradients through the sampling $z \sim q_\phi(z|x)$. Sampling is not differentiable, so we reparameterize:

$$z = \mu_\phi(x) + \sigma_\phi(x) \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$

Now the randomness is in $\epsilon$ (independent of $\phi$), and the gradient flows through $\mu_\phi$ and $\sigma_\phi$ normally.

import numpy as np

# VAE ELBO computation with reparameterization trick
np.random.seed(42)

# Encoder outputs (for a single data point x)
mu = np.array([0.3, -0.5, 1.2])       # mean of q(z|x)
log_var = np.array([-0.2, 0.1, -0.5])  # log variance of q(z|x)
sigma = np.exp(0.5 * log_var)

# Reparameterization: z = mu + sigma * epsilon
epsilon = np.random.randn(*mu.shape)
z = mu + sigma * epsilon
print("Sampled z:", np.round(z, 4))

# KL divergence: D_KL(N(mu, sigma^2) || N(0, I))
kl = -0.5 * np.sum(1 + log_var - mu**2 - np.exp(log_var))
print(f"KL divergence: {kl:.4f}")

# Reconstruction loss (assume Gaussian decoder, MSE)
x_original = np.array([1.0, 0.5, -0.3])
x_reconstructed = z * 0.8 + 0.1  # simplified decoder
recon_loss = 0.5 * np.sum((x_original - x_reconstructed)**2)
print(f"Reconstruction loss: {recon_loss:.4f}")

# ELBO = -recon_loss - KL (we maximize ELBO, so minimize -ELBO)
elbo = -recon_loss - kl
print(f"ELBO: {elbo:.4f}")

GAN Objectives

A GAN is a two-player game: a generator $G$ produces samples, and a discriminator $D$ tries to distinguish real from generated. The value function is:

$$V(D, G) = \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$$

The game is: $\min_G \max_D V(D, G)$.

Optimal Discriminator

Theorem: For fixed $G$, the optimal discriminator is:

$$D^*_G(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_g(x)}$$

Proof: For any $x$, the integrand in $V$ that depends on $D(x)$ is:

$$f(D(x)) = p_{\text{data}}(x) \log D(x) + p_g(x) \log(1 - D(x))$$

This is a function of the form $a\log y + b\log(1-y)$ with $a = p_{\text{data}}(x)$ and $b = p_g(x)$. Taking the derivative and setting to zero:

$$\frac{a}{y} - \frac{b}{1-y} = 0 \implies y^* = \frac{a}{a+b} = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_g(x)} \quad \square$$

Equilibrium & JS Divergence

Theorem: When $D = D^*_G$, the minimax game reduces to minimizing the Jensen-Shannon divergence:

$$C(G) = V(D^*_G, G) = -\log 4 + 2 \cdot D_{JS}(p_{\text{data}} \| p_g)$$

Proof: Substitute $D^*_G$ into $V$:

$$C(G) = \mathbb{E}_{x \sim p_{\text{data}}}\left[\log \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_g(x)}\right] + \mathbb{E}_{x \sim p_g}\left[\log \frac{p_g(x)}{p_{\text{data}}(x) + p_g(x)}\right]$$

Let $m(x) = \frac{1}{2}(p_{\text{data}}(x) + p_g(x))$. Then:

$$C(G) = \mathbb{E}_{p_{\text{data}}}\left[\log \frac{p_{\text{data}}}{2m}\right] + \mathbb{E}_{p_g}\left[\log \frac{p_g}{2m}\right]$$ $$= D_{KL}(p_{\text{data}} \| m) + D_{KL}(p_g \| m) - 2\log 2$$ $$= 2 \cdot D_{JS}(p_{\text{data}} \| p_g) - \log 4$$

Since $D_{JS} \geq 0$ with equality iff $p_{\text{data}} = p_g$, the global minimum is achieved when the generator perfectly matches the data distribution. At equilibrium, $D^*(x) = \frac{1}{2}$ everywhere. $\square$

import numpy as np

def js_divergence(p, q):
    """Compute Jensen-Shannon divergence between discrete distributions."""
    m = 0.5 * (p + q)
    # Avoid log(0) by adding small epsilon
    eps = 1e-10
    kl_pm = np.sum(p * np.log((p + eps) / (m + eps)))
    kl_qm = np.sum(q * np.log((q + eps) / (m + eps)))
    return 0.5 * (kl_pm + kl_qm)

# Example: measure how close generator distribution is to data
p_data = np.array([0.1, 0.2, 0.4, 0.2, 0.1])  # true data distribution
p_gen_bad = np.array([0.5, 0.1, 0.1, 0.1, 0.2])  # poor generator
p_gen_good = np.array([0.12, 0.18, 0.38, 0.22, 0.10])  # good generator

print(f"JS(data || bad_gen):  {js_divergence(p_data, p_gen_bad):.4f}")
print(f"JS(data || good_gen): {js_divergence(p_data, p_gen_good):.4f}")
print(f"JS(data || data):     {js_divergence(p_data, p_data):.6f}")

# GAN loss at optimal discriminator: C(G) = -log(4) + 2*JS
C_bad = -np.log(4) + 2 * js_divergence(p_data, p_gen_bad)
C_good = -np.log(4) + 2 * js_divergence(p_data, p_gen_good)
print(f"\nC(G_bad):  {C_bad:.4f}")
print(f"C(G_good): {C_good:.4f}")
print(f"C(G*) = -log(4) = {-np.log(4):.4f}  (global minimum)")

Normalizing Flows

Normalizing flows provide exact log-likelihoods by using invertible transformations. The key idea: if we can map a simple distribution to a complex one through invertible functions, we can compute the exact density of the complex distribution.

Change of Variables Derivation

Setup: Let $z \sim p_Z(z)$ (simple, e.g., standard Gaussian) and $x = f(z)$ where $f$ is invertible and differentiable. What is $p_X(x)$?

Derivation: Conservation of probability mass requires:

$$\int_A p_X(x)\,dx = \int_{f^{-1}(A)} p_Z(z)\,dz$$

By the change of variables formula for integrals:

$$\int_A p_X(x)\,dx = \int_A p_Z(f^{-1}(x)) \left|\det \frac{\partial f^{-1}}{\partial x}\right|\,dx$$

Since this holds for all measurable sets $A$:

$$\boxed{p_X(x) = p_Z(f^{-1}(x)) \left|\det \frac{\partial f^{-1}}{\partial x}\right|}$$

Taking logarithms:

$$\log p_X(x) = \log p_Z(f^{-1}(x)) + \log \left|\det J_{f^{-1}}(x)\right|$$

Why invertibility matters: If $f$ is not invertible, we cannot compute $f^{-1}(x)$ to evaluate $p_Z$ at the pre-image. Invertibility guarantees a unique pre-image for every output.

Composition of flows: Chaining $K$ invertible transforms $f = f_K \circ f_{K-1} \circ \ldots \circ f_1$:

$$\log p_X(x) = \log p_Z(z_0) + \sum_{k=1}^{K} \log \left|\det J_{f_k^{-1}}(z_k)\right|$$

Each layer must have an efficiently computable log-determinant. Common architectures (RealNVP, GLOW, Neural Spline Flows) achieve this via triangular Jacobians.

import numpy as np

def log_prob_flow(x, f_inverse, log_det_jacobian, base_log_prob):
    """
    Compute log p_X(x) using change of variables.
    
    Args:
        x: data point
        f_inverse: inverse transformation function
        log_det_jacobian: log|det(df^{-1}/dx)| function
        base_log_prob: log p_Z(z) function (e.g., standard Gaussian)
    """
    z = f_inverse(x)
    log_pz = base_log_prob(z)
    log_det = log_det_jacobian(x)
    return log_pz + log_det

# Simple example: affine flow x = scale * z + shift
scale = 2.0
shift = 1.0

def f_inverse(x):
    return (x - shift) / scale

def log_det_jacobian(x):
    # For 1D affine: det(J) = 1/scale, log|det| = -log|scale|
    return -np.log(np.abs(scale))

def standard_gaussian_log_prob(z):
    return -0.5 * (z**2 + np.log(2 * np.pi))

# Evaluate density at x = 3.0
x = 3.0
log_px = log_prob_flow(x, f_inverse, log_det_jacobian, standard_gaussian_log_prob)
print(f"x = {x}")
print(f"z = f^{{-1}}(x) = {f_inverse(x):.4f}")
print(f"log p_Z(z) = {standard_gaussian_log_prob(f_inverse(x)):.4f}")
print(f"log |det J| = {log_det_jacobian(x):.4f}")
print(f"log p_X(x) = {log_px:.4f}")
print(f"p_X(x) = {np.exp(log_px):.4f}")

Diffusion Models

Diffusion models define a forward process that gradually destroys data with noise, then learn a reverse process that reconstructs data from noise. The math draws on Markov chains and score functions.

Forward Process

The forward process is a Markov chain that adds Gaussian noise at each step $t = 1, \ldots, T$:

$$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}\,x_{t-1}, \beta_t I)$$

where $\beta_t \in (0,1)$ is a noise schedule (e.g., linear from $\beta_1=10^{-4}$ to $\beta_T=0.02$).

Key property: We can sample $x_t$ directly from $x_0$ without iterating through all steps. Define $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^t \alpha_s$:

$$\boxed{q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t}\,x_0, (1-\bar{\alpha}_t)I)}$$

Derivation: By induction. At $t=1$: $x_1 = \sqrt{\alpha_1}x_0 + \sqrt{1-\alpha_1}\epsilon_1$. At $t=2$: $x_2 = \sqrt{\alpha_2}x_1 + \sqrt{\beta_2}\epsilon_2 = \sqrt{\alpha_2\alpha_1}x_0 + \text{(combined noise)}$. The combined noise has variance $1 - \alpha_1\alpha_2 = 1 - \bar{\alpha}_2$ because the sum of independent Gaussians is Gaussian.

As $T \to \infty$, $\bar{\alpha}_T \to 0$ and $q(x_T|x_0) \approx \mathcal{N}(0, I)$ — pure noise.

Reverse Process Derivation

The reverse process also forms a Markov chain. When $\beta_t$ is small, the reverse conditional $q(x_{t-1}|x_t)$ is approximately Gaussian. We parameterize:

$$p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)$$

The training objective is derived from the variational lower bound on $\log p_\theta(x_0)$:

$$\log p_\theta(x_0) \geq \mathbb{E}_q\left[\log \frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)}\right]$$

After expanding and simplifying (see DDPM paper), this reduces to matching the reverse conditionals:

$$L = \sum_{t=2}^T D_{KL}(q(x_{t-1}|x_t, x_0) \| p_\theta(x_{t-1}|x_t)) + \text{const}$$

The posterior $q(x_{t-1}|x_t, x_0)$ is tractable (Gaussian) because we condition on both $x_0$ and $x_t$:

$$q(x_{t-1}|x_t, x_0) = \mathcal{N}(x_{t-1}; \tilde{\mu}_t(x_t, x_0), \tilde{\beta}_t I)$$

where:

$$\tilde{\mu}_t = \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}x_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}x_t, \quad \tilde{\beta}_t = \frac{(1-\bar{\alpha}_{t-1})\beta_t}{1-\bar{\alpha}_t}$$

The noise-prediction simplification: Since $x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon$, we can express $x_0 = \frac{x_t - \sqrt{1-\bar{\alpha}_t}\epsilon}{\sqrt{\bar{\alpha}_t}}$. Substituting into $\tilde{\mu}_t$ and choosing to predict $\epsilon_\theta(x_t, t) \approx \epsilon$:

$$\boxed{L_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]}$$

This is the denoising objective: predict the noise that was added, and the model implicitly learns the mean of the reverse process.

Score Matching

The score function is the gradient of the log-density:

$$s(x) = \nabla_x \log p(x)$$

Score matching trains a model $s_\theta(x) \approx \nabla_x \log p(x)$ without knowing $p(x)$ directly. The connection to diffusion: the noise-prediction network $\epsilon_\theta$ is related to the score by:

$$s_\theta(x_t, t) = -\frac{\epsilon_\theta(x_t, t)}{\sqrt{1-\bar{\alpha}_t}}$$

Why this works: For $q(x_t|x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t}x_0, (1-\bar{\alpha}_t)I)$:

$$\nabla_{x_t} \log q(x_t|x_0) = -\frac{x_t - \sqrt{\bar{\alpha}_t}x_0}{1-\bar{\alpha}_t} = -\frac{\epsilon}{\sqrt{1-\bar{\alpha}_t}}$$

So predicting $\epsilon$ is equivalent to estimating the score at noise level $t$.

import numpy as np

def forward_diffusion(x0, t, alpha_bar_schedule):
    """Sample x_t from q(x_t | x_0) = N(sqrt(alpha_bar_t)*x0, (1-alpha_bar_t)*I)."""
    alpha_bar_t = alpha_bar_schedule[t]
    epsilon = np.random.randn(*x0.shape)
    x_t = np.sqrt(alpha_bar_t) * x0 + np.sqrt(1 - alpha_bar_t) * epsilon
    return x_t, epsilon

# Create a linear noise schedule
T = 1000
beta = np.linspace(1e-4, 0.02, T)
alpha = 1.0 - beta
alpha_bar = np.cumprod(alpha)

# Demonstrate forward process at different timesteps
np.random.seed(0)
x0 = np.array([1.5, -0.8])  # clean data point

print(f"Original x0: {x0}")
print(f"{'t':>5} {'alpha_bar_t':>12} {'x_t':>20} {'signal/noise':>14}")
print("-" * 55)

for t in [0, 100, 250, 500, 750, 999]:
    x_t, eps = forward_diffusion(x0, t, alpha_bar)
    snr = alpha_bar[t] / (1 - alpha_bar[t])
    print(f"{t:5d} {alpha_bar[t]:12.4f} [{x_t[0]:8.4f}, {x_t[1]:8.4f}] {snr:14.4f}")

# Score at time t: nabla log q(x_t|x0) = -epsilon / sqrt(1 - alpha_bar_t)
t_demo = 500
x_t_demo, eps_demo = forward_diffusion(x0, t_demo, alpha_bar)
true_score = -eps_demo / np.sqrt(1 - alpha_bar[t_demo])
print(f"\nAt t={t_demo}:")
print(f"  True epsilon: {eps_demo}")
print(f"  True score:   {true_score}")
print(f"  Relation: score = -epsilon / sqrt(1 - alpha_bar_t)")

Guidance Techniques

Classifier guidance modifies the score during sampling by adding a classifier gradient:

$$\hat{s}(x_t, t, y) = s_\theta(x_t, t) + w \cdot \nabla_{x_t} \log p_\phi(y|x_t)$$

This steers generation toward class $y$ with strength $w$.

Classifier-free guidance avoids needing a separate classifier. During training, the model is trained with and without conditioning (by randomly dropping the condition). At inference:

$$\hat{\epsilon}(x_t, t, c) = \epsilon_\theta(x_t, t, \varnothing) + s \cdot (\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \varnothing))$$

Mathematically, this is equivalent to applying Bayes' rule in score space:

$$\hat{s}(x_t, t, c) = s(x_t, t) + s \cdot (s(x_t, t | c) - s(x_t, t))$$

The guidance scale $s$ controls the trade-off between diversity ($s=1$, no guidance) and adherence to the condition ($s \gg 1$, high guidance, reduced diversity).

Practice Exercises

Exercises: These solidify the derivations above.
  1. ELBO tightness: Show that $\log p_\theta(x) - \text{ELBO} = D_{KL}(q_\phi(z|x) \| p_\theta(z|x))$ by expanding the definition of KL divergence.
  2. Optimal discriminator: Verify that $D^*_G(x) = \frac{p_{\text{data}}}{p_{\text{data}} + p_g}$ is a maximum (not minimum) by checking the second derivative of $a\log y + b\log(1-y)$.
  3. JS divergence bounds: Prove that $0 \leq D_{JS}(p\|q) \leq \log 2$. When is each bound achieved?
  4. Flow composition: For two affine flows $f_1(z) = A_1 z + b_1$ and $f_2(z) = A_2 z + b_2$, derive the log-determinant of the composed transformation.
  5. Diffusion schedule: Verify by induction that $q(x_t|x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t}x_0, (1-\bar{\alpha}_t)I)$ given the step-wise transition $q(x_t|x_{t-1}) = \mathcal{N}(\sqrt{\alpha_t}x_{t-1}, \beta_t I)$.
  6. Score-epsilon connection: Starting from $x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon$, derive that $\nabla_{x_t}\log q(x_t|x_0) = -\epsilon/\sqrt{1-\bar{\alpha}_t}$.