Back to Mathematics

Part 11: Advanced Topics

April 26, 2026 Wasil Zafar 22 min read

This part covers the graduate-level mathematics that underpins some of the most powerful ML techniques: multivariate Gaussian distributions, Bayesian inference and the posterior, Markov chains and stationary distributions, the EM algorithm for latent variable models, and concentration inequalities that give us confidence bounds on learning algorithms.

Table of Contents

  1. Multivariate Gaussian
  2. Bayesian Inference
  3. Markov Chains
  4. EM Algorithm
  5. Concentration Inequalities
  6. PAC Learning
  7. Practice Exercises
  8. Conclusion & Next Steps

Multivariate Gaussian Distribution

The multivariate Gaussian $\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma)$ extends the univariate normal to $d$ dimensions:

$$p(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\!\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top \Sigma^{-1} (\mathbf{x}-\boldsymbol{\mu})\right)$$

where $\boldsymbol{\mu} \in \mathbb{R}^d$ is the mean vector and $\Sigma \in \mathbb{R}^{d\times d}$ is the symmetric positive semidefinite covariance matrix. The exponent $(\mathbf{x}-\boldsymbol{\mu})^\top \Sigma^{-1} (\mathbf{x}-\boldsymbol{\mu})$ is the squared Mahalanobis distance — a distance that accounts for feature correlations.

Key Properties
Properties of the Multivariate Gaussian
  • Marginals are Gaussian: any subset of coordinates is also Gaussian (marginalize by dropping rows/columns of $\Sigma$)
  • Conditionals are Gaussian: $p(x_1 | x_2)$ is Gaussian with updated mean/covariance — the basis of Gaussian processes
  • Linear transformations are Gaussian: $A\mathbf{x} + \mathbf{b} \sim \mathcal{N}(A\boldsymbol{\mu} + \mathbf{b}, A\Sigma A^\top)$
  • Diagonalizable: $\Sigma = Q\Lambda Q^\top$ (eigendecomposition) — principal axes aligned with eigenvectors
ML Connection — GMMs: Gaussian Mixture Models represent complex data distributions as $p(\mathbf{x}) = \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x}; \boldsymbol{\mu}_k, \Sigma_k)$ where $\pi_k$ are mixture weights. Each $k$ corresponds to a cluster. Fitting GMMs requires the EM algorithm (see below).

Bayesian Inference

Bayesian inference treats parameters $\theta$ as random variables, maintaining a distribution over possible values rather than committing to a single point estimate:

$$\underbrace{p(\theta | \mathcal{D})}_{\text{posterior}} = \frac{\underbrace{p(\mathcal{D} | \theta)}_{\text{likelihood}} \cdot \underbrace{p(\theta)}_{\text{prior}}}{\underbrace{p(\mathcal{D})}_{\text{evidence}}}$$

The evidence $p(\mathcal{D}) = \int p(\mathcal{D}|\theta)p(\theta)\,d\theta$ normalizes the posterior but is often intractable — motivating approximate inference methods like variational inference and MCMC.

Conjugate Priors
Conjugate Prior: Beta-Binomial

When the prior and posterior are in the same distribution family, we call the prior a conjugate prior — no numerical integration needed:

Model: $X | p \sim \text{Binomial}(n, p)$, prior: $p \sim \text{Beta}(\alpha, \beta)$

Posterior: $p | X=k \sim \text{Beta}(\alpha + k, \beta + n - k)$

This makes Bayesian updating closed-form: each observation simply increments the Beta parameters. The posterior mean is $\frac{\alpha + k}{\alpha + \beta + n}$ — a weighted combination of the prior belief $\frac{\alpha}{\alpha+\beta}$ and MLE $\frac{k}{n}$.

import numpy as np

# Bayesian coin flip: update Beta prior with observed data
def beta_posterior(alpha_prior, beta_prior, n_heads, n_tails):
    """Returns posterior Beta parameters after observing coin flips."""
    alpha_post = alpha_prior + n_heads
    beta_post  = beta_prior  + n_tails
    return alpha_post, beta_post

def beta_mean_std(alpha, beta):
    mean = alpha / (alpha + beta)
    var  = alpha * beta / ((alpha + beta)**2 * (alpha + beta + 1))
    return mean, np.sqrt(var)

# Start with uniform prior (Beta(1,1) = Uniform[0,1])
alpha0, beta0 = 1.0, 1.0
print("Bayesian coin flip (updating Beta prior):")
print(f"{'Observation':>25}  {'Alpha':>7}  {'Beta':>7}  {'Posterior Mean':>16}  {'Std':>8}")
print(f"{'Prior Beta(1,1)':>25}  {alpha0:>7.1f}  {beta0:>7.1f}  {alpha0/(alpha0+beta0):>16.4f}  {beta_mean_std(alpha0,beta0)[1]:>8.4f}")

# Observe: 3 heads, 2 tails, then 7 more heads, then 10 tails
observations = [(3, 2, "After 3H, 2T"), (7, 0, "After 7 more H"), (0, 10, "After 10 more T")]
a, b = alpha0, beta0
for h, t, label in observations:
    a, b = beta_posterior(a, b, h, t)
    mean, std = beta_mean_std(a, b)
    print(f"{label:>25}  {a:>7.1f}  {b:>7.1f}  {mean:>16.4f}  {std:>8.4f}")

print(f"\nMLE estimate: {(3+7)/(3+7+2+10):.4f}  (just heads/total, ignores prior)")

Markov Chains

A Markov chain is a sequence of random variables $X_0, X_1, \ldots$ where each state depends only on the immediately preceding state:

$$P(X_{t+1} = j \mid X_t = i, X_{t-1}, \ldots, X_0) = P(X_{t+1} = j \mid X_t = i) = T_{ij}$$

The transition matrix $T \in \mathbb{R}^{K \times K}$ has $T_{ij} = P(\text{next}=j \mid \text{current}=i)$. Each row sums to 1.

The distribution after $t$ steps: $\boldsymbol{\pi}^{(t)} = \boldsymbol{\pi}^{(0)} T^t$. A stationary distribution $\boldsymbol{\pi}^*$ satisfies $\boldsymbol{\pi}^* = \boldsymbol{\pi}^* T$ — it is the left eigenvector of $T$ for eigenvalue 1.

import numpy as np

# Weather Markov chain: states Sunny(0), Cloudy(1), Rainy(2)
T = np.array([[0.7, 0.2, 0.1],   # from Sunny
              [0.3, 0.4, 0.3],   # from Cloudy
              [0.2, 0.3, 0.5]])  # from Rainy

# Simulate chain evolution
pi0 = np.array([1.0, 0.0, 0.0])  # start in Sunny
print("Markov chain evolution (Weather model):")
print(f"t=0: {pi0}")
pi = pi0.copy()
for t in [1, 5, 10, 50, 100]:
    pi_t = pi0 @ np.linalg.matrix_power(T, t)
    print(f"t={t:3d}: [{pi_t[0]:.4f}, {pi_t[1]:.4f}, {pi_t[2]:.4f}]")

# Find stationary distribution via eigenvectors
# Left eigenvectors of T (rows of T.T's right eigenvectors)
eigenvalues, eigenvectors = np.linalg.eig(T.T)
idx = np.argmax(np.real(eigenvalues))  # index of eigenvalue closest to 1
stationary = np.real(eigenvectors[:, idx])
stationary /= stationary.sum()  # normalize to probability distribution

print(f"\nStationary distribution π*: [{stationary[0]:.4f}, {stationary[1]:.4f}, {stationary[2]:.4f}]")
print(f"Verify π*T = π*: {np.allclose(stationary @ T, stationary)}")
print(f"\nInterpretation: long-run fraction of time in each state:")
print(f"  Sunny={stationary[0]:.1%}, Cloudy={stationary[1]:.1%}, Rainy={stationary[2]:.1%}")
ML Connection — MCMC & PageRank: Markov Chain Monte Carlo (MCMC) methods like Metropolis-Hastings design a Markov chain whose stationary distribution is the target posterior $p(\theta|\mathcal{D})$ — sampling from it approximates Bayesian inference when the posterior is intractable. Google's PageRank algorithm computes the stationary distribution of a web-graph Markov chain.

The EM Algorithm

The Expectation-Maximization algorithm maximizes likelihood in models with latent (hidden) variables $Z$. Directly maximizing $\log p(\mathcal{X};\theta) = \log \sum_Z p(\mathcal{X}, Z;\theta)$ is hard — the sum inside the log is intractable. EM iterates:

EM Algorithm: E-step and M-step
flowchart LR
    A([Initialize θ⁰]) --> B
    B["E-step: Compute
Q(θ,θ^old) = E[log p(X,Z|θ) | X, θ^old]"] --> C C["M-step: Update
θ^new = argmax Q(θ, θ^old)"] --> D{Converged?} D -- No --> B D -- Yes --> E([Return θ*])
  • E-step: compute the expected complete-data log-likelihood $Q(\theta, \theta^{\text{old}}) = \mathbb{E}_{Z|\mathcal{X},\theta^{\text{old}}}[\log p(\mathcal{X},Z;\theta)]$
  • M-step: maximize $Q$ with respect to $\theta$, giving $\theta^{\text{new}}$
  • Guarantee: each iteration increases (or maintains) $\log p(\mathcal{X};\theta)$ — EM always ascends the likelihood (but may converge to local maxima)

For GMMs, the latent variables $Z = \{z_{ik}\}$ indicate which component $k$ generated data point $i$. The E-step computes responsibilities $r_{ik} = P(z_{ik}=1|x_i,\theta)$ (soft assignments); the M-step re-estimates $\boldsymbol{\mu}_k, \Sigma_k, \pi_k$ using these weighted assignments.

Concentration Inequalities

Concentration inequalities bound the probability that a random variable deviates far from its expectation — fundamental for proving generalization bounds in ML theory.

Markov's Inequality
Markov's Inequality (weakest, requires only non-negativity)
$$P(X \geq a) \leq \frac{\mathbb{E}[X]}{a} \quad \text{for } X \geq 0, a \gt 0$$

Too loose for practical use but the building block for all others.

Chebyshev's Inequality
Chebyshev's Inequality (requires finite variance)
$$P(|X - \mathbb{E}[X]| \geq k\sigma) \leq \frac{1}{k^2}$$

At most $1/k^2$ of probability mass lies $k$ standard deviations from the mean. This is distribution-free (holds for any finite-variance distribution) but loose for sub-Gaussian distributions.

Hoeffding's Inequality
Hoeffding's Inequality (for bounded independent variables)
$$P\!\left(\bar{X}_n - \mathbb{E}[\bar{X}_n] \geq \epsilon\right) \leq \exp\!\left(-\frac{2n^2\epsilon^2}{\sum_i (b_i - a_i)^2}\right)$$

For i.i.d. bounded $X_i \in [a, b]$: $P(\bar{X}_n - \mu \geq \epsilon) \leq \exp(-2n\epsilon^2/(b-a)^2)$. This decays exponentially in $n$ — showing why more data rapidly concentrates empirical averages around the true mean.

Generalization Bounds: Hoeffding's inequality underlies many ML generalization bounds. For a binary classifier with bounded loss in $[0,1]$: $P(\mathcal{L}_{\text{true}} \leq \mathcal{L}_{\text{train}} + \sqrt{\frac{\ln(|\mathcal{H}|) + \ln(1/\delta)}{2n}}) \geq 1-\delta$. This says: with $n$ training examples and hypothesis class of size $|\mathcal{H}|$, empirical risk approximates true risk to within $\tilde{O}(1/\sqrt{n})$ with high probability.

PAC Learning Framework

Probably Approximately Correct (PAC) learning formalizes when machine learning is possible. A concept class $\mathcal{C}$ is PAC-learnable if there exists an algorithm that, for any target concept $c \in \mathcal{C}$, any distribution $\mathcal{D}$, any $\epsilon, \delta \gt 0$, returns a hypothesis $h$ with:

$$P_{S \sim \mathcal{D}^m}[\mathcal{L}(h) \leq \epsilon] \geq 1 - \delta$$

using at most $m = \text{poly}(1/\epsilon, 1/\delta, n)$ samples. The PAC sample complexity is:

$$m \geq \frac{1}{2\epsilon^2}\left(\ln|\mathcal{H}| + \ln\frac{2}{\delta}\right)$$

The VC dimension (Vapnik-Chervonenkis dimension) extends PAC theory to infinite hypothesis classes, measuring the expressiveness of $\mathcal{H}$ by the largest set of points it can shatter (correctly classify in all $2^m$ labelings).

Practice Exercises

Exercise 1Markov Chains
Detailed Balance & Reversible Markov Chains

A Markov chain satisfies detailed balance if $\pi_i T_{ij} = \pi_j T_{ji}$ for all $i,j$. Verify that the weather chain above satisfies detailed balance with the computed stationary distribution. Does it?

Show Answer

Check: $\pi_0 T_{01}$ vs $\pi_1 T_{10}$. Using computed $\pi^* \approx [0.444, 0.290, 0.266]$:

$\pi_0 T_{01} = 0.444 \times 0.2 = 0.0888$, $\pi_1 T_{10} = 0.290 \times 0.3 = 0.0870$ — not equal.

This chain does NOT satisfy detailed balance — it is not reversible. Detailed balance holds only for reversible chains (e.g., Metropolis-Hastings uses it to guarantee the correct stationary distribution). Non-reversible chains can still have a unique stationary distribution, they just aren't time-reversible.

Exercise 2Bayesian Inference
Conjugate Prior Update for a Poisson Rate

The Gamma distribution is the conjugate prior for the Poisson rate $\lambda$. If $X_1,\ldots,X_n \sim \text{Poisson}(\lambda)$ and $\lambda \sim \text{Gamma}(\alpha, \beta)$, derive the posterior $p(\lambda | X_1,\ldots,X_n)$.

Show Answer

Likelihood: $p(\mathbf{X}|\lambda) = \prod_i \frac{e^{-\lambda}\lambda^{x_i}}{x_i!} \propto e^{-n\lambda}\lambda^{\sum x_i}$

Prior: $p(\lambda) \propto \lambda^{\alpha-1}e^{-\beta\lambda}$

Posterior: $p(\lambda|\mathbf{X}) \propto \lambda^{\alpha-1}e^{-\beta\lambda} \cdot e^{-n\lambda}\lambda^{\sum x_i} = \lambda^{(\alpha + \sum x_i) - 1} e^{-(\beta+n)\lambda}$

This is $\text{Gamma}(\alpha + \sum x_i, \; \beta + n)$ — the prior count $\alpha$ increases by total observations, and the prior rate $\beta$ increases by sample size $n$.

Conclusion & Next Steps

This part covered the mathematical depth that separates ML practitioners from ML researchers:

  • Multivariate Gaussian — the backbone of Gaussian processes, linear discriminant analysis, and GMMs
  • Bayesian inference — principled uncertainty quantification; regularization as a MAP estimate
  • Markov chains — the foundation of MCMC, PageRank, hidden Markov models
  • EM algorithm — fitting latent variable models like GMMs and k-means (hard-EM)
  • Concentration inequalities — Hoeffding's bound; the mathematical basis of generalization guarantees
  • PAC learning — when learning is tractable; the role of VC dimension and sample complexity

Next in the Series

In Part 12: Projects & Applications, we put everything together: building linear regression, a Naive Bayes classifier, PCA, and a mini neural network entirely from scratch using only NumPy — and review which parts of this series underpin which ML algorithms.