Part 11: Advanced Topics

Complete Math + Probability + Statistics for ML/AI/DS Bootcamp

Your 12-step learning path • Currently on Step 11

Advanced Topics

Multivariate stats & Bayesian inference

You Are Here

Projects & Applications

Build from scratch: regression, Bayes, PCA

Multivariate Gaussian Distribution

The multivariate Gaussian $\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma)$ extends the univariate normal to $d$ dimensions:

p(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\!\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top \Sigma^{-1} (\mathbf{x}-\boldsymbol{\mu})\right)

where $\boldsymbol{\mu} \in \mathbb{R}^d$ is the mean vector and $\Sigma \in \mathbb{R}^{d\times d}$ is the symmetric positive semidefinite covariance matrix. The exponent $(\mathbf{x}-\boldsymbol{\mu})^\top \Sigma^{-1} (\mathbf{x}-\boldsymbol{\mu})$ is the squared Mahalanobis distance — a distance that accounts for feature correlations.

Key Properties

Properties of the Multivariate Gaussian

Marginals are Gaussian: any subset of coordinates is also Gaussian (marginalize by dropping rows/columns of $\Sigma$)
Conditionals are Gaussian: $p(x_1 | x_2)$ is Gaussian with updated mean/covariance — the basis of Gaussian processes
Linear transformations are Gaussian: $A\mathbf{x} + \mathbf{b} \sim \mathcal{N}(A\boldsymbol{\mu} + \mathbf{b}, A\Sigma A^\top)$
Diagonalizable: $\Sigma = Q\Lambda Q^\top$ (eigendecomposition) — principal axes aligned with eigenvectors

                            ML Connection — GMMs: Gaussian Mixture Models represent complex data distributions as $p(\mathbf{x}) = \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x}; \boldsymbol{\mu}_k, \Sigma_k)$ where $\pi_k$ are mixture weights. Each $k$ corresponds to a cluster. Fitting GMMs requires the EM algorithm (see below).
                        

Bayesian Inference

Bayesian inference treats parameters $\theta$ as random variables, maintaining a distribution over possible values rather than committing to a single point estimate:

\underbrace{p(\theta | \mathcal{D})}_{\text{posterior}} = \frac{\underbrace{p(\mathcal{D} | \theta)}_{\text{likelihood}} \cdot \underbrace{p(\theta)}_{\text{prior}}}{\underbrace{p(\mathcal{D})}_{\text{evidence}}}

The evidence $p(\mathcal{D}) = \int p(\mathcal{D}|\theta)p(\theta)\,d\theta$ normalizes the posterior but is often intractable — motivating approximate inference methods like variational inference and MCMC.

Conjugate Priors

Conjugate Prior: Beta-Binomial

When the prior and posterior are in the same distribution family, we call the prior a conjugate prior — no numerical integration needed:

Model: $X | p \sim \text{Binomial}(n, p)$, prior: $p \sim \text{Beta}(\alpha, \beta)$

Posterior: $p | X=k \sim \text{Beta}(\alpha + k, \beta + n - k)$

This makes Bayesian updating closed-form: each observation simply increments the Beta parameters. The posterior mean is $\frac{\alpha + k}{\alpha + \beta + n}$ — a weighted combination of the prior belief $\frac{\alpha}{\alpha+\beta}$ and MLE $\frac{k}{n}$.

import numpy as np

# Bayesian coin flip: update Beta prior with observed data
def beta_posterior(alpha_prior, beta_prior, n_heads, n_tails):
    """Returns posterior Beta parameters after observing coin flips."""
    alpha_post = alpha_prior + n_heads
    beta_post  = beta_prior  + n_tails
    return alpha_post, beta_post

def beta_mean_std(alpha, beta):
    mean = alpha / (alpha + beta)
    var  = alpha * beta / ((alpha + beta)**2 * (alpha + beta + 1))
    return mean, np.sqrt(var)

# Start with uniform prior (Beta(1,1) = Uniform[0,1])
alpha0, beta0 = 1.0, 1.0
print("Bayesian coin flip (updating Beta prior):")
print(f"{'Observation':>25}  {'Alpha':>7}  {'Beta':>7}  {'Posterior Mean':>16}  {'Std':>8}")
print(f"{'Prior Beta(1,1)':>25}  {alpha0:>7.1f}  {beta0:>7.1f}  {alpha0/(alpha0+beta0):>16.4f}  {beta_mean_std(alpha0,beta0)[1]:>8.4f}")

# Observe: 3 heads, 2 tails, then 7 more heads, then 10 tails
observations = [(3, 2, "After 3H, 2T"), (7, 0, "After 7 more H"), (0, 10, "After 10 more T")]
a, b = alpha0, beta0
for h, t, label in observations:
    a, b = beta_posterior(a, b, h, t)
    mean, std = beta_mean_std(a, b)
    print(f"{label:>25}  {a:>7.1f}  {b:>7.1f}  {mean:>16.4f}  {std:>8.4f}")

print(f"\nMLE estimate: {(3+7)/(3+7+2+10):.4f}  (just heads/total, ignores prior)")

Markov Chains

A Markov chain is a sequence of random variables $X_0, X_1, \ldots$ where each state depends only on the immediately preceding state:

P(X_{t+1} = j \mid X_t = i, X_{t-1}, \ldots, X_0) = P(X_{t+1} = j \mid X_t = i) = T_{ij}

The transition matrix $T \in \mathbb{R}^{K \times K}$ has $T_{ij} = P(\text{next}=j \mid \text{current}=i)$. Each row sums to 1.

The distribution after $t$ steps: $\boldsymbol{\pi}^{(t)} = \boldsymbol{\pi}^{(0)} T^t$. A stationary distribution $\boldsymbol{\pi}^*$ satisfies $\boldsymbol{\pi}^* = \boldsymbol{\pi}^* T$ — it is the left eigenvector of $T$ for eigenvalue 1.

import numpy as np

# Weather Markov chain: states Sunny(0), Cloudy(1), Rainy(2)
T = np.array([[0.7, 0.2, 0.1],   # from Sunny
              [0.3, 0.4, 0.3],   # from Cloudy
              [0.2, 0.3, 0.5]])  # from Rainy

# Simulate chain evolution
pi0 = np.array([1.0, 0.0, 0.0])  # start in Sunny
print("Markov chain evolution (Weather model):")
print(f"t=0: {pi0}")
pi = pi0.copy()
for t in [1, 5, 10, 50, 100]:
    pi_t = pi0 @ np.linalg.matrix_power(T, t)
    print(f"t={t:3d}: [{pi_t[0]:.4f}, {pi_t[1]:.4f}, {pi_t[2]:.4f}]")

# Find stationary distribution via eigenvectors
# Left eigenvectors of T (rows of T.T's right eigenvectors)
eigenvalues, eigenvectors = np.linalg.eig(T.T)
idx = np.argmax(np.real(eigenvalues))  # index of eigenvalue closest to 1
stationary = np.real(eigenvectors[:, idx])
stationary /= stationary.sum()  # normalize to probability distribution

print(f"\nStationary distribution π*: [{stationary[0]:.4f}, {stationary[1]:.4f}, {stationary[2]:.4f}]")
print(f"Verify π*T = π*: {np.allclose(stationary @ T, stationary)}")
print(f"\nInterpretation: long-run fraction of time in each state:")
print(f"  Sunny={stationary[0]:.1%}, Cloudy={stationary[1]:.1%}, Rainy={stationary[2]:.1%}")

                            ML Connection — MCMC & PageRank: Markov Chain Monte Carlo (MCMC) methods like Metropolis-Hastings design a Markov chain whose stationary distribution is the target posterior $p(\theta|\mathcal{D})$ — sampling from it approximates Bayesian inference when the posterior is intractable. Google's PageRank algorithm computes the stationary distribution of a web-graph Markov chain.
                        

The EM Algorithm

The Expectation-Maximization algorithm maximizes likelihood in models with latent (hidden) variables $Z$. Directly maximizing $\log p(\mathcal{X};\theta) = \log \sum_Z p(\mathcal{X}, Z;\theta)$ is hard — the sum inside the log is intractable. EM iterates:

EM Algorithm: E-step and M-step

flowchart LR
    A([Initialize θ⁰]) --> B
    B["E-step: Compute
Q(θ,θ^old) = E[log p(X,Z|θ) | X, θ^old]"] --> C
    C["M-step: Update
θ^new = argmax Q(θ, θ^old)"] --> D{Converged?}
    D -- No --> B
    D -- Yes --> E([Return θ*])

E-step: compute the expected complete-data log-likelihood $Q(\theta, \theta^{\text{old}}) = \mathbb{E}_{Z|\mathcal{X},\theta^{\text{old}}}[\log p(\mathcal{X},Z;\theta)]$
M-step: maximize $Q$ with respect to $\theta$, giving $\theta^{\text{new}}$
Guarantee: each iteration increases (or maintains) $\log p(\mathcal{X};\theta)$ — EM always ascends the likelihood (but may converge to local maxima)

For GMMs, the latent variables $Z = \{z_{ik}\}$ indicate which component $k$ generated data point $i$. The E-step computes responsibilities $r_{ik} = P(z_{ik}=1|x_i,\theta)$ (soft assignments); the M-step re-estimates $\boldsymbol{\mu}_k, \Sigma_k, \pi_k$ using these weighted assignments.

Concentration Inequalities

Concentration inequalities bound the probability that a random variable deviates far from its expectation — fundamental for proving generalization bounds in ML theory.

Markov's Inequality

Markov's Inequality (weakest, requires only non-negativity)

P(X \geq a) \leq \frac{\mathbb{E}[X]}{a} \quad \text{for } X \geq 0, a \gt 0

Too loose for practical use but the building block for all others.

Chebyshev's Inequality

Chebyshev's Inequality (requires finite variance)

P(|X - \mathbb{E}[X]| \geq k\sigma) \leq \frac{1}{k^2}

At most $1/k^2$ of probability mass lies $k$ standard deviations from the mean. This is distribution-free (holds for any finite-variance distribution) but loose for sub-Gaussian distributions.

Hoeffding's Inequality

Hoeffding's Inequality (for bounded independent variables)

P\!\left(\bar{X}_n - \mathbb{E}[\bar{X}_n] \geq \epsilon\right) \leq \exp\!\left(-\frac{2n^2\epsilon^2}{\sum_i (b_i - a_i)^2}\right)

For i.i.d. bounded $X_i \in [a, b]$: $P(\bar{X}_n - \mu \geq \epsilon) \leq \exp(-2n\epsilon^2/(b-a)^2)$. This decays exponentially in $n$ — showing why more data rapidly concentrates empirical averages around the true mean.

                            Generalization Bounds: Hoeffding's inequality underlies many ML generalization bounds. For a binary classifier with bounded loss in $[0,1]$: $P(\mathcal{L}_{\text{true}} \leq \mathcal{L}_{\text{train}} + \sqrt{\frac{\ln(|\mathcal{H}|) + \ln(1/\delta)}{2n}}) \geq 1-\delta$. This says: with $n$ training examples and hypothesis class of size $|\mathcal{H}|$, empirical risk approximates true risk to within $\tilde{O}(1/\sqrt{n})$ with high probability.
                        

PAC Learning Framework

Probably Approximately Correct (PAC) learning formalizes when machine learning is possible. A concept class $\mathcal{C}$ is PAC-learnable if there exists an algorithm that, for any target concept $c \in \mathcal{C}$, any distribution $\mathcal{D}$, any $\epsilon, \delta \gt 0$, returns a hypothesis $h$ with:

P_{S \sim \mathcal{D}^m}[\mathcal{L}(h) \leq \epsilon] \geq 1 - \delta

using at most $m = \text{poly}(1/\epsilon, 1/\delta, n)$ samples. The PAC sample complexity is:

m \geq \frac{1}{2\epsilon^2}\left(\ln|\mathcal{H}| + \ln\frac{2}{\delta}\right)

The VC dimension (Vapnik-Chervonenkis dimension) extends PAC theory to infinite hypothesis classes, measuring the expressiveness of $\mathcal{H}$ by the largest set of points it can shatter (correctly classify in all $2^m$ labelings).

Practice Exercises

Exercise 1Markov Chains

Detailed Balance & Reversible Markov Chains

A Markov chain satisfies detailed balance if $\pi_i T_{ij} = \pi_j T_{ji}$ for all $i,j$. Verify that the weather chain above satisfies detailed balance with the computed stationary distribution. Does it?

Show Answer

Check: $\pi_0 T_{01}$ vs $\pi_1 T_{10}$. Using computed $\pi^* \approx [0.444, 0.290, 0.266]$:

$\pi_0 T_{01} = 0.444 \times 0.2 = 0.0888$, $\pi_1 T_{10} = 0.290 \times 0.3 = 0.0870$ — not equal.

This chain does NOT satisfy detailed balance — it is not reversible. Detailed balance holds only for reversible chains (e.g., Metropolis-Hastings uses it to guarantee the correct stationary distribution). Non-reversible chains can still have a unique stationary distribution, they just aren't time-reversible.

Exercise 2Bayesian Inference

Conjugate Prior Update for a Poisson Rate

The Gamma distribution is the conjugate prior for the Poisson rate $\lambda$. If $X_1,\ldots,X_n \sim \text{Poisson}(\lambda)$ and $\lambda \sim \text{Gamma}(\alpha, \beta)$, derive the posterior $p(\lambda | X_1,\ldots,X_n)$.

Show Answer

Likelihood: $p(\mathbf{X}|\lambda) = \prod_i \frac{e^{-\lambda}\lambda^{x_i}}{x_i!} \propto e^{-n\lambda}\lambda^{\sum x_i}$

Prior: $p(\lambda) \propto \lambda^{\alpha-1}e^{-\beta\lambda}$

Posterior: $p(\lambda|\mathbf{X}) \propto \lambda^{\alpha-1}e^{-\beta\lambda} \cdot e^{-n\lambda}\lambda^{\sum x_i} = \lambda^{(\alpha + \sum x_i) - 1} e^{-(\beta+n)\lambda}$

This is $\text{Gamma}(\alpha + \sum x_i, \; \beta + n)$ — the prior count $\alpha$ increases by total observations, and the prior rate $\beta$ increases by sample size $n$.

Conclusion & Next Steps

This part covered the mathematical depth that separates ML practitioners from ML researchers:

Multivariate Gaussian — the backbone of Gaussian processes, linear discriminant analysis, and GMMs
Bayesian inference — principled uncertainty quantification; regularization as a MAP estimate
Markov chains — the foundation of MCMC, PageRank, hidden Markov models
EM algorithm — fitting latent variable models like GMMs and k-means (hard-EM)
Concentration inequalities — Hoeffding's bound; the mathematical basis of generalization guarantees
PAC learning — when learning is tractable; the role of VC dimension and sample complexity

Next in the Series

In Part 12: Projects & Applications, we put everything together: building linear regression, a Naive Bayes classifier, PCA, and a mini neural network entirely from scratch using only NumPy — and review which parts of this series underpin which ML algorithms.

Cookie Consent

Table of Contents

Complete Math + Probability + Statistics for ML/AI/DS Bootcamp

Mathematical Thinking

Set Theory & Foundations

Combinatorics

Probability Fundamentals

Statistics

Information Theory

Linear Algebra

Calculus & Optimization

ML-Specific Math

Computational Math

Advanced Topics

Projects & Applications

Multivariate Gaussian Distribution

Properties of the Multivariate Gaussian

Bayesian Inference

Conjugate Prior: Beta-Binomial

Markov Chains

The EM Algorithm

Concentration Inequalities

Markov's Inequality (weakest, requires only non-negativity)

Chebyshev's Inequality (requires finite variance)

Hoeffding's Inequality (for bounded independent variables)

PAC Learning Framework

Practice Exercises

Detailed Balance & Reversible Markov Chains

Conjugate Prior Update for a Poisson Rate

Conclusion & Next Steps

Next in the Series

Cookie Consent

Part 11: Advanced Topics

Table of Contents

Complete Math + Probability + Statistics for ML/AI/DS Bootcamp

Mathematical Thinking

Set Theory & Foundations

Combinatorics

Probability Fundamentals

Statistics

Information Theory

Linear Algebra

Calculus & Optimization

ML-Specific Math

Computational Math

Advanced Topics

Projects & Applications

Multivariate Gaussian Distribution

Properties of the Multivariate Gaussian

Bayesian Inference

Conjugate Prior: Beta-Binomial

Markov Chains

The EM Algorithm

Concentration Inequalities

Markov's Inequality (weakest, requires only non-negativity)

Chebyshev's Inequality (requires finite variance)

Hoeffding's Inequality (for bounded independent variables)

PAC Learning Framework

Practice Exercises

Detailed Balance & Reversible Markov Chains

Conjugate Prior Update for a Poisson Rate

Conclusion & Next Steps

Next in the Series

Continue the Series

Part 10: Computational Math with Python

Part 12: Projects & Applications

Part 4: Probability Fundamentals