Complete Math + Probability + Statistics for ML/AI/DS Bootcamp
Mathematical Thinking
Mindset, notation & functionsSet Theory & Foundations
Sets, operations & ML connectionsCombinatorics
Counting, permutations & combinationsProbability Fundamentals
Rules, Bayes & distributionsStatistics
Descriptive to inferentialInformation Theory
Entropy, cross-entropy & KL divergenceLinear Algebra
Vectors, matrices & transformationsCalculus & Optimization
Derivatives, gradients & descentML-Specific Math
Loss functions & regularizationComputational Math
NumPy, SciPy & simulationAdvanced Topics
Multivariate stats & Bayesian inferenceProjects & Applications
Build from scratch: regression, Bayes, PCAMultivariate Gaussian Distribution
The multivariate Gaussian $\mathbf{x} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma)$ extends the univariate normal to $d$ dimensions:
where $\boldsymbol{\mu} \in \mathbb{R}^d$ is the mean vector and $\Sigma \in \mathbb{R}^{d\times d}$ is the symmetric positive semidefinite covariance matrix. The exponent $(\mathbf{x}-\boldsymbol{\mu})^\top \Sigma^{-1} (\mathbf{x}-\boldsymbol{\mu})$ is the squared Mahalanobis distance — a distance that accounts for feature correlations.
Properties of the Multivariate Gaussian
- Marginals are Gaussian: any subset of coordinates is also Gaussian (marginalize by dropping rows/columns of $\Sigma$)
- Conditionals are Gaussian: $p(x_1 | x_2)$ is Gaussian with updated mean/covariance — the basis of Gaussian processes
- Linear transformations are Gaussian: $A\mathbf{x} + \mathbf{b} \sim \mathcal{N}(A\boldsymbol{\mu} + \mathbf{b}, A\Sigma A^\top)$
- Diagonalizable: $\Sigma = Q\Lambda Q^\top$ (eigendecomposition) — principal axes aligned with eigenvectors
Bayesian Inference
Bayesian inference treats parameters $\theta$ as random variables, maintaining a distribution over possible values rather than committing to a single point estimate:
The evidence $p(\mathcal{D}) = \int p(\mathcal{D}|\theta)p(\theta)\,d\theta$ normalizes the posterior but is often intractable — motivating approximate inference methods like variational inference and MCMC.
Conjugate Prior: Beta-Binomial
When the prior and posterior are in the same distribution family, we call the prior a conjugate prior — no numerical integration needed:
Model: $X | p \sim \text{Binomial}(n, p)$, prior: $p \sim \text{Beta}(\alpha, \beta)$
Posterior: $p | X=k \sim \text{Beta}(\alpha + k, \beta + n - k)$
This makes Bayesian updating closed-form: each observation simply increments the Beta parameters. The posterior mean is $\frac{\alpha + k}{\alpha + \beta + n}$ — a weighted combination of the prior belief $\frac{\alpha}{\alpha+\beta}$ and MLE $\frac{k}{n}$.
import numpy as np
# Bayesian coin flip: update Beta prior with observed data
def beta_posterior(alpha_prior, beta_prior, n_heads, n_tails):
"""Returns posterior Beta parameters after observing coin flips."""
alpha_post = alpha_prior + n_heads
beta_post = beta_prior + n_tails
return alpha_post, beta_post
def beta_mean_std(alpha, beta):
mean = alpha / (alpha + beta)
var = alpha * beta / ((alpha + beta)**2 * (alpha + beta + 1))
return mean, np.sqrt(var)
# Start with uniform prior (Beta(1,1) = Uniform[0,1])
alpha0, beta0 = 1.0, 1.0
print("Bayesian coin flip (updating Beta prior):")
print(f"{'Observation':>25} {'Alpha':>7} {'Beta':>7} {'Posterior Mean':>16} {'Std':>8}")
print(f"{'Prior Beta(1,1)':>25} {alpha0:>7.1f} {beta0:>7.1f} {alpha0/(alpha0+beta0):>16.4f} {beta_mean_std(alpha0,beta0)[1]:>8.4f}")
# Observe: 3 heads, 2 tails, then 7 more heads, then 10 tails
observations = [(3, 2, "After 3H, 2T"), (7, 0, "After 7 more H"), (0, 10, "After 10 more T")]
a, b = alpha0, beta0
for h, t, label in observations:
a, b = beta_posterior(a, b, h, t)
mean, std = beta_mean_std(a, b)
print(f"{label:>25} {a:>7.1f} {b:>7.1f} {mean:>16.4f} {std:>8.4f}")
print(f"\nMLE estimate: {(3+7)/(3+7+2+10):.4f} (just heads/total, ignores prior)")
Markov Chains
A Markov chain is a sequence of random variables $X_0, X_1, \ldots$ where each state depends only on the immediately preceding state:
The transition matrix $T \in \mathbb{R}^{K \times K}$ has $T_{ij} = P(\text{next}=j \mid \text{current}=i)$. Each row sums to 1.
The distribution after $t$ steps: $\boldsymbol{\pi}^{(t)} = \boldsymbol{\pi}^{(0)} T^t$. A stationary distribution $\boldsymbol{\pi}^*$ satisfies $\boldsymbol{\pi}^* = \boldsymbol{\pi}^* T$ — it is the left eigenvector of $T$ for eigenvalue 1.
import numpy as np
# Weather Markov chain: states Sunny(0), Cloudy(1), Rainy(2)
T = np.array([[0.7, 0.2, 0.1], # from Sunny
[0.3, 0.4, 0.3], # from Cloudy
[0.2, 0.3, 0.5]]) # from Rainy
# Simulate chain evolution
pi0 = np.array([1.0, 0.0, 0.0]) # start in Sunny
print("Markov chain evolution (Weather model):")
print(f"t=0: {pi0}")
pi = pi0.copy()
for t in [1, 5, 10, 50, 100]:
pi_t = pi0 @ np.linalg.matrix_power(T, t)
print(f"t={t:3d}: [{pi_t[0]:.4f}, {pi_t[1]:.4f}, {pi_t[2]:.4f}]")
# Find stationary distribution via eigenvectors
# Left eigenvectors of T (rows of T.T's right eigenvectors)
eigenvalues, eigenvectors = np.linalg.eig(T.T)
idx = np.argmax(np.real(eigenvalues)) # index of eigenvalue closest to 1
stationary = np.real(eigenvectors[:, idx])
stationary /= stationary.sum() # normalize to probability distribution
print(f"\nStationary distribution π*: [{stationary[0]:.4f}, {stationary[1]:.4f}, {stationary[2]:.4f}]")
print(f"Verify π*T = π*: {np.allclose(stationary @ T, stationary)}")
print(f"\nInterpretation: long-run fraction of time in each state:")
print(f" Sunny={stationary[0]:.1%}, Cloudy={stationary[1]:.1%}, Rainy={stationary[2]:.1%}")
The EM Algorithm
The Expectation-Maximization algorithm maximizes likelihood in models with latent (hidden) variables $Z$. Directly maximizing $\log p(\mathcal{X};\theta) = \log \sum_Z p(\mathcal{X}, Z;\theta)$ is hard — the sum inside the log is intractable. EM iterates:
flowchart LR
A([Initialize θ⁰]) --> B
B["E-step: Compute
Q(θ,θ^old) = E[log p(X,Z|θ) | X, θ^old]"] --> C
C["M-step: Update
θ^new = argmax Q(θ, θ^old)"] --> D{Converged?}
D -- No --> B
D -- Yes --> E([Return θ*])
- E-step: compute the expected complete-data log-likelihood $Q(\theta, \theta^{\text{old}}) = \mathbb{E}_{Z|\mathcal{X},\theta^{\text{old}}}[\log p(\mathcal{X},Z;\theta)]$
- M-step: maximize $Q$ with respect to $\theta$, giving $\theta^{\text{new}}$
- Guarantee: each iteration increases (or maintains) $\log p(\mathcal{X};\theta)$ — EM always ascends the likelihood (but may converge to local maxima)
For GMMs, the latent variables $Z = \{z_{ik}\}$ indicate which component $k$ generated data point $i$. The E-step computes responsibilities $r_{ik} = P(z_{ik}=1|x_i,\theta)$ (soft assignments); the M-step re-estimates $\boldsymbol{\mu}_k, \Sigma_k, \pi_k$ using these weighted assignments.
Concentration Inequalities
Concentration inequalities bound the probability that a random variable deviates far from its expectation — fundamental for proving generalization bounds in ML theory.
Markov's Inequality (weakest, requires only non-negativity)
Too loose for practical use but the building block for all others.
Chebyshev's Inequality (requires finite variance)
At most $1/k^2$ of probability mass lies $k$ standard deviations from the mean. This is distribution-free (holds for any finite-variance distribution) but loose for sub-Gaussian distributions.
Hoeffding's Inequality (for bounded independent variables)
For i.i.d. bounded $X_i \in [a, b]$: $P(\bar{X}_n - \mu \geq \epsilon) \leq \exp(-2n\epsilon^2/(b-a)^2)$. This decays exponentially in $n$ — showing why more data rapidly concentrates empirical averages around the true mean.
PAC Learning Framework
Probably Approximately Correct (PAC) learning formalizes when machine learning is possible. A concept class $\mathcal{C}$ is PAC-learnable if there exists an algorithm that, for any target concept $c \in \mathcal{C}$, any distribution $\mathcal{D}$, any $\epsilon, \delta \gt 0$, returns a hypothesis $h$ with:
using at most $m = \text{poly}(1/\epsilon, 1/\delta, n)$ samples. The PAC sample complexity is:
The VC dimension (Vapnik-Chervonenkis dimension) extends PAC theory to infinite hypothesis classes, measuring the expressiveness of $\mathcal{H}$ by the largest set of points it can shatter (correctly classify in all $2^m$ labelings).
Practice Exercises
Detailed Balance & Reversible Markov Chains
A Markov chain satisfies detailed balance if $\pi_i T_{ij} = \pi_j T_{ji}$ for all $i,j$. Verify that the weather chain above satisfies detailed balance with the computed stationary distribution. Does it?
Show Answer
Check: $\pi_0 T_{01}$ vs $\pi_1 T_{10}$. Using computed $\pi^* \approx [0.444, 0.290, 0.266]$:
$\pi_0 T_{01} = 0.444 \times 0.2 = 0.0888$, $\pi_1 T_{10} = 0.290 \times 0.3 = 0.0870$ — not equal.
This chain does NOT satisfy detailed balance — it is not reversible. Detailed balance holds only for reversible chains (e.g., Metropolis-Hastings uses it to guarantee the correct stationary distribution). Non-reversible chains can still have a unique stationary distribution, they just aren't time-reversible.
Conjugate Prior Update for a Poisson Rate
The Gamma distribution is the conjugate prior for the Poisson rate $\lambda$. If $X_1,\ldots,X_n \sim \text{Poisson}(\lambda)$ and $\lambda \sim \text{Gamma}(\alpha, \beta)$, derive the posterior $p(\lambda | X_1,\ldots,X_n)$.
Show Answer
Likelihood: $p(\mathbf{X}|\lambda) = \prod_i \frac{e^{-\lambda}\lambda^{x_i}}{x_i!} \propto e^{-n\lambda}\lambda^{\sum x_i}$
Prior: $p(\lambda) \propto \lambda^{\alpha-1}e^{-\beta\lambda}$
Posterior: $p(\lambda|\mathbf{X}) \propto \lambda^{\alpha-1}e^{-\beta\lambda} \cdot e^{-n\lambda}\lambda^{\sum x_i} = \lambda^{(\alpha + \sum x_i) - 1} e^{-(\beta+n)\lambda}$
This is $\text{Gamma}(\alpha + \sum x_i, \; \beta + n)$ — the prior count $\alpha$ increases by total observations, and the prior rate $\beta$ increases by sample size $n$.
Conclusion & Next Steps
This part covered the mathematical depth that separates ML practitioners from ML researchers:
- Multivariate Gaussian — the backbone of Gaussian processes, linear discriminant analysis, and GMMs
- Bayesian inference — principled uncertainty quantification; regularization as a MAP estimate
- Markov chains — the foundation of MCMC, PageRank, hidden Markov models
- EM algorithm — fitting latent variable models like GMMs and k-means (hard-EM)
- Concentration inequalities — Hoeffding's bound; the mathematical basis of generalization guarantees
- PAC learning — when learning is tractable; the role of VC dimension and sample complexity
Next in the Series
In Part 12: Projects & Applications, we put everything together: building linear regression, a Naive Bayes classifier, PCA, and a mini neural network entirely from scratch using only NumPy — and review which parts of this series underpin which ML algorithms.