Complete Math + Probability + Statistics for ML/AI/DS Bootcamp
Mathematical Thinking
Mindset, notation & functionsSet Theory & Foundations
Sets, operations & ML connectionsCombinatorics
Counting, permutations & combinationsProbability Fundamentals
Rules, Bayes & distributionsStatistics
Descriptive to inferentialInformation Theory
Entropy, cross-entropy & KL divergenceLinear Algebra
Vectors, matrices & transformationsCalculus & Optimization
Derivatives, gradients & descentML-Specific Math
Loss functions & regularizationComputational Math
NumPy, SciPy & simulationAdvanced Topics
Multivariate stats & Bayesian inferenceProjects & Applications
Build from scratch: regression, Bayes, PCAWhy Information Theory in ML
Claude Shannon (1948) asked a deceptively simple question: how do we measure the amount of "information" in a message? The answer spawned a mathematical framework that now underlies:
- Cross-entropy loss in neural networks (classification training objective)
- KL divergence in VAEs (regularization in the latent space)
- Information gain in decision trees (split criterion)
- Mutual information for feature selection and representation learning
- Entropy regularization in reinforcement learning (maximum entropy RL)
- Perplexity as a language model evaluation metric
Self-Information
The self-information (or surprise) of an event with probability $p$ is:
Intuition: rare events are more "surprising" (carry more information) than common ones.
- $p = 1$: certain event → $I = 0$ bits (no surprise)
- $p = 0.5$: fair coin → $I = 1$ bit (one binary question answered)
- $p = 0.01$: rare event → $I \approx 6.64$ bits (highly surprising)
Shannon Entropy
Entropy is the expected (average) self-information of a distribution $p$:
Entropy measures the uncertainty of a distribution:
- Maximum entropy: Uniform distribution — all outcomes equally likely, maximum uncertainty
- Minimum entropy: Deterministic distribution — one outcome has $p=1$, zero uncertainty
For a binary variable with $P(X=1) = p$, the binary entropy function:
This is maximized at $p = 0.5$ giving $H = 1$ bit, and equals 0 at $p \in \{0, 1\}$.
import numpy as np
def entropy(probs, base=np.e):
"""Compute Shannon entropy. Default: nats (base e). Use base=2 for bits."""
probs = np.array(probs, dtype=float)
# Convention: 0 * log(0) = 0
mask = probs > 0
return -np.sum(probs[mask] * np.log(probs[mask]) / np.log(base))
# Discrete distributions: compare entropies
uniform_4 = [0.25, 0.25, 0.25, 0.25]
skewed_4 = [0.7, 0.1, 0.1, 0.1]
det_4 = [1.0, 0.0, 0.0, 0.0]
print("Shannon Entropy (bits):")
print(f" Uniform: {entropy(uniform_4, base=2):.4f} bits (max = log2(4)={np.log2(4):.4f})")
print(f" Skewed: {entropy(skewed_4, base=2):.4f} bits")
print(f" Deterministic:{entropy(det_4, base=2):.4f} bits (min = 0)")
# Binary entropy function
print("\nBinary entropy H(p):")
for p in [0.01, 0.1, 0.5, 0.9, 0.99]:
h = entropy([p, 1-p], base=2)
print(f" p={p}: H={h:.4f} bits")
Joint & Conditional Entropy
Joint entropy of two random variables $X, Y$:
Conditional entropy — uncertainty in $Y$ given $X$:
Key inequalities: $H(Y \mid X) \leq H(Y)$ — knowing $X$ cannot increase uncertainty about $Y$. And $H(X, Y) \leq H(X) + H(Y)$ — with equality when $X \perp Y$.
Mutual Information
Mutual information measures how much knowing $X$ reduces uncertainty about $Y$:
Properties: $I(X;Y) \geq 0$ (knowing more can't hurt), and $I(X;Y) = 0 \iff X \perp Y$.
Mutual Information for Feature Selection
A feature is useful if it has high mutual information with the target variable $Y$. In decision trees, the split criterion information gain is exactly mutual information:
$$\text{IG}(Y; A) = H(Y) - H(Y \mid A)$$Where $A$ is the feature (attribute). ID3 and C4.5 algorithms greedily select the feature maximizing information gain at each node. sklearn.feature_selection.mutual_info_classif computes $I(X_i; Y)$ for each feature $X_i$, giving a model-agnostic feature importance score.
Cross-Entropy
Cross-entropy measures the expected number of bits needed to encode data from distribution $p$ using a code designed for distribution $q$:
When $q = p$, this reduces to the ordinary entropy $H(p)$. When $q \neq p$, cross-entropy is always larger: $H(p, q) \geq H(p)$.
import numpy as np
def cross_entropy(p_true, q_pred):
"""
Compute cross-entropy H(p, q).
p_true: true distribution (e.g., one-hot labels)
q_pred: predicted distribution (e.g., softmax output)
"""
p_true = np.array(p_true, dtype=float)
q_pred = np.clip(q_pred, 1e-12, 1.0) # prevent log(0)
return -np.sum(p_true * np.log(q_pred))
# Example: 3-class classification
# True label: class 0 (one-hot)
p_true = [1.0, 0.0, 0.0]
# Good prediction: high probability on true class
q_good = [0.9, 0.07, 0.03]
# Bad prediction: spread probability
q_bad = [0.3, 0.4, 0.3]
# Perfect prediction
q_perfect = [1.0 - 1e-9, 5e-10, 5e-10]
print("Cross-entropy H(p, q):")
print(f" Good prediction: {cross_entropy(p_true, q_good):.4f}")
print(f" Bad prediction: {cross_entropy(p_true, q_bad):.4f}")
print(f" Perfect prediction: {cross_entropy(p_true, q_perfect):.4f}")
print(f" Entropy H(p): {cross_entropy(p_true, p_true):.4f} (lower bound)")
KL Divergence
Kullback-Leibler divergence (relative entropy) measures how different distribution $q$ is from $p$:
Key properties:
- $D_{KL}(p \| q) \geq 0$ always (Gibbs inequality)
- $D_{KL}(p \| q) = 0 \iff p = q$ everywhere
- Asymmetric: $D_{KL}(p \| q) \neq D_{KL}(q \| p)$ in general
- "Forward KL" ($D_{KL}(p \| q)$): penalizes $q$ placing zero mass where $p$ is nonzero
- "Reverse KL" ($D_{KL}(q \| p)$): penalizes $q$ placing mass where $p$ is zero
KL Divergence in Variational Autoencoders
VAEs are trained with the ELBO (Evidence Lower BOund) loss:
$$\mathcal{L} = \underbrace{\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]}_{\text{reconstruction}} - \underbrace{D_{KL}(q_\phi(z|x) \| p(z))}_{\text{regularization}}$$The KL term forces the approximate posterior $q_\phi(z|x)$ (encoder output) to stay close to the prior $p(z) = \mathcal{N}(0, I)$. This prevents the encoder from memorizing by collapsing to point estimates, encouraging a smooth, continuous latent space.
For two Gaussians, KL has a closed form: $D_{KL}(\mathcal{N}(\mu, \sigma^2) \| \mathcal{N}(0,1)) = \frac{1}{2}(\mu^2 + \sigma^2 - 1 - \log \sigma^2)$.
The Jensen-Shannon divergence is a symmetrized, bounded version of KL:
$D_{JS}$ is used as the training objective in the original GAN paper (Goodfellow et al. 2014) and in text generation evaluations.
import numpy as np
def kl_divergence(p, q):
"""KL(p || q). p and q are probability arrays summing to 1."""
p = np.array(p, dtype=float)
q = np.clip(q, 1e-12, 1.0)
mask = p > 0
return np.sum(p[mask] * np.log(p[mask] / q[mask]))
def js_divergence(p, q):
"""Jensen-Shannon divergence (symmetric, bounded in [0, log(2)])."""
p, q = np.array(p, dtype=float), np.array(q, dtype=float)
m = 0.5 * (p + q)
return 0.5 * kl_divergence(p, m) + 0.5 * kl_divergence(q, m)
# True label distribution vs model predictions
p_true = np.array([0.7, 0.2, 0.1]) # true distribution
q_close = np.array([0.65, 0.25, 0.10]) # similar
q_far = np.array([0.1, 0.1, 0.8]) # very different
print("KL and JS Divergences:")
print(f" KL(true || close) = {kl_divergence(p_true, q_close):.6f}")
print(f" KL(true || far) = {kl_divergence(p_true, q_far):.6f}")
print(f" JS(true, close) = {js_divergence(p_true, q_close):.6f}")
print(f" JS(true, far) = {js_divergence(p_true, q_far):.6f}")
print(f"\nKL asymmetry:")
print(f" KL(p || q_far) = {kl_divergence(p_true, q_far):.6f}")
print(f" KL(q_far || p) = {kl_divergence(q_far, p_true):.6f}")
Relationships Between Information Quantities
flowchart TD
SI["Self-Information
I(x) = -log p(x)"]
H["Shannon Entropy
H(X) = E[I(X)]"]
JH["Joint Entropy
H(X,Y)"]
CH["Conditional Entropy
H(Y|X) = H(X,Y) - H(X)"]
MI["Mutual Information
I(X;Y) = H(X) - H(X|Y)"]
CE["Cross-Entropy
H(p,q) = -Σ p log q"]
KL["KL Divergence
D_KL(p||q) = H(p,q) - H(p)"]
SI -->|"Expected value"| H
H -->|"Joint of two vars"| JH
JH -->|"Subtract H(X)"| CH
H --> MI
CH --> MI
H -->|"Replace log p with log q"| CE
CE -->|"Subtract H(p)"| KL
ML Connections Summary
Decision Tree Splitting via Entropy
At each node, the ID3 algorithm picks the feature $A$ that maximizes:
$$\text{IG}(Y; A) = H(Y) - \sum_{v \in \text{values}(A)} \frac{|S_v|}{|S|} H(Y \mid A = v)$$Where $S_v$ is the subset of training examples with $A = v$. High information gain = feature reduces uncertainty about class label the most. Gini impurity (used in sklearn's CART) is a related but different measure: $G = 1 - \sum_c p_c^2$.
Practice Exercises
Entropy of a Fair Die
(1) What is the entropy of a fair 6-sided die roll in bits? (2) What is the entropy of a loaded die where face 6 appears with probability 0.5 and others with 0.1 each? Which has more uncertainty?
Show Answer
(1) Fair die: $H = -6 \times \frac{1}{6}\log_2\frac{1}{6} = \log_2 6 \approx 2.585$ bits.
(2) Loaded die: $H = -0.5\log_2(0.5) - 5 \times 0.1\log_2(0.1) = 0.5 + 5 \times 0.332 = 0.5 + 1.661 \approx 2.161$ bits.
The fair die has higher entropy (2.585 bits) — more uncertainty since all outcomes are equally likely. The loaded die has lower entropy because face 6 is predictable. The uniform distribution maximizes entropy for a given number of outcomes.
Cross-Entropy Loss for Binary Classification
For binary classification (labels $y \in \{0,1\}$), show that the general cross-entropy formula $H(p, q) = -\sum_x p(x)\log q(x)$ reduces to the familiar binary cross-entropy: $\mathcal{L} = -y\log\hat{y} - (1-y)\log(1-\hat{y})$.
Show Answer
For a single binary example with true label $y \in \{0,1\}$: the true distribution is $p = [y, 1-y]$ (one-hot) and the predicted distribution is $q = [\hat{y}, 1-\hat{y}]$.
$H(p, q) = -p_0 \log q_0 - p_1 \log q_1 = -y \log \hat{y} - (1-y)\log(1-\hat{y})$. QED.
When $y=1$: only $-\log\hat{y}$ matters. When $y=0$: only $-\log(1-\hat{y})$ matters. This is exactly the binary cross-entropy implemented in BCELoss in PyTorch and BinaryCrossentropy in Keras.
Conclusion & Next Steps
Information theory gives ML a mathematically grounded way to measure uncertainty, distributional differences, and feature relevance. Key takeaways:
- Self-information: $I(x) = -\log p(x)$ — rare events are more surprising
- Entropy: $H(X)$ — average uncertainty; maximized by uniform distribution
- Cross-entropy: $H(p,q)$ — the loss function for classification in neural networks
- KL divergence: $D_{KL}(p\|q)$ — asymmetric distance between distributions; used in VAE training
- Mutual information: $I(X;Y)$ — feature relevance measure; decision tree criterion
- Key relationship: $H(p,q) = H(p) + D_{KL}(p\|q)$ — cross-entropy = entropy + extra cost of mismatch
Next in the Series
In Part 7: Linear Algebra, we tackle vectors, matrices, eigenvalues, and singular value decomposition — the computational backbone of neural network layers, PCA, and attention mechanisms.