Complete Math + Probability + Statistics for ML/AI/DS Bootcamp
Mathematical Thinking
Mindset, notation & functionsSet Theory & Foundations
Sets, operations & ML connectionsCombinatorics
Counting, permutations & combinationsProbability Fundamentals
Rules, Bayes & distributionsStatistics
Descriptive to inferentialInformation Theory
Entropy, cross-entropy & KL divergenceLinear Algebra
Vectors, matrices & transformationsCalculus & Optimization
Derivatives, gradients & descentML-Specific Math
Loss functions & regularizationComputational Math
NumPy, SciPy & simulationAdvanced Topics
Multivariate stats & Bayesian inferenceProjects & Applications
Build from scratch: regression, Bayes, PCAWhy Information Theory in ML
Claude Shannon (1948) asked a deceptively simple question: how do we measure the amount of "information" in a message? The answer spawned a mathematical framework that now underlies:
- Cross-entropy loss in neural networks (classification training objective)
- KL divergence in VAEs (regularization in the latent space)
- Information gain in decision trees (split criterion)
- Mutual information for feature selection and representation learning
- Entropy regularization in reinforcement learning (maximum entropy RL)
- Perplexity as a language model evaluation metric
Self-Information: Why Logarithm?
The self-information (or surprise) of an event with probability $p$ is:
But why logarithm? Why not $1/p$, or $1 - p$, or some other decreasing function? Shannon (1948) showed that $-\log p$ is the unique function satisfying three natural axioms for an information measure $I(p)$:
| Axiom | Formal Statement | Intuition |
|---|---|---|
| Monotonicity | $I(p)$ is continuous and decreasing in $p$ | Rarer events are more surprising |
| Zero for certainty | $I(1) = 0$ | A certain event carries no information |
| Additivity | $I(p_1 \cdot p_2) = I(p_1) + I(p_2)$ | Independent events' information adds up |
Verification with examples:
- $p = 1$: certain event → $I = -\log(1) = 0$ bits (no surprise) ✓ axiom 2
- $p = 0.5$: fair coin → $I = -\log_2(0.5) = 1$ bit (one binary question answered)
- $p = 0.01$: rare event → $I = -\log_2(0.01) \approx 6.64$ bits (highly surprising)
- Two fair coins (independent): $I(0.25) = 2$ bits $= I(0.5) + I(0.5)$ ✓ axiom 3
Shannon Entropy
Entropy is the expected (average) self-information of a distribution $p$ — combining self-information with the expectation from Part 4:
Entropy measures the average uncertainty of a distribution — equivalently, the minimum average bits needed per symbol for lossless compression.
Proof: Maximum Entropy = Uniform Distribution
Theorem: Among all distributions on $n$ outcomes, the uniform distribution $p(x_i) = 1/n$ uniquely maximizes entropy.
Proof via Lagrange multipliers (constrained optimization from Part 8):
We maximize $H = -\sum_{i=1}^n p_i \ln p_i$ subject to the constraint $\sum_{i=1}^n p_i = 1$.
$$\mathcal{L} = -\sum_{i=1}^n p_i \ln p_i - \lambda\left(\sum_{i=1}^n p_i - 1\right)$$Setting $\frac{\partial \mathcal{L}}{\partial p_i} = 0$:
$$-\ln p_i - 1 - \lambda = 0 \implies p_i = e^{-(1+\lambda)}$$Since the solution is the same for all $i$, all $p_i$ must be equal. Combined with $\sum p_i = 1$, we get $p_i = 1/n$ for all $i$. The maximum entropy is $\boxed{H_{\max} = \ln n}$.
Binary Entropy: Derivation of Maximum
For a binary variable with $P(X=1) = p$, the binary entropy function:
Finding the maximum via calculus (Part 8 preview — single-variable optimization):
$$\frac{dH_b}{dp} = -\ln p - 1 + \ln(1-p) + 1 = \ln\frac{1-p}{p}$$Setting $dH_b/dp = 0$: $\ln\frac{1-p}{p} = 0 \implies \frac{1-p}{p} = 1 \implies p = 1/2$.
Second derivative: $\frac{d^2H_b}{dp^2} = -\frac{1}{p} - \frac{1}{1-p} < 0$ for all $p \in (0,1)$, confirming this is a maximum.
Therefore $H_b(1/2) = \ln 2$ nats $= 1$ bit, and $H_b(0) = H_b(1) = 0$.
import numpy as np
def entropy(probs, base=np.e):
"""Compute Shannon entropy. Default: nats (base e). Use base=2 for bits."""
probs = np.array(probs, dtype=float)
# Convention: 0 * log(0) = 0
mask = probs > 0
return -np.sum(probs[mask] * np.log(probs[mask]) / np.log(base))
# Discrete distributions: compare entropies
uniform_4 = [0.25, 0.25, 0.25, 0.25]
skewed_4 = [0.7, 0.1, 0.1, 0.1]
det_4 = [1.0, 0.0, 0.0, 0.0]
print("Shannon Entropy (bits):")
print(f" Uniform: {entropy(uniform_4, base=2):.4f} bits (max = log2(4)={np.log2(4):.4f})")
print(f" Skewed: {entropy(skewed_4, base=2):.4f} bits")
print(f" Deterministic:{entropy(det_4, base=2):.4f} bits (min = 0)")
# Binary entropy function
print("\nBinary entropy H(p):")
for p in [0.01, 0.1, 0.5, 0.9, 0.99]:
h = entropy([p, 1-p], base=2)
print(f" p={p}: H={h:.4f} bits")
Joint & Conditional Entropy
Joint entropy of two random variables $X, Y$:
Conditional entropy — average uncertainty remaining in $Y$ after observing $X$:
Derivation: Entropy Chain Rule
We derive $H(X,Y) = H(X) + H(Y|X)$ from the probability chain rule ($p(x,y) = p(x)p(y|x)$ from Part 4):
$$H(X,Y) = -\sum_{x,y} p(x,y) \log p(x,y)$$ $$= -\sum_{x,y} p(x,y) \log\bigl[p(x) \cdot p(y|x)\bigr]$$ $$= -\sum_{x,y} p(x,y) \log p(x) - \sum_{x,y} p(x,y) \log p(y|x)$$ $$= -\sum_x p(x) \log p(x) \underbrace{\sum_y p(y|x)}_{=1} + H(Y|X)$$ $$\boxed{= H(X) + H(Y|X)}$$This generalizes: $H(X_1, X_2, \ldots, X_n) = \sum_{i=1}^n H(X_i | X_1, \ldots, X_{i-1})$.
Proof: Conditioning Reduces Entropy
Theorem: $H(Y|X) \leq H(Y)$ with equality iff $X \perp Y$.
Proof: We'll prove this after establishing $D_{KL} \geq 0$ (Gibbs inequality, below). The key insight is:
$$H(Y) - H(Y|X) = I(X;Y) = D_{KL}(p(x,y) \| p(x)p(y)) \geq 0$$Since KL divergence is non-negative (proved in the KL section), conditioning cannot increase entropy. Equality holds iff $p(x,y) = p(x)p(y)$ everywhere — i.e., $X$ and $Y$ are independent, so knowing $X$ tells us nothing about $Y$.
Mutual Information
Mutual information measures how much knowing $X$ reduces uncertainty about $Y$:
Derivation: MI as KL Divergence
We can express MI as a KL divergence measuring "distance from independence":
$$I(X;Y) = H(X) + H(Y) - H(X,Y)$$Substituting the definitions and using $p(x,y) = p(x)p(y)$ under independence:
$$I(X;Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)} = D_{KL}\bigl(p(x,y) \,\|\, p(x)p(y)\bigr)$$Data Processing Inequality
Theorem: If $X \to Y \to Z$ form a Markov chain (meaning $Z$ depends on $X$ only through $Y$), then:
Implication for ML: No post-processing of features $Y$ can create new information about the target $X$. Each layer in a neural network can only preserve or lose information — never create it. This is why representation learning matters: you need representations that retain relevant information, since downstream layers cannot recover what was lost.
Mutual Information for Feature Selection
A feature is useful if it has high mutual information with the target variable $Y$. In decision trees, the split criterion information gain is exactly mutual information:
$$\text{IG}(Y; A) = H(Y) - H(Y \mid A) = I(Y; A)$$Where $A$ is the feature (attribute). ID3 and C4.5 algorithms greedily select the feature maximizing information gain at each node. sklearn.feature_selection.mutual_info_classif computes $I(X_i; Y)$ for each feature $X_i$, giving a model-agnostic feature importance score that captures any dependency (not just linear).
import numpy as np
def entropy(probs):
"""Shannon entropy in nats. Convention: 0*log(0) = 0."""
probs = np.array(probs, dtype=float)
mask = probs > 0
return -np.sum(probs[mask] * np.log(probs[mask]))
def mutual_information(joint_px_y):
"""
Mutual information I(X;Y) from joint probability table.
joint_px_y: 2D array where joint_px_y[i][j] = P(X=i, Y=j)
"""
joint = np.array(joint_px_y, dtype=float)
p_x = joint.sum(axis=1) # marginal of X
p_y = joint.sum(axis=0) # marginal of Y
mi = 0.0
for i in range(joint.shape[0]):
for j in range(joint.shape[1]):
if joint[i, j] > 0:
mi += joint[i, j] * np.log(joint[i, j] / (p_x[i] * p_y[j]))
return mi
# Example: weather (X) vs umbrella (Y)
# X: {sunny=0, rainy=1}, Y: {no_umbrella=0, umbrella=1}
joint = np.array([
[0.45, 0.05], # sunny: usually no umbrella
[0.05, 0.45] # rainy: usually umbrella
])
mi = mutual_information(joint)
p_x = joint.sum(axis=1)
p_y = joint.sum(axis=0)
print(f"H(X) = {entropy(p_x):.4f} nats")
print(f"H(Y) = {entropy(p_y):.4f} nats")
print(f"I(X;Y) = {mi:.4f} nats")
print(f"H(X|Y) = H(X) - I(X;Y) = {entropy(p_x) - mi:.4f} nats")
print(f"\nKnowing umbrella reduces weather uncertainty by {mi/entropy(p_x)*100:.1f}%")
# Independent case: verify I(X;Y) = 0
joint_indep = np.array([[0.25, 0.25], [0.25, 0.25]])
print(f"\nIndependent case: I(X;Y) = {mutual_information(joint_indep):.6f} (≈ 0)")
Cross-Entropy
Cross-entropy measures the expected number of bits needed to encode data from distribution $p$ using a code optimized for distribution $q$:
Derivation: Decomposition into Entropy + KL
The fundamental identity linking cross-entropy, entropy, and KL divergence:
$$H(p, q) = -\sum_x p(x) \log q(x) = -\sum_x p(x) \log p(x) + \sum_x p(x) \log \frac{p(x)}{q(x)}$$Since $D_{KL}(p\|q) \geq 0$ (proved below), this immediately gives $H(p,q) \geq H(p)$, with equality iff $q = p$. The "extra bits" beyond entropy come entirely from the mismatch between $p$ and $q$.
Why Cross-Entropy = Maximum Likelihood
Claim: Minimizing cross-entropy loss is equivalent to maximum likelihood estimation (MLE from Part 5).
Proof: Given $N$ i.i.d. training samples $\{x_1, \ldots, x_N\}$ drawn from $p$, the negative log-likelihood of model $q$ is:
$$-\frac{1}{N}\sum_{i=1}^N \log q(x_i) \xrightarrow{N \to \infty} -\sum_x p(x) \log q(x) = H(p, q)$$By the law of large numbers (Part 5), the empirical average converges to the expectation. Since $H(p)$ is constant w.r.t. model parameters, minimizing $H(p,q)$ is equivalent to minimizing $D_{KL}(p\|q)$ — making $q$ as close to $p$ as possible.
import numpy as np
def cross_entropy(p_true, q_pred):
"""
Compute cross-entropy H(p, q).
p_true: true distribution (e.g., one-hot labels)
q_pred: predicted distribution (e.g., softmax output)
"""
p_true = np.array(p_true, dtype=float)
q_pred = np.clip(q_pred, 1e-12, 1.0) # prevent log(0)
return -np.sum(p_true * np.log(q_pred))
# Example: 3-class classification
# True label: class 0 (one-hot)
p_true = [1.0, 0.0, 0.0]
# Good prediction: high probability on true class
q_good = [0.9, 0.07, 0.03]
# Bad prediction: spread probability
q_bad = [0.3, 0.4, 0.3]
# Perfect prediction
q_perfect = [1.0 - 1e-9, 5e-10, 5e-10]
print("Cross-entropy H(p, q):")
print(f" Good prediction: {cross_entropy(p_true, q_good):.4f}")
print(f" Bad prediction: {cross_entropy(p_true, q_bad):.4f}")
print(f" Perfect prediction: {cross_entropy(p_true, q_perfect):.4f}")
print(f" Entropy H(p): {cross_entropy(p_true, p_true):.4f} (lower bound)")
KL Divergence
Kullback-Leibler divergence (relative entropy) measures how distribution $q$ differs from a reference $p$:
Proof: Gibbs Inequality ($D_{KL} \geq 0$)
This is arguably the most important inequality in information theory — it underpins why cross-entropy is a valid loss function and why conditioning reduces entropy.
Theorem (Gibbs inequality): $D_{KL}(p \| q) \geq 0$ for any distributions $p, q$, with equality iff $p = q$.
Proof via Jensen's inequality (using convexity from Part 8):
$$-D_{KL}(p \| q) = -\sum_x p(x) \log \frac{p(x)}{q(x)} = \sum_x p(x) \log \frac{q(x)}{p(x)}$$Since $\log$ is strictly concave, Jensen's inequality gives:
$$\sum_x p(x) \log \frac{q(x)}{p(x)} \leq \log \left(\sum_x p(x) \cdot \frac{q(x)}{p(x)}\right) = \log\left(\sum_x q(x)\right) = \log(1) = 0$$Therefore $-D_{KL}(p\|q) \leq 0$, which means $\boxed{D_{KL}(p\|q) \geq 0}$.
Equality holds iff $q(x)/p(x)$ is constant for all $x$ (Jensen's equality condition), which requires $q = p$ since both sum to 1.
- $H(p,q) \geq H(p)$: cross-entropy as a valid loss (it's minimized at ground truth)
- $H(Y|X) \leq H(Y)$: conditioning reduces entropy (information can't hurt)
- $I(X;Y) \geq 0$: mutual information is non-negative
- ELBO ≤ log-evidence: the variational bound is a true lower bound
KL Divergence: Forward vs Reverse
KL is asymmetric: $D_{KL}(p \| q) \neq D_{KL}(q \| p)$. The two directions have different behaviors crucial for ML:
| Direction | Name | Penalty | Result | Used In |
|---|---|---|---|---|
| $D_{KL}(p \| q)$ | Forward KL | $q(x)=0$ where $p(x)>0$ → $\infty$ | $q$ must cover all of $p$ → mean-seeking | MLE, cross-entropy loss |
| $D_{KL}(q \| p)$ | Reverse KL | $p(x)=0$ where $q(x)>0$ → $\infty$ | $q$ avoids placing mass where $p=0$ → mode-seeking | Variational inference, VAEs |
Concrete example: If $p$ is bimodal (two peaks) and $q$ is a single Gaussian:
- Forward KL forces $q$ to cover both modes → $q$ spreads across both peaks (over-dispersed)
- Reverse KL lets $q$ collapse onto one mode → $q$ captures one peak perfectly but ignores the other
KL Divergence in Variational Autoencoders
VAEs are trained with the ELBO (Evidence Lower BOund) loss:
$$\mathcal{L} = \underbrace{\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]}_{\text{reconstruction}} - \underbrace{D_{KL}(q_\phi(z|x) \| p(z))}_{\text{regularization}}$$The KL term forces the approximate posterior $q_\phi(z|x)$ (encoder output) to stay close to the prior $p(z) = \mathcal{N}(0, I)$. This prevents the encoder from memorizing by collapsing to point estimates, encouraging a smooth, continuous latent space.
For two Gaussians, KL has a closed form: $D_{KL}(\mathcal{N}(\mu, \sigma^2) \| \mathcal{N}(0,1)) = \frac{1}{2}(\mu^2 + \sigma^2 - 1 - \log \sigma^2)$.
The Jensen-Shannon divergence is a symmetrized, bounded version of KL:
$D_{JS}$ is used as the training objective in the original GAN paper (Goodfellow et al. 2014) and in text generation evaluations.
import numpy as np
def kl_divergence(p, q):
"""KL(p || q). p and q are probability arrays summing to 1."""
p = np.array(p, dtype=float)
q = np.clip(q, 1e-12, 1.0)
mask = p > 0
return np.sum(p[mask] * np.log(p[mask] / q[mask]))
def js_divergence(p, q):
"""Jensen-Shannon divergence (symmetric, bounded in [0, log(2)])."""
p, q = np.array(p, dtype=float), np.array(q, dtype=float)
m = 0.5 * (p + q)
return 0.5 * kl_divergence(p, m) + 0.5 * kl_divergence(q, m)
# True label distribution vs model predictions
p_true = np.array([0.7, 0.2, 0.1]) # true distribution
q_close = np.array([0.65, 0.25, 0.10]) # similar
q_far = np.array([0.1, 0.1, 0.8]) # very different
print("KL and JS Divergences:")
print(f" KL(true || close) = {kl_divergence(p_true, q_close):.6f}")
print(f" KL(true || far) = {kl_divergence(p_true, q_far):.6f}")
print(f" JS(true, close) = {js_divergence(p_true, q_close):.6f}")
print(f" JS(true, far) = {js_divergence(p_true, q_far):.6f}")
print(f"\nKL asymmetry:")
print(f" KL(p || q_far) = {kl_divergence(p_true, q_far):.6f}")
print(f" KL(q_far || p) = {kl_divergence(q_far, p_true):.6f}")
Differential Entropy (Continuous Case)
For continuous random variables with density $f(x)$, we replace the sum with an integral:
Critical difference from discrete entropy: Differential entropy can be negative. For example, a uniform on $[0, 1/2]$ has $h = -\log 2 < 0$.
Derivation: Gaussian Has Maximum Entropy
Theorem: Among all distributions with fixed mean $\mu$ and variance $\sigma^2$, the Gaussian $\mathcal{N}(\mu, \sigma^2)$ uniquely maximizes differential entropy.
Proof sketch (Lagrange multipliers): Maximize $h(X) = -\int f \log f\, dx$ subject to $\int f = 1$, $\int xf = \mu$, $\int (x-\mu)^2 f = \sigma^2$. The Lagrangian yields $f(x) \propto \exp(-\alpha - \beta x - \gamma x^2)$, which is exactly the Gaussian form. The maximum entropy is:
Implications for ML:
- This is why the Gaussian prior in VAEs ($p(z) = \mathcal{N}(0,I)$) is the "least informative" assumption given finite variance
- The KL between two Gaussians has a closed form (used directly in VAE training): $D_{KL}(\mathcal{N}(\mu_1, \sigma_1^2) \| \mathcal{N}(\mu_2, \sigma_2^2)) = \log\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}$
- Differential entropy is not invariant to coordinate changes (unlike discrete entropy), but KL divergence is — making KL the preferred measure for continuous distributions
import numpy as np
# Differential entropy of common distributions
print("Differential entropy h(X):")
print("="*50)
# Uniform on [a, b]: h = ln(b - a)
a, b = 0, 1
h_uniform = np.log(b - a)
print(f"Uniform[{a},{b}]: h = ln({b-a}) = {h_uniform:.4f} nats")
a, b = 0, 0.5
h_uniform_narrow = np.log(b - a)
print(f"Uniform[{a},{b}]: h = ln({b-a}) = {h_uniform_narrow:.4f} nats (NEGATIVE!)")
# Gaussian N(mu, sigma^2): h = 0.5 * ln(2*pi*e*sigma^2)
for sigma in [0.5, 1.0, 2.0, 5.0]:
h_gauss = 0.5 * np.log(2 * np.pi * np.e * sigma**2)
print(f"Gaussian σ={sigma}: h = {h_gauss:.4f} nats")
# KL between two Gaussians (closed form)
mu1, sig1 = 1.0, 0.5 # q
mu2, sig2 = 0.0, 1.0 # p (standard normal)
kl_gauss = np.log(sig2/sig1) + (sig1**2 + (mu1-mu2)**2)/(2*sig2**2) - 0.5
print(f"\nKL(N({mu1},{sig1}²) || N({mu2},{sig2}²)) = {kl_gauss:.4f} nats")
print("(This is the VAE regularization term for one latent dimension)")
Relationships Between Information Quantities
flowchart TD
SI["Self-Information
I(x) = -log p(x)"]
H["Shannon Entropy
H(X) = E[I(X)]"]
JH["Joint Entropy
H(X,Y)"]
CH["Conditional Entropy
H(Y|X) = H(X,Y) - H(X)"]
MI["Mutual Information
I(X;Y) = H(X) - H(X|Y)"]
CE["Cross-Entropy
H(p,q) = -Σ p log q"]
KL["KL Divergence
D_KL(p||q) = H(p,q) - H(p)"]
SI -->|"Expected value"| H
H -->|"Joint of two vars"| JH
JH -->|"Subtract H(X)"| CH
H --> MI
CH --> MI
H -->|"Replace log p with log q"| CE
CE -->|"Subtract H(p)"| KL
ML Connections Summary
Decision Tree Splitting via Entropy
At each node, the ID3 algorithm picks the feature $A$ that maximizes:
$$\text{IG}(Y; A) = H(Y) - \sum_{v \in \text{values}(A)} \frac{|S_v|}{|S|} H(Y \mid A = v)$$Where $S_v$ is the subset of training examples with $A = v$. High information gain = feature reduces uncertainty about class label the most. Gini impurity (used in sklearn's CART) is a related but different measure: $G = 1 - \sum_c p_c^2$.
Practice Exercises
Entropy of a Fair Die
(1) What is the entropy of a fair 6-sided die roll in bits? (2) What is the entropy of a loaded die where face 6 appears with probability 0.5 and others with 0.1 each? Which has more uncertainty?
Show Answer
(1) Fair die: $H = -6 \times \frac{1}{6}\log_2\frac{1}{6} = \log_2 6 \approx 2.585$ bits.
(2) Loaded die: $H = -0.5\log_2(0.5) - 5 \times 0.1\log_2(0.1) = 0.5 + 5 \times 0.332 = 0.5 + 1.661 \approx 2.161$ bits.
The fair die has higher entropy (2.585 bits) — more uncertainty since all outcomes are equally likely. The loaded die has lower entropy because face 6 is predictable. The uniform distribution maximizes entropy for a given number of outcomes.
Cross-Entropy Loss for Binary Classification
For binary classification (labels $y \in \{0,1\}$), show that the general cross-entropy formula $H(p, q) = -\sum_x p(x)\log q(x)$ reduces to the familiar binary cross-entropy: $\mathcal{L} = -y\log\hat{y} - (1-y)\log(1-\hat{y})$.
Show Answer
For a single binary example with true label $y \in \{0,1\}$: the true distribution is $p = [y, 1-y]$ (one-hot) and the predicted distribution is $q = [\hat{y}, 1-\hat{y}]$.
$H(p, q) = -p_0 \log q_0 - p_1 \log q_1 = -y \log \hat{y} - (1-y)\log(1-\hat{y})$. QED.
When $y=1$: only $-\log\hat{y}$ matters. When $y=0$: only $-\log(1-\hat{y})$ matters. This is exactly the binary cross-entropy implemented in BCELoss in PyTorch and BinaryCrossentropy in Keras.
Prove KL Divergence Between Two Bernoulli Distributions
Derive the closed-form KL divergence $D_{KL}(\text{Bern}(p) \| \text{Bern}(q))$ from the definition. Then verify that $D_{KL}(\text{Bern}(0.3) \| \text{Bern}(0.7)) \neq D_{KL}(\text{Bern}(0.7) \| \text{Bern}(0.3))$.
Show Answer
$D_{KL}(\text{Bern}(p) \| \text{Bern}(q)) = \sum_{x \in \{0,1\}} p(x) \log\frac{p(x)}{q(x)}$
$= p\log\frac{p}{q} + (1-p)\log\frac{1-p}{1-q}$
For $p=0.3, q=0.7$: $D_{KL} = 0.3\ln(0.3/0.7) + 0.7\ln(0.7/0.3) = 0.3(-0.847) + 0.7(0.847) = -0.254 + 0.593 = 0.339$ nats.
For $p=0.7, q=0.3$: $D_{KL} = 0.7\ln(0.7/0.3) + 0.3\ln(0.3/0.7) = 0.7(0.847) + 0.3(-0.847) = 0.593 - 0.254 = 0.339$ nats.
In this special case they're equal! That's because Bernoulli with $p$ and $1-p$ have a symmetry. But in general (e.g., $p=0.2, q=0.8$): $D_{KL}(0.2\|0.8) = 0.2\ln(0.25) + 0.8\ln(8/2) \neq D_{KL}(0.8\|0.2)$.
Verify the Entropy Chain Rule Numerically
Given the joint distribution $P(X=0,Y=0)=0.3$, $P(X=0,Y=1)=0.1$, $P(X=1,Y=0)=0.2$, $P(X=1,Y=1)=0.4$: (1) Compute $H(X,Y)$ directly. (2) Compute $H(X) + H(Y|X)$ and verify it equals $H(X,Y)$.
Show Answer
(1) $H(X,Y) = -(0.3\ln0.3 + 0.1\ln0.1 + 0.2\ln0.2 + 0.4\ln0.4)$
$= -(−0.361 − 0.230 − 0.322 − 0.366) = 1.279$ nats.
(2) Marginals: $P(X=0)=0.4$, $P(X=1)=0.6$. So $H(X) = -(0.4\ln0.4 + 0.6\ln0.6) = 0.673$ nats.
Conditionals: $P(Y=0|X=0)=0.75$, $P(Y=1|X=0)=0.25$, $P(Y=0|X=1)=1/3$, $P(Y=1|X=1)=2/3$.
$H(Y|X) = 0.4 \cdot H_b(0.75) + 0.6 \cdot H_b(1/3) = 0.4(0.562) + 0.6(0.637) = 0.225 + 0.382 = 0.607$ nats.
$H(X) + H(Y|X) = 0.673 + 0.607 = 1.280$ nats ≈ $H(X,Y)$ ✓ (rounding).
Apply Gibbs Inequality to Show Cross-Entropy ≥ Entropy
Using only the fact that $D_{KL}(p\|q) \geq 0$ (Gibbs inequality) and the decomposition $H(p,q) = H(p) + D_{KL}(p\|q)$, prove that: (1) The optimal softmax output minimizing cross-entropy loss is the true label distribution. (2) For a one-hot label, the minimum possible cross-entropy loss is 0.
Show Answer
(1) $H(p,q) = H(p) + D_{KL}(p\|q) \geq H(p)$. The minimum is achieved when $D_{KL}(p\|q) = 0$, which happens iff $q = p$. So the optimal model output is $q^* = p$ (the true distribution).
(2) For a one-hot label $p = [0,\ldots,1,\ldots,0]$: $H(p) = -1 \cdot \log(1) - 0 \cdot \log(0) - \ldots = 0$. Therefore $\min_q H(p,q) = H(p) = 0$, achieved when the model outputs probability 1 on the correct class. This is why cross-entropy loss is non-negative and reaches 0 only for perfect predictions on one-hot targets.
Conclusion & Next Steps
Information theory gives ML a mathematically rigorous framework for measuring uncertainty and distributional mismatch. We derived — not just stated — every key result from first principles:
- Self-information: $I(x) = -\log p(x)$ — uniquely determined by Shannon's additivity axiom (Cauchy equation)
- Entropy: $H(X)$ — average uncertainty; maximized by uniform (proved via Lagrange multipliers)
- Chain rule: $H(X,Y) = H(X) + H(Y|X)$ — derived from the probability chain rule
- Gibbs inequality: $D_{KL}(p\|q) \geq 0$ — the foundational theorem, proved via Jensen's inequality
- Cross-entropy = MLE: Minimizing $H(p,q)$ ≡ minimizing $D_{KL}(p\|q)$ ≡ maximizing log-likelihood
- Mutual information: $I(X;Y) = D_{KL}(p_{XY}\|p_X p_Y) \geq 0$ — non-negativity from Gibbs
- Data processing inequality: $X \to Y \to Z \implies I(X;Z) \leq I(X;Y)$ — information can only be lost
- Differential entropy: Gaussian maximizes $h(X)$ for fixed variance — justifying the VAE Gaussian prior
Next in the Series
In Part 7: Linear Algebra, we tackle vectors, matrices, eigenvalues, and singular value decomposition — the computational backbone of neural network layers, PCA, and attention mechanisms.