Back to Mathematics

Part 6: Information Theory

April 26, 2026 Wasil Zafar 32 min read

Every time you minimize cross-entropy loss, you're applying information theory. Claude Shannon's framework for quantifying information provides the mathematical foundation for loss functions, compression, feature selection, and the latent space of VAEs.

Table of Contents

  1. Why Information Theory in ML
  2. Self-Information
  3. Shannon Entropy
  4. Joint & Conditional Entropy
  5. Mutual Information
  6. Cross-Entropy
  7. KL Divergence
  8. Differential Entropy
  9. Relationships Diagram
  10. ML Connections
  11. Practice Exercises
  12. Conclusion & Next Steps

Why Information Theory in ML

Claude Shannon (1948) asked a deceptively simple question: how do we measure the amount of "information" in a message? The answer spawned a mathematical framework that now underlies:

Information theory is everywhere in ML:
  • Cross-entropy loss in neural networks (classification training objective)
  • KL divergence in VAEs (regularization in the latent space)
  • Information gain in decision trees (split criterion)
  • Mutual information for feature selection and representation learning
  • Entropy regularization in reinforcement learning (maximum entropy RL)
  • Perplexity as a language model evaluation metric

Self-Information: Why Logarithm?

The self-information (or surprise) of an event with probability $p$ is:

$$I(x) = -\log_2 p(x) \quad \text{(in bits)}$$

But why logarithm? Why not $1/p$, or $1 - p$, or some other decreasing function? Shannon (1948) showed that $-\log p$ is the unique function satisfying three natural axioms for an information measure $I(p)$:

AxiomFormal StatementIntuition
Monotonicity$I(p)$ is continuous and decreasing in $p$Rarer events are more surprising
Zero for certainty$I(1) = 0$A certain event carries no information
Additivity$I(p_1 \cdot p_2) = I(p_1) + I(p_2)$Independent events' information adds up
Derivation from axioms (connecting to Part 4): From the additivity axiom and independence (Part 4: $P(A \cap B) = P(A) \cdot P(B)$), we need $I(p_1 p_2) = I(p_1) + I(p_2)$. The only continuous, monotonically decreasing functions satisfying $f(xy) = f(x) + f(y)$ are of the form $f(x) = -c\log(x)$ for some constant $c > 0$. Setting $c = 1/\ln(2)$ gives bits; $c = 1$ gives nats. This is a Cauchy functional equation — the logarithm is the unique solution under continuity.

Verification with examples:

  • $p = 1$: certain event → $I = -\log(1) = 0$ bits (no surprise) ✓ axiom 2
  • $p = 0.5$: fair coin → $I = -\log_2(0.5) = 1$ bit (one binary question answered)
  • $p = 0.01$: rare event → $I = -\log_2(0.01) \approx 6.64$ bits (highly surprising)
  • Two fair coins (independent): $I(0.25) = 2$ bits $= I(0.5) + I(0.5)$ ✓ axiom 3
Natural units: Using $\log_e$ (natural log) gives nats; using $\log_2$ gives bits. In ML, we almost always use natural log — this is why cross-entropy loss uses $\ln$, not $\log_2$. The conversion: $1 \text{ nat} = \frac{1}{\ln 2} \approx 1.443$ bits. The concepts are identical; only the unit changes.
Connection to optimal coding (Shannon's source coding theorem): A message with self-information $I(x)$ bits cannot be compressed to fewer than $I(x)$ bits on average. This connects information content to physical limits: entropy (defined next) is the minimum average bits per symbol needed for lossless compression. More probable symbols → shorter codes (Huffman coding uses this directly).

Shannon Entropy

Entropy is the expected (average) self-information of a distribution $p$ — combining self-information with the expectation from Part 4:

$$H(X) = \mathbb{E}[-\log p(X)] = -\sum_{x} p(x) \log p(x)$$

Entropy measures the average uncertainty of a distribution — equivalently, the minimum average bits needed per symbol for lossless compression.

Proof: Maximum Entropy = Uniform Distribution

Theorem: Among all distributions on $n$ outcomes, the uniform distribution $p(x_i) = 1/n$ uniquely maximizes entropy.

Proof via Lagrange multipliers (constrained optimization from Part 8):

We maximize $H = -\sum_{i=1}^n p_i \ln p_i$ subject to the constraint $\sum_{i=1}^n p_i = 1$.

$$\mathcal{L} = -\sum_{i=1}^n p_i \ln p_i - \lambda\left(\sum_{i=1}^n p_i - 1\right)$$

Setting $\frac{\partial \mathcal{L}}{\partial p_i} = 0$:

$$-\ln p_i - 1 - \lambda = 0 \implies p_i = e^{-(1+\lambda)}$$

Since the solution is the same for all $i$, all $p_i$ must be equal. Combined with $\sum p_i = 1$, we get $p_i = 1/n$ for all $i$. The maximum entropy is $\boxed{H_{\max} = \ln n}$.

Alternative proof via Jensen's inequality: Since $-\log$ is convex, by Jensen's: $H(X) = \mathbb{E}[-\log p(X)] \leq -\log \mathbb{E}[p(X)] = -\log\left(\frac{1}{n}\sum p_i^2\right)$... but the cleanest route is noting $H(X) = -\sum p_i \log p_i \leq -\sum p_i \log(1/n) = \log n$, where equality holds iff all $p_i = 1/n$ (from the strict concavity of $\log$).

Binary Entropy: Derivation of Maximum

For a binary variable with $P(X=1) = p$, the binary entropy function:

$$H_b(p) = -p \log p - (1-p)\log(1-p)$$

Finding the maximum via calculus (Part 8 preview — single-variable optimization):

$$\frac{dH_b}{dp} = -\ln p - 1 + \ln(1-p) + 1 = \ln\frac{1-p}{p}$$

Setting $dH_b/dp = 0$: $\ln\frac{1-p}{p} = 0 \implies \frac{1-p}{p} = 1 \implies p = 1/2$.

Second derivative: $\frac{d^2H_b}{dp^2} = -\frac{1}{p} - \frac{1}{1-p} < 0$ for all $p \in (0,1)$, confirming this is a maximum.

Therefore $H_b(1/2) = \ln 2$ nats $= 1$ bit, and $H_b(0) = H_b(1) = 0$.

import numpy as np

def entropy(probs, base=np.e):
    """Compute Shannon entropy. Default: nats (base e). Use base=2 for bits."""
    probs = np.array(probs, dtype=float)
    # Convention: 0 * log(0) = 0
    mask = probs > 0
    return -np.sum(probs[mask] * np.log(probs[mask]) / np.log(base))

# Discrete distributions: compare entropies
uniform_4   = [0.25, 0.25, 0.25, 0.25]
skewed_4    = [0.7,  0.1,  0.1,  0.1]
det_4       = [1.0,  0.0,  0.0,  0.0]

print("Shannon Entropy (bits):")
print(f"  Uniform:      {entropy(uniform_4, base=2):.4f} bits  (max = log2(4)={np.log2(4):.4f})")
print(f"  Skewed:       {entropy(skewed_4, base=2):.4f} bits")
print(f"  Deterministic:{entropy(det_4, base=2):.4f} bits  (min = 0)")

# Binary entropy function
print("\nBinary entropy H(p):")
for p in [0.01, 0.1, 0.5, 0.9, 0.99]:
    h = entropy([p, 1-p], base=2)
    print(f"  p={p}: H={h:.4f} bits")

Joint & Conditional Entropy

Joint entropy of two random variables $X, Y$:

$$H(X, Y) = -\sum_{x,y} p(x,y) \log p(x,y)$$

Conditional entropy — average uncertainty remaining in $Y$ after observing $X$:

$$H(Y \mid X) = -\sum_{x,y} p(x,y) \log p(y \mid x)$$

Derivation: Entropy Chain Rule

We derive $H(X,Y) = H(X) + H(Y|X)$ from the probability chain rule ($p(x,y) = p(x)p(y|x)$ from Part 4):

$$H(X,Y) = -\sum_{x,y} p(x,y) \log p(x,y)$$ $$= -\sum_{x,y} p(x,y) \log\bigl[p(x) \cdot p(y|x)\bigr]$$ $$= -\sum_{x,y} p(x,y) \log p(x) - \sum_{x,y} p(x,y) \log p(y|x)$$ $$= -\sum_x p(x) \log p(x) \underbrace{\sum_y p(y|x)}_{=1} + H(Y|X)$$ $$\boxed{= H(X) + H(Y|X)}$$

This generalizes: $H(X_1, X_2, \ldots, X_n) = \sum_{i=1}^n H(X_i | X_1, \ldots, X_{i-1})$.

Proof: Conditioning Reduces Entropy

Theorem: $H(Y|X) \leq H(Y)$ with equality iff $X \perp Y$.

Proof: We'll prove this after establishing $D_{KL} \geq 0$ (Gibbs inequality, below). The key insight is:

$$H(Y) - H(Y|X) = I(X;Y) = D_{KL}(p(x,y) \| p(x)p(y)) \geq 0$$

Since KL divergence is non-negative (proved in the KL section), conditioning cannot increase entropy. Equality holds iff $p(x,y) = p(x)p(y)$ everywhere — i.e., $X$ and $Y$ are independent, so knowing $X$ tells us nothing about $Y$.

Bound on joint entropy: From the chain rule: $H(X,Y) = H(X) + H(Y|X) \leq H(X) + H(Y)$, with equality iff $X \perp Y$. Joint entropy of independent variables is the sum of their individual entropies — analogous to $\text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y)$ for independent variables from Part 5.

Mutual Information

Mutual information measures how much knowing $X$ reduces uncertainty about $Y$:

$$I(X; Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X)$$

Derivation: MI as KL Divergence

We can express MI as a KL divergence measuring "distance from independence":

$$I(X;Y) = H(X) + H(Y) - H(X,Y)$$

Substituting the definitions and using $p(x,y) = p(x)p(y)$ under independence:

$$I(X;Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)} = D_{KL}\bigl(p(x,y) \,\|\, p(x)p(y)\bigr)$$
Non-negativity proof: Since $I(X;Y) = D_{KL}(p(x,y) \| p(x)p(y))$ and KL divergence is always non-negative (Gibbs inequality, proved below), we immediately get $I(X;Y) \geq 0$. Equality holds iff $p(x,y) = p(x)p(y)$ for all $x,y$ — i.e., $X$ and $Y$ are independent.

Data Processing Inequality

Theorem: If $X \to Y \to Z$ form a Markov chain (meaning $Z$ depends on $X$ only through $Y$), then:

$$I(X; Z) \leq I(X; Y)$$

Implication for ML: No post-processing of features $Y$ can create new information about the target $X$. Each layer in a neural network can only preserve or lose information — never create it. This is why representation learning matters: you need representations that retain relevant information, since downstream layers cannot recover what was lost.

Information bottleneck principle: Good intermediate representations compress $X$ (low $I(Y;X)$) while retaining what matters about label $Z$ (high $I(Y;Z)$). The data processing inequality bounds what's achievable: $I(Y;Z) \leq I(X;Z)$ — you can never predict $Z$ better from features than from raw data.
Feature SelectionDecision Trees
Mutual Information for Feature Selection

A feature is useful if it has high mutual information with the target variable $Y$. In decision trees, the split criterion information gain is exactly mutual information:

$$\text{IG}(Y; A) = H(Y) - H(Y \mid A) = I(Y; A)$$

Where $A$ is the feature (attribute). ID3 and C4.5 algorithms greedily select the feature maximizing information gain at each node. sklearn.feature_selection.mutual_info_classif computes $I(X_i; Y)$ for each feature $X_i$, giving a model-agnostic feature importance score that captures any dependency (not just linear).

import numpy as np

def entropy(probs):
    """Shannon entropy in nats. Convention: 0*log(0) = 0."""
    probs = np.array(probs, dtype=float)
    mask = probs > 0
    return -np.sum(probs[mask] * np.log(probs[mask]))

def mutual_information(joint_px_y):
    """
    Mutual information I(X;Y) from joint probability table.
    joint_px_y: 2D array where joint_px_y[i][j] = P(X=i, Y=j)
    """
    joint = np.array(joint_px_y, dtype=float)
    p_x = joint.sum(axis=1)  # marginal of X
    p_y = joint.sum(axis=0)  # marginal of Y

    mi = 0.0
    for i in range(joint.shape[0]):
        for j in range(joint.shape[1]):
            if joint[i, j] > 0:
                mi += joint[i, j] * np.log(joint[i, j] / (p_x[i] * p_y[j]))
    return mi

# Example: weather (X) vs umbrella (Y)
# X: {sunny=0, rainy=1}, Y: {no_umbrella=0, umbrella=1}
joint = np.array([
    [0.45, 0.05],   # sunny: usually no umbrella
    [0.05, 0.45]    # rainy: usually umbrella
])

mi = mutual_information(joint)
p_x = joint.sum(axis=1)
p_y = joint.sum(axis=0)

print(f"H(X) = {entropy(p_x):.4f} nats")
print(f"H(Y) = {entropy(p_y):.4f} nats")
print(f"I(X;Y) = {mi:.4f} nats")
print(f"H(X|Y) = H(X) - I(X;Y) = {entropy(p_x) - mi:.4f} nats")
print(f"\nKnowing umbrella reduces weather uncertainty by {mi/entropy(p_x)*100:.1f}%")

# Independent case: verify I(X;Y) = 0
joint_indep = np.array([[0.25, 0.25], [0.25, 0.25]])
print(f"\nIndependent case: I(X;Y) = {mutual_information(joint_indep):.6f} (≈ 0)")

Cross-Entropy

Cross-entropy measures the expected number of bits needed to encode data from distribution $p$ using a code optimized for distribution $q$:

$$H(p, q) = -\sum_x p(x) \log q(x) = \mathbb{E}_{x \sim p}[-\log q(x)]$$

Derivation: Decomposition into Entropy + KL

The fundamental identity linking cross-entropy, entropy, and KL divergence:

$$H(p, q) = -\sum_x p(x) \log q(x) = -\sum_x p(x) \log p(x) + \sum_x p(x) \log \frac{p(x)}{q(x)}$$
$$\boxed{H(p, q) = H(p) + D_{KL}(p \| q)}$$

Since $D_{KL}(p\|q) \geq 0$ (proved below), this immediately gives $H(p,q) \geq H(p)$, with equality iff $q = p$. The "extra bits" beyond entropy come entirely from the mismatch between $p$ and $q$.

Why Cross-Entropy = Maximum Likelihood

Claim: Minimizing cross-entropy loss is equivalent to maximum likelihood estimation (MLE from Part 5).

Proof: Given $N$ i.i.d. training samples $\{x_1, \ldots, x_N\}$ drawn from $p$, the negative log-likelihood of model $q$ is:

$$-\frac{1}{N}\sum_{i=1}^N \log q(x_i) \xrightarrow{N \to \infty} -\sum_x p(x) \log q(x) = H(p, q)$$

By the law of large numbers (Part 5), the empirical average converges to the expectation. Since $H(p)$ is constant w.r.t. model parameters, minimizing $H(p,q)$ is equivalent to minimizing $D_{KL}(p\|q)$ — making $q$ as close to $p$ as possible.

Cross-entropy loss in neural networks: Let $p$ be the true label distribution (one-hot vector) and $q$ be the model's softmax output. Then: $$\mathcal{L}_{\text{CE}} = -\sum_c y_c \log \hat{y}_c$$ For a one-hot label (true class $c^*$), this simplifies to $-\log \hat{y}_{c^*}$ — maximizing the predicted probability of the correct class. This follows from the decomposition above: since one-hot distributions have $H(p) = 0$, minimizing cross-entropy directly minimizes $D_{KL}(p\|\hat{y})$.
import numpy as np

def cross_entropy(p_true, q_pred):
    """
    Compute cross-entropy H(p, q).
    p_true: true distribution (e.g., one-hot labels)
    q_pred: predicted distribution (e.g., softmax output)
    """
    p_true = np.array(p_true, dtype=float)
    q_pred = np.clip(q_pred, 1e-12, 1.0)  # prevent log(0)
    return -np.sum(p_true * np.log(q_pred))

# Example: 3-class classification
# True label: class 0 (one-hot)
p_true = [1.0, 0.0, 0.0]

# Good prediction: high probability on true class
q_good = [0.9, 0.07, 0.03]
# Bad prediction: spread probability
q_bad  = [0.3, 0.4, 0.3]
# Perfect prediction
q_perfect = [1.0 - 1e-9, 5e-10, 5e-10]

print("Cross-entropy H(p, q):")
print(f"  Good prediction:    {cross_entropy(p_true, q_good):.4f}")
print(f"  Bad prediction:     {cross_entropy(p_true, q_bad):.4f}")
print(f"  Perfect prediction: {cross_entropy(p_true, q_perfect):.4f}")
print(f"  Entropy H(p):       {cross_entropy(p_true, p_true):.4f}  (lower bound)")

KL Divergence

Kullback-Leibler divergence (relative entropy) measures how distribution $q$ differs from a reference $p$:

$$D_{KL}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)} = H(p, q) - H(p)$$

Proof: Gibbs Inequality ($D_{KL} \geq 0$)

This is arguably the most important inequality in information theory — it underpins why cross-entropy is a valid loss function and why conditioning reduces entropy.

Theorem (Gibbs inequality): $D_{KL}(p \| q) \geq 0$ for any distributions $p, q$, with equality iff $p = q$.

Proof via Jensen's inequality (using convexity from Part 8):

$$-D_{KL}(p \| q) = -\sum_x p(x) \log \frac{p(x)}{q(x)} = \sum_x p(x) \log \frac{q(x)}{p(x)}$$

Since $\log$ is strictly concave, Jensen's inequality gives:

$$\sum_x p(x) \log \frac{q(x)}{p(x)} \leq \log \left(\sum_x p(x) \cdot \frac{q(x)}{p(x)}\right) = \log\left(\sum_x q(x)\right) = \log(1) = 0$$

Therefore $-D_{KL}(p\|q) \leq 0$, which means $\boxed{D_{KL}(p\|q) \geq 0}$.

Equality holds iff $q(x)/p(x)$ is constant for all $x$ (Jensen's equality condition), which requires $q = p$ since both sum to 1.

Why this matters: Gibbs inequality is the foundation for:
  • $H(p,q) \geq H(p)$: cross-entropy as a valid loss (it's minimized at ground truth)
  • $H(Y|X) \leq H(Y)$: conditioning reduces entropy (information can't hurt)
  • $I(X;Y) \geq 0$: mutual information is non-negative
  • ELBO ≤ log-evidence: the variational bound is a true lower bound

KL Divergence: Forward vs Reverse

KL is asymmetric: $D_{KL}(p \| q) \neq D_{KL}(q \| p)$. The two directions have different behaviors crucial for ML:

DirectionNamePenaltyResultUsed In
$D_{KL}(p \| q)$Forward KL$q(x)=0$ where $p(x)>0$ → $\infty$$q$ must cover all of $p$ → mean-seekingMLE, cross-entropy loss
$D_{KL}(q \| p)$Reverse KL$p(x)=0$ where $q(x)>0$ → $\infty$$q$ avoids placing mass where $p=0$ → mode-seekingVariational inference, VAEs

Concrete example: If $p$ is bimodal (two peaks) and $q$ is a single Gaussian:

  • Forward KL forces $q$ to cover both modes → $q$ spreads across both peaks (over-dispersed)
  • Reverse KL lets $q$ collapse onto one mode → $q$ captures one peak perfectly but ignores the other
VAEsKL Term
KL Divergence in Variational Autoencoders

VAEs are trained with the ELBO (Evidence Lower BOund) loss:

$$\mathcal{L} = \underbrace{\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]}_{\text{reconstruction}} - \underbrace{D_{KL}(q_\phi(z|x) \| p(z))}_{\text{regularization}}$$

The KL term forces the approximate posterior $q_\phi(z|x)$ (encoder output) to stay close to the prior $p(z) = \mathcal{N}(0, I)$. This prevents the encoder from memorizing by collapsing to point estimates, encouraging a smooth, continuous latent space.

For two Gaussians, KL has a closed form: $D_{KL}(\mathcal{N}(\mu, \sigma^2) \| \mathcal{N}(0,1)) = \frac{1}{2}(\mu^2 + \sigma^2 - 1 - \log \sigma^2)$.

The Jensen-Shannon divergence is a symmetrized, bounded version of KL:

$$D_{JS}(p \| q) = \frac{1}{2} D_{KL}\!\left(p \,\Big\|\, \frac{p+q}{2}\right) + \frac{1}{2} D_{KL}\!\left(q \,\Big\|\, \frac{p+q}{2}\right) \in [0, 1]$$

$D_{JS}$ is used as the training objective in the original GAN paper (Goodfellow et al. 2014) and in text generation evaluations.

import numpy as np

def kl_divergence(p, q):
    """KL(p || q). p and q are probability arrays summing to 1."""
    p = np.array(p, dtype=float)
    q = np.clip(q, 1e-12, 1.0)
    mask = p > 0
    return np.sum(p[mask] * np.log(p[mask] / q[mask]))

def js_divergence(p, q):
    """Jensen-Shannon divergence (symmetric, bounded in [0, log(2)])."""
    p, q = np.array(p, dtype=float), np.array(q, dtype=float)
    m = 0.5 * (p + q)
    return 0.5 * kl_divergence(p, m) + 0.5 * kl_divergence(q, m)

# True label distribution vs model predictions
p_true  = np.array([0.7, 0.2, 0.1])     # true distribution
q_close = np.array([0.65, 0.25, 0.10])  # similar
q_far   = np.array([0.1, 0.1, 0.8])     # very different

print("KL and JS Divergences:")
print(f"  KL(true || close) = {kl_divergence(p_true, q_close):.6f}")
print(f"  KL(true || far)   = {kl_divergence(p_true, q_far):.6f}")
print(f"  JS(true, close)   = {js_divergence(p_true, q_close):.6f}")
print(f"  JS(true, far)     = {js_divergence(p_true, q_far):.6f}")
print(f"\nKL asymmetry:")
print(f"  KL(p || q_far)   = {kl_divergence(p_true, q_far):.6f}")
print(f"  KL(q_far || p)   = {kl_divergence(q_far, p_true):.6f}")

Differential Entropy (Continuous Case)

For continuous random variables with density $f(x)$, we replace the sum with an integral:

$$h(X) = -\int_{-\infty}^{\infty} f(x) \log f(x)\, dx$$

Critical difference from discrete entropy: Differential entropy can be negative. For example, a uniform on $[0, 1/2]$ has $h = -\log 2 < 0$.

Derivation: Gaussian Has Maximum Entropy

Theorem: Among all distributions with fixed mean $\mu$ and variance $\sigma^2$, the Gaussian $\mathcal{N}(\mu, \sigma^2)$ uniquely maximizes differential entropy.

Proof sketch (Lagrange multipliers): Maximize $h(X) = -\int f \log f\, dx$ subject to $\int f = 1$, $\int xf = \mu$, $\int (x-\mu)^2 f = \sigma^2$. The Lagrangian yields $f(x) \propto \exp(-\alpha - \beta x - \gamma x^2)$, which is exactly the Gaussian form. The maximum entropy is:

$$h(\mathcal{N}(\mu, \sigma^2)) = \frac{1}{2}\ln(2\pi e \sigma^2)$$

Implications for ML:

  • This is why the Gaussian prior in VAEs ($p(z) = \mathcal{N}(0,I)$) is the "least informative" assumption given finite variance
  • The KL between two Gaussians has a closed form (used directly in VAE training): $D_{KL}(\mathcal{N}(\mu_1, \sigma_1^2) \| \mathcal{N}(\mu_2, \sigma_2^2)) = \log\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}$
  • Differential entropy is not invariant to coordinate changes (unlike discrete entropy), but KL divergence is — making KL the preferred measure for continuous distributions
import numpy as np

# Differential entropy of common distributions
print("Differential entropy h(X):")
print("="*50)

# Uniform on [a, b]: h = ln(b - a)
a, b = 0, 1
h_uniform = np.log(b - a)
print(f"Uniform[{a},{b}]:     h = ln({b-a}) = {h_uniform:.4f} nats")

a, b = 0, 0.5
h_uniform_narrow = np.log(b - a)
print(f"Uniform[{a},{b}]:   h = ln({b-a}) = {h_uniform_narrow:.4f} nats (NEGATIVE!)")

# Gaussian N(mu, sigma^2): h = 0.5 * ln(2*pi*e*sigma^2)
for sigma in [0.5, 1.0, 2.0, 5.0]:
    h_gauss = 0.5 * np.log(2 * np.pi * np.e * sigma**2)
    print(f"Gaussian σ={sigma}:    h = {h_gauss:.4f} nats")

# KL between two Gaussians (closed form)
mu1, sig1 = 1.0, 0.5   # q
mu2, sig2 = 0.0, 1.0   # p (standard normal)
kl_gauss = np.log(sig2/sig1) + (sig1**2 + (mu1-mu2)**2)/(2*sig2**2) - 0.5
print(f"\nKL(N({mu1},{sig1}²) || N({mu2},{sig2}²)) = {kl_gauss:.4f} nats")
print("(This is the VAE regularization term for one latent dimension)")

Relationships Between Information Quantities

Information Theory: Key Relationships
flowchart TD
    SI["Self-Information
I(x) = -log p(x)"] H["Shannon Entropy
H(X) = E[I(X)]"] JH["Joint Entropy
H(X,Y)"] CH["Conditional Entropy
H(Y|X) = H(X,Y) - H(X)"] MI["Mutual Information
I(X;Y) = H(X) - H(X|Y)"] CE["Cross-Entropy
H(p,q) = -Σ p log q"] KL["KL Divergence
D_KL(p||q) = H(p,q) - H(p)"] SI -->|"Expected value"| H H -->|"Joint of two vars"| JH JH -->|"Subtract H(X)"| CH H --> MI CH --> MI H -->|"Replace log p with log q"| CE CE -->|"Subtract H(p)"| KL

ML Connections Summary

Decision TreesInformation Gain
Decision Tree Splitting via Entropy

At each node, the ID3 algorithm picks the feature $A$ that maximizes:

$$\text{IG}(Y; A) = H(Y) - \sum_{v \in \text{values}(A)} \frac{|S_v|}{|S|} H(Y \mid A = v)$$

Where $S_v$ is the subset of training examples with $A = v$. High information gain = feature reduces uncertainty about class label the most. Gini impurity (used in sklearn's CART) is a related but different measure: $G = 1 - \sum_c p_c^2$.

Perplexity — evaluating language models: Perplexity is $2^{H(p,q)}$ (or $e^{H(p,q)}$ in nats). For a language model with cross-entropy loss $\mathcal{L}$ nats per token: $$\text{Perplexity} = e^\mathcal{L}$$ Lower perplexity = model assigns higher probability to actual text = better language model. A perplexity of $k$ means the model is as uncertain as if it had to choose uniformly among $k$ equally likely options at each step.

Practice Exercises

Exercise 1Entropy
Entropy of a Fair Die

(1) What is the entropy of a fair 6-sided die roll in bits? (2) What is the entropy of a loaded die where face 6 appears with probability 0.5 and others with 0.1 each? Which has more uncertainty?

Show Answer

(1) Fair die: $H = -6 \times \frac{1}{6}\log_2\frac{1}{6} = \log_2 6 \approx 2.585$ bits.

(2) Loaded die: $H = -0.5\log_2(0.5) - 5 \times 0.1\log_2(0.1) = 0.5 + 5 \times 0.332 = 0.5 + 1.661 \approx 2.161$ bits.

The fair die has higher entropy (2.585 bits) — more uncertainty since all outcomes are equally likely. The loaded die has lower entropy because face 6 is predictable. The uniform distribution maximizes entropy for a given number of outcomes.

Exercise 2Cross-Entropy Loss
Cross-Entropy Loss for Binary Classification

For binary classification (labels $y \in \{0,1\}$), show that the general cross-entropy formula $H(p, q) = -\sum_x p(x)\log q(x)$ reduces to the familiar binary cross-entropy: $\mathcal{L} = -y\log\hat{y} - (1-y)\log(1-\hat{y})$.

Show Answer

For a single binary example with true label $y \in \{0,1\}$: the true distribution is $p = [y, 1-y]$ (one-hot) and the predicted distribution is $q = [\hat{y}, 1-\hat{y}]$.

$H(p, q) = -p_0 \log q_0 - p_1 \log q_1 = -y \log \hat{y} - (1-y)\log(1-\hat{y})$. QED.

When $y=1$: only $-\log\hat{y}$ matters. When $y=0$: only $-\log(1-\hat{y})$ matters. This is exactly the binary cross-entropy implemented in BCELoss in PyTorch and BinaryCrossentropy in Keras.

Exercise 3KL Divergence
Prove KL Divergence Between Two Bernoulli Distributions

Derive the closed-form KL divergence $D_{KL}(\text{Bern}(p) \| \text{Bern}(q))$ from the definition. Then verify that $D_{KL}(\text{Bern}(0.3) \| \text{Bern}(0.7)) \neq D_{KL}(\text{Bern}(0.7) \| \text{Bern}(0.3))$.

Show Answer

$D_{KL}(\text{Bern}(p) \| \text{Bern}(q)) = \sum_{x \in \{0,1\}} p(x) \log\frac{p(x)}{q(x)}$

$= p\log\frac{p}{q} + (1-p)\log\frac{1-p}{1-q}$

For $p=0.3, q=0.7$: $D_{KL} = 0.3\ln(0.3/0.7) + 0.7\ln(0.7/0.3) = 0.3(-0.847) + 0.7(0.847) = -0.254 + 0.593 = 0.339$ nats.

For $p=0.7, q=0.3$: $D_{KL} = 0.7\ln(0.7/0.3) + 0.3\ln(0.3/0.7) = 0.7(0.847) + 0.3(-0.847) = 0.593 - 0.254 = 0.339$ nats.

In this special case they're equal! That's because Bernoulli with $p$ and $1-p$ have a symmetry. But in general (e.g., $p=0.2, q=0.8$): $D_{KL}(0.2\|0.8) = 0.2\ln(0.25) + 0.8\ln(8/2) \neq D_{KL}(0.8\|0.2)$.

Exercise 4Chain Rule
Verify the Entropy Chain Rule Numerically

Given the joint distribution $P(X=0,Y=0)=0.3$, $P(X=0,Y=1)=0.1$, $P(X=1,Y=0)=0.2$, $P(X=1,Y=1)=0.4$: (1) Compute $H(X,Y)$ directly. (2) Compute $H(X) + H(Y|X)$ and verify it equals $H(X,Y)$.

Show Answer

(1) $H(X,Y) = -(0.3\ln0.3 + 0.1\ln0.1 + 0.2\ln0.2 + 0.4\ln0.4)$
$= -(−0.361 − 0.230 − 0.322 − 0.366) = 1.279$ nats.

(2) Marginals: $P(X=0)=0.4$, $P(X=1)=0.6$. So $H(X) = -(0.4\ln0.4 + 0.6\ln0.6) = 0.673$ nats.

Conditionals: $P(Y=0|X=0)=0.75$, $P(Y=1|X=0)=0.25$, $P(Y=0|X=1)=1/3$, $P(Y=1|X=1)=2/3$.

$H(Y|X) = 0.4 \cdot H_b(0.75) + 0.6 \cdot H_b(1/3) = 0.4(0.562) + 0.6(0.637) = 0.225 + 0.382 = 0.607$ nats.

$H(X) + H(Y|X) = 0.673 + 0.607 = 1.280$ nats ≈ $H(X,Y)$ ✓ (rounding).

Exercise 5Gibbs Inequality
Apply Gibbs Inequality to Show Cross-Entropy ≥ Entropy

Using only the fact that $D_{KL}(p\|q) \geq 0$ (Gibbs inequality) and the decomposition $H(p,q) = H(p) + D_{KL}(p\|q)$, prove that: (1) The optimal softmax output minimizing cross-entropy loss is the true label distribution. (2) For a one-hot label, the minimum possible cross-entropy loss is 0.

Show Answer

(1) $H(p,q) = H(p) + D_{KL}(p\|q) \geq H(p)$. The minimum is achieved when $D_{KL}(p\|q) = 0$, which happens iff $q = p$. So the optimal model output is $q^* = p$ (the true distribution).

(2) For a one-hot label $p = [0,\ldots,1,\ldots,0]$: $H(p) = -1 \cdot \log(1) - 0 \cdot \log(0) - \ldots = 0$. Therefore $\min_q H(p,q) = H(p) = 0$, achieved when the model outputs probability 1 on the correct class. This is why cross-entropy loss is non-negative and reaches 0 only for perfect predictions on one-hot targets.

Conclusion & Next Steps

Information theory gives ML a mathematically rigorous framework for measuring uncertainty and distributional mismatch. We derived — not just stated — every key result from first principles:

  • Self-information: $I(x) = -\log p(x)$ — uniquely determined by Shannon's additivity axiom (Cauchy equation)
  • Entropy: $H(X)$ — average uncertainty; maximized by uniform (proved via Lagrange multipliers)
  • Chain rule: $H(X,Y) = H(X) + H(Y|X)$ — derived from the probability chain rule
  • Gibbs inequality: $D_{KL}(p\|q) \geq 0$ — the foundational theorem, proved via Jensen's inequality
  • Cross-entropy = MLE: Minimizing $H(p,q)$ ≡ minimizing $D_{KL}(p\|q)$ ≡ maximizing log-likelihood
  • Mutual information: $I(X;Y) = D_{KL}(p_{XY}\|p_X p_Y) \geq 0$ — non-negativity from Gibbs
  • Data processing inequality: $X \to Y \to Z \implies I(X;Z) \leq I(X;Y)$ — information can only be lost
  • Differential entropy: Gaussian maximizes $h(X)$ for fixed variance — justifying the VAE Gaussian prior

Next in the Series

In Part 7: Linear Algebra, we tackle vectors, matrices, eigenvalues, and singular value decomposition — the computational backbone of neural network layers, PCA, and attention mechanisms.