Back to Mathematics

Part 6: Information Theory

April 26, 2026 Wasil Zafar 18 min read

Every time you minimize cross-entropy loss, you're applying information theory. Claude Shannon's framework for quantifying information provides the mathematical foundation for loss functions, compression, feature selection, and the latent space of VAEs.

Table of Contents

  1. Why Information Theory in ML
  2. Self-Information
  3. Shannon Entropy
  4. Joint & Conditional Entropy
  5. Mutual Information
  6. Cross-Entropy
  7. KL Divergence
  8. Relationships Diagram
  9. ML Connections
  10. Practice Exercises
  11. Conclusion & Next Steps

Why Information Theory in ML

Claude Shannon (1948) asked a deceptively simple question: how do we measure the amount of "information" in a message? The answer spawned a mathematical framework that now underlies:

Information theory is everywhere in ML:
  • Cross-entropy loss in neural networks (classification training objective)
  • KL divergence in VAEs (regularization in the latent space)
  • Information gain in decision trees (split criterion)
  • Mutual information for feature selection and representation learning
  • Entropy regularization in reinforcement learning (maximum entropy RL)
  • Perplexity as a language model evaluation metric

Self-Information

The self-information (or surprise) of an event with probability $p$ is:

$$I(x) = -\log_2 p(x) \quad \text{(in bits)}$$

Intuition: rare events are more "surprising" (carry more information) than common ones.

  • $p = 1$: certain event → $I = 0$ bits (no surprise)
  • $p = 0.5$: fair coin → $I = 1$ bit (one binary question answered)
  • $p = 0.01$: rare event → $I \approx 6.64$ bits (highly surprising)
Natural units: Using $\log_e$ (natural log) gives nats; using $\log_2$ gives bits. In ML, we almost always use natural log — this is why cross-entropy loss uses $\ln$, not $\log_2$. The concepts are identical; only the unit changes.

Shannon Entropy

Entropy is the expected (average) self-information of a distribution $p$:

$$H(X) = \mathbb{E}[-\log p(X)] = -\sum_{x} p(x) \log p(x)$$

Entropy measures the uncertainty of a distribution:

  • Maximum entropy: Uniform distribution — all outcomes equally likely, maximum uncertainty
  • Minimum entropy: Deterministic distribution — one outcome has $p=1$, zero uncertainty

For a binary variable with $P(X=1) = p$, the binary entropy function:

$$H_b(p) = -p \log p - (1-p)\log(1-p)$$

This is maximized at $p = 0.5$ giving $H = 1$ bit, and equals 0 at $p \in \{0, 1\}$.

import numpy as np

def entropy(probs, base=np.e):
    """Compute Shannon entropy. Default: nats (base e). Use base=2 for bits."""
    probs = np.array(probs, dtype=float)
    # Convention: 0 * log(0) = 0
    mask = probs > 0
    return -np.sum(probs[mask] * np.log(probs[mask]) / np.log(base))

# Discrete distributions: compare entropies
uniform_4   = [0.25, 0.25, 0.25, 0.25]
skewed_4    = [0.7,  0.1,  0.1,  0.1]
det_4       = [1.0,  0.0,  0.0,  0.0]

print("Shannon Entropy (bits):")
print(f"  Uniform:      {entropy(uniform_4, base=2):.4f} bits  (max = log2(4)={np.log2(4):.4f})")
print(f"  Skewed:       {entropy(skewed_4, base=2):.4f} bits")
print(f"  Deterministic:{entropy(det_4, base=2):.4f} bits  (min = 0)")

# Binary entropy function
print("\nBinary entropy H(p):")
for p in [0.01, 0.1, 0.5, 0.9, 0.99]:
    h = entropy([p, 1-p], base=2)
    print(f"  p={p}: H={h:.4f} bits")

Joint & Conditional Entropy

Joint entropy of two random variables $X, Y$:

$$H(X, Y) = -\sum_{x,y} p(x,y) \log p(x,y)$$

Conditional entropy — uncertainty in $Y$ given $X$:

$$H(Y \mid X) = H(X, Y) - H(X) = -\sum_{x,y} p(x,y) \log p(y \mid x)$$

Key inequalities: $H(Y \mid X) \leq H(Y)$ — knowing $X$ cannot increase uncertainty about $Y$. And $H(X, Y) \leq H(X) + H(Y)$ — with equality when $X \perp Y$.

Mutual Information

Mutual information measures how much knowing $X$ reduces uncertainty about $Y$:

$$I(X; Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X) = H(X) + H(Y) - H(X, Y)$$

Properties: $I(X;Y) \geq 0$ (knowing more can't hurt), and $I(X;Y) = 0 \iff X \perp Y$.

Feature SelectionDecision Trees
Mutual Information for Feature Selection

A feature is useful if it has high mutual information with the target variable $Y$. In decision trees, the split criterion information gain is exactly mutual information:

$$\text{IG}(Y; A) = H(Y) - H(Y \mid A)$$

Where $A$ is the feature (attribute). ID3 and C4.5 algorithms greedily select the feature maximizing information gain at each node. sklearn.feature_selection.mutual_info_classif computes $I(X_i; Y)$ for each feature $X_i$, giving a model-agnostic feature importance score.

Cross-Entropy

Cross-entropy measures the expected number of bits needed to encode data from distribution $p$ using a code designed for distribution $q$:

$$H(p, q) = -\sum_x p(x) \log q(x)$$

When $q = p$, this reduces to the ordinary entropy $H(p)$. When $q \neq p$, cross-entropy is always larger: $H(p, q) \geq H(p)$.

Cross-entropy loss in neural networks: Let $p$ be the true label distribution (one-hot vector) and $q$ be the model's softmax output. Then: $$\mathcal{L}_{\text{CE}} = -\sum_c y_c \log \hat{y}_c$$ For a one-hot label (true class $c^*$), this simplifies to $-\log \hat{y}_{c^*}$ — maximizing the predicted probability of the correct class. Minimizing cross-entropy loss is equivalent to maximizing the log-likelihood (MLE) under the categorical distribution.
import numpy as np

def cross_entropy(p_true, q_pred):
    """
    Compute cross-entropy H(p, q).
    p_true: true distribution (e.g., one-hot labels)
    q_pred: predicted distribution (e.g., softmax output)
    """
    p_true = np.array(p_true, dtype=float)
    q_pred = np.clip(q_pred, 1e-12, 1.0)  # prevent log(0)
    return -np.sum(p_true * np.log(q_pred))

# Example: 3-class classification
# True label: class 0 (one-hot)
p_true = [1.0, 0.0, 0.0]

# Good prediction: high probability on true class
q_good = [0.9, 0.07, 0.03]
# Bad prediction: spread probability
q_bad  = [0.3, 0.4, 0.3]
# Perfect prediction
q_perfect = [1.0 - 1e-9, 5e-10, 5e-10]

print("Cross-entropy H(p, q):")
print(f"  Good prediction:    {cross_entropy(p_true, q_good):.4f}")
print(f"  Bad prediction:     {cross_entropy(p_true, q_bad):.4f}")
print(f"  Perfect prediction: {cross_entropy(p_true, q_perfect):.4f}")
print(f"  Entropy H(p):       {cross_entropy(p_true, p_true):.4f}  (lower bound)")

KL Divergence

Kullback-Leibler divergence (relative entropy) measures how different distribution $q$ is from $p$:

$$D_{KL}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)} = H(p, q) - H(p)$$

Key properties:

  • $D_{KL}(p \| q) \geq 0$ always (Gibbs inequality)
  • $D_{KL}(p \| q) = 0 \iff p = q$ everywhere
  • Asymmetric: $D_{KL}(p \| q) \neq D_{KL}(q \| p)$ in general
  • "Forward KL" ($D_{KL}(p \| q)$): penalizes $q$ placing zero mass where $p$ is nonzero
  • "Reverse KL" ($D_{KL}(q \| p)$): penalizes $q$ placing mass where $p$ is zero
VAEsKL Term
KL Divergence in Variational Autoencoders

VAEs are trained with the ELBO (Evidence Lower BOund) loss:

$$\mathcal{L} = \underbrace{\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)]}_{\text{reconstruction}} - \underbrace{D_{KL}(q_\phi(z|x) \| p(z))}_{\text{regularization}}$$

The KL term forces the approximate posterior $q_\phi(z|x)$ (encoder output) to stay close to the prior $p(z) = \mathcal{N}(0, I)$. This prevents the encoder from memorizing by collapsing to point estimates, encouraging a smooth, continuous latent space.

For two Gaussians, KL has a closed form: $D_{KL}(\mathcal{N}(\mu, \sigma^2) \| \mathcal{N}(0,1)) = \frac{1}{2}(\mu^2 + \sigma^2 - 1 - \log \sigma^2)$.

The Jensen-Shannon divergence is a symmetrized, bounded version of KL:

$$D_{JS}(p \| q) = \frac{1}{2} D_{KL}\!\left(p \,\Big\|\, \frac{p+q}{2}\right) + \frac{1}{2} D_{KL}\!\left(q \,\Big\|\, \frac{p+q}{2}\right) \in [0, 1]$$

$D_{JS}$ is used as the training objective in the original GAN paper (Goodfellow et al. 2014) and in text generation evaluations.

import numpy as np

def kl_divergence(p, q):
    """KL(p || q). p and q are probability arrays summing to 1."""
    p = np.array(p, dtype=float)
    q = np.clip(q, 1e-12, 1.0)
    mask = p > 0
    return np.sum(p[mask] * np.log(p[mask] / q[mask]))

def js_divergence(p, q):
    """Jensen-Shannon divergence (symmetric, bounded in [0, log(2)])."""
    p, q = np.array(p, dtype=float), np.array(q, dtype=float)
    m = 0.5 * (p + q)
    return 0.5 * kl_divergence(p, m) + 0.5 * kl_divergence(q, m)

# True label distribution vs model predictions
p_true  = np.array([0.7, 0.2, 0.1])     # true distribution
q_close = np.array([0.65, 0.25, 0.10])  # similar
q_far   = np.array([0.1, 0.1, 0.8])     # very different

print("KL and JS Divergences:")
print(f"  KL(true || close) = {kl_divergence(p_true, q_close):.6f}")
print(f"  KL(true || far)   = {kl_divergence(p_true, q_far):.6f}")
print(f"  JS(true, close)   = {js_divergence(p_true, q_close):.6f}")
print(f"  JS(true, far)     = {js_divergence(p_true, q_far):.6f}")
print(f"\nKL asymmetry:")
print(f"  KL(p || q_far)   = {kl_divergence(p_true, q_far):.6f}")
print(f"  KL(q_far || p)   = {kl_divergence(q_far, p_true):.6f}")

Relationships Between Information Quantities

Information Theory: Key Relationships
flowchart TD
    SI["Self-Information
I(x) = -log p(x)"] H["Shannon Entropy
H(X) = E[I(X)]"] JH["Joint Entropy
H(X,Y)"] CH["Conditional Entropy
H(Y|X) = H(X,Y) - H(X)"] MI["Mutual Information
I(X;Y) = H(X) - H(X|Y)"] CE["Cross-Entropy
H(p,q) = -Σ p log q"] KL["KL Divergence
D_KL(p||q) = H(p,q) - H(p)"] SI -->|"Expected value"| H H -->|"Joint of two vars"| JH JH -->|"Subtract H(X)"| CH H --> MI CH --> MI H -->|"Replace log p with log q"| CE CE -->|"Subtract H(p)"| KL

ML Connections Summary

Decision TreesInformation Gain
Decision Tree Splitting via Entropy

At each node, the ID3 algorithm picks the feature $A$ that maximizes:

$$\text{IG}(Y; A) = H(Y) - \sum_{v \in \text{values}(A)} \frac{|S_v|}{|S|} H(Y \mid A = v)$$

Where $S_v$ is the subset of training examples with $A = v$. High information gain = feature reduces uncertainty about class label the most. Gini impurity (used in sklearn's CART) is a related but different measure: $G = 1 - \sum_c p_c^2$.

Perplexity — evaluating language models: Perplexity is $2^{H(p,q)}$ (or $e^{H(p,q)}$ in nats). For a language model with cross-entropy loss $\mathcal{L}$ nats per token: $$\text{Perplexity} = e^\mathcal{L}$$ Lower perplexity = model assigns higher probability to actual text = better language model. A perplexity of $k$ means the model is as uncertain as if it had to choose uniformly among $k$ equally likely options at each step.

Practice Exercises

Exercise 1Entropy
Entropy of a Fair Die

(1) What is the entropy of a fair 6-sided die roll in bits? (2) What is the entropy of a loaded die where face 6 appears with probability 0.5 and others with 0.1 each? Which has more uncertainty?

Show Answer

(1) Fair die: $H = -6 \times \frac{1}{6}\log_2\frac{1}{6} = \log_2 6 \approx 2.585$ bits.

(2) Loaded die: $H = -0.5\log_2(0.5) - 5 \times 0.1\log_2(0.1) = 0.5 + 5 \times 0.332 = 0.5 + 1.661 \approx 2.161$ bits.

The fair die has higher entropy (2.585 bits) — more uncertainty since all outcomes are equally likely. The loaded die has lower entropy because face 6 is predictable. The uniform distribution maximizes entropy for a given number of outcomes.

Exercise 2Cross-Entropy Loss
Cross-Entropy Loss for Binary Classification

For binary classification (labels $y \in \{0,1\}$), show that the general cross-entropy formula $H(p, q) = -\sum_x p(x)\log q(x)$ reduces to the familiar binary cross-entropy: $\mathcal{L} = -y\log\hat{y} - (1-y)\log(1-\hat{y})$.

Show Answer

For a single binary example with true label $y \in \{0,1\}$: the true distribution is $p = [y, 1-y]$ (one-hot) and the predicted distribution is $q = [\hat{y}, 1-\hat{y}]$.

$H(p, q) = -p_0 \log q_0 - p_1 \log q_1 = -y \log \hat{y} - (1-y)\log(1-\hat{y})$. QED.

When $y=1$: only $-\log\hat{y}$ matters. When $y=0$: only $-\log(1-\hat{y})$ matters. This is exactly the binary cross-entropy implemented in BCELoss in PyTorch and BinaryCrossentropy in Keras.

Conclusion & Next Steps

Information theory gives ML a mathematically grounded way to measure uncertainty, distributional differences, and feature relevance. Key takeaways:

  • Self-information: $I(x) = -\log p(x)$ — rare events are more surprising
  • Entropy: $H(X)$ — average uncertainty; maximized by uniform distribution
  • Cross-entropy: $H(p,q)$ — the loss function for classification in neural networks
  • KL divergence: $D_{KL}(p\|q)$ — asymmetric distance between distributions; used in VAE training
  • Mutual information: $I(X;Y)$ — feature relevance measure; decision tree criterion
  • Key relationship: $H(p,q) = H(p) + D_{KL}(p\|q)$ — cross-entropy = entropy + extra cost of mismatch

Next in the Series

In Part 7: Linear Algebra, we tackle vectors, matrices, eigenvalues, and singular value decomposition — the computational backbone of neural network layers, PCA, and attention mechanisms.