The Zero-Probability Problem
An n-gram language model estimates $P(w_i | w_{i-n+1}, \ldots, w_{i-1})$ from corpus counts. The maximum likelihood estimate (MLE) is:
$$P_{\text{MLE}}(w_i | w_{i-1}) = \frac{C(w_{i-1}, w_i)}{C(w_{i-1})}$$where $C(\cdot)$ denotes the count. Problem: Any bigram not observed in training gets $P = 0$. Since perplexity involves $\prod P(w_i|w_{i-1})$, a single zero term makes perplexity infinite. We need to assign non-zero probability to unseen events while maintaining $\sum_{w} P(w|h) = 1$.
Laplace (Add-$k$) Smoothing
Idea: Add a constant $k$ to every count, pretending we observed each n-gram $k$ extra times:
$$\boxed{P_{\text{Laplace}}(w_i | w_{i-1}) = \frac{C(w_{i-1}, w_i) + k}{C(w_{i-1}) + k|V|}}$$where $|V|$ is the vocabulary size. When $k = 1$, this is classical Laplace smoothing (add-one).
Verification that probabilities sum to 1:
$$\sum_{w_i \in V} P_{\text{Laplace}}(w_i|w_{i-1}) = \frac{\sum_{w_i} [C(w_{i-1}, w_i) + k]}{C(w_{i-1}) + k|V|} = \frac{C(w_{i-1}) + k|V|}{C(w_{i-1}) + k|V|} = 1 \checkmark$$Bayesian interpretation: Laplace smoothing is the posterior mean of a Dirichlet-Multinomial model with uniform prior $\text{Dir}(k, k, \ldots, k)$. The prior effectively contributes $k$ pseudo-counts per outcome.
Absolute Discounting
Key insight (Church & Gale 1991): Empirically, the "true" count of an n-gram is approximately $C - d$ where $d \approx 0.75$ is a fixed discount. This motivates subtracting a constant rather than adding one:
$$P_{\text{AD}}(w_i | w_{i-1}) = \frac{\max(C(w_{i-1}, w_i) - d,\; 0)}{C(w_{i-1})} + \lambda(w_{i-1})\, P_{\text{backoff}}(w_i)$$The discount $d$ is subtracted from each observed count, freeing probability mass $\lambda(w_{i-1})$ that is redistributed via a backoff distribution. Setting $\lambda$ to ensure normalization:
$$\lambda(w_{i-1}) = \frac{d \cdot |\{w : C(w_{i-1}, w) > 0\}|}{C(w_{i-1})}$$The numerator counts how many unique words follow $w_{i-1}$ (each contributes $d$ mass to the pot).
Kneser-Ney Smoothing
The innovation of Kneser-Ney (1995) is the choice of backoff distribution. Rather than using unigram frequency $P(w_i)$, it uses the number of distinct contexts a word appears in — its continuation probability:
$$\boxed{P_{\text{KN}}(w_i) = \frac{|\{w_{i-1} : C(w_{i-1}, w_i) > 0\}|}{|\{(w_{j-1}, w_j) : C(w_{j-1}, w_j) > 0\}|}}$$Intuition: The word "Francisco" has high unigram count but almost always follows "San". Its continuation count is low (1 unique context), so $P_{\text{KN}}(\text{Francisco})$ is appropriately small — it shouldn't receive much backoff mass.
Full interpolated Kneser-Ney:
$$P_{\text{IKN}}(w_i | w_{i-1}) = \frac{\max(C(w_{i-1}, w_i) - d,\; 0)}{C(w_{i-1})} + \lambda(w_{i-1})\, P_{\text{KN}}(w_i)$$Modified Kneser-Ney (Chen & Goodman 1999) uses three discount values $d_1, d_2, d_{3+}$ depending on whether the n-gram was seen 1, 2, or 3+ times — yielding state-of-the-art n-gram performance before neural LMs.
import numpy as np
from collections import defaultdict
def kneser_ney_bigram(corpus_bigrams, d=0.75):
"""
Compute interpolated Kneser-Ney bigram probabilities.
Args:
corpus_bigrams: list of (w_{i-1}, w_i) tuples
d: discount value (typically 0.75)
Returns:
Function P(w_i | w_{i-1})
"""
# Count bigrams and unigrams
bigram_counts = defaultdict(int)
context_counts = defaultdict(int)
for w_prev, w_curr in corpus_bigrams:
bigram_counts[(w_prev, w_curr)] += 1
context_counts[w_prev] += 1
# Continuation counts: how many unique left-contexts does w_i appear in?
continuation_counts = defaultdict(set)
for w_prev, w_curr in corpus_bigrams:
continuation_counts[w_curr].add(w_prev)
# Total number of unique bigram types
total_bigram_types = len(set(corpus_bigrams))
# P_KN(w_i) = |{w_{i-1} : C(w_{i-1}, w_i) > 0}| / |total bigram types|
def p_continuation(w):
return len(continuation_counts[w]) / total_bigram_types
# Number of unique words following context
unique_following = defaultdict(set)
for w_prev, w_curr in corpus_bigrams:
unique_following[w_prev].add(w_curr)
# Lambda: interpolation weight
def lam(w_prev):
return d * len(unique_following[w_prev]) / context_counts[w_prev]
# P_IKN(w_i | w_{i-1})
def p_kn(w_curr, w_prev):
numerator = max(bigram_counts[(w_prev, w_curr)] - d, 0)
first_term = numerator / context_counts[w_prev] if context_counts[w_prev] > 0 else 0
return first_term + lam(w_prev) * p_continuation(w_curr)
return p_kn
# Example: small corpus
corpus = "the cat sat on the mat the cat ate the food".split()
bigrams = [(corpus[i], corpus[i+1]) for i in range(len(corpus)-1)]
p_kn = kneser_ney_bigram(bigrams, d=0.75)
print("Interpolated Kneser-Ney bigram probabilities:")
print(f" P(cat | the) = {p_kn('cat', 'the'):.4f}")
print(f" P(mat | the) = {p_kn('mat', 'the'):.4f}")
print(f" P(dog | the) = {p_kn('dog', 'the'):.4f} (unseen bigram, gets backoff mass)")
print(f" P(sat | cat) = {p_kn('sat', 'cat'):.4f}")
# Verify normalization for context "the"
vocab = set(corpus)
total = sum(p_kn(w, 'the') for w in vocab)
print(f"\n Sum P(w|the) for vocab = {total:.4f} (approx 1.0 over full vocab)")
Interpolation Methods
Linear interpolation combines n-gram models of different orders:
$$P_{\text{interp}}(w_i | w_{i-2}, w_{i-1}) = \lambda_3\, P_3(w_i|w_{i-2},w_{i-1}) + \lambda_2\, P_2(w_i|w_{i-1}) + \lambda_1\, P_1(w_i)$$where $\lambda_1 + \lambda_2 + \lambda_3 = 1$ and the $\lambda_k$ are tuned on held-out data (typically via EM on a development set). This is valid because any convex combination of proper distributions is itself a proper distribution.