NLP Smoothing Mathematics

Prerequisites: Part 4: Probability (conditional probability, MLE). This specialist note provides the canonical derivation for smoothing content in the NLP: Statistical Language Models article.

The Zero-Probability Problem

An n-gram language model estimates $P(w_i | w_{i-n+1}, \ldots, w_{i-1})$ from corpus counts. The maximum likelihood estimate (MLE) is:

$$P_{\text{MLE}}(w_i | w_{i-1}) = \frac{C(w_{i-1}, w_i)}{C(w_{i-1})}$$

where $C(\cdot)$ denotes the count. Problem: Any bigram not observed in training gets $P = 0$. Since perplexity involves $\prod P(w_i|w_{i-1})$, a single zero term makes perplexity infinite. We need to assign non-zero probability to unseen events while maintaining $\sum_{w} P(w|h) = 1$.

Laplace (Add-$k$) Smoothing

Idea: Add a constant $k$ to every count, pretending we observed each n-gram $k$ extra times:

$$\boxed{P_{\text{Laplace}}(w_i | w_{i-1}) = \frac{C(w_{i-1}, w_i) + k}{C(w_{i-1}) + k|V|}}$$

where $|V|$ is the vocabulary size. When $k = 1$, this is classical Laplace smoothing (add-one).

Verification that probabilities sum to 1:

$$\sum_{w_i \in V} P_{\text{Laplace}}(w_i|w_{i-1}) = \frac{\sum_{w_i} [C(w_{i-1}, w_i) + k]}{C(w_{i-1}) + k|V|} = \frac{C(w_{i-1}) + k|V|}{C(w_{i-1}) + k|V|} = 1 \checkmark$$

Bayesian interpretation: Laplace smoothing is the posterior mean of a Dirichlet-Multinomial model with uniform prior $\text{Dir}(k, k, \ldots, k)$. The prior effectively contributes $k$ pseudo-counts per outcome.

Limitation: Add-1 smoothing moves too much mass to unseen events when $|V|$ is large (typical NLP: $|V| \sim 10^5$). For bigrams, it can reduce MLE probabilities by a factor of $|V|$. In practice, $k \ll 1$ (e.g., $k = 0.01$) or more sophisticated methods are preferred.

Absolute Discounting

Key insight (Church & Gale 1991): Empirically, the "true" count of an n-gram is approximately $C - d$ where $d \approx 0.75$ is a fixed discount. This motivates subtracting a constant rather than adding one:

$$P_{\text{AD}}(w_i | w_{i-1}) = \frac{\max(C(w_{i-1}, w_i) - d,\; 0)}{C(w_{i-1})} + \lambda(w_{i-1})\, P_{\text{backoff}}(w_i)$$

The discount $d$ is subtracted from each observed count, freeing probability mass $\lambda(w_{i-1})$ that is redistributed via a backoff distribution. Setting $\lambda$ to ensure normalization:

$$\lambda(w_{i-1}) = \frac{d \cdot |\{w : C(w_{i-1}, w) > 0\}|}{C(w_{i-1})}$$

The numerator counts how many unique words follow $w_{i-1}$ (each contributes $d$ mass to the pot).

Kneser-Ney Smoothing

The innovation of Kneser-Ney (1995) is the choice of backoff distribution. Rather than using unigram frequency $P(w_i)$, it uses the number of distinct contexts a word appears in — its continuation probability:

$$\boxed{P_{\text{KN}}(w_i) = \frac{|\{w_{i-1} : C(w_{i-1}, w_i) > 0\}|}{|\{(w_{j-1}, w_j) : C(w_{j-1}, w_j) > 0\}|}}$$

Intuition: The word "Francisco" has high unigram count but almost always follows "San". Its continuation count is low (1 unique context), so $P_{\text{KN}}(\text{Francisco})$ is appropriately small — it shouldn't receive much backoff mass.

Full interpolated Kneser-Ney:

$$P_{\text{IKN}}(w_i | w_{i-1}) = \frac{\max(C(w_{i-1}, w_i) - d,\; 0)}{C(w_{i-1})} + \lambda(w_{i-1})\, P_{\text{KN}}(w_i)$$

Modified Kneser-Ney (Chen & Goodman 1999) uses three discount values $d_1, d_2, d_{3+}$ depending on whether the n-gram was seen 1, 2, or 3+ times — yielding state-of-the-art n-gram performance before neural LMs.

import numpy as np
from collections import defaultdict

def kneser_ney_bigram(corpus_bigrams, d=0.75):
    """
    Compute interpolated Kneser-Ney bigram probabilities.
    
    Args:
        corpus_bigrams: list of (w_{i-1}, w_i) tuples
        d: discount value (typically 0.75)
    
    Returns:
        Function P(w_i | w_{i-1})
    """
    # Count bigrams and unigrams
    bigram_counts = defaultdict(int)
    context_counts = defaultdict(int)
    for w_prev, w_curr in corpus_bigrams:
        bigram_counts[(w_prev, w_curr)] += 1
        context_counts[w_prev] += 1
    
    # Continuation counts: how many unique left-contexts does w_i appear in?
    continuation_counts = defaultdict(set)
    for w_prev, w_curr in corpus_bigrams:
        continuation_counts[w_curr].add(w_prev)
    
    # Total number of unique bigram types
    total_bigram_types = len(set(corpus_bigrams))
    
    # P_KN(w_i) = |{w_{i-1} : C(w_{i-1}, w_i) > 0}| / |total bigram types|
    def p_continuation(w):
        return len(continuation_counts[w]) / total_bigram_types
    
    # Number of unique words following context
    unique_following = defaultdict(set)
    for w_prev, w_curr in corpus_bigrams:
        unique_following[w_prev].add(w_curr)
    
    # Lambda: interpolation weight
    def lam(w_prev):
        return d * len(unique_following[w_prev]) / context_counts[w_prev]
    
    # P_IKN(w_i | w_{i-1})
    def p_kn(w_curr, w_prev):
        numerator = max(bigram_counts[(w_prev, w_curr)] - d, 0)
        first_term = numerator / context_counts[w_prev] if context_counts[w_prev] > 0 else 0
        return first_term + lam(w_prev) * p_continuation(w_curr)
    
    return p_kn

# Example: small corpus
corpus = "the cat sat on the mat the cat ate the food".split()
bigrams = [(corpus[i], corpus[i+1]) for i in range(len(corpus)-1)]

p_kn = kneser_ney_bigram(bigrams, d=0.75)

print("Interpolated Kneser-Ney bigram probabilities:")
print(f"  P(cat | the) = {p_kn('cat', 'the'):.4f}")
print(f"  P(mat | the) = {p_kn('mat', 'the'):.4f}")
print(f"  P(dog | the) = {p_kn('dog', 'the'):.4f}  (unseen bigram, gets backoff mass)")
print(f"  P(sat | cat) = {p_kn('sat', 'cat'):.4f}")

# Verify normalization for context "the"
vocab = set(corpus)
total = sum(p_kn(w, 'the') for w in vocab)
print(f"\n  Sum P(w|the) for vocab = {total:.4f} (approx 1.0 over full vocab)")

Interpolation Methods

Linear interpolation combines n-gram models of different orders:

$$P_{\text{interp}}(w_i | w_{i-2}, w_{i-1}) = \lambda_3\, P_3(w_i|w_{i-2},w_{i-1}) + \lambda_2\, P_2(w_i|w_{i-1}) + \lambda_1\, P_1(w_i)$$

where $\lambda_1 + \lambda_2 + \lambda_3 = 1$ and the $\lambda_k$ are tuned on held-out data (typically via EM on a development set). This is valid because any convex combination of proper distributions is itself a proper distribution.

Historical context: Modified Kneser-Ney was the gold standard for language modeling from 1999 until neural LMs (2013+). It remains the strongest non-neural baseline and its continuation probability idea influenced subword tokenization (BPE frequency heuristics).

Table of Contents

The Zero-Probability Problem

Laplace (Add-$k$) Smoothing

Absolute Discounting

Kneser-Ney Smoothing

Interpolation Methods

Related Articles

Part 4: Probability Theory

NLP: Statistical Language Models