Back to Mathematics

Part 3: Combinatorics

April 26, 2026 Wasil Zafar 14 min read

Why does hyperparameter search explode? Why are there so many possible neural network architectures? Why is the binomial distribution called binomial? Combinatorics — the mathematics of counting — answers all of these.

Table of Contents

  1. Why Counting Matters in ML
  2. Fundamental Counting Principles
  3. Factorial
  4. Permutations (Order Matters)
  5. Combinations (Order Doesn't Matter)
  6. Binomial Theorem & Pascal's Triangle
  7. Multinomials & Multi-class
  8. ML Applications
  9. Practice Exercises
  10. Conclusion & Next Steps

Why Counting Matters in ML

Counting might sound elementary, but combinatorics is the mathematical engine behind some of the most important practical questions in ML:

Where combinatorics appears in ML:
  • Hyperparameter search: With 5 hyperparameters each having 4 options, you have $4^5 = 1{,}024$ combinations to evaluate.
  • Feature selection: Choosing $k$ features from $d$ total: $\binom{d}{k}$ possible subsets.
  • Neural architecture: The number of possible architectures (layers, widths, activation functions) is combinatorially enormous.
  • Probability: The binomial distribution uses combinations: $P(X=k) = \binom{n}{k} p^k (1-p)^{n-k}$.
  • Tokenization: BPE (Byte-Pair Encoding) selects from all possible character-pair merges combinatorially.
  • Decision trees: The number of possible decision tree structures grows super-exponentially with depth.

Fundamental Counting Principles

1 — The Multiplication Rule

If a task consists of sequential, independent choices — first choosing from $n_1$ options, then from $n_2$ options, … — the total number of outcomes is:

$$n_1 \times n_2 \times \cdots \times n_k$$
ExampleHyperparameters
Hyperparameter Grid Search

Suppose you're searching over:

  • Learning rate: 3 choices {0.001, 0.01, 0.1}
  • Batch size: 3 choices {32, 64, 128}
  • Architecture: 2 choices {shallow, deep}
  • Optimizer: 2 choices {Adam, SGD}

Total combinations: $3 \times 3 \times 2 \times 2 = 36$. This is why random search and Bayesian optimization are preferred over exhaustive grid search — even with just 4 hyperparameters, a coarse grid produces dozens to hundreds of experiments, each requiring a full training run.

2 — The Addition Rule

If two events are mutually exclusive (can't both happen), the total number of outcomes is the sum:

$$\text{total} = n_1 + n_2 \quad (\text{if mutually exclusive})$$

For non-disjoint events, apply inclusion-exclusion: $n_1 + n_2 - |\text{overlap}|$ (from set theory, Part 2).

Factorial

Before counting permutations and combinations, you need to understand the factorial — one of the most fundamental operations in combinatorics and probability.

The factorial of a non-negative integer $n$, written $n!$, is the product of all positive integers from 1 up to $n$:

$$n! = n \times (n-1) \times (n-2) \times \cdots \times 2 \times 1$$

With the special case:

$$0! = 1 \qquad \text{(by convention — there is exactly one way to arrange zero items)}$$
$n$$n!$CalculationMeaning
01— (by definition)1 way to arrange 0 items
11$1$1 way to arrange 1 item
22$2 \times 1$2 ways to arrange 2 items
36$3 \times 2 \times 1$6 ways to arrange 3 items
424$4 \times 3 \times 2 \times 1$24 ways to arrange 4 items
5120$5 \times 4 \times 3 \times 2 \times 1$120 ways to arrange 5 items
103,628,800$10!$Over 3.6 million arrangements!
20$\approx 2.4 \times 10^{18}$$20!$Astronomically large
Factorials grow incredibly fast. $n!$ grows faster than exponential functions. This is why:
  • Brute-force sorting of $n$ items takes $O(n!)$ time — completely infeasible for large $n$
  • Exhaustive search over $n$ features takes $n!$ time — use greedy methods instead
  • The number of ways to order $n$ tokens is $n!$ — why position encodings are needed

Recursive definition: $n! = n \times (n-1)!$ for $n \geq 1$, and $0! = 1$. This recursive structure makes factorials easy to compute programmatically.

Stirling's approximation is useful when $n$ is large (too large to compute exactly):

$$n! \approx \sqrt{2\pi n} \left(\frac{n}{e}\right)^n \qquad \text{(very accurate for large } n \text{)}$$
import math

# Factorial basics
for n in range(0, 11):
    print(f"{n:2d}! = {math.factorial(n):>12,}")

# Recursive implementation (to understand the structure)
def factorial_recursive(n):
    """Recursive: n! = n * (n-1)!"""
    if n == 0:
        return 1
    return n * factorial_recursive(n - 1)

print(f"\nRecursive: 5! = {factorial_recursive(5)}")  # 120

# Iterative implementation
def factorial_iterative(n):
    result = 1
    for k in range(2, n + 1):
        result *= k
    return result

print(f"Iterative: 5! = {factorial_iterative(5)}")   # 120

# Stirling's approximation for large n
import numpy as np
n = 100
exact = math.factorial(n)
stirling = math.sqrt(2 * math.pi * n) * (n / math.e) ** n
print(f"\nFor n={n}:")
print(f"  log10(exact)    = {math.log10(exact):.4f}")
print(f"  log10(stirling) = {math.log10(stirling):.4f}")
print(f"  Relative error  = {abs(exact - stirling)/exact:.6f}")

Permutations — When Order Matters

A permutation is an ordered arrangement. When order matters, different arrangements are counted as distinct outcomes.

Permutations of $n$ items taken $r$ at a time (no repetition):

$$P(n, r) = \frac{n!}{(n-r)!} = n \cdot (n-1) \cdot (n-2) \cdots (n-r+1)$$

All permutations of $n$ items: $P(n, n) = n!$

Factorial: $n! = n \cdot (n-1) \cdots 2 \cdot 1$, with $0! = 1$ by convention.

Counting Permutations: 3 items taken 2 at a time
flowchart TD
    S["3 items: {A, B, C}\nChoose 2 in order"] --> A1["First: A"]
    S --> B1["First: B"]
    S --> C1["First: C"]
    A1 --> AB["(A,B)"]
    A1 --> AC["(A,C)"]
    B1 --> BA["(B,A)"]
    B1 --> BC["(B,C)"]
    C1 --> CA["(C,A)"]
    C1 --> CB["(C,B)"]
    AB & AC & BA & BC & CA & CB --> T["P(3,2) = 3!/(3-2)! = 6 permutations"]

    style S fill:#132440,stroke:#132440,color:#fff
    style T fill:#3B9797,stroke:#3B9797,color:#fff
                            
Why permutations appear in ML: Attention mechanisms in Transformers are permutation-equivariant — the number of ways $n$ tokens can be ordered is $n!$. For $n=512$ tokens, $512!$ is astronomically large. Position encodings are added precisely because the model would otherwise treat all orderings as equivalent.
import math
from itertools import permutations

# Permutations formula
def P(n, r):
    return math.factorial(n) // math.factorial(n - r)

print(f"P(5,3) = {P(5,3)}")   # 60
print(f"P(10,4) = {P(10,4)}") # 5040

# Generate all permutations of 3 items taken 2 at a time
items = ['A', 'B', 'C']
all_perms = list(permutations(items, 2))
print(f"\nAll P(3,2) permutations: {all_perms}")
print(f"Count: {len(all_perms)}")  # 6

# Factorial grows fast — this is why neural architecture search is hard
for n in range(1, 13):
    print(f"{n}! = {math.factorial(n):,}")

Combinations — When Order Doesn't Matter

A combination is a selection where order doesn't matter. $\{A, B, C\}$ and $\{C, B, A\}$ are the same combination but different permutations.

Combinations of $n$ items taken $r$ at a time:

$$\binom{n}{r} = C(n,r) = \frac{n!}{r!\,(n-r)!} = \frac{P(n,r)}{r!}$$

Read $\binom{n}{r}$ as "$n$ choose $r$." We divide by $r!$ to remove the $r!$ orderings of each selection that permutations would count separately.

Key identities:

$$\binom{n}{0} = \binom{n}{n} = 1 \qquad \binom{n}{r} = \binom{n}{n-r} \qquad \sum_{r=0}^{n} \binom{n}{r} = 2^n$$
Feature SelectionCombinatorial Explosion
Why Exhaustive Feature Selection Fails

With $d$ features, selecting the best subset of size $k$ requires evaluating $\binom{d}{k}$ subsets. For $d=50$, $k=10$: $\binom{50}{10} = 10,272,278,170$ — over 10 billion subsets. Even evaluating each in 1 millisecond would take 115 days. This combinatorial explosion explains why L1 regularization (LASSO), which implicitly selects features by pushing weights to zero, is so practically valuable.

import math
from itertools import combinations

# Combinations formula
def C(n, r):
    return math.comb(n, r)   # math.comb available in Python 3.8+

print(f"C(5,2) = {C(5,2)}")    # 10
print(f"C(10,3) = {C(10,3)}")  # 120

# Generate all combinations of 4 items taken 2 at a time
items = ['a', 'b', 'c', 'd']
all_combs = list(combinations(items, 2))
print(f"\nAll C(4,2) combinations: {all_combs}")
print(f"Count: {len(all_combs)}")   # 6

# Combinatorial explosion in feature selection
print("\nFeature subsets of size k from d features:")
for d in [10, 20, 50, 100]:
    k = d // 5  # select 20% of features
    count = C(d, k)
    print(f"  d={d:3d}, k={k:2d}: C({d},{k}) = {count:,}")

Binomial Theorem & Pascal's Triangle

The binomial theorem expands $(a + b)^n$ using combinations:

$$(a + b)^n = \sum_{k=0}^{n} \binom{n}{k} a^{n-k} b^k = \binom{n}{0}a^n + \binom{n}{1}a^{n-1}b + \cdots + \binom{n}{n}b^n$$

The coefficients $\binom{n}{k}$ form Pascal's Triangle, where each entry is the sum of the two entries above it:

$$\binom{n}{k} = \binom{n-1}{k-1} + \binom{n-1}{k} \quad \text{(Pascal's identity)}$$
The Binomial Distribution connection: When you flip a biased coin $n$ times with $P(\text{heads}) = p$, the probability of getting exactly $k$ heads is: $$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}$$ The $\binom{n}{k}$ term counts the number of ways to arrange $k$ heads in $n$ flips. This is called "binomial" precisely because the coefficients come from the binomial theorem. We'll study this distribution in depth in Part 4.
import math

# Pascal's Triangle
def pascal_triangle(rows):
    triangle = []
    for n in range(rows):
        row = [math.comb(n, k) for k in range(n + 1)]
        triangle.append(row)
    return triangle

triangle = pascal_triangle(8)
print("Pascal's Triangle (rows 0–7):")
for i, row in enumerate(triangle):
    padding = " " * (7 - i) * 2
    print(padding + "  ".join(f"{x:3d}" for x in row))

# Verify binomial theorem: (1+1)^n = sum of row n = 2^n
for n in range(8):
    row_sum = sum(triangle[n])
    print(f"Row {n} sum = {row_sum} = 2^{n} = {2**n}")
import math

# Binomial expansion
def binomial_expand(a_coef, b_coef, n, show_terms=True):
    """Expand (a_coef * a + b_coef * b)^n numerically for a=b=1."""
    terms = []
    total = 0
    for k in range(n + 1):
        coef = math.comb(n, k) * (a_coef ** (n - k)) * (b_coef ** k)
        terms.append(coef)
        total += coef
    if show_terms:
        print(f"Coefficients of (a+b)^{n}: {terms}")
        print(f"Sum (a=b=1): {total} = {a_coef+b_coef}^{n} = {(a_coef+b_coef)**n}")
    return terms

binomial_expand(1, 1, 5)  # 1 5 10 10 5 1

# Binomial probability: 3 heads in 5 fair coin flips
n, k, p = 5, 3, 0.5
prob = math.comb(n, k) * (p**k) * ((1-p)**(n-k))
print(f"\nP(X=3) for n=5, p=0.5: {prob:.4f}")  # 0.3125

Multinomials & Multi-class Classification

The multinomial coefficient generalizes combinations to distributing $n$ objects into $k$ groups:

$$\binom{n}{n_1, n_2, \ldots, n_k} = \frac{n!}{n_1!\, n_2!\, \cdots\, n_k!} \quad \text{where } n_1 + n_2 + \cdots + n_k = n$$

The multinomial distribution describes drawing from $k$ categories with probabilities $p_1, \ldots, p_k$:

$$P(X_1 = n_1, \ldots, X_k = n_k) = \frac{n!}{n_1! \cdots n_k!} p_1^{n_1} \cdots p_k^{n_k}$$
Softmax and the multinomial distribution: The softmax function outputs a probability vector $(p_1, \ldots, p_k)$ summing to 1. A single sample from this distribution follows the categorical distribution (multinomial with $n=1$). Cross-entropy loss (Part 6) measures how well your model's softmax output matches the true class distribution.

ML Applications Summary

NLPVocabulary
Vocabulary Size and Sequence Complexity

A vocabulary of size $V = 50{,}000$ (typical for GPT-2): the number of possible sequences of length $L$ is $V^L$. For $L = 100$: $50{,}000^{100}$ — an incomprehensibly large number. Language models learn to assign probabilities to this vast combinatorial space from a finite training corpus. This is why they generalize: they learn patterns that let them evaluate unseen combinations. The challenge of language modeling is fundamentally a combinatorial one.

import math

# Combinatorics in NLP
V = 50_000  # vocabulary size
L = 10      # short sequence length

# Number of possible sequences (permutations WITH repetition)
# V^L — each position independently draws from V tokens
print("Number of sequences of length L:")
for l in [3, 5, 10]:
    count = V ** l
    print(f"  L={l}: V^L = {V}^{l} ≈ {count:.2e}")

# BPE: number of character pair merges to evaluate at each step
# If vocabulary has V_current tokens, pairs to consider ≈ V_current^2
for v in [256, 1000, 10000]:
    pairs = v * (v - 1) // 2
    print(f"\n  V={v}: ~{pairs:,} possible pairs to merge")

Practice Exercises

Exercise 1Grid Search
Count the Search Space

A neural network has: 4 choices for learning rate, 3 for batch size, 5 for hidden layer size, 2 for activation function, 3 for optimizer. (1) How many configurations are there in a full grid search? (2) If you randomly sample 10%, approximately how many runs is that?

Show Answer

(1) $4 \times 3 \times 5 \times 2 \times 3 = 360$ configurations. (2) $0.10 \times 360 = 36$ runs. Random search with 36 trials often matches or beats exhaustive grid search in practice (Bergstra & Bengio, 2012) because not all hyperparameters are equally important.

Exercise 2Probability
Binomial Probability

A model achieves 80% accuracy on each example independently. What is the probability that it gets exactly 7 out of 10 examples correct?

Show Answer

$P(X=7) = \binom{10}{7} (0.8)^7 (0.2)^3 = 120 \times 0.2097 \times 0.008 \approx 0.2013 \approx 20.1\%$. Note the peak probability is at $k = np = 10 \times 0.8 = 8$ correct examples.

Conclusion & Next Steps

Combinatorics gives you the tools to quantify possibility. The key results:

  • Multiplication rule: Independent sequential choices multiply — hyperparameter grids grow as $n_1 \times n_2 \times \cdots$
  • Permutations $P(n,r) = \frac{n!}{(n-r)!}$ — when order matters (token sequences, rankings)
  • Combinations $\binom{n}{r} = \frac{n!}{r!(n-r)!}$ — when order doesn't matter (feature subsets, cross-validation folds)
  • Binomial theorem: $(a+b)^n = \sum_k \binom{n}{k} a^{n-k} b^k$ — foundation of the binomial distribution
  • Multinomials: Generalize to $k$ categories — direct connection to softmax and cross-entropy

Next in the Series

In Part 4: Probability Fundamentals, we combine sets (Part 2) and counting (Part 3) to build formal probability theory — sample spaces, Bayes' theorem, probability distributions, and the mathematical foundation of all probabilistic ML.