Complete Math + Probability + Statistics for ML/AI/DS Bootcamp
Mathematical Thinking
Mindset, notation & functionsSet Theory & Foundations
Sets, operations & ML connectionsCombinatorics
Counting, permutations & combinationsProbability Fundamentals
Rules, Bayes & distributionsStatistics
Descriptive to inferentialInformation Theory
Entropy, cross-entropy & KL divergenceLinear Algebra
Vectors, matrices & transformationsCalculus & Optimization
Derivatives, gradients & descentML-Specific Math
Loss functions & regularizationComputational Math
NumPy, Pandas & simulationAdvanced Topics
Multivariate stats & Bayesian inferenceProjects & Applications
Build from scratch: regression, Bayes, PCAWhy Counting Matters in ML
Counting might sound elementary, but combinatorics is the mathematical engine behind some of the most important practical questions in ML:
- Hyperparameter search: With 5 hyperparameters each having 4 options, you have $4^5 = 1{,}024$ combinations to evaluate.
- Feature selection: Choosing $k$ features from $d$ total: $\binom{d}{k}$ possible subsets.
- Neural architecture: The number of possible architectures (layers, widths, activation functions) is combinatorially enormous.
- Probability: The binomial distribution uses combinations: $P(X=k) = \binom{n}{k} p^k (1-p)^{n-k}$.
- Tokenization: BPE (Byte-Pair Encoding) selects from all possible character-pair merges combinatorially.
- Decision trees: The number of possible decision tree structures grows super-exponentially with depth.
Fundamental Counting Principles
1 — The Multiplication Rule
If a task consists of sequential, independent choices — first choosing from $n_1$ options, then from $n_2$ options, … — the total number of outcomes is:
Hyperparameter Grid Search
Suppose you're searching over:
- Learning rate: 3 choices {0.001, 0.01, 0.1}
- Batch size: 3 choices {32, 64, 128}
- Architecture: 2 choices {shallow, deep}
- Optimizer: 2 choices {Adam, SGD}
Total combinations: $3 \times 3 \times 2 \times 2 = 36$. This is why random search and Bayesian optimization are preferred over exhaustive grid search — even with just 4 hyperparameters, a coarse grid produces dozens to hundreds of experiments, each requiring a full training run.
2 — The Addition Rule
If two events are mutually exclusive (can't both happen), the total number of outcomes is the sum:
For non-disjoint events, apply inclusion-exclusion: $n_1 + n_2 - |\text{overlap}|$ (from set theory, Part 2).
Factorial
Before counting permutations and combinations, you need to understand the factorial — one of the most fundamental operations in combinatorics and probability.
The factorial of a non-negative integer $n$, written $n!$, is the product of all positive integers from 1 up to $n$:
With the special case:
| $n$ | $n!$ | Calculation | Meaning |
|---|---|---|---|
| 0 | 1 | — (by definition) | 1 way to arrange 0 items |
| 1 | 1 | $1$ | 1 way to arrange 1 item |
| 2 | 2 | $2 \times 1$ | 2 ways to arrange 2 items |
| 3 | 6 | $3 \times 2 \times 1$ | 6 ways to arrange 3 items |
| 4 | 24 | $4 \times 3 \times 2 \times 1$ | 24 ways to arrange 4 items |
| 5 | 120 | $5 \times 4 \times 3 \times 2 \times 1$ | 120 ways to arrange 5 items |
| 10 | 3,628,800 | $10!$ | Over 3.6 million arrangements! |
| 20 | $\approx 2.4 \times 10^{18}$ | $20!$ | Astronomically large |
- Brute-force sorting of $n$ items takes $O(n!)$ time — completely infeasible for large $n$
- Exhaustive search over $n$ features takes $n!$ time — use greedy methods instead
- The number of ways to order $n$ tokens is $n!$ — why position encodings are needed
Recursive definition: $n! = n \times (n-1)!$ for $n \geq 1$, and $0! = 1$. This recursive structure makes factorials easy to compute programmatically.
Stirling's approximation is useful when $n$ is large (too large to compute exactly):
import math
# Factorial basics
for n in range(0, 11):
print(f"{n:2d}! = {math.factorial(n):>12,}")
# Recursive implementation (to understand the structure)
def factorial_recursive(n):
"""Recursive: n! = n * (n-1)!"""
if n == 0:
return 1
return n * factorial_recursive(n - 1)
print(f"\nRecursive: 5! = {factorial_recursive(5)}") # 120
# Iterative implementation
def factorial_iterative(n):
result = 1
for k in range(2, n + 1):
result *= k
return result
print(f"Iterative: 5! = {factorial_iterative(5)}") # 120
# Stirling's approximation for large n
import numpy as np
n = 100
exact = math.factorial(n)
stirling = math.sqrt(2 * math.pi * n) * (n / math.e) ** n
print(f"\nFor n={n}:")
print(f" log10(exact) = {math.log10(exact):.4f}")
print(f" log10(stirling) = {math.log10(stirling):.4f}")
print(f" Relative error = {abs(exact - stirling)/exact:.6f}")
Permutations — When Order Matters
A permutation is an ordered arrangement. When order matters, different arrangements are counted as distinct outcomes.
Permutations of $n$ items taken $r$ at a time (no repetition):
All permutations of $n$ items: $P(n, n) = n!$
Factorial: $n! = n \cdot (n-1) \cdots 2 \cdot 1$, with $0! = 1$ by convention.
flowchart TD
S["3 items: {A, B, C}\nChoose 2 in order"] --> A1["First: A"]
S --> B1["First: B"]
S --> C1["First: C"]
A1 --> AB["(A,B)"]
A1 --> AC["(A,C)"]
B1 --> BA["(B,A)"]
B1 --> BC["(B,C)"]
C1 --> CA["(C,A)"]
C1 --> CB["(C,B)"]
AB & AC & BA & BC & CA & CB --> T["P(3,2) = 3!/(3-2)! = 6 permutations"]
style S fill:#132440,stroke:#132440,color:#fff
style T fill:#3B9797,stroke:#3B9797,color:#fff
import math
from itertools import permutations
# Permutations formula
def P(n, r):
return math.factorial(n) // math.factorial(n - r)
print(f"P(5,3) = {P(5,3)}") # 60
print(f"P(10,4) = {P(10,4)}") # 5040
# Generate all permutations of 3 items taken 2 at a time
items = ['A', 'B', 'C']
all_perms = list(permutations(items, 2))
print(f"\nAll P(3,2) permutations: {all_perms}")
print(f"Count: {len(all_perms)}") # 6
# Factorial grows fast — this is why neural architecture search is hard
for n in range(1, 13):
print(f"{n}! = {math.factorial(n):,}")
Combinations — When Order Doesn't Matter
A combination is a selection where order doesn't matter. $\{A, B, C\}$ and $\{C, B, A\}$ are the same combination but different permutations.
Combinations of $n$ items taken $r$ at a time:
Read $\binom{n}{r}$ as "$n$ choose $r$." We divide by $r!$ to remove the $r!$ orderings of each selection that permutations would count separately.
Key identities:
Why Exhaustive Feature Selection Fails
With $d$ features, selecting the best subset of size $k$ requires evaluating $\binom{d}{k}$ subsets. For $d=50$, $k=10$: $\binom{50}{10} = 10,272,278,170$ — over 10 billion subsets. Even evaluating each in 1 millisecond would take 115 days. This combinatorial explosion explains why L1 regularization (LASSO), which implicitly selects features by pushing weights to zero, is so practically valuable.
import math
from itertools import combinations
# Combinations formula
def C(n, r):
return math.comb(n, r) # math.comb available in Python 3.8+
print(f"C(5,2) = {C(5,2)}") # 10
print(f"C(10,3) = {C(10,3)}") # 120
# Generate all combinations of 4 items taken 2 at a time
items = ['a', 'b', 'c', 'd']
all_combs = list(combinations(items, 2))
print(f"\nAll C(4,2) combinations: {all_combs}")
print(f"Count: {len(all_combs)}") # 6
# Combinatorial explosion in feature selection
print("\nFeature subsets of size k from d features:")
for d in [10, 20, 50, 100]:
k = d // 5 # select 20% of features
count = C(d, k)
print(f" d={d:3d}, k={k:2d}: C({d},{k}) = {count:,}")
Binomial Theorem & Pascal's Triangle
The binomial theorem expands $(a + b)^n$ using combinations:
The coefficients $\binom{n}{k}$ form Pascal's Triangle, where each entry is the sum of the two entries above it:
import math
# Pascal's Triangle
def pascal_triangle(rows):
triangle = []
for n in range(rows):
row = [math.comb(n, k) for k in range(n + 1)]
triangle.append(row)
return triangle
triangle = pascal_triangle(8)
print("Pascal's Triangle (rows 0–7):")
for i, row in enumerate(triangle):
padding = " " * (7 - i) * 2
print(padding + " ".join(f"{x:3d}" for x in row))
# Verify binomial theorem: (1+1)^n = sum of row n = 2^n
for n in range(8):
row_sum = sum(triangle[n])
print(f"Row {n} sum = {row_sum} = 2^{n} = {2**n}")
import math
# Binomial expansion
def binomial_expand(a_coef, b_coef, n, show_terms=True):
"""Expand (a_coef * a + b_coef * b)^n numerically for a=b=1."""
terms = []
total = 0
for k in range(n + 1):
coef = math.comb(n, k) * (a_coef ** (n - k)) * (b_coef ** k)
terms.append(coef)
total += coef
if show_terms:
print(f"Coefficients of (a+b)^{n}: {terms}")
print(f"Sum (a=b=1): {total} = {a_coef+b_coef}^{n} = {(a_coef+b_coef)**n}")
return terms
binomial_expand(1, 1, 5) # 1 5 10 10 5 1
# Binomial probability: 3 heads in 5 fair coin flips
n, k, p = 5, 3, 0.5
prob = math.comb(n, k) * (p**k) * ((1-p)**(n-k))
print(f"\nP(X=3) for n=5, p=0.5: {prob:.4f}") # 0.3125
Multinomials & Multi-class Classification
The multinomial coefficient generalizes combinations to distributing $n$ objects into $k$ groups:
The multinomial distribution describes drawing from $k$ categories with probabilities $p_1, \ldots, p_k$:
ML Applications Summary
Vocabulary Size and Sequence Complexity
A vocabulary of size $V = 50{,}000$ (typical for GPT-2): the number of possible sequences of length $L$ is $V^L$. For $L = 100$: $50{,}000^{100}$ — an incomprehensibly large number. Language models learn to assign probabilities to this vast combinatorial space from a finite training corpus. This is why they generalize: they learn patterns that let them evaluate unseen combinations. The challenge of language modeling is fundamentally a combinatorial one.
import math
# Combinatorics in NLP
V = 50_000 # vocabulary size
L = 10 # short sequence length
# Number of possible sequences (permutations WITH repetition)
# V^L — each position independently draws from V tokens
print("Number of sequences of length L:")
for l in [3, 5, 10]:
count = V ** l
print(f" L={l}: V^L = {V}^{l} ≈ {count:.2e}")
# BPE: number of character pair merges to evaluate at each step
# If vocabulary has V_current tokens, pairs to consider ≈ V_current^2
for v in [256, 1000, 10000]:
pairs = v * (v - 1) // 2
print(f"\n V={v}: ~{pairs:,} possible pairs to merge")
Practice Exercises
Count the Search Space
A neural network has: 4 choices for learning rate, 3 for batch size, 5 for hidden layer size, 2 for activation function, 3 for optimizer. (1) How many configurations are there in a full grid search? (2) If you randomly sample 10%, approximately how many runs is that?
Show Answer
(1) $4 \times 3 \times 5 \times 2 \times 3 = 360$ configurations. (2) $0.10 \times 360 = 36$ runs. Random search with 36 trials often matches or beats exhaustive grid search in practice (Bergstra & Bengio, 2012) because not all hyperparameters are equally important.
Binomial Probability
A model achieves 80% accuracy on each example independently. What is the probability that it gets exactly 7 out of 10 examples correct?
Show Answer
$P(X=7) = \binom{10}{7} (0.8)^7 (0.2)^3 = 120 \times 0.2097 \times 0.008 \approx 0.2013 \approx 20.1\%$. Note the peak probability is at $k = np = 10 \times 0.8 = 8$ correct examples.
Conclusion & Next Steps
Combinatorics gives you the tools to quantify possibility. The key results:
- Multiplication rule: Independent sequential choices multiply — hyperparameter grids grow as $n_1 \times n_2 \times \cdots$
- Permutations $P(n,r) = \frac{n!}{(n-r)!}$ — when order matters (token sequences, rankings)
- Combinations $\binom{n}{r} = \frac{n!}{r!(n-r)!}$ — when order doesn't matter (feature subsets, cross-validation folds)
- Binomial theorem: $(a+b)^n = \sum_k \binom{n}{k} a^{n-k} b^k$ — foundation of the binomial distribution
- Multinomials: Generalize to $k$ categories — direct connection to softmax and cross-entropy
Next in the Series
In Part 4: Probability Fundamentals, we combine sets (Part 2) and counting (Part 3) to build formal probability theory — sample spaces, Bayes' theorem, probability distributions, and the mathematical foundation of all probabilistic ML.