Back to Mathematics

Part 4: Probability Fundamentals

April 26, 2026 Wasil Zafar 38 min read

All ML models make decisions under uncertainty. Probability theory is the rigorous language for reasoning about uncertainty — and understanding it transforms you from a practitioner who uses models into one who truly understands why they work.

Table of Contents

  1. Probability and ML
  2. Theoretical vs Empirical Probability
  3. Probability Space & Axioms
  4. Conditional Probability
  5. Bayes' Theorem
  6. Independence & Chain Rule
  7. Random Variables
  8. Key Distributions
  9. Geometric Distribution
  10. Hypergeometric Distribution
  11. Expectation, Variance & Covariance
  12. Law of Total Probability
  13. Z-Score & Standardization
  14. T-Distribution
  15. Skewness & Kurtosis
  16. Sensitivity & Specificity
  17. Decision Tree of Probability
  18. Practice Exercises
  19. Conclusion & Next Steps

Probability and ML

Every ML model expresses uncertainty. Logistic regression outputs a probability. Naive Bayes applies Bayes' theorem directly. Neural networks trained with cross-entropy loss implicitly model probability distributions. Bayesian optimization navigates uncertainty to find hyperparameters. This part builds the formal foundations.

What you'll be able to explain after this part:
  • Why logistic regression's sigmoid output is interpreted as a probability
  • How Naive Bayes derives its prediction rule from Bayes' theorem
  • What "MLE" means and why it's equivalent to minimizing cross-entropy loss
  • Why the Gaussian distribution appears everywhere in ML
  • The difference between variance of a variable and variance of a model's predictions

Theoretical vs Empirical Probability

Before formalizing probability, it helps to understand the different ways we assign probabilities to events:

Theoretical (Classical) Probability — also called a priori probability — is based on reasoning about equally likely outcomes. You don't need to run an experiment:

Notation introduced here:
  • $\Omega$ (capital omega) — the sample space: the set of every possible outcome (e.g., $\Omega = \{1,2,3,4,5,6\}$ for a die roll)
  • $A$ — an event: a specific subset of outcomes we care about (e.g., $A = \{2,4,6\}$ for "roll an even number")
  • $|A|$ — the cardinality of set $A$: simply how many outcomes are in it ($|A| = 3$ for the even-number event above)
  • $|\Omega|$ — the cardinality of the sample space: the total count of all possible outcomes ($|\Omega| = 6$ for a die)
  • $P(A)$ — the probability of event $A$: a number in $[0, 1]$ measuring how likely $A$ is
$$P(A) = \frac{|A|}{|\Omega|} = \frac{\text{number of outcomes favorable to } A}{\text{total number of equally likely outcomes}}$$

Example: the probability of rolling a 4 on a fair die = $1/6$, because each of the 6 faces is equally likely.

Empirical (Experimental/Frequentist) Probability is based on actual observations from repeated trials:

$$P(A) \approx \frac{f}{n} = \frac{\text{number of times A occurred}}{\text{total number of trials}}$$

Example: you flip a coin 1,000 times and get heads 503 times. The empirical probability of heads $\approx 503/1000 = 0.503$.

Subjective Probability is based on personal belief or expert judgment when neither equally-likely outcomes nor large datasets are available. Examples: a doctor's estimate of a patient's recovery probability, or an analyst's forecast.

Law of Large Numbers: As the number of trials $n \to \infty$, empirical probability converges to theoretical probability: $$\frac{f}{n} \xrightarrow{n \to \infty} P(A)$$ This is why ML models need enough training data — small samples give noisy empirical estimates of the true data distribution. With enough data, the empirical distribution (training set) approximates the theoretical distribution (data-generating process).
import numpy as np

# Theoretical vs Empirical probability of rolling even on a fair die
rng = np.random.default_rng(42)
theoretical = 3/6  # 2, 4, 6 are even

ns = [10, 100, 1000, 10000, 100000]
print(f"Theoretical P(even) = {theoretical:.4f}")
print(f"{'-'*40}")
for n in ns:
    rolls = rng.integers(1, 7, size=n)
    empirical = (rolls % 2 == 0).mean()
    print(f"n={n:>7,}: empirical = {empirical:.4f}  (error = {abs(empirical-theoretical):.4f})")
# Law of Large Numbers: error shrinks as n grows

Probability Space & Axioms

Sample Space & Events

A probability space consists of three components $(\Omega, \mathcal{F}, P)$:

  • $\Omega$ (Omega): the sample space — the set of all possible outcomes
  • $\mathcal{F}$: a sigma-algebra — a collection of measurable subsets of $\Omega$ called events
  • $P$: a probability measure — a function assigning a number $P(A) \in [0,1]$ to each event $A \in \mathcal{F}$

For ML purposes: $\Omega$ is the set of all possible observations (e.g., all possible images), an event $A \subseteq \Omega$ is a property (e.g., "this image contains a cat"), and $P(A)$ is the fraction of the data distribution with that property.

Kolmogorov Axioms

All of probability theory follows from three axioms (Kolmogorov, 1933):

$$\text{1. Non-negativity:} \quad P(A) \geq 0 \;\text{ for all events } A$$
$$\text{2. Normalization:} \quad P(\Omega) = 1$$
$$\text{3. Countable additivity:} \quad \text{if } A \cap B = \emptyset, \text{ then } P(A \cup B) = P(A) + P(B)$$

From these three axioms, all probability theorems follow. Key derived results:

$$P(\emptyset) = 0 \qquad P(A^c) = 1 - P(A) \qquad P(A \cup B) = P(A) + P(B) - P(A \cap B)$$

Mutually Exclusive vs Non-Mutually Exclusive Events

Two events are mutually exclusive (also called disjoint) if they cannot both occur in the same trial:

$$A \cap B = \emptyset \quad \Rightarrow \quad P(A \cap B) = 0$$

Two events are non-mutually exclusive if they can both occur:

$$A \cap B \neq \emptyset \quad \Rightarrow \quad P(A \cap B) \gt 0$$
FeatureMutually ExclusiveNon-Mutually Exclusive
Can both occur?NoYes
$P(A \cap B)$$0$$\gt 0$
ExampleRolling 1 and 6 simultaneouslyRolling even and rolling > 3
ML exampleA token is label "cat" or "dog" (not both)Email is spam AND contains "free"
Common confusion: Mutually exclusive ≠ Independent. In fact, mutually exclusive events with non-zero probability are necessarily dependent: if $A$ occurs, $B$ definitely cannot — so knowing $A$ changes $P(B)$.

Addition Rule of Probability

The addition rule calculates $P(A \cup B)$ — the probability that at least one of $A$ or $B$ occurs:

General Addition Rule (for any two events):

$$P(A \cup B) = P(A) + P(B) - P(A \cap B)$$

The $P(A \cap B)$ term is subtracted because the overlap region gets counted twice when you add $P(A)$ and $P(B)$. This is the Inclusion-Exclusion Principle from set theory applied to probability.

Special case for Mutually Exclusive Events:

$$P(A \cup B) = P(A) + P(B) \quad \text{(since } P(A \cap B) = 0 \text{)}$$
Addition RuleMulti-label
Addition Rule in ML: Multi-Label Classification

In a dataset, 35% of images contain cats ($P(C) = 0.35$), 28% contain dogs ($P(D) = 0.28$), and 10% contain both ($P(C \cap D) = 0.10$). What fraction contain at least one pet?

$P(C \cup D) = 0.35 + 0.28 - 0.10 = 0.53$ — 53% of images contain a cat or dog or both.

For 3+ events: $P(A\cup B\cup C) = P(A)+P(B)+P(C)-P(A\cap B)-P(A\cap C)-P(B\cap C)+P(A\cap B\cap C)$

Conditional Probability

Conditional probability asks: "What is the probability of $B$ given that $A$ has already happened?" Instead of looking at the full sample space $\Omega$, we restrict our view to only the outcomes inside $A$.

Visualising with a Venn Diagram

Imagine the sample space as a rectangle (total probability = 1). $A$ is a region inside it, and $A \cap B$ is the overlap where both events occur:

Conditional Probability — Restricting to Event A
Sample Space Ω A A ∩ B B Ω outside A = excluded when conditioning on A

When we condition on $A$, we shrink the universe to just $A$. Inside that shrunken universe, only $A \cap B$ counts as "$B$ happening."

The Key Idea — shrink the sample space:
  • The new "total probability" (the whole universe, now restricted to $A$) $= P(A)$
  • The part of that restricted universe where $B$ also occurs $= P(A \cap B)$
  • So $P(B \mid A)$ is simply the fraction of $A$ that is also $B$

Deriving the formula — within the restricted universe of $A$, we redefine probability as the ratio of two measures:

$$P(B \mid A) = \frac{\text{favorable outcomes inside } A}{\text{total outcomes inside } A} = \frac{P(A \cap B)}{P(A)} \quad \text{for } P(A) \gt 0$$

This is the formal definition of conditional probability. The labels $A$ and $B$ can be swapped — the logic is identical — giving the symmetric form:

$$P(A \mid B) = \frac{P(A \cap B)}{P(B)} \quad \text{for } P(B) \gt 0$$

This is the foundation of all probabilistic reasoning in ML — restrict the sample space to the known event, then measure how much of that restricted space satisfies the target event.

Law of Total Probability: If events $B_1, B_2, \ldots, B_k$ partition $\Omega$ (mutually exclusive, exhaustive), then for any event $A$: $$P(A) = \sum_{i=1}^{k} P(A \mid B_i)\, P(B_i)$$ This is used constantly — e.g., $P(\text{email is spam}) = P(\text{spam} \mid \text{from known sender}) \cdot P(\text{known}) + P(\text{spam} \mid \text{unknown}) \cdot P(\text{unknown})$.

Multiplication Rule of Probability

The multiplication rule calculates $P(A \cap B)$ — the probability that both $A$ and $B$ occur:

General Multiplication Rule (for any two events, using conditional probability):

$$P(A \cap B) = P(A) \cdot P(B \mid A) = P(B) \cdot P(A \mid B)$$

This is also called the chain rule for two events. It extends naturally to $n$ events:

$$P(A_1 \cap A_2 \cap \cdots \cap A_n) = P(A_1)\, P(A_2 \mid A_1)\, P(A_3 \mid A_1, A_2) \cdots$$

Special case for Independent Events:

$$P(A \cap B) = P(A) \cdot P(B) \quad \text{(since } P(B \mid A) = P(B) \text{ when independent)}$$

Dependent vs Independent Events

Two events are independent if the occurrence of one gives no information about the other:

$$A \perp B \iff P(A \mid B) = P(A) \iff P(B \mid A) = P(B) \iff P(A \cap B) = P(A) \cdot P(B)$$

Two events are dependent if knowing one changes the probability of the other:

$$A \not\perp B \iff P(A \mid B) \neq P(A)$$
IndependentDependent
IntuitionKnowing $B$ doesn't help predict $A$Knowing $B$ changes our estimate of $A$
Multiplication$P(A \cap B) = P(A) \cdot P(B)$$P(A \cap B) = P(A) \cdot P(B|A)$
ExampleTwo separate coin flipsDrawing two cards without replacement
ML exampleNaive Bayes feature assumptionSequential tokens in language models
import numpy as np

rng = np.random.default_rng(42)
n = 100000

# Independent events: two separate coin flips
coin1 = rng.integers(0, 2, n)  # 0=tails, 1=heads
coin2 = rng.integers(0, 2, n)

p_head1 = coin1.mean()
p_head2 = coin2.mean()
p_both_heads = (coin1 & coin2).mean()

print("Independent events (two coins):")
print(f"  P(H1)        = {p_head1:.4f}  (theoretical: 0.5)")
print(f"  P(H2)        = {p_head2:.4f}  (theoretical: 0.5)")
print(f"  P(H1 ∩ H2)   = {p_both_heads:.4f}  (theoretical: 0.25)")
print(f"  P(H1)*P(H2) = {p_head1*p_head2:.4f}  ← matches, confirming independence")

# Dependent events: draw cards without replacement
# P(2nd card is ace | 1st was ace) ≠ P(2nd card is ace)
p_first_ace = 4 / 52
p_second_ace_given_first = 3 / 51  # one ace removed
p_second_ace_given_not_first = 4 / 51  # still 4 aces in 51 cards
p_second_ace = p_second_ace_given_first * p_first_ace + p_second_ace_given_not_first * (1 - p_first_ace)

print("\nDependent events (cards without replacement):")
print(f"  P(2nd ace | 1st was ace) = {p_second_ace_given_first:.4f}")
print(f"  P(2nd ace | 1st not ace) = {p_second_ace_given_not_first:.4f}")
print(f"  P(2nd ace overall)       = {p_second_ace:.4f}")
print("  Different conditional probabilities → DEPENDENT")

Bayes' Theorem

Rearranging the definition of conditional probability gives Bayes' theorem. Here is the step-by-step derivation:

Derivation — From Conditional Probability to Bayes' Theorem

Step 1 — Write both conditional probability definitions. The definition holds regardless of which event is in the condition:

$$P(A \mid B) = \frac{P(A \cap B)}{P(B)} \qquad \text{and} \qquad P(B \mid A) = \frac{P(A \cap B)}{P(A)}$$

Both definitions share the same numerator: the joint probability $P(A \cap B)$.

Step 2 — Isolate $P(A \cap B)$ from the right-hand definition. Multiply both sides by $P(A)$:

$$P(A \cap B) = P(B \mid A)\, P(A)$$

Step 3 — Substitute into the left-hand definition. Replace $P(A \cap B)$ in the numerator of the first equation:

$$P(A \mid B) \;=\; \frac{P(A \cap B)}{P(B)} \;=\; \frac{P(B \mid A)\, P(A)}{P(B)}$$

The joint term $P(A \cap B)$ disappears as an intermediate. The result expresses the reverse conditional $P(A \mid B)$ purely in terms of the forward conditional $P(B \mid A)$ — the core power of Bayes' theorem: we can flip the direction of conditioning.

$$\boxed{P(A \mid B) = \frac{P(B \mid A)\, P(A)}{P(B)}}$$

In Bayesian ML terminology we rename the symbols: $A \to H$ (the hypothesis — the model or class we want to know about) and $B \to D$ (the data — what we actually observed). The formula becomes:

$$\underbrace{P(H \mid D)}_{\text{posterior}} = \frac{\underbrace{P(D \mid H)}_{\text{likelihood}} \cdot \underbrace{P(H)}_{\text{prior}}}{\underbrace{P(D)}_{\text{evidence}}}$$

Each of the four terms has a precise meaning and a natural intuition:

The Four Terms of Bayes' Theorem
Term Notation Plain English Medical test analogy Spam filter analogy
Prior $P(H)$ What you believe before seeing any data — your starting belief about the hypothesis. 1 in 1,000 people have the disease — the base-rate before any test. 30% of all emails are spam — the base-rate before reading the email.
Likelihood $P(D \mid H)$ How probable is this data if the hypothesis is true? This is what the model or generative process predicts. The test returns positive 99% of the time when the patient truly has the disease. The word "prize" appears in 40% of actual spam emails.
Evidence $P(D)$ The total probability of observing this data under all hypotheses — a normalising constant that ensures the posterior sums to 1. Overall rate of a positive test result in the whole population (sick + healthy). Overall frequency of "prize" in all emails (spam + ham).
Posterior $P(H \mid D)$ Your updated belief after seeing the data — what you believe now, combining prior knowledge with new evidence. Given a positive test, what is the actual probability the patient is ill? Given the word "prize", what is the probability the email is spam?
The intuition in one sentence

Bayes' theorem is a belief-update machine: start with your prior belief $P(H)$, multiply by how well the data fits that belief $P(D \mid H)$, and rescale by $P(D)$ so everything still adds up to 1. The output is your posterior $P(H \mid D)$ — a sharper, data-informed belief. In ML the prior encodes what the model knew before training, the likelihood is the training objective, and the posterior is what the model knows after seeing the data.

Watch — Bayes’ Theorem Visually Explained - 3Blue1Brown
Naive BayesClassifier
Naive Bayes Spam Classification

Given email containing word "prize", is it spam? Using Bayes' theorem:

$$P(\text{spam} \mid \text{"prize"}) = \frac{P(\text{"prize"} \mid \text{spam}) \cdot P(\text{spam})}{P(\text{"prize"})}$$

Suppose: $P(\text{spam}) = 0.3$ (30% of emails are spam), $P(\text{"prize"} \mid \text{spam}) = 0.4$, $P(\text{"prize"} \mid \text{not spam}) = 0.01$.

$P(\text{"prize"}) = 0.4 \times 0.3 + 0.01 \times 0.7 = 0.12 + 0.007 = 0.127$

$P(\text{spam} \mid \text{"prize"}) = \frac{0.4 \times 0.3}{0.127} \approx 0.945$ — 94.5% probability of spam.

Naive Bayes extends this to multiple words by assuming conditional independence (the "naive" assumption), multiplying individual $P(w_i \mid \text{class})$ probabilities together.

# Bayes' theorem applied to spam classification
def bayes_update(prior_spam, p_word_given_spam, p_word_given_ham):
    """Apply Bayes' theorem to update spam probability given a word."""
    prior_ham = 1 - prior_spam
    # Law of total probability
    p_word = p_word_given_spam * prior_spam + p_word_given_ham * prior_ham
    # Bayes' theorem
    posterior_spam = (p_word_given_spam * prior_spam) / p_word
    return posterior_spam

# Single word update
prior = 0.3
posterior = bayes_update(prior, p_word_given_spam=0.4, p_word_given_ham=0.01)
print(f"Prior P(spam) = {prior:.2f}")
print(f"After seeing 'prize': P(spam|'prize') = {posterior:.4f}")

# Naive Bayes: multiple words (conditional independence assumption)
words = [
    ("prize",      0.40, 0.01),
    ("free",       0.35, 0.02),
    ("winner",     0.30, 0.005),
    ("claim",      0.25, 0.01),
]

p = 0.3  # prior
for word, p_s, p_h in words:
    p = bayes_update(p, p_s, p_h)
    print(f"After '{word:8s}': P(spam) = {p:.6f}")

Independence & the Chain Rule

Events $A$ and $B$ are independent if knowing $B$ tells you nothing about $A$:

$$A \perp B \iff P(A \cap B) = P(A)\, P(B) \iff P(A \mid B) = P(A)$$

The chain rule of probability expresses joint probabilities as products of conditionals:

$$P(A_1 \cap A_2 \cap \cdots \cap A_n) = P(A_1)\, P(A_2 \mid A_1)\, P(A_3 \mid A_1, A_2) \cdots P(A_n \mid A_1, \ldots, A_{n-1})$$
Chain rule and language models: The probability of a sentence $w_1 w_2 \ldots w_n$ is: $$P(w_1, w_2, \ldots, w_n) = P(w_1)\, P(w_2 \mid w_1)\, P(w_3 \mid w_1, w_2) \cdots$$ This is the exact formula that GPT models try to learn. Autoregressive generation samples $w_{t+1} \sim P(w_{t+1} \mid w_1, \ldots, w_t)$ at each step.

Random Variables

A random variable $X$ is a function $X: \Omega \to \mathbb{R}$ that maps outcomes to real numbers. It "converts" qualitative outcomes into numbers we can compute with.

Discrete Random Variables: PMF & CDF

A discrete RV takes countable values. Its distribution is described by the probability mass function (PMF):

$$p_X(x) = P(X = x) \quad \text{with} \quad \sum_x p_X(x) = 1, \quad p_X(x) \geq 0$$

The cumulative distribution function (CDF) accumulates probability up to $x$:

$$F_X(x) = P(X \leq x) = \sum_{k \leq x} p_X(k)$$

Continuous Random Variables: PDF & CDF

A continuous RV takes values in an interval. Its distribution is described by the probability density function (PDF) $f_X(x)$ where:

$$P(a \leq X \leq b) = \int_a^b f_X(x)\,dx \qquad \int_{-\infty}^{\infty} f_X(x)\,dx = 1$$

Note: $P(X = x) = 0$ for any single point in the continuous case.

Key Probability Distributions

BernoulliBinary
Bernoulli Distribution — Binary Outcomes

$X \sim \text{Bernoulli}(p)$: a single trial with success probability $p$.

$$P(X=1) = p, \quad P(X=0) = 1-p, \quad \mathbb{E}[X] = p, \quad \text{Var}(X) = p(1-p)$$

ML use: Binary classification output layer (sigmoid), coin-flip random variables, click prediction.

BinomialCount of Successes
Binomial Distribution — Sum of Bernoullis

$X \sim \text{Binomial}(n, p)$: number of successes in $n$ independent Bernoulli trials.

$$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad \mathbb{E}[X] = np, \quad \text{Var}(X) = np(1-p)$$

ML use: Testing significance of accuracy differences; dropout (each neuron dropped with probability $p$) follows a binomial structure.

GaussianNormal
Gaussian Distribution — The Bell Curve

$X \sim \mathcal{N}(\mu, \sigma^2)$: described by mean $\mu$ and variance $\sigma^2$.

$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right), \quad \mathbb{E}[X] = \mu, \quad \text{Var}(X) = \sigma^2$$

ML use: Weight initialization (e.g., Xavier/He init), Gaussian noise in data augmentation, residuals in linear regression (assumption for OLS), L2 regularization is equivalent to a Gaussian prior on weights. By the CLT, sample means are approximately Gaussian.

UniformFlat Prior
Uniform Distribution

$X \sim \text{Uniform}(a, b)$: all values in $[a, b]$ are equally likely.

$$f(x) = \frac{1}{b-a}, \quad \mathbb{E}[X] = \frac{a+b}{2}, \quad \text{Var}(X) = \frac{(b-a)^2}{12}$$

ML use: Hyperparameter sampling in random search, weight initialization in some variants, representing maximum-entropy priors.

import numpy as np

# Simulate key distributions and compute statistics
rng = np.random.default_rng(42)

# Bernoulli(p=0.7)
p = 0.7
bernoulli_samples = rng.binomial(1, p, size=10000)
print(f"Bernoulli(p=0.7)")
print(f"  Theoretical: E[X]={p:.2f}, Var={p*(1-p):.4f}")
print(f"  Empirical:   E[X]={bernoulli_samples.mean():.4f}, Var={bernoulli_samples.var():.4f}")

# Binomial(n=10, p=0.3)
n, p = 10, 0.3
binom_samples = rng.binomial(n, p, size=10000)
print(f"\nBinomial(n=10, p=0.3)")
print(f"  Theoretical: E[X]={n*p:.2f}, Var={n*p*(1-p):.4f}")
print(f"  Empirical:   E[X]={binom_samples.mean():.4f}, Var={binom_samples.var():.4f}")

# Gaussian(mu=0, sigma=1)
mu, sigma = 0, 1
normal_samples = rng.normal(mu, sigma, size=10000)
print(f"\nGaussian(mu=0, sigma=1)")
print(f"  Theoretical: E[X]={mu:.2f}, Var={sigma**2:.4f}")
print(f"  Empirical:   E[X]={normal_samples.mean():.4f}, Var={normal_samples.var():.4f}")

Expectation, Variance & Covariance

The expectation (mean) of a random variable is its probability-weighted average:

$$\mathbb{E}[X] = \sum_x x \cdot p_X(x) \quad \text{(discrete)} \qquad \mathbb{E}[X] = \int_{-\infty}^{\infty} x \, f_X(x)\,dx \quad \text{(continuous)}$$

Linearity of expectation (holds even for dependent variables):

$$\mathbb{E}[aX + bY + c] = a\,\mathbb{E}[X] + b\,\mathbb{E}[Y] + c$$

The variance measures spread around the mean:

$$\text{Var}(X) = \mathbb{E}[(X - \mu)^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$$

Covariance measures how two variables move together:

$$\text{Cov}(X, Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)] = \mathbb{E}[XY] - \mathbb{E}[X]\,\mathbb{E}[Y]$$

If $X \perp Y$ (independent), then $\text{Cov}(X,Y) = 0$ and $\text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y)$.

MLE and Probability: Maximum Likelihood Estimation finds the parameters $\theta$ that maximize $P(\text{data} \mid \theta)$. For $n$ i.i.d. observations: $$\hat{\theta}_{\text{MLE}} = \arg\max_\theta \prod_{i=1}^n P(x_i \mid \theta) = \arg\max_\theta \sum_{i=1}^n \log P(x_i \mid \theta)$$ For Gaussian MLE on data $\{x_i\}$: the MLE solution gives $\hat{\mu} = \bar{x}$ (sample mean) and $\hat{\sigma}^2 = \frac{1}{n}\sum(x_i - \bar{x})^2$. Minimizing MSE loss in regression is equivalent to Gaussian MLE.
import numpy as np

# Expectation, Variance, Covariance computations
rng = np.random.default_rng(0)

# Create correlated samples using Cholesky decomposition
mean = [2.0, 5.0]
cov_matrix = [[1.0, 0.8], [0.8, 2.0]]  # Covariance matrix
samples = rng.multivariate_normal(mean, cov_matrix, size=5000)

X, Y = samples[:, 0], samples[:, 1]

print("E[X]     =", X.mean().round(4),  "  (theoretical: 2.0)")
print("E[Y]     =", Y.mean().round(4),  "  (theoretical: 5.0)")
print("Var(X)   =", X.var().round(4),   "  (theoretical: 1.0)")
print("Var(Y)   =", Y.var().round(4),   "  (theoretical: 2.0)")
print("Cov(X,Y) =", np.cov(X, Y)[0,1].round(4), "  (theoretical: 0.8)")

# MLE for Gaussian parameters
data = rng.normal(loc=3.5, scale=2.0, size=10000)
mu_mle    = data.mean()
sigma_mle = data.std()   # numpy uses n denominator (MLE)
print(f"\nMLE estimates (true mu=3.5, sigma=2.0):")
print(f"  mu_hat    = {mu_mle:.4f}")
print(f"  sigma_hat = {sigma_mle:.4f}")

Law of Total Probability

The Law of Total Probability lets you compute $P(A)$ by breaking the sample space into exhaustive, mutually exclusive "scenarios" $B_1, B_2, \ldots, B_k$ (a partition of $\Omega$):

$$P(A) = \sum_{i=1}^{k} P(A \mid B_i)\, P(B_i) = P(A \mid B_1)P(B_1) + P(A \mid B_2)P(B_2) + \cdots + P(A \mid B_k)P(B_k)$$

The partition requirement means: $B_i \cap B_j = \emptyset$ for $i \neq j$ (mutually exclusive) and $B_1 \cup B_2 \cup \cdots \cup B_k = \Omega$ (exhaustive, i.e., covers all outcomes).

Total ProbabilityBayes Denominator
Example: Disease Testing

Let $A$ = test positive. Partition: $B_1$ = "has disease" ($P(B_1) = 0.01$) and $B_2$ = "no disease" ($P(B_2) = 0.99$).

$P(\text{positive} \mid \text{disease}) = 0.95$, $P(\text{positive} \mid \text{no disease}) = 0.05$ (false positive rate).

$P(\text{positive}) = 0.95 \times 0.01 + 0.05 \times 0.99 = 0.0095 + 0.0495 = 0.059$

This is the denominator $P(B)$ in Bayes' theorem. Without the Law of Total Probability, we couldn't compute posterior probabilities.

import numpy as np

# Law of Total Probability: compute P(test positive)
# Partition: disease status {sick, healthy}
priors = {'sick': 0.01, 'healthy': 0.99}
likelihoods = {
    'sick':    {'positive': 0.95, 'negative': 0.05},
    'healthy': {'positive': 0.05, 'negative': 0.95},
}

# P(positive) = sum over partition
p_positive = sum(priors[s] * likelihoods[s]['positive'] for s in priors)
print(f"P(test positive) = {p_positive:.4f}")  # 0.059

# Now use Bayes: P(sick | positive)
p_sick_given_positive = (likelihoods['sick']['positive'] * priors['sick']) / p_positive
print(f"P(sick | positive) = {p_sick_given_positive:.4f}")  # ~0.161 (16%)

Geometric Distribution

The geometric distribution models the number of trials needed to get the first success in a sequence of independent Bernoulli trials, each with success probability $p$.

$$P(X = k) = (1-p)^{k-1} \cdot p \quad \text{for } k = 1, 2, 3, \ldots$$

Read as: you fail $k-1$ times (probability $(1-p)^{k-1}$) then succeed on trial $k$ (probability $p$).

$$\mathbb{E}[X] = \frac{1}{p} \qquad \text{Var}(X) = \frac{1-p}{p^2}$$
$p$$E[X] = 1/p$InterpretationML Example
0.52Expect 2 tries on averageBinary search in hyperparameter tuning
0.110Expect 10 tries on average10% of training samples are "informative"
0.01100Expect 100 tries on averageRare event detection in anomaly detection
Memoryless property: $P(X \gt s + t \mid X \gt s) = P(X \gt t)$. The geometric distribution has "no memory" — if you've already failed $s$ times, your expected remaining trials is still $1/p$. In ML: this models scenarios where each attempt is independent regardless of history (e.g., random restarts of gradient descent).
import numpy as np
from scipy import stats

# Geometric distribution: trials until first success
p = 0.2  # 20% success rate per trial

# Theoretical
print(f"Geometric(p={p})")
print(f"  E[X] = 1/p = {1/p:.1f}")
print(f"  Var(X) = (1-p)/p^2 = {(1-p)/p**2:.1f}")

# P(X = k) for first few values
print("\n  k   P(X=k)")
for k in range(1, 8):
    prob = (1-p)**(k-1) * p
    print(f"  {k}   {prob:.4f}")

# Simulation
rng = np.random.default_rng(42)
trials = []
for _ in range(10000):
    k = 1
    while rng.random() > p:
        k += 1
    trials.append(k)
print(f"\nSimulated E[X] = {np.mean(trials):.3f}  (theoretical: {1/p:.1f})")

Hypergeometric Distribution

The hypergeometric distribution models drawing a sample without replacement from a population. Unlike the binomial (which assumes replacement and independent trials), hypergeometric captures dependence between draws.

Setup: population of $N$ items, $K$ of which are "successes". Draw $n$ items without replacement. Let $X$ = number of successes in the sample:

$$P(X = k) = \frac{\dbinom{K}{k} \dbinom{N-K}{n-k}}{\dbinom{N}{n}} \quad \text{for } k = \max(0, n-(N-K)) \ldots \min(n, K)$$
$$\mathbb{E}[X] = \frac{nK}{N} \qquad \text{Var}(X) = n\frac{K}{N}\frac{N-K}{N}\frac{N-n}{N-1}$$

The $\frac{N-n}{N-1}$ factor is the finite population correction — variance is smaller than binomial because sampling without replacement reduces variability.

HypergeometricStratified Sampling
Example: Evaluating a Classifier Without Replacement

You have 1000 test examples: 200 positive, 800 negative. You pick 50 examples for manual review (without replacement). $N=1000$, $K=200$, $n=50$.

$E[\text{positives in sample}] = \frac{50 \times 200}{1000} = 10$ positives expected.

This is the math behind stratified sampling — ensuring each class is proportionally represented in your evaluation set.

import numpy as np
from scipy.stats import hypergeom

# Hypergeometric distribution
N = 1000  # population size
K = 200   # total successes in population
n = 50    # sample size

rv = hypergeom(N, K, n)
mean = rv.mean()
var = rv.var()

print(f"Hypergeometric(N={N}, K={K}, n={n})")
print(f"  E[X] = {mean:.2f}  (theoretical: {n*K/N:.2f})")
print(f"  Var(X) = {var:.4f}")
print(f"  P(X=10) = {rv.pmf(10):.4f}")
print(f"  P(X>=15) = {1-rv.cdf(14):.4f}")

# Compare to binomial (with replacement)
from scipy.stats import binom
binom_rv = binom(n, K/N)
print(f"\nBinomial(n={n}, p={K/N}) for comparison:")
print(f"  E[X] = {binom_rv.mean():.2f}  (same expected value)")
print(f"  Var(X) = {binom_rv.var():.4f}  (higher than hypergeometric)")

Z-Score & Standardization

A Z-score (also called a standard score) expresses how many standard deviations an observation is from the mean of its distribution:

$$z = \frac{x - \mu}{\sigma}$$

Where $\mu$ is the population mean and $\sigma$ is the population standard deviation. For a sample: $z = (x - \bar{x}) / s$.

Standardization transforms an entire dataset so it has mean 0 and standard deviation 1 — every value becomes its Z-score. The resulting distribution is called the Standard Normal Distribution $\mathcal{N}(0,1)$.

What Z-scores tell you:
  • $z = 0$: the value equals the mean
  • $z = 1$: the value is 1 standard deviation above the mean
  • $z = -2$: the value is 2 standard deviations below the mean
  • $|z| \gt 3$: unusual — only ~0.3% of values if the data is normal (outlier threshold)

Using Z-tables: For $X \sim \mathcal{N}(\mu, \sigma^2)$, compute $z = (x - \mu)/\sigma$, then look up $P(Z \leq z) = \Phi(z)$ in the standard normal table. Example: $X \sim \mathcal{N}(70, 10^2)$, $P(X \leq 85) = P(Z \leq 1.5) = \Phi(1.5) \approx 0.9332$.

import numpy as np
from scipy import stats

# Z-score calculation
data = np.array([52, 65, 70, 73, 75, 78, 80, 85, 90, 95])

mu = data.mean()
sigma = data.std(ddof=0)  # population std

z_scores = (data - mu) / sigma
print("Original values:", data)
print(f"Mean: {mu:.2f}, Std: {sigma:.2f}")
print("Z-scores:", z_scores.round(2))

# Using scipy for standard normal probabilities
# P(X <= 85) where X ~ N(70, 10^2)
mu_dist, sigma_dist = 70, 10
x = 85
z = (x - mu_dist) / sigma_dist
prob = stats.norm.cdf(z)
print(f"\nFor X ~ N({mu_dist}, {sigma_dist}^2):")
print(f"  P(X <= {x}) = P(Z <= {z}) = {prob:.4f}")

# Standardization for ML (scikit-learn style)
rng = np.random.default_rng(42)
features = rng.normal(loc=[100, 0.5, 1000], scale=[20, 0.1, 300], size=(100, 3))
standardized = (features - features.mean(axis=0)) / features.std(axis=0)
print(f"\nBefore standardization: mean = {features.mean(axis=0).round(2)}")
print(f"After standardization:  mean = {standardized.mean(axis=0).round(4)}")
print(f"After standardization:  std  = {standardized.std(axis=0).round(4)}")

T-Distribution

The t-distribution (Student's t) is used when the population standard deviation $\sigma$ is unknown and must be estimated from the sample. It has heavier tails than the normal distribution, reflecting extra uncertainty from estimating $\sigma$.

The t-distribution is parameterized by degrees of freedom $\nu = n - 1$ (where $n$ is sample size):

$$t = \frac{\bar{x} - \mu}{s / \sqrt{n}} \sim t(\nu) \qquad \text{where } s = \sqrt{\frac{1}{n-1}\sum(x_i - \bar{x})^2}$$
$\nu$ (degrees of freedom)ShapeWhen to use
$\nu = 1$Very heavy tails (Cauchy)Almost never in practice
$\nu = 5$Noticeably heavier than normalVery small samples ($n=6$)
$\nu = 30$Very close to normalMost practical applications
$\nu \to \infty$Converges to $\mathcal{N}(0,1)$Large samples — use normal
T-distribution in ML: Used in two-sample t-tests to compare model accuracy on different test sets. A paired t-test on cross-validation folds checks whether accuracy improvement is statistically significant or could be due to variance.
import numpy as np
from scipy import stats

rng = np.random.default_rng(42)

# Compare normal and t distributions
import matplotlib.pyplot as plt

x = np.linspace(-5, 5, 300)
normal_pdf = stats.norm.pdf(x)
t5_pdf = stats.t.pdf(x, df=5)
t30_pdf = stats.t.pdf(x, df=30)

# Heavier tails at x=3
print("Comparison of tail probabilities P(X > 3):")
print(f"  Normal:   {1 - stats.norm.cdf(3):.6f}")
print(f"  t(df=5):  {1 - stats.t.cdf(3, df=5):.6f}  <- much heavier tail")
print(f"  t(df=30): {1 - stats.t.cdf(3, df=30):.6f}")

# Two-sample t-test for comparing model accuracy
model_a_scores = rng.normal(0.82, 0.03, 10)  # 10 cross-val folds
model_b_scores = rng.normal(0.85, 0.03, 10)

t_stat, p_value = stats.ttest_rel(model_a_scores, model_b_scores)
print(f"\nPaired t-test (is Model B significantly better?):")
print(f"  t-statistic = {t_stat:.4f}")
print(f"  p-value     = {p_value:.4f}")
if p_value < 0.05:
    print("  Significant difference (p < 0.05)")
else:
    print("  No significant difference (p >= 0.05)")

Skewness & Kurtosis

Skewness measures the asymmetry of a probability distribution around its mean. Kurtosis measures the "tailedness" — how extreme the outliers are.

Skewness is the standardized third central moment:

$$\gamma_1 = \frac{\mu_3}{\sigma^3} = \frac{\mathbb{E}[(X-\mu)^3]}{(\mathbb{E}[(X-\mu)^2])^{3/2}}$$
SkewnessShapeMean vs MedianML Example
$\gamma_1 = 0$SymmetricMean = MedianGaussian noise, weight initializations
$\gamma_1 \gt 0$Right-skewed (long right tail)Mean > MedianIncome data, response times, loss values
$\gamma_1 \lt 0$Left-skewed (long left tail)Mean < MedianAccuracy distributions near ceiling

Kurtosis is the standardized fourth central moment. Excess kurtosis is kurtosis minus 3 (so a normal distribution has excess kurtosis = 0):

$$\gamma_2 = \frac{\mu_4}{\sigma^4} - 3 = \frac{\mathbb{E}[(X-\mu)^4]}{(\mathbb{E}[(X-\mu)^2])^2} - 3$$
Excess KurtosisTypeTail behavior
$\gamma_2 = 0$Mesokurtic (normal)Normal tails
$\gamma_2 \gt 0$LeptokurticHeavier tails, sharper peak — more outliers
$\gamma_2 \lt 0$PlatykurticLighter tails, flatter peak — fewer outliers
ML relevance: Loss distributions and gradient distributions during training often have non-zero skewness and heavy tails (leptokurtic). This is why gradient clipping is used — large gradients from heavy-tailed distributions can destabilize training.
import numpy as np
from scipy import stats

rng = np.random.default_rng(42)

# Symmetric (normal)
normal_data = rng.normal(0, 1, 10000)
# Right-skewed (log-normal)
lognormal_data = rng.lognormal(0, 1, 10000)
# Heavy-tailed (t with df=3)
heavy_data = stats.t.rvs(df=3, size=10000, random_state=42)

distributions = {
    'Normal(0,1)': normal_data,
    'LogNormal(0,1)': lognormal_data,
    't(df=3)': heavy_data,
}

print(f"{'Distribution':<20} {'Skewness':>10} {'Excess Kurt':>12} {'Type'}")
print("-" * 60)
for name, data in distributions.items():
    sk = stats.skew(data)
    ku = stats.kurtosis(data)  # returns excess kurtosis by default
    if abs(sk) < 0.1:
        sk_type = "symmetric"
    elif sk > 0:
        sk_type = "right-skewed"
    else:
        sk_type = "left-skewed"
    print(f"{name:<20} {sk:>10.3f} {ku:>12.3f}   {sk_type}")

Sensitivity & Specificity

These are probability-based metrics that evaluate a binary classifier's performance on each class separately. They come directly from the conditional probability framework.

Given a classifier with a binary outcome (Positive/Negative) and true labels, define:

  • TP (True Positives), FP (False Positives), TN (True Negatives), FN (False Negatives)
MetricFormulaInterpretationAlso called
Sensitivity$P(\hat{Y}=+ \mid Y=+) = \frac{TP}{TP+FN}$Of all actual positives, how many did we catch?Recall, True Positive Rate (TPR)
Specificity$P(\hat{Y}=- \mid Y=-) = \frac{TN}{TN+FP}$Of all actual negatives, how many did we correctly reject?True Negative Rate (TNR)
PPV$P(Y=+ \mid \hat{Y}=+) = \frac{TP}{TP+FP}$Of all predicted positives, how many are truly positive?Precision
NPV$P(Y=- \mid \hat{Y}=-) = \frac{TN}{TN+FN}$Of all predicted negatives, how many are truly negative?Negative Predictive Value
Sensitivity-Specificity tradeoff: Increasing the decision threshold raises specificity but lowers sensitivity, and vice versa. The ROC curve plots this tradeoff. In medical diagnosis, high sensitivity is critical (don't miss cases) but may reduce specificity (more false alarms). The "optimal" operating point depends on the cost of FP vs FN.
import numpy as np

# Compute sensitivity, specificity from confusion matrix
y_true = np.array([1,1,1,1,1,0,0,0,0,0,1,0,1,0,1])
y_pred = np.array([1,1,0,1,1,0,0,1,0,0,1,0,0,0,1])

TP = np.sum((y_pred == 1) & (y_true == 1))
TN = np.sum((y_pred == 0) & (y_true == 0))
FP = np.sum((y_pred == 1) & (y_true == 0))
FN = np.sum((y_pred == 0) & (y_true == 1))

sensitivity = TP / (TP + FN)  # recall / TPR
specificity = TN / (TN + FP)  # TNR
ppv         = TP / (TP + FP)  # precision
npv         = TN / (TN + FN)  # negative predictive value

print(f"Confusion Matrix: TP={TP}, TN={TN}, FP={FP}, FN={FN}")
print(f"\nSensitivity (Recall) = {sensitivity:.4f}  -- P(predict + | actually +)")
print(f"Specificity          = {specificity:.4f}  -- P(predict - | actually -)")
print(f"Precision (PPV)      = {ppv:.4f}  -- P(actually + | predict +)")
print(f"NPV                  = {npv:.4f}  -- P(actually - | predict -)")
print(f"\nAccuracy = {(TP+TN)/(TP+TN+FP+FN):.4f}")

Decision Tree of Probability

A probability tree (decision tree diagram) is a visual method for tracing conditional branches and computing joint and total probabilities. Each path from root to leaf represents a sequence of outcomes.

Construction rules:

  1. Each branch shows one possible outcome with its probability
  2. Probabilities at each branch level sum to 1
  3. Path probability = product of branch probabilities (Multiplication Rule)
  4. Total probability for an outcome = sum of all paths leading to it (Addition Rule)
Probability Tree: Email Spam Classification
flowchart TD
    E["Email arrives\n(all emails)"] -->|"P(spam)=0.3"| S["Spam"]
    E -->|"P(ham)=0.7"| H["Ham"]
    S -->|"P(free|spam)=0.4"| SF["Spam + 'free'\nP = 0.3×0.4 = 0.12"]
    S -->|"P(no free|spam)=0.6"| SN["Spam, no 'free'\nP = 0.3×0.6 = 0.18"]
    H -->|"P(free|ham)=0.02"| HF["Ham + 'free'\nP = 0.7×0.02 = 0.014"]
    H -->|"P(no free|ham)=0.98"| HN["Ham, no 'free'\nP = 0.7×0.98 = 0.686"]

    style E fill:#132440,stroke:#132440,color:#fff
    style SF fill:#BF092F,stroke:#BF092F,color:#fff
    style SN fill:#BF092F,stroke:#BF092F,color:#fff
    style HF fill:#3B9797,stroke:#3B9797,color:#fff
    style HN fill:#3B9797,stroke:#3B9797,color:#fff
                            
Using the tree:
  • $P(\text{email contains 'free'}) = 0.12 + 0.014 = 0.134$ (sum of all paths with 'free')
  • $P(\text{spam} \mid \text{'free'}) = 0.12 / 0.134 \approx 0.896$ (Bayes' theorem from tree)
  • All four leaf probabilities sum to: $0.12 + 0.18 + 0.014 + 0.686 = 1.0$ \u2713
import numpy as np

# Probability tree computation
# Email classification: spam(0.3) / ham(0.7), contains 'free' / no 'free'

# Structure: (branch_prob, conditions)
tree = {
    ('spam', 'free'):    0.3 * 0.40,  # 0.12
    ('spam', 'no_free'): 0.3 * 0.60,  # 0.18
    ('ham',  'free'):    0.7 * 0.02,  # 0.014
    ('ham',  'no_free'): 0.7 * 0.98,  # 0.686
}

print("Leaf probabilities:")
for path, prob in tree.items():
    print(f"  {path[0]:6s} + {path[1]:8s}: {prob:.4f}")

# Total probability of "email contains 'free'"
p_free = sum(p for (cls, word), p in tree.items() if word == 'free')
print(f"\nP('free' in email) = {p_free:.4f}")

# Bayes: P(spam | 'free')
p_spam_and_free = tree[('spam', 'free')]
p_spam_given_free = p_spam_and_free / p_free
print(f"P(spam | 'free') = {p_spam_and_free:.4f} / {p_free:.4f} = {p_spam_given_free:.4f}")

# Verify: all paths sum to 1
assert abs(sum(tree.values()) - 1.0) < 1e-10
print(f"\nAll paths sum to: {sum(tree.values()):.4f} ✓")

Practice Exercises

Exercise 1Bayes
Medical Diagnosis

A test for a disease is 99% accurate: $P(\text{positive} \mid \text{sick}) = 0.99$ and $P(\text{negative} \mid \text{healthy}) = 0.99$. The disease affects 0.1% of the population. If you test positive, what is the actual probability you have the disease?

Show Answer

$P(\text{sick} \mid +) = \frac{0.99 \times 0.001}{0.99 \times 0.001 + 0.01 \times 0.999} = \frac{0.00099}{0.00099 + 0.00999} = \frac{0.00099}{0.01098} \approx 9.0\%$

Despite a 99% accurate test, a positive result only means 9% chance of actually being sick! This is because the disease is rare (base rate of 0.1%). This "base rate neglect" is a key insight in medical testing and classifier evaluation — high precision requires high recall AND high prior probability. In ML: always check the class imbalance (prior probabilities) before interpreting classifier outputs.

Exercise 2Independence
Dropout and Independence

A neural network layer has 100 neurons. Dropout with $p = 0.5$ independently masks each neuron. (1) What is the expected number of active neurons? (2) What is the variance? (3) Why does the independence assumption matter here?

Show Answer

(1) Let $X_i \sim \text{Bernoulli}(0.5)$ indicate if neuron $i$ is active. By linearity: $\mathbb{E}[\sum X_i] = 100 \times 0.5 = 50$ neurons.

(2) Since neurons are masked independently, $\text{Var}(\sum X_i) = \sum \text{Var}(X_i) = 100 \times 0.5 \times 0.5 = 25$. Standard deviation = 5 neurons.

(3) Independence ensures dropout provides diverse training: different subnetworks see different random neuron combinations each step, preventing co-adaptation where neurons rely on each other.

Conclusion & Next Steps

Probability theory is the language ML speaks. The key ideas from this part:

  • Kolmogorov axioms — all probability theory follows from three simple rules
  • Conditional probability $P(A|B) = P(A \cap B)/P(B)$ — restricting to what you know
  • Bayes' theorem — updating beliefs with evidence; the basis of Bayesian ML
  • Independence — simplifies computation; the "naive" assumption in Naive Bayes
  • Random variables — PMF/PDF/CDF describe distributions over possible values
  • Key distributions — Bernoulli, Binomial, Gaussian appear constantly in ML architectures and loss functions
  • MLE — finding parameters by maximizing the likelihood of observed data

Next in the Series

In Part 5: Statistics, we move from single probability distributions to the challenge of drawing conclusions from data samples — hypothesis testing, the Central Limit Theorem, confidence intervals, and the statistical tools behind rigorous ML model evaluation.