Complete Math + Probability + Statistics for ML/AI/DS Bootcamp
Mathematical Thinking
Mindset, notation & functionsSet Theory & Foundations
Sets, operations & ML connectionsCombinatorics
Counting, permutations & combinationsProbability Fundamentals
Rules, Bayes & distributionsStatistics
Descriptive to inferentialInformation Theory
Entropy, cross-entropy & KL divergenceLinear Algebra
Vectors, matrices & transformationsCalculus & Optimization
Derivatives, gradients & descentML-Specific Math
Loss functions & regularizationComputational Math
NumPy, SciPy & simulationAdvanced Topics
Multivariate stats & Bayesian inferenceProjects & Applications
Build from scratch: regression, Bayes, PCAProbability and ML
Every ML model expresses uncertainty. Logistic regression outputs a probability. Naive Bayes applies Bayes' theorem directly. Neural networks trained with cross-entropy loss implicitly model probability distributions. Bayesian optimization navigates uncertainty to find hyperparameters. This part builds the formal foundations.
- Why logistic regression's sigmoid output is interpreted as a probability
- How Naive Bayes derives its prediction rule from Bayes' theorem
- What "MLE" means and why it's equivalent to minimizing cross-entropy loss
- Why the Gaussian distribution appears everywhere in ML
- The difference between variance of a variable and variance of a model's predictions
Theoretical vs Empirical Probability
Before formalizing probability, it helps to understand the different ways we assign probabilities to events:
Theoretical (Classical) Probability — also called a priori probability — is based on reasoning about equally likely outcomes. You don't need to run an experiment:
- $\Omega$ (capital omega) — the sample space: the set of every possible outcome (e.g., $\Omega = \{1,2,3,4,5,6\}$ for a die roll)
- $A$ — an event: a specific subset of outcomes we care about (e.g., $A = \{2,4,6\}$ for "roll an even number")
- $|A|$ — the cardinality of set $A$: simply how many outcomes are in it ($|A| = 3$ for the even-number event above)
- $|\Omega|$ — the cardinality of the sample space: the total count of all possible outcomes ($|\Omega| = 6$ for a die)
- $P(A)$ — the probability of event $A$: a number in $[0, 1]$ measuring how likely $A$ is
Example: the probability of rolling a 4 on a fair die = $1/6$, because each of the 6 faces is equally likely.
Empirical (Experimental/Frequentist) Probability is based on actual observations from repeated trials:
Example: you flip a coin 1,000 times and get heads 503 times. The empirical probability of heads $\approx 503/1000 = 0.503$.
Subjective Probability is based on personal belief or expert judgment when neither equally-likely outcomes nor large datasets are available. Examples: a doctor's estimate of a patient's recovery probability, or an analyst's forecast.
import numpy as np
# Theoretical vs Empirical probability of rolling even on a fair die
rng = np.random.default_rng(42)
theoretical = 3/6 # 2, 4, 6 are even
ns = [10, 100, 1000, 10000, 100000]
print(f"Theoretical P(even) = {theoretical:.4f}")
print(f"{'-'*40}")
for n in ns:
rolls = rng.integers(1, 7, size=n)
empirical = (rolls % 2 == 0).mean()
print(f"n={n:>7,}: empirical = {empirical:.4f} (error = {abs(empirical-theoretical):.4f})")
# Law of Large Numbers: error shrinks as n grows
Probability Space & Axioms
Sample Space & Events
A probability space consists of three components $(\Omega, \mathcal{F}, P)$:
- $\Omega$ (Omega): the sample space — the set of all possible outcomes
- $\mathcal{F}$: a sigma-algebra — a collection of measurable subsets of $\Omega$ called events
- $P$: a probability measure — a function assigning a number $P(A) \in [0,1]$ to each event $A \in \mathcal{F}$
For ML purposes: $\Omega$ is the set of all possible observations (e.g., all possible images), an event $A \subseteq \Omega$ is a property (e.g., "this image contains a cat"), and $P(A)$ is the fraction of the data distribution with that property.
Kolmogorov Axioms
All of probability theory follows from three axioms (Kolmogorov, 1933):
From these three axioms, all probability theorems follow. Key derived results:
Mutually Exclusive vs Non-Mutually Exclusive Events
Two events are mutually exclusive (also called disjoint) if they cannot both occur in the same trial:
Two events are non-mutually exclusive if they can both occur:
| Feature | Mutually Exclusive | Non-Mutually Exclusive |
|---|---|---|
| Can both occur? | No | Yes |
| $P(A \cap B)$ | $0$ | $\gt 0$ |
| Example | Rolling 1 and 6 simultaneously | Rolling even and rolling > 3 |
| ML example | A token is label "cat" or "dog" (not both) | Email is spam AND contains "free" |
Addition Rule of Probability
The addition rule calculates $P(A \cup B)$ — the probability that at least one of $A$ or $B$ occurs:
General Addition Rule (for any two events):
The $P(A \cap B)$ term is subtracted because the overlap region gets counted twice when you add $P(A)$ and $P(B)$. This is the Inclusion-Exclusion Principle from set theory applied to probability.
Special case for Mutually Exclusive Events:
Addition Rule in ML: Multi-Label Classification
In a dataset, 35% of images contain cats ($P(C) = 0.35$), 28% contain dogs ($P(D) = 0.28$), and 10% contain both ($P(C \cap D) = 0.10$). What fraction contain at least one pet?
$P(C \cup D) = 0.35 + 0.28 - 0.10 = 0.53$ — 53% of images contain a cat or dog or both.
For 3+ events: $P(A\cup B\cup C) = P(A)+P(B)+P(C)-P(A\cap B)-P(A\cap C)-P(B\cap C)+P(A\cap B\cap C)$
Conditional Probability
Conditional probability asks: "What is the probability of $B$ given that $A$ has already happened?" Instead of looking at the full sample space $\Omega$, we restrict our view to only the outcomes inside $A$.
Imagine the sample space as a rectangle (total probability = 1). $A$ is a region inside it, and $A \cap B$ is the overlap where both events occur:
When we condition on $A$, we shrink the universe to just $A$. Inside that shrunken universe, only $A \cap B$ counts as "$B$ happening."
- The new "total probability" (the whole universe, now restricted to $A$) $= P(A)$
- The part of that restricted universe where $B$ also occurs $= P(A \cap B)$
- So $P(B \mid A)$ is simply the fraction of $A$ that is also $B$
Deriving the formula — within the restricted universe of $A$, we redefine probability as the ratio of two measures:
This is the formal definition of conditional probability. The labels $A$ and $B$ can be swapped — the logic is identical — giving the symmetric form:
This is the foundation of all probabilistic reasoning in ML — restrict the sample space to the known event, then measure how much of that restricted space satisfies the target event.
Multiplication Rule of Probability
The multiplication rule calculates $P(A \cap B)$ — the probability that both $A$ and $B$ occur:
General Multiplication Rule (for any two events, using conditional probability):
This is also called the chain rule for two events. It extends naturally to $n$ events:
Special case for Independent Events:
Dependent vs Independent Events
Two events are independent if the occurrence of one gives no information about the other:
Two events are dependent if knowing one changes the probability of the other:
| Independent | Dependent | |
|---|---|---|
| Intuition | Knowing $B$ doesn't help predict $A$ | Knowing $B$ changes our estimate of $A$ |
| Multiplication | $P(A \cap B) = P(A) \cdot P(B)$ | $P(A \cap B) = P(A) \cdot P(B|A)$ |
| Example | Two separate coin flips | Drawing two cards without replacement |
| ML example | Naive Bayes feature assumption | Sequential tokens in language models |
import numpy as np
rng = np.random.default_rng(42)
n = 100000
# Independent events: two separate coin flips
coin1 = rng.integers(0, 2, n) # 0=tails, 1=heads
coin2 = rng.integers(0, 2, n)
p_head1 = coin1.mean()
p_head2 = coin2.mean()
p_both_heads = (coin1 & coin2).mean()
print("Independent events (two coins):")
print(f" P(H1) = {p_head1:.4f} (theoretical: 0.5)")
print(f" P(H2) = {p_head2:.4f} (theoretical: 0.5)")
print(f" P(H1 ∩ H2) = {p_both_heads:.4f} (theoretical: 0.25)")
print(f" P(H1)*P(H2) = {p_head1*p_head2:.4f} ← matches, confirming independence")
# Dependent events: draw cards without replacement
# P(2nd card is ace | 1st was ace) ≠ P(2nd card is ace)
p_first_ace = 4 / 52
p_second_ace_given_first = 3 / 51 # one ace removed
p_second_ace_given_not_first = 4 / 51 # still 4 aces in 51 cards
p_second_ace = p_second_ace_given_first * p_first_ace + p_second_ace_given_not_first * (1 - p_first_ace)
print("\nDependent events (cards without replacement):")
print(f" P(2nd ace | 1st was ace) = {p_second_ace_given_first:.4f}")
print(f" P(2nd ace | 1st not ace) = {p_second_ace_given_not_first:.4f}")
print(f" P(2nd ace overall) = {p_second_ace:.4f}")
print(" Different conditional probabilities → DEPENDENT")
Bayes' Theorem
Rearranging the definition of conditional probability gives Bayes' theorem. Here is the step-by-step derivation:
Step 1 — Write both conditional probability definitions. The definition holds regardless of which event is in the condition:
Both definitions share the same numerator: the joint probability $P(A \cap B)$.
Step 2 — Isolate $P(A \cap B)$ from the right-hand definition. Multiply both sides by $P(A)$:
Step 3 — Substitute into the left-hand definition. Replace $P(A \cap B)$ in the numerator of the first equation:
The joint term $P(A \cap B)$ disappears as an intermediate. The result expresses the reverse conditional $P(A \mid B)$ purely in terms of the forward conditional $P(B \mid A)$ — the core power of Bayes' theorem: we can flip the direction of conditioning.
In Bayesian ML terminology we rename the symbols: $A \to H$ (the hypothesis — the model or class we want to know about) and $B \to D$ (the data — what we actually observed). The formula becomes:
Each of the four terms has a precise meaning and a natural intuition:
| Term | Notation | Plain English | Medical test analogy | Spam filter analogy |
|---|---|---|---|---|
| Prior | $P(H)$ | What you believe before seeing any data — your starting belief about the hypothesis. | 1 in 1,000 people have the disease — the base-rate before any test. | 30% of all emails are spam — the base-rate before reading the email. |
| Likelihood | $P(D \mid H)$ | How probable is this data if the hypothesis is true? This is what the model or generative process predicts. | The test returns positive 99% of the time when the patient truly has the disease. | The word "prize" appears in 40% of actual spam emails. |
| Evidence | $P(D)$ | The total probability of observing this data under all hypotheses — a normalising constant that ensures the posterior sums to 1. | Overall rate of a positive test result in the whole population (sick + healthy). | Overall frequency of "prize" in all emails (spam + ham). |
| Posterior | $P(H \mid D)$ | Your updated belief after seeing the data — what you believe now, combining prior knowledge with new evidence. | Given a positive test, what is the actual probability the patient is ill? | Given the word "prize", what is the probability the email is spam? |
Bayes' theorem is a belief-update machine: start with your prior belief $P(H)$, multiply by how well the data fits that belief $P(D \mid H)$, and rescale by $P(D)$ so everything still adds up to 1. The output is your posterior $P(H \mid D)$ — a sharper, data-informed belief. In ML the prior encodes what the model knew before training, the likelihood is the training objective, and the posterior is what the model knows after seeing the data.
Naive Bayes Spam Classification
Given email containing word "prize", is it spam? Using Bayes' theorem:
$$P(\text{spam} \mid \text{"prize"}) = \frac{P(\text{"prize"} \mid \text{spam}) \cdot P(\text{spam})}{P(\text{"prize"})}$$Suppose: $P(\text{spam}) = 0.3$ (30% of emails are spam), $P(\text{"prize"} \mid \text{spam}) = 0.4$, $P(\text{"prize"} \mid \text{not spam}) = 0.01$.
$P(\text{"prize"}) = 0.4 \times 0.3 + 0.01 \times 0.7 = 0.12 + 0.007 = 0.127$
$P(\text{spam} \mid \text{"prize"}) = \frac{0.4 \times 0.3}{0.127} \approx 0.945$ — 94.5% probability of spam.
Naive Bayes extends this to multiple words by assuming conditional independence (the "naive" assumption), multiplying individual $P(w_i \mid \text{class})$ probabilities together.
# Bayes' theorem applied to spam classification
def bayes_update(prior_spam, p_word_given_spam, p_word_given_ham):
"""Apply Bayes' theorem to update spam probability given a word."""
prior_ham = 1 - prior_spam
# Law of total probability
p_word = p_word_given_spam * prior_spam + p_word_given_ham * prior_ham
# Bayes' theorem
posterior_spam = (p_word_given_spam * prior_spam) / p_word
return posterior_spam
# Single word update
prior = 0.3
posterior = bayes_update(prior, p_word_given_spam=0.4, p_word_given_ham=0.01)
print(f"Prior P(spam) = {prior:.2f}")
print(f"After seeing 'prize': P(spam|'prize') = {posterior:.4f}")
# Naive Bayes: multiple words (conditional independence assumption)
words = [
("prize", 0.40, 0.01),
("free", 0.35, 0.02),
("winner", 0.30, 0.005),
("claim", 0.25, 0.01),
]
p = 0.3 # prior
for word, p_s, p_h in words:
p = bayes_update(p, p_s, p_h)
print(f"After '{word:8s}': P(spam) = {p:.6f}")
Independence & the Chain Rule
Events $A$ and $B$ are independent if knowing $B$ tells you nothing about $A$:
The chain rule of probability expresses joint probabilities as products of conditionals:
Random Variables
A random variable $X$ is a function $X: \Omega \to \mathbb{R}$ that maps outcomes to real numbers. It "converts" qualitative outcomes into numbers we can compute with.
Discrete Random Variables: PMF & CDF
A discrete RV takes countable values. Its distribution is described by the probability mass function (PMF):
The cumulative distribution function (CDF) accumulates probability up to $x$:
Continuous Random Variables: PDF & CDF
A continuous RV takes values in an interval. Its distribution is described by the probability density function (PDF) $f_X(x)$ where:
Note: $P(X = x) = 0$ for any single point in the continuous case.
Key Probability Distributions
Bernoulli Distribution — Binary Outcomes
$X \sim \text{Bernoulli}(p)$: a single trial with success probability $p$.
$$P(X=1) = p, \quad P(X=0) = 1-p, \quad \mathbb{E}[X] = p, \quad \text{Var}(X) = p(1-p)$$ML use: Binary classification output layer (sigmoid), coin-flip random variables, click prediction.
Binomial Distribution — Sum of Bernoullis
$X \sim \text{Binomial}(n, p)$: number of successes in $n$ independent Bernoulli trials.
$$P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad \mathbb{E}[X] = np, \quad \text{Var}(X) = np(1-p)$$ML use: Testing significance of accuracy differences; dropout (each neuron dropped with probability $p$) follows a binomial structure.
Gaussian Distribution — The Bell Curve
$X \sim \mathcal{N}(\mu, \sigma^2)$: described by mean $\mu$ and variance $\sigma^2$.
$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right), \quad \mathbb{E}[X] = \mu, \quad \text{Var}(X) = \sigma^2$$ML use: Weight initialization (e.g., Xavier/He init), Gaussian noise in data augmentation, residuals in linear regression (assumption for OLS), L2 regularization is equivalent to a Gaussian prior on weights. By the CLT, sample means are approximately Gaussian.
Uniform Distribution
$X \sim \text{Uniform}(a, b)$: all values in $[a, b]$ are equally likely.
$$f(x) = \frac{1}{b-a}, \quad \mathbb{E}[X] = \frac{a+b}{2}, \quad \text{Var}(X) = \frac{(b-a)^2}{12}$$ML use: Hyperparameter sampling in random search, weight initialization in some variants, representing maximum-entropy priors.
import numpy as np
# Simulate key distributions and compute statistics
rng = np.random.default_rng(42)
# Bernoulli(p=0.7)
p = 0.7
bernoulli_samples = rng.binomial(1, p, size=10000)
print(f"Bernoulli(p=0.7)")
print(f" Theoretical: E[X]={p:.2f}, Var={p*(1-p):.4f}")
print(f" Empirical: E[X]={bernoulli_samples.mean():.4f}, Var={bernoulli_samples.var():.4f}")
# Binomial(n=10, p=0.3)
n, p = 10, 0.3
binom_samples = rng.binomial(n, p, size=10000)
print(f"\nBinomial(n=10, p=0.3)")
print(f" Theoretical: E[X]={n*p:.2f}, Var={n*p*(1-p):.4f}")
print(f" Empirical: E[X]={binom_samples.mean():.4f}, Var={binom_samples.var():.4f}")
# Gaussian(mu=0, sigma=1)
mu, sigma = 0, 1
normal_samples = rng.normal(mu, sigma, size=10000)
print(f"\nGaussian(mu=0, sigma=1)")
print(f" Theoretical: E[X]={mu:.2f}, Var={sigma**2:.4f}")
print(f" Empirical: E[X]={normal_samples.mean():.4f}, Var={normal_samples.var():.4f}")
Expectation, Variance & Covariance
The expectation (mean) of a random variable is its probability-weighted average:
Linearity of expectation (holds even for dependent variables):
The variance measures spread around the mean:
Covariance measures how two variables move together:
If $X \perp Y$ (independent), then $\text{Cov}(X,Y) = 0$ and $\text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y)$.
import numpy as np
# Expectation, Variance, Covariance computations
rng = np.random.default_rng(0)
# Create correlated samples using Cholesky decomposition
mean = [2.0, 5.0]
cov_matrix = [[1.0, 0.8], [0.8, 2.0]] # Covariance matrix
samples = rng.multivariate_normal(mean, cov_matrix, size=5000)
X, Y = samples[:, 0], samples[:, 1]
print("E[X] =", X.mean().round(4), " (theoretical: 2.0)")
print("E[Y] =", Y.mean().round(4), " (theoretical: 5.0)")
print("Var(X) =", X.var().round(4), " (theoretical: 1.0)")
print("Var(Y) =", Y.var().round(4), " (theoretical: 2.0)")
print("Cov(X,Y) =", np.cov(X, Y)[0,1].round(4), " (theoretical: 0.8)")
# MLE for Gaussian parameters
data = rng.normal(loc=3.5, scale=2.0, size=10000)
mu_mle = data.mean()
sigma_mle = data.std() # numpy uses n denominator (MLE)
print(f"\nMLE estimates (true mu=3.5, sigma=2.0):")
print(f" mu_hat = {mu_mle:.4f}")
print(f" sigma_hat = {sigma_mle:.4f}")
Law of Total Probability
The Law of Total Probability lets you compute $P(A)$ by breaking the sample space into exhaustive, mutually exclusive "scenarios" $B_1, B_2, \ldots, B_k$ (a partition of $\Omega$):
The partition requirement means: $B_i \cap B_j = \emptyset$ for $i \neq j$ (mutually exclusive) and $B_1 \cup B_2 \cup \cdots \cup B_k = \Omega$ (exhaustive, i.e., covers all outcomes).
Example: Disease Testing
Let $A$ = test positive. Partition: $B_1$ = "has disease" ($P(B_1) = 0.01$) and $B_2$ = "no disease" ($P(B_2) = 0.99$).
$P(\text{positive} \mid \text{disease}) = 0.95$, $P(\text{positive} \mid \text{no disease}) = 0.05$ (false positive rate).
$P(\text{positive}) = 0.95 \times 0.01 + 0.05 \times 0.99 = 0.0095 + 0.0495 = 0.059$
This is the denominator $P(B)$ in Bayes' theorem. Without the Law of Total Probability, we couldn't compute posterior probabilities.
import numpy as np
# Law of Total Probability: compute P(test positive)
# Partition: disease status {sick, healthy}
priors = {'sick': 0.01, 'healthy': 0.99}
likelihoods = {
'sick': {'positive': 0.95, 'negative': 0.05},
'healthy': {'positive': 0.05, 'negative': 0.95},
}
# P(positive) = sum over partition
p_positive = sum(priors[s] * likelihoods[s]['positive'] for s in priors)
print(f"P(test positive) = {p_positive:.4f}") # 0.059
# Now use Bayes: P(sick | positive)
p_sick_given_positive = (likelihoods['sick']['positive'] * priors['sick']) / p_positive
print(f"P(sick | positive) = {p_sick_given_positive:.4f}") # ~0.161 (16%)
Geometric Distribution
The geometric distribution models the number of trials needed to get the first success in a sequence of independent Bernoulli trials, each with success probability $p$.
Read as: you fail $k-1$ times (probability $(1-p)^{k-1}$) then succeed on trial $k$ (probability $p$).
| $p$ | $E[X] = 1/p$ | Interpretation | ML Example |
|---|---|---|---|
| 0.5 | 2 | Expect 2 tries on average | Binary search in hyperparameter tuning |
| 0.1 | 10 | Expect 10 tries on average | 10% of training samples are "informative" |
| 0.01 | 100 | Expect 100 tries on average | Rare event detection in anomaly detection |
import numpy as np
from scipy import stats
# Geometric distribution: trials until first success
p = 0.2 # 20% success rate per trial
# Theoretical
print(f"Geometric(p={p})")
print(f" E[X] = 1/p = {1/p:.1f}")
print(f" Var(X) = (1-p)/p^2 = {(1-p)/p**2:.1f}")
# P(X = k) for first few values
print("\n k P(X=k)")
for k in range(1, 8):
prob = (1-p)**(k-1) * p
print(f" {k} {prob:.4f}")
# Simulation
rng = np.random.default_rng(42)
trials = []
for _ in range(10000):
k = 1
while rng.random() > p:
k += 1
trials.append(k)
print(f"\nSimulated E[X] = {np.mean(trials):.3f} (theoretical: {1/p:.1f})")
Hypergeometric Distribution
The hypergeometric distribution models drawing a sample without replacement from a population. Unlike the binomial (which assumes replacement and independent trials), hypergeometric captures dependence between draws.
Setup: population of $N$ items, $K$ of which are "successes". Draw $n$ items without replacement. Let $X$ = number of successes in the sample:
The $\frac{N-n}{N-1}$ factor is the finite population correction — variance is smaller than binomial because sampling without replacement reduces variability.
Example: Evaluating a Classifier Without Replacement
You have 1000 test examples: 200 positive, 800 negative. You pick 50 examples for manual review (without replacement). $N=1000$, $K=200$, $n=50$.
$E[\text{positives in sample}] = \frac{50 \times 200}{1000} = 10$ positives expected.
This is the math behind stratified sampling — ensuring each class is proportionally represented in your evaluation set.
import numpy as np
from scipy.stats import hypergeom
# Hypergeometric distribution
N = 1000 # population size
K = 200 # total successes in population
n = 50 # sample size
rv = hypergeom(N, K, n)
mean = rv.mean()
var = rv.var()
print(f"Hypergeometric(N={N}, K={K}, n={n})")
print(f" E[X] = {mean:.2f} (theoretical: {n*K/N:.2f})")
print(f" Var(X) = {var:.4f}")
print(f" P(X=10) = {rv.pmf(10):.4f}")
print(f" P(X>=15) = {1-rv.cdf(14):.4f}")
# Compare to binomial (with replacement)
from scipy.stats import binom
binom_rv = binom(n, K/N)
print(f"\nBinomial(n={n}, p={K/N}) for comparison:")
print(f" E[X] = {binom_rv.mean():.2f} (same expected value)")
print(f" Var(X) = {binom_rv.var():.4f} (higher than hypergeometric)")
Z-Score & Standardization
A Z-score (also called a standard score) expresses how many standard deviations an observation is from the mean of its distribution:
Where $\mu$ is the population mean and $\sigma$ is the population standard deviation. For a sample: $z = (x - \bar{x}) / s$.
Standardization transforms an entire dataset so it has mean 0 and standard deviation 1 — every value becomes its Z-score. The resulting distribution is called the Standard Normal Distribution $\mathcal{N}(0,1)$.
- $z = 0$: the value equals the mean
- $z = 1$: the value is 1 standard deviation above the mean
- $z = -2$: the value is 2 standard deviations below the mean
- $|z| \gt 3$: unusual — only ~0.3% of values if the data is normal (outlier threshold)
Using Z-tables: For $X \sim \mathcal{N}(\mu, \sigma^2)$, compute $z = (x - \mu)/\sigma$, then look up $P(Z \leq z) = \Phi(z)$ in the standard normal table. Example: $X \sim \mathcal{N}(70, 10^2)$, $P(X \leq 85) = P(Z \leq 1.5) = \Phi(1.5) \approx 0.9332$.
import numpy as np
from scipy import stats
# Z-score calculation
data = np.array([52, 65, 70, 73, 75, 78, 80, 85, 90, 95])
mu = data.mean()
sigma = data.std(ddof=0) # population std
z_scores = (data - mu) / sigma
print("Original values:", data)
print(f"Mean: {mu:.2f}, Std: {sigma:.2f}")
print("Z-scores:", z_scores.round(2))
# Using scipy for standard normal probabilities
# P(X <= 85) where X ~ N(70, 10^2)
mu_dist, sigma_dist = 70, 10
x = 85
z = (x - mu_dist) / sigma_dist
prob = stats.norm.cdf(z)
print(f"\nFor X ~ N({mu_dist}, {sigma_dist}^2):")
print(f" P(X <= {x}) = P(Z <= {z}) = {prob:.4f}")
# Standardization for ML (scikit-learn style)
rng = np.random.default_rng(42)
features = rng.normal(loc=[100, 0.5, 1000], scale=[20, 0.1, 300], size=(100, 3))
standardized = (features - features.mean(axis=0)) / features.std(axis=0)
print(f"\nBefore standardization: mean = {features.mean(axis=0).round(2)}")
print(f"After standardization: mean = {standardized.mean(axis=0).round(4)}")
print(f"After standardization: std = {standardized.std(axis=0).round(4)}")
T-Distribution
The t-distribution (Student's t) is used when the population standard deviation $\sigma$ is unknown and must be estimated from the sample. It has heavier tails than the normal distribution, reflecting extra uncertainty from estimating $\sigma$.
The t-distribution is parameterized by degrees of freedom $\nu = n - 1$ (where $n$ is sample size):
| $\nu$ (degrees of freedom) | Shape | When to use |
|---|---|---|
| $\nu = 1$ | Very heavy tails (Cauchy) | Almost never in practice |
| $\nu = 5$ | Noticeably heavier than normal | Very small samples ($n=6$) |
| $\nu = 30$ | Very close to normal | Most practical applications |
| $\nu \to \infty$ | Converges to $\mathcal{N}(0,1)$ | Large samples — use normal |
import numpy as np
from scipy import stats
rng = np.random.default_rng(42)
# Compare normal and t distributions
import matplotlib.pyplot as plt
x = np.linspace(-5, 5, 300)
normal_pdf = stats.norm.pdf(x)
t5_pdf = stats.t.pdf(x, df=5)
t30_pdf = stats.t.pdf(x, df=30)
# Heavier tails at x=3
print("Comparison of tail probabilities P(X > 3):")
print(f" Normal: {1 - stats.norm.cdf(3):.6f}")
print(f" t(df=5): {1 - stats.t.cdf(3, df=5):.6f} <- much heavier tail")
print(f" t(df=30): {1 - stats.t.cdf(3, df=30):.6f}")
# Two-sample t-test for comparing model accuracy
model_a_scores = rng.normal(0.82, 0.03, 10) # 10 cross-val folds
model_b_scores = rng.normal(0.85, 0.03, 10)
t_stat, p_value = stats.ttest_rel(model_a_scores, model_b_scores)
print(f"\nPaired t-test (is Model B significantly better?):")
print(f" t-statistic = {t_stat:.4f}")
print(f" p-value = {p_value:.4f}")
if p_value < 0.05:
print(" Significant difference (p < 0.05)")
else:
print(" No significant difference (p >= 0.05)")
Skewness & Kurtosis
Skewness measures the asymmetry of a probability distribution around its mean. Kurtosis measures the "tailedness" — how extreme the outliers are.
Skewness is the standardized third central moment:
| Skewness | Shape | Mean vs Median | ML Example |
|---|---|---|---|
| $\gamma_1 = 0$ | Symmetric | Mean = Median | Gaussian noise, weight initializations |
| $\gamma_1 \gt 0$ | Right-skewed (long right tail) | Mean > Median | Income data, response times, loss values |
| $\gamma_1 \lt 0$ | Left-skewed (long left tail) | Mean < Median | Accuracy distributions near ceiling |
Kurtosis is the standardized fourth central moment. Excess kurtosis is kurtosis minus 3 (so a normal distribution has excess kurtosis = 0):
| Excess Kurtosis | Type | Tail behavior |
|---|---|---|
| $\gamma_2 = 0$ | Mesokurtic (normal) | Normal tails |
| $\gamma_2 \gt 0$ | Leptokurtic | Heavier tails, sharper peak — more outliers |
| $\gamma_2 \lt 0$ | Platykurtic | Lighter tails, flatter peak — fewer outliers |
import numpy as np
from scipy import stats
rng = np.random.default_rng(42)
# Symmetric (normal)
normal_data = rng.normal(0, 1, 10000)
# Right-skewed (log-normal)
lognormal_data = rng.lognormal(0, 1, 10000)
# Heavy-tailed (t with df=3)
heavy_data = stats.t.rvs(df=3, size=10000, random_state=42)
distributions = {
'Normal(0,1)': normal_data,
'LogNormal(0,1)': lognormal_data,
't(df=3)': heavy_data,
}
print(f"{'Distribution':<20} {'Skewness':>10} {'Excess Kurt':>12} {'Type'}")
print("-" * 60)
for name, data in distributions.items():
sk = stats.skew(data)
ku = stats.kurtosis(data) # returns excess kurtosis by default
if abs(sk) < 0.1:
sk_type = "symmetric"
elif sk > 0:
sk_type = "right-skewed"
else:
sk_type = "left-skewed"
print(f"{name:<20} {sk:>10.3f} {ku:>12.3f} {sk_type}")
Sensitivity & Specificity
These are probability-based metrics that evaluate a binary classifier's performance on each class separately. They come directly from the conditional probability framework.
Given a classifier with a binary outcome (Positive/Negative) and true labels, define:
- TP (True Positives), FP (False Positives), TN (True Negatives), FN (False Negatives)
| Metric | Formula | Interpretation | Also called |
|---|---|---|---|
| Sensitivity | $P(\hat{Y}=+ \mid Y=+) = \frac{TP}{TP+FN}$ | Of all actual positives, how many did we catch? | Recall, True Positive Rate (TPR) |
| Specificity | $P(\hat{Y}=- \mid Y=-) = \frac{TN}{TN+FP}$ | Of all actual negatives, how many did we correctly reject? | True Negative Rate (TNR) |
| PPV | $P(Y=+ \mid \hat{Y}=+) = \frac{TP}{TP+FP}$ | Of all predicted positives, how many are truly positive? | Precision |
| NPV | $P(Y=- \mid \hat{Y}=-) = \frac{TN}{TN+FN}$ | Of all predicted negatives, how many are truly negative? | Negative Predictive Value |
import numpy as np
# Compute sensitivity, specificity from confusion matrix
y_true = np.array([1,1,1,1,1,0,0,0,0,0,1,0,1,0,1])
y_pred = np.array([1,1,0,1,1,0,0,1,0,0,1,0,0,0,1])
TP = np.sum((y_pred == 1) & (y_true == 1))
TN = np.sum((y_pred == 0) & (y_true == 0))
FP = np.sum((y_pred == 1) & (y_true == 0))
FN = np.sum((y_pred == 0) & (y_true == 1))
sensitivity = TP / (TP + FN) # recall / TPR
specificity = TN / (TN + FP) # TNR
ppv = TP / (TP + FP) # precision
npv = TN / (TN + FN) # negative predictive value
print(f"Confusion Matrix: TP={TP}, TN={TN}, FP={FP}, FN={FN}")
print(f"\nSensitivity (Recall) = {sensitivity:.4f} -- P(predict + | actually +)")
print(f"Specificity = {specificity:.4f} -- P(predict - | actually -)")
print(f"Precision (PPV) = {ppv:.4f} -- P(actually + | predict +)")
print(f"NPV = {npv:.4f} -- P(actually - | predict -)")
print(f"\nAccuracy = {(TP+TN)/(TP+TN+FP+FN):.4f}")
Decision Tree of Probability
A probability tree (decision tree diagram) is a visual method for tracing conditional branches and computing joint and total probabilities. Each path from root to leaf represents a sequence of outcomes.
Construction rules:
- Each branch shows one possible outcome with its probability
- Probabilities at each branch level sum to 1
- Path probability = product of branch probabilities (Multiplication Rule)
- Total probability for an outcome = sum of all paths leading to it (Addition Rule)
flowchart TD
E["Email arrives\n(all emails)"] -->|"P(spam)=0.3"| S["Spam"]
E -->|"P(ham)=0.7"| H["Ham"]
S -->|"P(free|spam)=0.4"| SF["Spam + 'free'\nP = 0.3×0.4 = 0.12"]
S -->|"P(no free|spam)=0.6"| SN["Spam, no 'free'\nP = 0.3×0.6 = 0.18"]
H -->|"P(free|ham)=0.02"| HF["Ham + 'free'\nP = 0.7×0.02 = 0.014"]
H -->|"P(no free|ham)=0.98"| HN["Ham, no 'free'\nP = 0.7×0.98 = 0.686"]
style E fill:#132440,stroke:#132440,color:#fff
style SF fill:#BF092F,stroke:#BF092F,color:#fff
style SN fill:#BF092F,stroke:#BF092F,color:#fff
style HF fill:#3B9797,stroke:#3B9797,color:#fff
style HN fill:#3B9797,stroke:#3B9797,color:#fff
- $P(\text{email contains 'free'}) = 0.12 + 0.014 = 0.134$ (sum of all paths with 'free')
- $P(\text{spam} \mid \text{'free'}) = 0.12 / 0.134 \approx 0.896$ (Bayes' theorem from tree)
- All four leaf probabilities sum to: $0.12 + 0.18 + 0.014 + 0.686 = 1.0$ \u2713
import numpy as np
# Probability tree computation
# Email classification: spam(0.3) / ham(0.7), contains 'free' / no 'free'
# Structure: (branch_prob, conditions)
tree = {
('spam', 'free'): 0.3 * 0.40, # 0.12
('spam', 'no_free'): 0.3 * 0.60, # 0.18
('ham', 'free'): 0.7 * 0.02, # 0.014
('ham', 'no_free'): 0.7 * 0.98, # 0.686
}
print("Leaf probabilities:")
for path, prob in tree.items():
print(f" {path[0]:6s} + {path[1]:8s}: {prob:.4f}")
# Total probability of "email contains 'free'"
p_free = sum(p for (cls, word), p in tree.items() if word == 'free')
print(f"\nP('free' in email) = {p_free:.4f}")
# Bayes: P(spam | 'free')
p_spam_and_free = tree[('spam', 'free')]
p_spam_given_free = p_spam_and_free / p_free
print(f"P(spam | 'free') = {p_spam_and_free:.4f} / {p_free:.4f} = {p_spam_given_free:.4f}")
# Verify: all paths sum to 1
assert abs(sum(tree.values()) - 1.0) < 1e-10
print(f"\nAll paths sum to: {sum(tree.values()):.4f} ✓")
Practice Exercises
Medical Diagnosis
A test for a disease is 99% accurate: $P(\text{positive} \mid \text{sick}) = 0.99$ and $P(\text{negative} \mid \text{healthy}) = 0.99$. The disease affects 0.1% of the population. If you test positive, what is the actual probability you have the disease?
Show Answer
$P(\text{sick} \mid +) = \frac{0.99 \times 0.001}{0.99 \times 0.001 + 0.01 \times 0.999} = \frac{0.00099}{0.00099 + 0.00999} = \frac{0.00099}{0.01098} \approx 9.0\%$
Despite a 99% accurate test, a positive result only means 9% chance of actually being sick! This is because the disease is rare (base rate of 0.1%). This "base rate neglect" is a key insight in medical testing and classifier evaluation — high precision requires high recall AND high prior probability. In ML: always check the class imbalance (prior probabilities) before interpreting classifier outputs.
Dropout and Independence
A neural network layer has 100 neurons. Dropout with $p = 0.5$ independently masks each neuron. (1) What is the expected number of active neurons? (2) What is the variance? (3) Why does the independence assumption matter here?
Show Answer
(1) Let $X_i \sim \text{Bernoulli}(0.5)$ indicate if neuron $i$ is active. By linearity: $\mathbb{E}[\sum X_i] = 100 \times 0.5 = 50$ neurons.
(2) Since neurons are masked independently, $\text{Var}(\sum X_i) = \sum \text{Var}(X_i) = 100 \times 0.5 \times 0.5 = 25$. Standard deviation = 5 neurons.
(3) Independence ensures dropout provides diverse training: different subnetworks see different random neuron combinations each step, preventing co-adaptation where neurons rely on each other.
Conclusion & Next Steps
Probability theory is the language ML speaks. The key ideas from this part:
- Kolmogorov axioms — all probability theory follows from three simple rules
- Conditional probability $P(A|B) = P(A \cap B)/P(B)$ — restricting to what you know
- Bayes' theorem — updating beliefs with evidence; the basis of Bayesian ML
- Independence — simplifies computation; the "naive" assumption in Naive Bayes
- Random variables — PMF/PDF/CDF describe distributions over possible values
- Key distributions — Bernoulli, Binomial, Gaussian appear constantly in ML architectures and loss functions
- MLE — finding parameters by maximizing the likelihood of observed data
Next in the Series
In Part 5: Statistics, we move from single probability distributions to the challenge of drawing conclusions from data samples — hypothesis testing, the Central Limit Theorem, confidence intervals, and the statistical tools behind rigorous ML model evaluation.