Part 5: Statistics

Probability (Part 4) tells us about theoretical distributions. Statistics tells us what to conclude when we only observe a finite sample from an unknown distribution. This distinction matters enormously:

                            Statistics underlies core ML questions:
                            Model comparison: Is 92.1% accuracy significantly better than 91.8%? Or is it noise?
Feature selection: Is this feature statistically significantly correlated with the target?
Dataset analysis: Are train and test distributions the same? (Kolmogorov-Smirnov test)
A/B testing: Did the new model version genuinely improve the click-through rate?
Data cleaning: Is this value an outlier (3 standard deviations from the mean)?

                        

Data Fundamentals

Before computing statistics, understand how data is recorded and organised. Every ML pipeline begins with raw data; understanding its structure prevents subtle errors in preprocessing and feature engineering.

Raw Data, Arrays, and Datasets

Raw data is unprocessed observations as collected. An array (or data vector) is an ordered list of values for a single variable. A dataset is a rectangular table of $n$ observations (rows) × $p$ variables (columns) — the standard input to any ML model.

                            Notation conventions:
                            Population: $N$ = total size; $\mu$ = population mean; $\sigma^2$ = population variance; $\sigma$ = population standard deviation
Sample: $n$ = sample size; $\bar{x}$ = sample mean; $s^2$ = sample variance; $s$ = sample standard deviation
Individual value: $x_i$ for the $i$-th observation (1-indexed)
Sum notation: $\sum_{i=1}^{n} x_i$ = sum of all values in the sample

                        

Frequency Tables and Tally Charts

A tally chart counts occurrences by placing a mark (|) for each value — five marks form a gate (||||). A frequency table organises these counts into a table with columns for value, tally, and frequency $f$.

Score (x)	Tally	Frequency (f)	Relative Freq
1	\|\|	2	0.10
2	\|\|\|\|	4	0.20
3	\|\|\|\| \|	6	0.30
4	\|\|\|\|	5	0.25
5	\|\|\|	3	0.15
Total		20	1.00

For grouped / class intervals (e.g., 0–9, 10–19, …), each row represents a range. The midpoint of each class is used as the representative value $x_i$ when computing statistics from grouped data.

import numpy as np
from collections import Counter

# Build a frequency table from raw data
data = [1, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 2, 2, 1]
n = len(data)

counts = Counter(data)
print(f"{'Value':>6}  {'Freq':>5}  {'Rel Freq':>9}  {'Cum Freq':>9}")
print("-" * 35)
cumulative = 0
for val in sorted(counts):
    f = counts[val]
    cumulative += f
    print(f"{val:>6}  {f:>5}  {f/n:>9.3f}  {cumulative/n:>9.3f}")
print(f"{'Total':>6}  {n:>5}  {'1.000':>9}")

Descriptive Statistics

Measures of Central Tendency

For a dataset $\{x_1, x_2, \ldots, x_n\}$:

Mean: $\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$ — sensitive to outliers
Median: the middle value when sorted — robust to outliers
Mode: the most frequent value — useful for categorical data

                            Mean vs Median in ML: Mean-based metrics (MSE, RMSE) are sensitive to outliers. If 1% of your labels have severe label noise, MSE will punish your model heavily on those examples. Median Absolute Error (MAE) is more robust. When your loss metric and the data distribution are mismatched, model comparison can be misleading.
                        

Measures of Spread

How spread out are the values?

Range: $\max(x) - \min(x)$ — affected by a single outlier
Interquartile Range (IQR): $Q_3 - Q_1$ — robust spread measure
Variance: $s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2$ — uses $n-1$ (Bessel's correction, see below)
Standard Deviation: $s = \sqrt{s^2}$ — same units as the data
Coefficient of Variation: $\text{CV} = s/\bar{x}$ — relative spread (unit-free)

import numpy as np

# Descriptive statistics from scratch (to understand the math)
data = np.array([2, 4, 4, 4, 5, 5, 7, 9, 10, 100])  # note: 100 is an outlier

n = len(data)
mean = np.mean(data)
median = np.median(data)

# Variance: two formulas
var_population = np.var(data, ddof=0)      # divide by n
var_sample    = np.var(data, ddof=1)       # divide by (n-1): Bessel's correction
std_sample    = np.std(data, ddof=1)

# IQR
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1

print(f"n       = {n}")
print(f"Mean    = {mean:.2f}  (pulled by outlier 100)")
print(f"Median  = {median:.2f}  (robust to outlier)")
print(f"Var (pop)   = {var_population:.2f}  (divides by n)")
print(f"Var (sample)= {var_sample:.2f}  (divides by n-1, Bessel)")
print(f"Std Dev = {std_sample:.2f}")
print(f"IQR     = {iqr:.2f}  (Q1={q1}, Q3={q3})")

Weighted Mean

When different observations carry different importance, the weighted mean assigns a weight $w_i$ to each value $x_i$:

\bar{x}_w = \frac{\sum_{i=1}^{n} w_i x_i}{\sum_{i=1}^{n} w_i}

                            ML applications of weighted mean:
                            Class-weighted loss: minority class gets higher weight $w_i$ so the model doesn't ignore it
Ensemble averaging: combine model predictions with weights proportional to their validation accuracy
Exponential moving average (EMA) for training: recent batches weighted more heavily — used in Adam optimizer's momentum estimate

                        

import numpy as np

# Weighted mean examples
scores =  np.array([72, 85, 91, 60, 88])
weights = np.array([0.10, 0.20, 0.30, 0.15, 0.25])  # sum to 1.0

# Manual weighted mean
weighted_mean = np.sum(weights * scores) / np.sum(weights)
# NumPy built-in
weighted_mean_np = np.average(scores, weights=weights)

print(f"Scores:       {scores}")
print(f"Weights:      {weights}")
print(f"Ordinary mean: {np.mean(scores):.2f}")
print(f"Weighted mean: {weighted_mean:.2f}")
print(f"NumPy average: {weighted_mean_np:.2f}")

Properties of the Mean

The mean has algebraic properties that are heavily used in ML math:

Property	Formula	ML Use
Linearity — constant shift	$\overline{(x+c)} = \bar{x} + c$	Feature centring: $x' = x - \bar{x}$
Linearity — scaling	$\overline{(cx)} = c\bar{x}$	Feature scaling: $x' = x / \sigma$
Sum of means	$\overline{(x+y)} = \bar{x} + \bar{y}$	Expected value of sum of random variables
Deviations sum to zero	$\sum_{i=1}^n (x_i - \bar{x}) = 0$	Centred data has zero mean — prerequisite for PCA
Minimises squared deviations	$\arg\min_c \sum(x_i - c)^2 = \bar{x}$	MSE loss minimiser is the mean — basis of OLS regression
Minimises absolute deviations	$\arg\min_c \sum\|x_i - c\| = \text{median}$	MAE loss minimiser is the median — robust regression

Mean from a Frequency Distribution

When data is presented in a frequency table with values $x_i$ and frequencies $f_i$:

\bar{x} = \frac{\sum f_i x_i}{\sum f_i} = \frac{\sum f_i x_i}{n}

For grouped data (class intervals), use the midpoint of each class as $x_i$:

x_i = \frac{\text{lower bound} + \text{upper bound}}{2}

import numpy as np

# Mean from frequency distribution
scores    = np.array([1,  2,  3,  4,  5])
frequency = np.array([2,  4,  6,  5,  3])

n = frequency.sum()
mean_fd = np.sum(scores * frequency) / n
print(f"n = {n}")
print(f"Mean from frequency distribution = {mean_fd:.4f}")

# Grouped data: class midpoints
class_midpoints = np.array([5.0, 15.0, 25.0, 35.0, 45.0])
class_freq      = np.array([3,    8,    12,    7,     5])

n_grouped = class_freq.sum()
mean_grouped = np.sum(class_midpoints * class_freq) / n_grouped
print(f"\nGrouped data: n = {n_grouped}")
print(f"Mean (from midpoints) = {mean_grouped:.2f}")

Median from a Frequency Distribution

For ungrouped frequency data: compute the cumulative frequency and find the class containing the $\frac{n+1}{2}$-th observation (odd $n$) or the average of the $\frac{n}{2}$ and $\frac{n}{2}+1$ observations (even $n$).

For grouped data, use the interpolation formula:

\text{Median} = L + \frac{\frac{n}{2} - F}{f_m} \times h

where $L$ = lower boundary of the median class, $F$ = cumulative frequency before the median class, $f_m$ = frequency of the median class, $h$ = class width.

Mode from a Frequency Distribution

For ungrouped data: the value with the highest frequency. For grouped data:

\text{Mode} = L + \frac{d_1}{d_1 + d_2} \times h

where $L$ = lower boundary of the modal class (highest frequency), $d_1 = f_m - f_{\text{prev}}$, $d_2 = f_m - f_{\text{next}}$, $h$ = class width.

                            Mode in ML: In classification, the mode of training labels is the zero-rule baseline — always predicting the most frequent class. Any model must beat this trivial baseline to be useful. Heavily imbalanced datasets (mode frequency > 90%) require special techniques: SMOTE, class weights, or focal loss.
                        

Mean Deviation (Mean Absolute Deviation)

The mean deviation measures average distance from the mean, using absolute values (not squares) to avoid cancellation:

MD = \frac{1}{n} \sum_{i=1}^{n} |x_i - \bar{x}|

Unlike variance, it uses the same units as the data and is less sensitive to outliers than standard deviation (because it doesn't square the deviations).

Mean Deviation from a Frequency Distribution

When data is given as a frequency table:

MD = \frac{\sum f_i |x_i - \bar{x}|}{\sum f_i}

import numpy as np

# Mean Deviation — raw data
data = np.array([4, 7, 13, 16, 21, 7, 4])
n = len(data)
mean = np.mean(data)
MD_raw = np.mean(np.abs(data - mean))
print(f"Data:        {data}")
print(f"Mean:        {mean:.4f}")
print(f"Mean Deviation (raw): {MD_raw:.4f}")

# Mean Deviation — frequency distribution
x_vals = np.array([1, 2, 3, 4, 5])
freqs  = np.array([2, 4, 6, 5, 3])
n_fd   = freqs.sum()
mean_fd = np.sum(x_vals * freqs) / n_fd
MD_fd   = np.sum(freqs * np.abs(x_vals - mean_fd)) / n_fd
print(f"\nFrequency distribution:")
print(f"Mean:        {mean_fd:.4f}")
print(f"Mean Deviation (freq): {MD_fd:.4f}")

# Compare: MD vs Std Dev — MD is less sensitive to outliers
data_outlier = np.append(data, 100)
print(f"\nWith outlier 100:")
print(f"Mean Deviation: {np.mean(np.abs(data_outlier - data_outlier.mean())):.4f}")
print(f"Std Deviation:  {np.std(data_outlier, ddof=1):.4f}")

Population vs Sample Statistics

A population is the complete set (all possible data); a sample is the subset we observe. Statistical estimators are functions of the sample that estimate population parameters:

\text{Population mean: } \mu = \mathbb{E}[X] \qquad \text{Sample mean: } \hat{\mu} = \bar{x} = \frac{1}{n}\sum x_i

\text{Population variance: } \sigma^2 = \frac{1}{n}\sum(x_i-\mu)^2 \qquad \text{Sample variance: } s^2 = \frac{1}{n-1}\sum(x_i-\bar{x})^2

Bessel's correction ($n-1$ instead of $n$) makes $s^2$ an unbiased estimator of $\sigma^2$: $\mathbb{E}[s^2] = \sigma^2$. Using $n$ would systematically underestimate population variance.

Variable Types

Understanding variable types is prerequisite to choosing the right model, loss function, and evaluation metric. Variables differ in both their role in a study and their measurement scale.

By Role

Type	Definition	ML Equivalent
Dependent (outcome)	What you measure / predict — the result	Target $y$ (label)
Independent (predictor)	What you manipulate / observe — the inputs	Feature $x$ (input)
Control	Held constant to isolate the effect of the independent variable	Fixed hyperparameter (e.g., fixed batch size)
Moderating	Changes the direction or strength of the relationship between X and Y	Interaction feature $x_1 \times x_2$
Mediating	Explains the mechanism by which X influences Y (intermediate variable)	Latent representation in a neural network

By Measurement Scale (Stevens's Scale)

Scale	Properties	Examples	Valid Statistics
Nominal	Named categories, no order	Colour, country, model architecture	Mode, frequency, $\chi^2$
Ordinal	Ordered categories, unequal gaps	Star ratings (1–5), severity (low/mid/high)	Median, percentile, Spearman $r_s$
Interval	Equal gaps, no true zero	Temperature (°C), year, IQ score	Mean, std, Pearson $r$
Ratio	Equal gaps, true zero exists	Height, weight, time, accuracy, loss	All statistics, geometric mean, CV

                            Why scale matters in ML: Nominal variables must be one-hot encoded — computing their mean is meaningless. Ordinal variables can use label encoding only when order is meaningful (tree-based models). Interval/ratio variables support direct arithmetic. Misidentifying a nominal variable as numerical (e.g., using zip codes as integers) is a common preprocessing bug.
                        

Central Limit Theorem (CLT)

The CLT is arguably the most important theorem in statistics for ML:

\text{If } X_1, X_2, \ldots, X_n \overset{i.i.d.}{\sim} \text{any distribution with mean } \mu \text{ and variance } \sigma^2, \text{ then as } n \to \infty:

\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \xrightarrow{d} \mathcal{N}\!\left(\mu,\, \frac{\sigma^2}{n}\right)

The standard error of the mean is $\text{SE} = \sigma/\sqrt{n}$: variance of the sample mean decreases as $1/n$.

CLT in PracticeMini-batch SGD

Why Mini-batch Gradient Estimates Work

In mini-batch SGD, instead of computing the exact gradient over all $N$ training examples, you estimate it using $B$ random samples. By the CLT, this estimate has mean equal to the true gradient and variance $\sigma^2_{\text{grad}} / B$. Larger batch size $B$ → smaller gradient variance → more stable but more computation per step. Smaller $B$ → noisier gradients → can escape local minima but converges slower. The CLT is precisely why this stochastic approximation works.

import numpy as np

# Demonstrate the Central Limit Theorem
rng = np.random.default_rng(42)

# Underlying distribution: highly skewed (exponential)
population_size = 100_000
population = rng.exponential(scale=2.0, size=population_size)
print(f"Population: mean={population.mean():.3f}, std={population.std():.3f}")
print(f"Exponential: highly right-skewed, NOT Gaussian\n")

# Sample means for different sample sizes
for n in [5, 30, 100, 500]:
    # Draw 10,000 samples of size n and compute their means
    sample_means = [rng.choice(population, n, replace=False).mean() for _ in range(10_000)]
    sample_means = np.array(sample_means)

    theoretical_se = population.std() / np.sqrt(n)
    print(f"n={n:4d}: mean of means={sample_means.mean():.3f}, "
          f"SE={sample_means.std():.4f} (theoretical: {theoretical_se:.4f})")

Hypothesis Testing

Hypothesis testing provides a formal framework for deciding whether observed data is compatible with a specific claim (the null hypothesis):

$H_0$ (null hypothesis): the default claim, e.g., "both models have equal accuracy"
$H_1$ (alternative): what we're trying to show, e.g., "Model B is better than Model A"

The p-value is the probability of observing data at least as extreme as what we saw, assuming $H_0$ is true. A small p-value (typically $p \lt 0.05$) provides evidence against $H_0$.

                            p-value misconception: "p = 0.03" does NOT mean "there's a 3% chance $H_0$ is true." It means "if $H_0$ were true, there's a 3% chance of observing data this extreme." The probability of $H_0$ being true requires a prior (Bayesian approach). This distinction is crucial for correctly interpreting A/B test results.
                        

t-Test: Comparing Means

To compare means of two groups, the t-statistic is:

t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}

Under $H_0$: equal means, $t$ follows a t-distribution. Large $|t|$ → small p-value → reject $H_0$.

Chi-Squared Test: Categorical Variables

The $\chi^2$ test checks whether observed categorical frequencies differ from expected:

\chi^2 = \sum_i \frac{(O_i - E_i)^2}{E_i}

In ML: used for feature selection (is this categorical feature independent of the target?), testing whether predicted class distribution matches actual distribution.

Type I and Type II Errors

\alpha = P(\text{reject } H_0 \mid H_0 \text{ true}) = \text{Type I error (false positive)}

\beta = P(\text{fail to reject } H_0 \mid H_0 \text{ false}) = \text{Type II error (false negative)}

Power $= 1 - \beta$ — the probability of correctly detecting a real effect. Increasing sample size $n$ increases power.

import numpy as np
from scipy import stats

# t-test: compare accuracy of two models on 30 test runs
rng = np.random.default_rng(0)
# Model A: true mean accuracy 0.85
# Model B: true mean accuracy 0.88
model_a = rng.normal(0.85, 0.03, size=30)
model_b = rng.normal(0.88, 0.03, size=30)

t_stat, p_val = stats.ttest_ind(model_a, model_b)
print(f"Model A: mean = {model_a.mean():.4f} +/- {model_a.std():.4f}")
print(f"Model B: mean = {model_b.mean():.4f} +/- {model_b.std():.4f}")
print(f"\nt-statistic = {t_stat:.4f}")
print(f"p-value     = {p_val:.4f}")
print(f"Significant at alpha=0.05: {p_val < 0.05}")

Confidence Intervals

A 95% confidence interval (CI) for the population mean, using the CLT:

\bar{x} \pm z_{0.025} \cdot \frac{s}{\sqrt{n}} = \bar{x} \pm 1.96 \cdot \frac{s}{\sqrt{n}}

Interpretation: if we repeated the sampling procedure many times, 95% of such intervals would contain the true population mean. It does NOT mean "there's a 95% chance the true mean is in this interval."

                            Confidence intervals in ML: Always report uncertainty alongside performance metrics. "Accuracy = 93.2% ± 0.8% (95% CI)" is far more informative than "Accuracy = 93.2%". The width of the CI tells you how much to trust the point estimate — narrow CI (large n) means reliable, wide CI (small n) means more data needed.
                        

import numpy as np
from scipy import stats

# Compute confidence interval for model accuracy
rng = np.random.default_rng(7)

# Simulate binary prediction outcomes (1=correct, 0=incorrect)
n_samples = 200
true_accuracy = 0.87
predictions_correct = rng.binomial(1, true_accuracy, size=n_samples)

sample_acc = predictions_correct.mean()
se = predictions_correct.std(ddof=1) / np.sqrt(n_samples)

# 95% CI using t-distribution (since n < infinity)
ci = stats.t.interval(0.95, df=n_samples-1, loc=sample_acc, scale=se)

print(f"Observed accuracy: {sample_acc:.4f}")
print(f"Standard error:    {se:.4f}")
print(f"95% CI:            ({ci[0]:.4f}, {ci[1]:.4f})")
print(f"CI width:          {ci[1]-ci[0]:.4f}")

# How CI width changes with n
print("\nCI width vs sample size (true acc=0.87, std~0.336):")
for n in [50, 100, 200, 500, 1000]:
    w = 2 * 1.96 * 0.336 / np.sqrt(n)
    print(f"  n={n:5d}: width ≈ {w:.4f} (+/- {w/2:.4f})")

Correlation

Pearson correlation measures linear relationship between $X$ and $Y$:

r = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_i (x_i-\bar{x})^2 \cdot \sum_i (y_i-\bar{y})^2}} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y} \in [-1, 1]

Spearman rank correlation uses ranks instead of values — more robust to outliers and non-linear monotonic relationships.

                            Correlation ≠ Causation: Anscombe's Quartet (4 different datasets) each has identical mean, variance, and Pearson correlation of 0.816 — yet they look completely different when plotted. Always visualize your data. High correlation between features can cause multicollinearity issues in regression. Remove redundant correlated features or use PCA (Part 12).
                        

import numpy as np
from scipy import stats

# Pearson and Spearman correlation comparison
rng = np.random.default_rng(42)

# Case 1: Linear relationship
x = np.linspace(0, 10, 100)
y_linear = 2 * x + rng.normal(0, 1, 100)

pearson_lin, _ = stats.pearsonr(x, y_linear)
spearman_lin, _ = stats.spearmanr(x, y_linear)
print(f"Linear relationship:")
print(f"  Pearson r  = {pearson_lin:.4f}")
print(f"  Spearman r = {spearman_lin:.4f}")

# Case 2: Non-linear monotone (Pearson underestimates, Spearman captures)
y_nonlinear = x**3 + rng.normal(0, 5, 100)

pearson_nl, _ = stats.pearsonr(x, y_nonlinear)
spearman_nl, _ = stats.spearmanr(x, y_nonlinear)
print(f"\nNon-linear monotone (x^3):")
print(f"  Pearson r  = {pearson_nl:.4f}  (underestimates)")
print(f"  Spearman r = {spearman_nl:.4f}  (correctly high)")

# Case 3: No relationship
y_random = rng.normal(0, 1, 100)
pearson_rnd, _ = stats.pearsonr(x, y_random)
print(f"\nNo relationship:")
print(f"  Pearson r  = {pearson_rnd:.4f}  (near zero)")

Spearman Rank Correlation — Explicit Formula

For $n$ pairs, compute rank differences $d_i = \text{rank}(x_i) - \text{rank}(y_i)$. Then:

r_s = 1 - \frac{6 \sum_{i=1}^{n} d_i^2}{n(n^2 - 1)}

This shortcut formula is exact when there are no tied ranks. For tied ranks, use the standard Pearson formula applied to the ranks.

import numpy as np
from scipy import stats

# Spearman rank correlation — manual and scipy
x = np.array([10, 20, 30, 40, 50])
y = np.array([12, 35, 25, 48, 44])

# Manual: compute ranks, then differences
rank_x = stats.rankdata(x)
rank_y = stats.rankdata(y)
d = rank_x - rank_y
n = len(x)

rs_manual = 1 - (6 * np.sum(d**2)) / (n * (n**2 - 1))
rs_scipy, _ = stats.spearmanr(x, y)

print(f"Ranks x: {rank_x}")
print(f"Ranks y: {rank_y}")
print(f"d  = {d}")
print(f"d^2 = {d**2}, sum = {np.sum(d**2)}")
print(f"\nSpearman rs (manual formula) = {rs_manual:.4f}")
print(f"Spearman rs (scipy)          = {rs_scipy:.4f}")

Skewness & Kurtosis

Correlation and mean alone don't describe the shape of a distribution. Skewness measures asymmetry; kurtosis measures tail heaviness — both are crucial for understanding model residuals and selecting appropriate distributions.

Skewness

\gamma_1 = \frac{\mu_3}{\sigma^3} = \frac{\mathbb{E}[(X-\mu)^3]}{(\mathbb{E}[(X-\mu)^2])^{3/2}}

Value	Shape	Mean vs Median	ML Examples
$\gamma_1 = 0$	Symmetric	Mean = Median	Gaussian noise, weight initializations
$\gamma_1 \gt 0$	Right-skewed (long right tail)	Mean > Median	Income data, response times, loss values
$\gamma_1 \lt 0$	Left-skewed (long left tail)	Mean < Median	Accuracy distributions near ceiling

Kurtosis (Excess Kurtosis)

\gamma_2 = \frac{\mu_4}{\sigma^4} - 3 = \frac{\mathbb{E}[(X-\mu)^4]}{(\mathbb{E}[(X-\mu)^2])^2} - 3

The $-3$ normalises against the normal distribution (which has kurtosis 0 in this excess form). Raw kurtosis of normal = 3.

Excess Kurtosis	Name	Tail Behaviour
$\gamma_2 = 0$	Mesokurtic (normal)	Normal tails
$\gamma_2 \gt 0$	Leptokurtic	Heavier tails, sharper peak — more outliers
$\gamma_2 \lt 0$	Platykurtic	Lighter tails, flatter peak — fewer outliers

import numpy as np
from scipy import stats

rng = np.random.default_rng(0)

distributions = {
    'Normal':      rng.normal(0, 1, 10000),
    'Exponential': rng.exponential(1, 10000),      # right-skewed
    'Uniform':     rng.uniform(-2, 2, 10000),       # platykurtic
    'Laplace':     rng.laplace(0, 1, 10000),        # leptokurtic
}

print(f"{'Distribution':15}  {'Skewness':>10}  {'Excess Kurt':>12}")
print("-" * 42)
for name, data in distributions.items():
    sk = stats.skew(data)
    kt = stats.kurtosis(data)   # scipy returns EXCESS kurtosis by default
    print(f"{name:15}  {sk:>10.4f}  {kt:>12.4f}")

                            Skewness/Kurtosis in ML: Right-skewed features (log-normal income, pixel intensities) often benefit from log-transform before feeding to linear models. Residuals of a good regression should have near-zero skewness and near-zero excess kurtosis (approximately normal). High kurtosis in model errors signals the model is missing extreme events — common with financial time series.
                        

Regression & Collinearity

Linear regression models the relationship between a target $y$ and features $\mathbf{x}$ as a linear combination:

y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \varepsilon

In matrix form: $\mathbf{y} = X\boldsymbol{\beta} + \boldsymbol{\varepsilon}$ where $X$ is the $n \times (p+1)$ design matrix (first column of ones for intercept).

Ordinary Least Squares (OLS)

OLS minimises the sum of squared residuals $\sum_i (y_i - \hat{y}_i)^2$. The analytical solution is:

\hat{\boldsymbol{\beta}} = (X^T X)^{-1} X^T \mathbf{y}

For simple linear regression ($p=1$, one feature):

\hat{\beta}_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2} = \frac{\text{Cov}(x,y)}{s_x^2} \qquad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}

Multicollinearity & VIF

Multicollinearity occurs when two or more features are highly correlated. This causes:

Unstable $\hat{\boldsymbol{\beta}}$ estimates — small data changes flip signs of coefficients
Inflated standard errors — coefficients are statistically insignificant even when the overall model is good
Difficulty interpreting individual feature importance

The Variance Inflation Factor (VIF) quantifies multicollinearity for feature $j$:

\text{VIF}_j = \frac{1}{1 - R^2_j}

where $R^2_j$ is the $R^2$ from regressing feature $j$ on all other features. Rule of thumb: VIF $\gt 5$ → investigate; VIF $\gt 10$ → serious multicollinearity.

import numpy as np
from sklearn.linear_model import LinearRegression

rng = np.random.default_rng(42)
n = 100

# Generate correlated features (x2 is nearly x1 + noise)
x1 = rng.normal(0, 1, n)
x2 = x1 * 0.95 + rng.normal(0, 0.2, n)   # highly correlated with x1
x3 = rng.normal(0, 1, n)                   # independent
y  = 2*x1 + 3*x3 + rng.normal(0, 0.5, n)

X = np.column_stack([x1, x2, x3])

# Compute VIF for each feature
def compute_vif(X):
    vifs = []
    for i in range(X.shape[1]):
        y_i    = X[:, i]
        X_rest = np.delete(X, i, axis=1)
        reg    = LinearRegression().fit(X_rest, y_i)
        r2     = reg.score(X_rest, y_i)
        vif    = 1 / (1 - r2) if r2 < 1.0 else float('inf')
        vifs.append(vif)
    return vifs

vifs = compute_vif(X)
for i, v in enumerate(vifs):
    flag = "  <-- HIGH multicollinearity!" if v > 5 else ""
    print(f"Feature x{i+1}: VIF = {v:.2f}{flag}")

Regression Error Metrics

After fitting a regression model, we need metrics to assess how well $\hat{y}$ approximates $y$. Different metrics penalise errors differently and suit different use cases.

Metric	Formula	Units	Penalises Outliers
MAE (Mean Absolute Error)	$\frac{1}{n}\sum\|y_i - \hat{y}_i\|$	Same as $y$	Linearly
MSE (Mean Squared Error)	$\frac{1}{n}\sum(y_i - \hat{y}_i)^2$	$y^2$	Quadratically (heavy)
RMSE (Root MSE)	$\sqrt{\text{MSE}}$	Same as $y$	Quadratically
$R^2$ (Coefficient of Determination)	$1 - \frac{\sum(y_i-\hat{y}_i)^2}{\sum(y_i-\bar{y})^2}$	Unitless [0,1]	—
Adjusted $R^2$	$1 - (1-R^2)\frac{n-1}{n-p-1}$	Unitless	—

                            Key insights:
                            $R^2 = 1$: perfect fit. $R^2 = 0$: model is no better than predicting $\bar{y}$ always. $R^2 \lt 0$: model is worse than the mean — check for data leakage or a serious bug.
Adjusted $R^2$ penalises adding useless features ($p$ = number of predictors). Adding a random feature increases $R^2$ but decreases Adjusted $R^2$.
RMSE vs MAE: If large errors are especially bad (e.g., predicting drug dosage), use RMSE. If outliers should not dominate, use MAE.
$R^2$ is not a complete diagnostic: a model with $R^2 = 0.95$ but highly skewed residuals is still a bad model.

                        

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

rng = np.random.default_rng(7)
n = 100
X = rng.normal(0, 1, (n, 3))
beta_true = np.array([2.0, -1.5, 0.8])
y = X @ beta_true + rng.normal(0, 1, n)

# Fit model
model = LinearRegression().fit(X, y)
y_hat = model.predict(X)

# Compute metrics
mae  = mean_absolute_error(y, y_hat)
mse  = mean_squared_error(y, y_hat)
rmse = np.sqrt(mse)
r2   = r2_score(y, y_hat)

# Adjusted R^2
p = X.shape[1]  # number of features
adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)

print(f"MAE         = {mae:.4f}")
print(f"MSE         = {mse:.4f}")
print(f"RMSE        = {rmse:.4f}")
print(f"R^2         = {r2:.4f}")
print(f"Adjusted R^2= {adj_r2:.4f}")
print(f"\nEstimated coefficients: {model.coef_.round(3)}")
print(f"True coefficients:      {beta_true}")

Practice Exercises

Exercise 1A/B Testing

Is the New Model Actually Better?

You test two recommendation models on 500 users each. Model A clicks: 120/500 (24%). Model B clicks: 140/500 (28%). Is Model B significantly better at $\alpha = 0.05$?

Show Answer

This is a two-proportion z-test. Pooled proportion: $\hat{p} = (120+140)/(500+500) = 0.26$. Test statistic: $z = (0.28 - 0.24) / \sqrt{0.26 \times 0.74 \times (1/500 + 1/500)} = 0.04 / \sqrt{0.000385} \approx 0.04/0.01963 \approx 2.04$. p-value $\approx 0.041 \lt 0.05$. Yes, Model B is statistically significantly better. But is the 4% improvement practically significant? Depends on business context — statistical significance $\neq$ practical significance.

Conclusion & Next Steps

Statistics is the discipline of uncertainty quantification from data. Core takeaways:

Data fundamentals: Frequency tables, tally charts, and class midpoints for grouped data — always know how your data is structured
Descriptive stats: Mean (simple, weighted), median, mode — compute from raw data and frequency distributions; mean deviation is the absolute-value spread measure
Properties of mean: Linearity, deviations sum to zero, mean minimises MSE — these underpin loss function math
Variable types: Role (dependent/independent/control/moderating/mediating) and scale (nominal/ordinal/interval/ratio) determine what operations are valid
Population vs sample: Bessel's correction makes variance estimators unbiased
CLT: Sample means are approximately Gaussian for large $n$ — explains why mini-batch SGD works
Hypothesis testing: p-values quantify evidence against null hypothesis — not probability of the null being true
Skewness & Kurtosis: Measure distribution shape — right-skewed features need log-transforms; high kurtosis means heavy tails
Correlation: Pearson (linear), Spearman with explicit rank formula $r_s = 1 - 6\sum d_i^2/(n(n^2-1))$ — visualize before trusting correlations
Regression & VIF: OLS solution $(X^TX)^{-1}X^Ty$; VIF $\gt 5$ indicates multicollinearity — remove or regularise
Error metrics: MAE (robust), MSE/RMSE (penalises outliers), $R^2$ (variance explained), Adjusted $R^2$ (penalises extra features)

Next in the Series

In Part 6: Information Theory, we use probability to quantify information — Shannon entropy, cross-entropy loss, KL divergence, and mutual information for feature selection. This is the math directly behind the loss functions you minimize every time you train a neural network.

Cookie Consent

Table of Contents

Complete Math + Probability + Statistics for ML/AI/DS Bootcamp

Mathematical Thinking

Set Theory & Foundations

Combinatorics

Probability Fundamentals

Statistics

Information Theory

Linear Algebra

Calculus & Optimization

ML-Specific Math

Computational Math

Advanced Topics

Projects & Applications

Why Statistics in ML