Back to Mathematics

Part 5: Statistics

April 26, 2026 Wasil Zafar 45 min read

Statistics is how we reason about populations from samples. From deciding whether one model genuinely outperforms another to understanding why training on too little data is dangerous, statistics provides the rigorous foundation for data-driven decisions.

Table of Contents

  1. Why Statistics in ML
  2. Data Fundamentals
  3. Descriptive Statistics
  4. Population vs Sample
  5. Variable Types
  6. Central Limit Theorem
  7. Hypothesis Testing
  8. Confidence Intervals
  9. Skewness & Kurtosis
  10. Correlation
  11. Regression & Collinearity
  12. Regression Error Metrics
  13. Practice Exercises
  14. Conclusion & Next Steps

Why Statistics in ML

Probability (Part 4) tells us about theoretical distributions. Statistics tells us what to conclude when we only observe a finite sample from an unknown distribution. This distinction matters enormously:

Statistics underlies core ML questions:
  • Model comparison: Is 92.1% accuracy significantly better than 91.8%? Or is it noise?
  • Feature selection: Is this feature statistically significantly correlated with the target?
  • Dataset analysis: Are train and test distributions the same? (Kolmogorov-Smirnov test)
  • A/B testing: Did the new model version genuinely improve the click-through rate?
  • Data cleaning: Is this value an outlier (3 standard deviations from the mean)?

Data Fundamentals

Before computing statistics, understand how data is recorded and organised. Every ML pipeline begins with raw data; understanding its structure prevents subtle errors in preprocessing and feature engineering.

Raw Data, Arrays, and Datasets

Raw data is unprocessed observations as collected. An array (or data vector) is an ordered list of values for a single variable. A dataset is a rectangular table of $n$ observations (rows) × $p$ variables (columns) — the standard input to any ML model.

Notation conventions:
  • Population: $N$ = total size; $\mu$ = population mean; $\sigma^2$ = population variance; $\sigma$ = population standard deviation
  • Sample: $n$ = sample size; $\bar{x}$ = sample mean; $s^2$ = sample variance; $s$ = sample standard deviation
  • Individual value: $x_i$ for the $i$-th observation (1-indexed)
  • Sum notation: $\sum_{i=1}^{n} x_i$ = sum of all values in the sample

Frequency Tables and Tally Charts

A tally chart counts occurrences by placing a mark (|) for each value — five marks form a gate (||||). A frequency table organises these counts into a table with columns for value, tally, and frequency $f$.

Score (x)TallyFrequency (f)Relative Freq
1||20.10
2||||40.20
3|||| |60.30
4||||50.25
5|||30.15
Total201.00

For grouped / class intervals (e.g., 0–9, 10–19, …), each row represents a range. The midpoint of each class is used as the representative value $x_i$ when computing statistics from grouped data.

import numpy as np
from collections import Counter

# Build a frequency table from raw data
data = [1, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 2, 2, 1]
n = len(data)

counts = Counter(data)
print(f"{'Value':>6}  {'Freq':>5}  {'Rel Freq':>9}  {'Cum Freq':>9}")
print("-" * 35)
cumulative = 0
for val in sorted(counts):
    f = counts[val]
    cumulative += f
    print(f"{val:>6}  {f:>5}  {f/n:>9.3f}  {cumulative/n:>9.3f}")
print(f"{'Total':>6}  {n:>5}  {'1.000':>9}")

Descriptive Statistics

Measures of Central Tendency

For a dataset $\{x_1, x_2, \ldots, x_n\}$:

  • Mean: $\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$ — sensitive to outliers
  • Median: the middle value when sorted — robust to outliers
  • Mode: the most frequent value — useful for categorical data
Mean vs Median in ML: Mean-based metrics (MSE, RMSE) are sensitive to outliers. If 1% of your labels have severe label noise, MSE will punish your model heavily on those examples. Median Absolute Error (MAE) is more robust. When your loss metric and the data distribution are mismatched, model comparison can be misleading.

Measures of Spread

How spread out are the values?

  • Range: $\max(x) - \min(x)$ — affected by a single outlier
  • Interquartile Range (IQR): $Q_3 - Q_1$ — robust spread measure
  • Variance: $s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2$ — uses $n-1$ (Bessel's correction, see below)
  • Standard Deviation: $s = \sqrt{s^2}$ — same units as the data
  • Coefficient of Variation: $\text{CV} = s/\bar{x}$ — relative spread (unit-free)
import numpy as np

# Descriptive statistics from scratch (to understand the math)
data = np.array([2, 4, 4, 4, 5, 5, 7, 9, 10, 100])  # note: 100 is an outlier

n = len(data)
mean = np.mean(data)
median = np.median(data)

# Variance: two formulas
var_population = np.var(data, ddof=0)      # divide by n
var_sample    = np.var(data, ddof=1)       # divide by (n-1): Bessel's correction
std_sample    = np.std(data, ddof=1)

# IQR
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1

print(f"n       = {n}")
print(f"Mean    = {mean:.2f}  (pulled by outlier 100)")
print(f"Median  = {median:.2f}  (robust to outlier)")
print(f"Var (pop)   = {var_population:.2f}  (divides by n)")
print(f"Var (sample)= {var_sample:.2f}  (divides by n-1, Bessel)")
print(f"Std Dev = {std_sample:.2f}")
print(f"IQR     = {iqr:.2f}  (Q1={q1}, Q3={q3})")

Weighted Mean

When different observations carry different importance, the weighted mean assigns a weight $w_i$ to each value $x_i$:

$$\bar{x}_w = \frac{\sum_{i=1}^{n} w_i x_i}{\sum_{i=1}^{n} w_i}$$
ML applications of weighted mean:
  • Class-weighted loss: minority class gets higher weight $w_i$ so the model doesn't ignore it
  • Ensemble averaging: combine model predictions with weights proportional to their validation accuracy
  • Exponential moving average (EMA) for training: recent batches weighted more heavily — used in Adam optimizer's momentum estimate
import numpy as np

# Weighted mean examples
scores =  np.array([72, 85, 91, 60, 88])
weights = np.array([0.10, 0.20, 0.30, 0.15, 0.25])  # sum to 1.0

# Manual weighted mean
weighted_mean = np.sum(weights * scores) / np.sum(weights)
# NumPy built-in
weighted_mean_np = np.average(scores, weights=weights)

print(f"Scores:       {scores}")
print(f"Weights:      {weights}")
print(f"Ordinary mean: {np.mean(scores):.2f}")
print(f"Weighted mean: {weighted_mean:.2f}")
print(f"NumPy average: {weighted_mean_np:.2f}")

Properties of the Mean

The mean has algebraic properties that are heavily used in ML math:

PropertyFormulaML Use
Linearity — constant shift$\overline{(x+c)} = \bar{x} + c$Feature centring: $x' = x - \bar{x}$
Linearity — scaling$\overline{(cx)} = c\bar{x}$Feature scaling: $x' = x / \sigma$
Sum of means$\overline{(x+y)} = \bar{x} + \bar{y}$Expected value of sum of random variables
Deviations sum to zero$\sum_{i=1}^n (x_i - \bar{x}) = 0$Centred data has zero mean — prerequisite for PCA
Minimises squared deviations$\arg\min_c \sum(x_i - c)^2 = \bar{x}$MSE loss minimiser is the mean — basis of OLS regression
Minimises absolute deviations$\arg\min_c \sum|x_i - c| = \text{median}$MAE loss minimiser is the median — robust regression

Mean from a Frequency Distribution

When data is presented in a frequency table with values $x_i$ and frequencies $f_i$:

$$\bar{x} = \frac{\sum f_i x_i}{\sum f_i} = \frac{\sum f_i x_i}{n}$$

For grouped data (class intervals), use the midpoint of each class as $x_i$:

$$x_i = \frac{\text{lower bound} + \text{upper bound}}{2}$$
import numpy as np

# Mean from frequency distribution
scores    = np.array([1,  2,  3,  4,  5])
frequency = np.array([2,  4,  6,  5,  3])

n = frequency.sum()
mean_fd = np.sum(scores * frequency) / n
print(f"n = {n}")
print(f"Mean from frequency distribution = {mean_fd:.4f}")

# Grouped data: class midpoints
class_midpoints = np.array([5.0, 15.0, 25.0, 35.0, 45.0])
class_freq      = np.array([3,    8,    12,    7,     5])

n_grouped = class_freq.sum()
mean_grouped = np.sum(class_midpoints * class_freq) / n_grouped
print(f"\nGrouped data: n = {n_grouped}")
print(f"Mean (from midpoints) = {mean_grouped:.2f}")

Median from a Frequency Distribution

For ungrouped frequency data: compute the cumulative frequency and find the class containing the $\frac{n+1}{2}$-th observation (odd $n$) or the average of the $\frac{n}{2}$ and $\frac{n}{2}+1$ observations (even $n$).

For grouped data, use the interpolation formula:

$$\text{Median} = L + \frac{\frac{n}{2} - F}{f_m} \times h$$

where $L$ = lower boundary of the median class, $F$ = cumulative frequency before the median class, $f_m$ = frequency of the median class, $h$ = class width.

Mode from a Frequency Distribution

For ungrouped data: the value with the highest frequency. For grouped data:

$$\text{Mode} = L + \frac{d_1}{d_1 + d_2} \times h$$

where $L$ = lower boundary of the modal class (highest frequency), $d_1 = f_m - f_{\text{prev}}$, $d_2 = f_m - f_{\text{next}}$, $h$ = class width.

Mode in ML: In classification, the mode of training labels is the zero-rule baseline — always predicting the most frequent class. Any model must beat this trivial baseline to be useful. Heavily imbalanced datasets (mode frequency > 90%) require special techniques: SMOTE, class weights, or focal loss.

Mean Deviation (Mean Absolute Deviation)

The mean deviation measures average distance from the mean, using absolute values (not squares) to avoid cancellation:

$$MD = \frac{1}{n} \sum_{i=1}^{n} |x_i - \bar{x}|$$

Unlike variance, it uses the same units as the data and is less sensitive to outliers than standard deviation (because it doesn't square the deviations).

Mean Deviation from a Frequency Distribution

When data is given as a frequency table:

$$MD = \frac{\sum f_i |x_i - \bar{x}|}{\sum f_i}$$
import numpy as np

# Mean Deviation — raw data
data = np.array([4, 7, 13, 16, 21, 7, 4])
n = len(data)
mean = np.mean(data)
MD_raw = np.mean(np.abs(data - mean))
print(f"Data:        {data}")
print(f"Mean:        {mean:.4f}")
print(f"Mean Deviation (raw): {MD_raw:.4f}")

# Mean Deviation — frequency distribution
x_vals = np.array([1, 2, 3, 4, 5])
freqs  = np.array([2, 4, 6, 5, 3])
n_fd   = freqs.sum()
mean_fd = np.sum(x_vals * freqs) / n_fd
MD_fd   = np.sum(freqs * np.abs(x_vals - mean_fd)) / n_fd
print(f"\nFrequency distribution:")
print(f"Mean:        {mean_fd:.4f}")
print(f"Mean Deviation (freq): {MD_fd:.4f}")

# Compare: MD vs Std Dev — MD is less sensitive to outliers
data_outlier = np.append(data, 100)
print(f"\nWith outlier 100:")
print(f"Mean Deviation: {np.mean(np.abs(data_outlier - data_outlier.mean())):.4f}")
print(f"Std Deviation:  {np.std(data_outlier, ddof=1):.4f}")

Population vs Sample Statistics

A population is the complete set (all possible data); a sample is the subset we observe. Statistical estimators are functions of the sample that estimate population parameters:

$$\text{Population mean: } \mu = \mathbb{E}[X] \qquad \text{Sample mean: } \hat{\mu} = \bar{x} = \frac{1}{n}\sum x_i$$
$$\text{Population variance: } \sigma^2 = \frac{1}{n}\sum(x_i-\mu)^2 \qquad \text{Sample variance: } s^2 = \frac{1}{n-1}\sum(x_i-\bar{x})^2$$

Bessel's correction ($n-1$ instead of $n$) makes $s^2$ an unbiased estimator of $\sigma^2$: $\mathbb{E}[s^2] = \sigma^2$. Using $n$ would systematically underestimate population variance.

Variable Types

Understanding variable types is prerequisite to choosing the right model, loss function, and evaluation metric. Variables differ in both their role in a study and their measurement scale.

By Role

TypeDefinitionML Equivalent
Dependent (outcome)What you measure / predict — the resultTarget $y$ (label)
Independent (predictor)What you manipulate / observe — the inputsFeature $x$ (input)
ControlHeld constant to isolate the effect of the independent variableFixed hyperparameter (e.g., fixed batch size)
ModeratingChanges the direction or strength of the relationship between X and YInteraction feature $x_1 \times x_2$
MediatingExplains the mechanism by which X influences Y (intermediate variable)Latent representation in a neural network

By Measurement Scale (Stevens's Scale)

ScalePropertiesExamplesValid Statistics
NominalNamed categories, no orderColour, country, model architectureMode, frequency, $\chi^2$
OrdinalOrdered categories, unequal gapsStar ratings (1–5), severity (low/mid/high)Median, percentile, Spearman $r_s$
IntervalEqual gaps, no true zeroTemperature (°C), year, IQ scoreMean, std, Pearson $r$
RatioEqual gaps, true zero existsHeight, weight, time, accuracy, lossAll statistics, geometric mean, CV
Why scale matters in ML: Nominal variables must be one-hot encoded — computing their mean is meaningless. Ordinal variables can use label encoding only when order is meaningful (tree-based models). Interval/ratio variables support direct arithmetic. Misidentifying a nominal variable as numerical (e.g., using zip codes as integers) is a common preprocessing bug.

Central Limit Theorem (CLT)

The CLT is arguably the most important theorem in statistics for ML:

$$\text{If } X_1, X_2, \ldots, X_n \overset{i.i.d.}{\sim} \text{any distribution with mean } \mu \text{ and variance } \sigma^2, \text{ then as } n \to \infty:$$
$$\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i \xrightarrow{d} \mathcal{N}\!\left(\mu,\, \frac{\sigma^2}{n}\right)$$

The standard error of the mean is $\text{SE} = \sigma/\sqrt{n}$: variance of the sample mean decreases as $1/n$.

CLT in PracticeMini-batch SGD
Why Mini-batch Gradient Estimates Work

In mini-batch SGD, instead of computing the exact gradient over all $N$ training examples, you estimate it using $B$ random samples. By the CLT, this estimate has mean equal to the true gradient and variance $\sigma^2_{\text{grad}} / B$. Larger batch size $B$ → smaller gradient variance → more stable but more computation per step. Smaller $B$ → noisier gradients → can escape local minima but converges slower. The CLT is precisely why this stochastic approximation works.

import numpy as np

# Demonstrate the Central Limit Theorem
rng = np.random.default_rng(42)

# Underlying distribution: highly skewed (exponential)
population_size = 100_000
population = rng.exponential(scale=2.0, size=population_size)
print(f"Population: mean={population.mean():.3f}, std={population.std():.3f}")
print(f"Exponential: highly right-skewed, NOT Gaussian\n")

# Sample means for different sample sizes
for n in [5, 30, 100, 500]:
    # Draw 10,000 samples of size n and compute their means
    sample_means = [rng.choice(population, n, replace=False).mean() for _ in range(10_000)]
    sample_means = np.array(sample_means)

    theoretical_se = population.std() / np.sqrt(n)
    print(f"n={n:4d}: mean of means={sample_means.mean():.3f}, "
          f"SE={sample_means.std():.4f} (theoretical: {theoretical_se:.4f})")

Hypothesis Testing

Hypothesis testing provides a formal framework for deciding whether observed data is compatible with a specific claim (the null hypothesis):

  • $H_0$ (null hypothesis): the default claim, e.g., "both models have equal accuracy"
  • $H_1$ (alternative): what we're trying to show, e.g., "Model B is better than Model A"

The p-value is the probability of observing data at least as extreme as what we saw, assuming $H_0$ is true. A small p-value (typically $p \lt 0.05$) provides evidence against $H_0$.

p-value misconception: "p = 0.03" does NOT mean "there's a 3% chance $H_0$ is true." It means "if $H_0$ were true, there's a 3% chance of observing data this extreme." The probability of $H_0$ being true requires a prior (Bayesian approach). This distinction is crucial for correctly interpreting A/B test results.

t-Test: Comparing Means

To compare means of two groups, the t-statistic is:

$$t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$$

Under $H_0$: equal means, $t$ follows a t-distribution. Large $|t|$ → small p-value → reject $H_0$.

Chi-Squared Test: Categorical Variables

The $\chi^2$ test checks whether observed categorical frequencies differ from expected:

$$\chi^2 = \sum_i \frac{(O_i - E_i)^2}{E_i}$$

In ML: used for feature selection (is this categorical feature independent of the target?), testing whether predicted class distribution matches actual distribution.

Type I and Type II Errors

$$\alpha = P(\text{reject } H_0 \mid H_0 \text{ true}) = \text{Type I error (false positive)}$$
$$\beta = P(\text{fail to reject } H_0 \mid H_0 \text{ false}) = \text{Type II error (false negative)}$$

Power $= 1 - \beta$ — the probability of correctly detecting a real effect. Increasing sample size $n$ increases power.

import numpy as np
from scipy import stats

# t-test: compare accuracy of two models on 30 test runs
rng = np.random.default_rng(0)
# Model A: true mean accuracy 0.85
# Model B: true mean accuracy 0.88
model_a = rng.normal(0.85, 0.03, size=30)
model_b = rng.normal(0.88, 0.03, size=30)

t_stat, p_val = stats.ttest_ind(model_a, model_b)
print(f"Model A: mean = {model_a.mean():.4f} +/- {model_a.std():.4f}")
print(f"Model B: mean = {model_b.mean():.4f} +/- {model_b.std():.4f}")
print(f"\nt-statistic = {t_stat:.4f}")
print(f"p-value     = {p_val:.4f}")
print(f"Significant at alpha=0.05: {p_val < 0.05}")

Confidence Intervals

A 95% confidence interval (CI) for the population mean, using the CLT:

$$\bar{x} \pm z_{0.025} \cdot \frac{s}{\sqrt{n}} = \bar{x} \pm 1.96 \cdot \frac{s}{\sqrt{n}}$$

Interpretation: if we repeated the sampling procedure many times, 95% of such intervals would contain the true population mean. It does NOT mean "there's a 95% chance the true mean is in this interval."

Confidence intervals in ML: Always report uncertainty alongside performance metrics. "Accuracy = 93.2% ± 0.8% (95% CI)" is far more informative than "Accuracy = 93.2%". The width of the CI tells you how much to trust the point estimate — narrow CI (large n) means reliable, wide CI (small n) means more data needed.
import numpy as np
from scipy import stats

# Compute confidence interval for model accuracy
rng = np.random.default_rng(7)

# Simulate binary prediction outcomes (1=correct, 0=incorrect)
n_samples = 200
true_accuracy = 0.87
predictions_correct = rng.binomial(1, true_accuracy, size=n_samples)

sample_acc = predictions_correct.mean()
se = predictions_correct.std(ddof=1) / np.sqrt(n_samples)

# 95% CI using t-distribution (since n < infinity)
ci = stats.t.interval(0.95, df=n_samples-1, loc=sample_acc, scale=se)

print(f"Observed accuracy: {sample_acc:.4f}")
print(f"Standard error:    {se:.4f}")
print(f"95% CI:            ({ci[0]:.4f}, {ci[1]:.4f})")
print(f"CI width:          {ci[1]-ci[0]:.4f}")

# How CI width changes with n
print("\nCI width vs sample size (true acc=0.87, std~0.336):")
for n in [50, 100, 200, 500, 1000]:
    w = 2 * 1.96 * 0.336 / np.sqrt(n)
    print(f"  n={n:5d}: width ≈ {w:.4f} (+/- {w/2:.4f})")

Correlation

Pearson correlation measures linear relationship between $X$ and $Y$:

$$r = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_i (x_i-\bar{x})^2 \cdot \sum_i (y_i-\bar{y})^2}} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y} \in [-1, 1]$$

Spearman rank correlation uses ranks instead of values — more robust to outliers and non-linear monotonic relationships.

Correlation ≠ Causation: Anscombe's Quartet (4 different datasets) each has identical mean, variance, and Pearson correlation of 0.816 — yet they look completely different when plotted. Always visualize your data. High correlation between features can cause multicollinearity issues in regression. Remove redundant correlated features or use PCA (Part 12).
import numpy as np
from scipy import stats

# Pearson and Spearman correlation comparison
rng = np.random.default_rng(42)

# Case 1: Linear relationship
x = np.linspace(0, 10, 100)
y_linear = 2 * x + rng.normal(0, 1, 100)

pearson_lin, _ = stats.pearsonr(x, y_linear)
spearman_lin, _ = stats.spearmanr(x, y_linear)
print(f"Linear relationship:")
print(f"  Pearson r  = {pearson_lin:.4f}")
print(f"  Spearman r = {spearman_lin:.4f}")

# Case 2: Non-linear monotone (Pearson underestimates, Spearman captures)
y_nonlinear = x**3 + rng.normal(0, 5, 100)

pearson_nl, _ = stats.pearsonr(x, y_nonlinear)
spearman_nl, _ = stats.spearmanr(x, y_nonlinear)
print(f"\nNon-linear monotone (x^3):")
print(f"  Pearson r  = {pearson_nl:.4f}  (underestimates)")
print(f"  Spearman r = {spearman_nl:.4f}  (correctly high)")

# Case 3: No relationship
y_random = rng.normal(0, 1, 100)
pearson_rnd, _ = stats.pearsonr(x, y_random)
print(f"\nNo relationship:")
print(f"  Pearson r  = {pearson_rnd:.4f}  (near zero)")

Spearman Rank Correlation — Explicit Formula

For $n$ pairs, compute rank differences $d_i = \text{rank}(x_i) - \text{rank}(y_i)$. Then:

$$r_s = 1 - \frac{6 \sum_{i=1}^{n} d_i^2}{n(n^2 - 1)}$$

This shortcut formula is exact when there are no tied ranks. For tied ranks, use the standard Pearson formula applied to the ranks.

import numpy as np
from scipy import stats

# Spearman rank correlation — manual and scipy
x = np.array([10, 20, 30, 40, 50])
y = np.array([12, 35, 25, 48, 44])

# Manual: compute ranks, then differences
rank_x = stats.rankdata(x)
rank_y = stats.rankdata(y)
d = rank_x - rank_y
n = len(x)

rs_manual = 1 - (6 * np.sum(d**2)) / (n * (n**2 - 1))
rs_scipy, _ = stats.spearmanr(x, y)

print(f"Ranks x: {rank_x}")
print(f"Ranks y: {rank_y}")
print(f"d  = {d}")
print(f"d^2 = {d**2}, sum = {np.sum(d**2)}")
print(f"\nSpearman rs (manual formula) = {rs_manual:.4f}")
print(f"Spearman rs (scipy)          = {rs_scipy:.4f}")

Skewness & Kurtosis

Correlation and mean alone don't describe the shape of a distribution. Skewness measures asymmetry; kurtosis measures tail heaviness — both are crucial for understanding model residuals and selecting appropriate distributions.

Skewness

$$\gamma_1 = \frac{\mu_3}{\sigma^3} = \frac{\mathbb{E}[(X-\mu)^3]}{(\mathbb{E}[(X-\mu)^2])^{3/2}}$$
ValueShapeMean vs MedianML Examples
$\gamma_1 = 0$SymmetricMean = MedianGaussian noise, weight initializations
$\gamma_1 \gt 0$Right-skewed (long right tail)Mean > MedianIncome data, response times, loss values
$\gamma_1 \lt 0$Left-skewed (long left tail)Mean < MedianAccuracy distributions near ceiling

Kurtosis (Excess Kurtosis)

$$\gamma_2 = \frac{\mu_4}{\sigma^4} - 3 = \frac{\mathbb{E}[(X-\mu)^4]}{(\mathbb{E}[(X-\mu)^2])^2} - 3$$

The $-3$ normalises against the normal distribution (which has kurtosis 0 in this excess form). Raw kurtosis of normal = 3.

Excess KurtosisNameTail Behaviour
$\gamma_2 = 0$Mesokurtic (normal)Normal tails
$\gamma_2 \gt 0$LeptokurticHeavier tails, sharper peak — more outliers
$\gamma_2 \lt 0$PlatykurticLighter tails, flatter peak — fewer outliers
import numpy as np
from scipy import stats

rng = np.random.default_rng(0)

distributions = {
    'Normal':      rng.normal(0, 1, 10000),
    'Exponential': rng.exponential(1, 10000),      # right-skewed
    'Uniform':     rng.uniform(-2, 2, 10000),       # platykurtic
    'Laplace':     rng.laplace(0, 1, 10000),        # leptokurtic
}

print(f"{'Distribution':15}  {'Skewness':>10}  {'Excess Kurt':>12}")
print("-" * 42)
for name, data in distributions.items():
    sk = stats.skew(data)
    kt = stats.kurtosis(data)   # scipy returns EXCESS kurtosis by default
    print(f"{name:15}  {sk:>10.4f}  {kt:>12.4f}")
Skewness/Kurtosis in ML: Right-skewed features (log-normal income, pixel intensities) often benefit from log-transform before feeding to linear models. Residuals of a good regression should have near-zero skewness and near-zero excess kurtosis (approximately normal). High kurtosis in model errors signals the model is missing extreme events — common with financial time series.

Regression & Collinearity

Linear regression models the relationship between a target $y$ and features $\mathbf{x}$ as a linear combination:

$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \varepsilon$$

In matrix form: $\mathbf{y} = X\boldsymbol{\beta} + \boldsymbol{\varepsilon}$ where $X$ is the $n \times (p+1)$ design matrix (first column of ones for intercept).

Ordinary Least Squares (OLS)

OLS minimises the sum of squared residuals $\sum_i (y_i - \hat{y}_i)^2$. The analytical solution is:

$$\hat{\boldsymbol{\beta}} = (X^T X)^{-1} X^T \mathbf{y}$$

For simple linear regression ($p=1$, one feature):

$$\hat{\beta}_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2} = \frac{\text{Cov}(x,y)}{s_x^2} \qquad \hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$$

Multicollinearity & VIF

Multicollinearity occurs when two or more features are highly correlated. This causes:

  • Unstable $\hat{\boldsymbol{\beta}}$ estimates — small data changes flip signs of coefficients
  • Inflated standard errors — coefficients are statistically insignificant even when the overall model is good
  • Difficulty interpreting individual feature importance

The Variance Inflation Factor (VIF) quantifies multicollinearity for feature $j$:

$$\text{VIF}_j = \frac{1}{1 - R^2_j}$$

where $R^2_j$ is the $R^2$ from regressing feature $j$ on all other features. Rule of thumb: VIF $\gt 5$ → investigate; VIF $\gt 10$ → serious multicollinearity.

import numpy as np
from sklearn.linear_model import LinearRegression

rng = np.random.default_rng(42)
n = 100

# Generate correlated features (x2 is nearly x1 + noise)
x1 = rng.normal(0, 1, n)
x2 = x1 * 0.95 + rng.normal(0, 0.2, n)   # highly correlated with x1
x3 = rng.normal(0, 1, n)                   # independent
y  = 2*x1 + 3*x3 + rng.normal(0, 0.5, n)

X = np.column_stack([x1, x2, x3])

# Compute VIF for each feature
def compute_vif(X):
    vifs = []
    for i in range(X.shape[1]):
        y_i    = X[:, i]
        X_rest = np.delete(X, i, axis=1)
        reg    = LinearRegression().fit(X_rest, y_i)
        r2     = reg.score(X_rest, y_i)
        vif    = 1 / (1 - r2) if r2 < 1.0 else float('inf')
        vifs.append(vif)
    return vifs

vifs = compute_vif(X)
for i, v in enumerate(vifs):
    flag = "  <-- HIGH multicollinearity!" if v > 5 else ""
    print(f"Feature x{i+1}: VIF = {v:.2f}{flag}")

Regression Error Metrics

After fitting a regression model, we need metrics to assess how well $\hat{y}$ approximates $y$. Different metrics penalise errors differently and suit different use cases.

MetricFormulaUnitsPenalises Outliers
MAE (Mean Absolute Error)$\frac{1}{n}\sum|y_i - \hat{y}_i|$Same as $y$Linearly
MSE (Mean Squared Error)$\frac{1}{n}\sum(y_i - \hat{y}_i)^2$$y^2$Quadratically (heavy)
RMSE (Root MSE)$\sqrt{\text{MSE}}$Same as $y$Quadratically
$R^2$ (Coefficient of Determination)$1 - \frac{\sum(y_i-\hat{y}_i)^2}{\sum(y_i-\bar{y})^2}$Unitless [0,1]
Adjusted $R^2$$1 - (1-R^2)\frac{n-1}{n-p-1}$Unitless
Key insights:
  • $R^2 = 1$: perfect fit. $R^2 = 0$: model is no better than predicting $\bar{y}$ always. $R^2 \lt 0$: model is worse than the mean — check for data leakage or a serious bug.
  • Adjusted $R^2$ penalises adding useless features ($p$ = number of predictors). Adding a random feature increases $R^2$ but decreases Adjusted $R^2$.
  • RMSE vs MAE: If large errors are especially bad (e.g., predicting drug dosage), use RMSE. If outliers should not dominate, use MAE.
  • $R^2$ is not a complete diagnostic: a model with $R^2 = 0.95$ but highly skewed residuals is still a bad model.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

rng = np.random.default_rng(7)
n = 100
X = rng.normal(0, 1, (n, 3))
beta_true = np.array([2.0, -1.5, 0.8])
y = X @ beta_true + rng.normal(0, 1, n)

# Fit model
model = LinearRegression().fit(X, y)
y_hat = model.predict(X)

# Compute metrics
mae  = mean_absolute_error(y, y_hat)
mse  = mean_squared_error(y, y_hat)
rmse = np.sqrt(mse)
r2   = r2_score(y, y_hat)

# Adjusted R^2
p = X.shape[1]  # number of features
adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)

print(f"MAE         = {mae:.4f}")
print(f"MSE         = {mse:.4f}")
print(f"RMSE        = {rmse:.4f}")
print(f"R^2         = {r2:.4f}")
print(f"Adjusted R^2= {adj_r2:.4f}")
print(f"\nEstimated coefficients: {model.coef_.round(3)}")
print(f"True coefficients:      {beta_true}")

Practice Exercises

Exercise 1A/B Testing
Is the New Model Actually Better?

You test two recommendation models on 500 users each. Model A clicks: 120/500 (24%). Model B clicks: 140/500 (28%). Is Model B significantly better at $\alpha = 0.05$?

Show Answer

This is a two-proportion z-test. Pooled proportion: $\hat{p} = (120+140)/(500+500) = 0.26$. Test statistic: $z = (0.28 - 0.24) / \sqrt{0.26 \times 0.74 \times (1/500 + 1/500)} = 0.04 / \sqrt{0.000385} \approx 0.04/0.01963 \approx 2.04$. p-value $\approx 0.041 \lt 0.05$. Yes, Model B is statistically significantly better. But is the 4% improvement practically significant? Depends on business context — statistical significance $\neq$ practical significance.

Conclusion & Next Steps

Statistics is the discipline of uncertainty quantification from data. Core takeaways:

  • Data fundamentals: Frequency tables, tally charts, and class midpoints for grouped data — always know how your data is structured
  • Descriptive stats: Mean (simple, weighted), median, mode — compute from raw data and frequency distributions; mean deviation is the absolute-value spread measure
  • Properties of mean: Linearity, deviations sum to zero, mean minimises MSE — these underpin loss function math
  • Variable types: Role (dependent/independent/control/moderating/mediating) and scale (nominal/ordinal/interval/ratio) determine what operations are valid
  • Population vs sample: Bessel's correction makes variance estimators unbiased
  • CLT: Sample means are approximately Gaussian for large $n$ — explains why mini-batch SGD works
  • Hypothesis testing: p-values quantify evidence against null hypothesis — not probability of the null being true
  • Skewness & Kurtosis: Measure distribution shape — right-skewed features need log-transforms; high kurtosis means heavy tails
  • Correlation: Pearson (linear), Spearman with explicit rank formula $r_s = 1 - 6\sum d_i^2/(n(n^2-1))$ — visualize before trusting correlations
  • Regression & VIF: OLS solution $(X^TX)^{-1}X^Ty$; VIF $\gt 5$ indicates multicollinearity — remove or regularise
  • Error metrics: MAE (robust), MSE/RMSE (penalises outliers), $R^2$ (variance explained), Adjusted $R^2$ (penalises extra features)

Next in the Series

In Part 6: Information Theory, we use probability to quantify information — Shannon entropy, cross-entropy loss, KL divergence, and mutual information for feature selection. This is the math directly behind the loss functions you minimize every time you train a neural network.