Complete Math + Probability + Statistics for ML/AI/DS Bootcamp
Mathematical Thinking
Mindset, notation & functionsSet Theory & Foundations
Sets, operations & ML connectionsCombinatorics
Counting, permutations & combinationsProbability Fundamentals
Rules, Bayes & distributionsStatistics
Descriptive to inferentialInformation Theory
Entropy, cross-entropy & KL divergenceLinear Algebra
Vectors, matrices & transformationsCalculus & Optimization
Derivatives, gradients & descentML-Specific Math
Loss functions & regularizationComputational Math
NumPy, SciPy & simulationAdvanced Topics
Multivariate stats & Bayesian inferenceProjects & Applications
Build from scratch: regression, Bayes, PCAWhy Statistics in ML
Probability (Part 4) tells us about theoretical distributions. Statistics tells us what to conclude when we only observe a finite sample from an unknown distribution. This distinction matters enormously:
- Model comparison: Is 92.1% accuracy significantly better than 91.8%? Or is it noise?
- Feature selection: Is this feature statistically significantly correlated with the target?
- Dataset analysis: Are train and test distributions the same? (Kolmogorov-Smirnov test)
- A/B testing: Did the new model version genuinely improve the click-through rate?
- Data cleaning: Is this value an outlier (3 standard deviations from the mean)?
Data Fundamentals
Before computing statistics, understand how data is recorded and organised. Every ML pipeline begins with raw data; understanding its structure prevents subtle errors in preprocessing and feature engineering.
Raw Data, Arrays, and Datasets
Raw data is unprocessed observations as collected. An array (or data vector) is an ordered list of values for a single variable. A dataset is a rectangular table of $n$ observations (rows) × $p$ variables (columns) — the standard input to any ML model.
- Population: $N$ = total size; $\mu$ = population mean; $\sigma^2$ = population variance; $\sigma$ = population standard deviation
- Sample: $n$ = sample size; $\bar{x}$ = sample mean; $s^2$ = sample variance; $s$ = sample standard deviation
- Individual value: $x_i$ for the $i$-th observation (1-indexed)
- Sum notation: $\sum_{i=1}^{n} x_i$ = sum of all values in the sample
Frequency Tables and Tally Charts
A tally chart counts occurrences by placing a mark (|) for each value — five marks form a gate (||||). A frequency table organises these counts into a table with columns for value, tally, and frequency $f$.
| Score (x) | Tally | Frequency (f) | Relative Freq |
|---|---|---|---|
| 1 | || | 2 | 0.10 |
| 2 | |||| | 4 | 0.20 |
| 3 | |||| | | 6 | 0.30 |
| 4 | |||| | 5 | 0.25 |
| 5 | ||| | 3 | 0.15 |
| Total | 20 | 1.00 |
For grouped / class intervals (e.g., 0–9, 10–19, …), each row represents a range. The midpoint of each class is used as the representative value $x_i$ when computing statistics from grouped data.
import numpy as np
from collections import Counter
# Build a frequency table from raw data
data = [1, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 2, 2, 1]
n = len(data)
counts = Counter(data)
print(f"{'Value':>6} {'Freq':>5} {'Rel Freq':>9} {'Cum Freq':>9}")
print("-" * 35)
cumulative = 0
for val in sorted(counts):
f = counts[val]
cumulative += f
print(f"{val:>6} {f:>5} {f/n:>9.3f} {cumulative/n:>9.3f}")
print(f"{'Total':>6} {n:>5} {'1.000':>9}")
Descriptive Statistics
Measures of Central Tendency
For a dataset $\{x_1, x_2, \ldots, x_n\}$:
- Mean: $\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$ — sensitive to outliers
- Median: the middle value when sorted — robust to outliers
- Mode: the most frequent value — useful for categorical data
Measures of Spread
How spread out are the values?
- Range: $\max(x) - \min(x)$ — affected by a single outlier
- Interquartile Range (IQR): $Q_3 - Q_1$ — robust spread measure
- Variance: $s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2$ — uses $n-1$ (Bessel's correction, see below)
- Standard Deviation: $s = \sqrt{s^2}$ — same units as the data
- Coefficient of Variation: $\text{CV} = s/\bar{x}$ — relative spread (unit-free)
import numpy as np
# Descriptive statistics from scratch (to understand the math)
data = np.array([2, 4, 4, 4, 5, 5, 7, 9, 10, 100]) # note: 100 is an outlier
n = len(data)
mean = np.mean(data)
median = np.median(data)
# Variance: two formulas
var_population = np.var(data, ddof=0) # divide by n
var_sample = np.var(data, ddof=1) # divide by (n-1): Bessel's correction
std_sample = np.std(data, ddof=1)
# IQR
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1
print(f"n = {n}")
print(f"Mean = {mean:.2f} (pulled by outlier 100)")
print(f"Median = {median:.2f} (robust to outlier)")
print(f"Var (pop) = {var_population:.2f} (divides by n)")
print(f"Var (sample)= {var_sample:.2f} (divides by n-1, Bessel)")
print(f"Std Dev = {std_sample:.2f}")
print(f"IQR = {iqr:.2f} (Q1={q1}, Q3={q3})")
Weighted Mean
When different observations carry different importance, the weighted mean assigns a weight $w_i$ to each value $x_i$:
- Class-weighted loss: minority class gets higher weight $w_i$ so the model doesn't ignore it
- Ensemble averaging: combine model predictions with weights proportional to their validation accuracy
- Exponential moving average (EMA) for training: recent batches weighted more heavily — used in Adam optimizer's momentum estimate
import numpy as np
# Weighted mean examples
scores = np.array([72, 85, 91, 60, 88])
weights = np.array([0.10, 0.20, 0.30, 0.15, 0.25]) # sum to 1.0
# Manual weighted mean
weighted_mean = np.sum(weights * scores) / np.sum(weights)
# NumPy built-in
weighted_mean_np = np.average(scores, weights=weights)
print(f"Scores: {scores}")
print(f"Weights: {weights}")
print(f"Ordinary mean: {np.mean(scores):.2f}")
print(f"Weighted mean: {weighted_mean:.2f}")
print(f"NumPy average: {weighted_mean_np:.2f}")
Properties of the Mean
The mean has algebraic properties that are heavily used in ML math:
| Property | Formula | ML Use |
|---|---|---|
| Linearity — constant shift | $\overline{(x+c)} = \bar{x} + c$ | Feature centring: $x' = x - \bar{x}$ |
| Linearity — scaling | $\overline{(cx)} = c\bar{x}$ | Feature scaling: $x' = x / \sigma$ |
| Sum of means | $\overline{(x+y)} = \bar{x} + \bar{y}$ | Expected value of sum of random variables |
| Deviations sum to zero | $\sum_{i=1}^n (x_i - \bar{x}) = 0$ | Centred data has zero mean — prerequisite for PCA |
| Minimises squared deviations | $\arg\min_c \sum(x_i - c)^2 = \bar{x}$ | MSE loss minimiser is the mean — basis of OLS regression |
| Minimises absolute deviations | $\arg\min_c \sum|x_i - c| = \text{median}$ | MAE loss minimiser is the median — robust regression |
Mean from a Frequency Distribution
When data is presented in a frequency table with values $x_i$ and frequencies $f_i$:
For grouped data (class intervals), use the midpoint of each class as $x_i$:
import numpy as np
# Mean from frequency distribution
scores = np.array([1, 2, 3, 4, 5])
frequency = np.array([2, 4, 6, 5, 3])
n = frequency.sum()
mean_fd = np.sum(scores * frequency) / n
print(f"n = {n}")
print(f"Mean from frequency distribution = {mean_fd:.4f}")
# Grouped data: class midpoints
class_midpoints = np.array([5.0, 15.0, 25.0, 35.0, 45.0])
class_freq = np.array([3, 8, 12, 7, 5])
n_grouped = class_freq.sum()
mean_grouped = np.sum(class_midpoints * class_freq) / n_grouped
print(f"\nGrouped data: n = {n_grouped}")
print(f"Mean (from midpoints) = {mean_grouped:.2f}")
Median from a Frequency Distribution
For ungrouped frequency data: compute the cumulative frequency and find the class containing the $\frac{n+1}{2}$-th observation (odd $n$) or the average of the $\frac{n}{2}$ and $\frac{n}{2}+1$ observations (even $n$).
For grouped data, use the interpolation formula:
where $L$ = lower boundary of the median class, $F$ = cumulative frequency before the median class, $f_m$ = frequency of the median class, $h$ = class width.
Mode from a Frequency Distribution
For ungrouped data: the value with the highest frequency. For grouped data:
where $L$ = lower boundary of the modal class (highest frequency), $d_1 = f_m - f_{\text{prev}}$, $d_2 = f_m - f_{\text{next}}$, $h$ = class width.
Mean Deviation (Mean Absolute Deviation)
The mean deviation measures average distance from the mean, using absolute values (not squares) to avoid cancellation:
Unlike variance, it uses the same units as the data and is less sensitive to outliers than standard deviation (because it doesn't square the deviations).
Mean Deviation from a Frequency Distribution
When data is given as a frequency table:
import numpy as np
# Mean Deviation — raw data
data = np.array([4, 7, 13, 16, 21, 7, 4])
n = len(data)
mean = np.mean(data)
MD_raw = np.mean(np.abs(data - mean))
print(f"Data: {data}")
print(f"Mean: {mean:.4f}")
print(f"Mean Deviation (raw): {MD_raw:.4f}")
# Mean Deviation — frequency distribution
x_vals = np.array([1, 2, 3, 4, 5])
freqs = np.array([2, 4, 6, 5, 3])
n_fd = freqs.sum()
mean_fd = np.sum(x_vals * freqs) / n_fd
MD_fd = np.sum(freqs * np.abs(x_vals - mean_fd)) / n_fd
print(f"\nFrequency distribution:")
print(f"Mean: {mean_fd:.4f}")
print(f"Mean Deviation (freq): {MD_fd:.4f}")
# Compare: MD vs Std Dev — MD is less sensitive to outliers
data_outlier = np.append(data, 100)
print(f"\nWith outlier 100:")
print(f"Mean Deviation: {np.mean(np.abs(data_outlier - data_outlier.mean())):.4f}")
print(f"Std Deviation: {np.std(data_outlier, ddof=1):.4f}")
Population vs Sample Statistics
A population is the complete set (all possible data); a sample is the subset we observe. Statistical estimators are functions of the sample that estimate population parameters:
Bessel's correction ($n-1$ instead of $n$) makes $s^2$ an unbiased estimator of $\sigma^2$: $\mathbb{E}[s^2] = \sigma^2$. Using $n$ would systematically underestimate population variance.
Variable Types
Understanding variable types is prerequisite to choosing the right model, loss function, and evaluation metric. Variables differ in both their role in a study and their measurement scale.
By Role
| Type | Definition | ML Equivalent |
|---|---|---|
| Dependent (outcome) | What you measure / predict — the result | Target $y$ (label) |
| Independent (predictor) | What you manipulate / observe — the inputs | Feature $x$ (input) |
| Control | Held constant to isolate the effect of the independent variable | Fixed hyperparameter (e.g., fixed batch size) |
| Moderating | Changes the direction or strength of the relationship between X and Y | Interaction feature $x_1 \times x_2$ |
| Mediating | Explains the mechanism by which X influences Y (intermediate variable) | Latent representation in a neural network |
By Measurement Scale (Stevens's Scale)
| Scale | Properties | Examples | Valid Statistics |
|---|---|---|---|
| Nominal | Named categories, no order | Colour, country, model architecture | Mode, frequency, $\chi^2$ |
| Ordinal | Ordered categories, unequal gaps | Star ratings (1–5), severity (low/mid/high) | Median, percentile, Spearman $r_s$ |
| Interval | Equal gaps, no true zero | Temperature (°C), year, IQ score | Mean, std, Pearson $r$ |
| Ratio | Equal gaps, true zero exists | Height, weight, time, accuracy, loss | All statistics, geometric mean, CV |
Central Limit Theorem (CLT)
The CLT is arguably the most important theorem in statistics for ML:
The standard error of the mean is $\text{SE} = \sigma/\sqrt{n}$: variance of the sample mean decreases as $1/n$.
Why Mini-batch Gradient Estimates Work
In mini-batch SGD, instead of computing the exact gradient over all $N$ training examples, you estimate it using $B$ random samples. By the CLT, this estimate has mean equal to the true gradient and variance $\sigma^2_{\text{grad}} / B$. Larger batch size $B$ → smaller gradient variance → more stable but more computation per step. Smaller $B$ → noisier gradients → can escape local minima but converges slower. The CLT is precisely why this stochastic approximation works.
import numpy as np
# Demonstrate the Central Limit Theorem
rng = np.random.default_rng(42)
# Underlying distribution: highly skewed (exponential)
population_size = 100_000
population = rng.exponential(scale=2.0, size=population_size)
print(f"Population: mean={population.mean():.3f}, std={population.std():.3f}")
print(f"Exponential: highly right-skewed, NOT Gaussian\n")
# Sample means for different sample sizes
for n in [5, 30, 100, 500]:
# Draw 10,000 samples of size n and compute their means
sample_means = [rng.choice(population, n, replace=False).mean() for _ in range(10_000)]
sample_means = np.array(sample_means)
theoretical_se = population.std() / np.sqrt(n)
print(f"n={n:4d}: mean of means={sample_means.mean():.3f}, "
f"SE={sample_means.std():.4f} (theoretical: {theoretical_se:.4f})")
Hypothesis Testing
Hypothesis testing provides a formal framework for deciding whether observed data is compatible with a specific claim (the null hypothesis):
- $H_0$ (null hypothesis): the default claim, e.g., "both models have equal accuracy"
- $H_1$ (alternative): what we're trying to show, e.g., "Model B is better than Model A"
The p-value is the probability of observing data at least as extreme as what we saw, assuming $H_0$ is true. A small p-value (typically $p \lt 0.05$) provides evidence against $H_0$.
t-Test: Comparing Means
To compare means of two groups, the t-statistic is:
Under $H_0$: equal means, $t$ follows a t-distribution. Large $|t|$ → small p-value → reject $H_0$.
Chi-Squared Test: Categorical Variables
The $\chi^2$ test checks whether observed categorical frequencies differ from expected:
In ML: used for feature selection (is this categorical feature independent of the target?), testing whether predicted class distribution matches actual distribution.
Type I and Type II Errors
Power $= 1 - \beta$ — the probability of correctly detecting a real effect. Increasing sample size $n$ increases power.
import numpy as np
from scipy import stats
# t-test: compare accuracy of two models on 30 test runs
rng = np.random.default_rng(0)
# Model A: true mean accuracy 0.85
# Model B: true mean accuracy 0.88
model_a = rng.normal(0.85, 0.03, size=30)
model_b = rng.normal(0.88, 0.03, size=30)
t_stat, p_val = stats.ttest_ind(model_a, model_b)
print(f"Model A: mean = {model_a.mean():.4f} +/- {model_a.std():.4f}")
print(f"Model B: mean = {model_b.mean():.4f} +/- {model_b.std():.4f}")
print(f"\nt-statistic = {t_stat:.4f}")
print(f"p-value = {p_val:.4f}")
print(f"Significant at alpha=0.05: {p_val < 0.05}")
Confidence Intervals
A 95% confidence interval (CI) for the population mean, using the CLT:
Interpretation: if we repeated the sampling procedure many times, 95% of such intervals would contain the true population mean. It does NOT mean "there's a 95% chance the true mean is in this interval."
import numpy as np
from scipy import stats
# Compute confidence interval for model accuracy
rng = np.random.default_rng(7)
# Simulate binary prediction outcomes (1=correct, 0=incorrect)
n_samples = 200
true_accuracy = 0.87
predictions_correct = rng.binomial(1, true_accuracy, size=n_samples)
sample_acc = predictions_correct.mean()
se = predictions_correct.std(ddof=1) / np.sqrt(n_samples)
# 95% CI using t-distribution (since n < infinity)
ci = stats.t.interval(0.95, df=n_samples-1, loc=sample_acc, scale=se)
print(f"Observed accuracy: {sample_acc:.4f}")
print(f"Standard error: {se:.4f}")
print(f"95% CI: ({ci[0]:.4f}, {ci[1]:.4f})")
print(f"CI width: {ci[1]-ci[0]:.4f}")
# How CI width changes with n
print("\nCI width vs sample size (true acc=0.87, std~0.336):")
for n in [50, 100, 200, 500, 1000]:
w = 2 * 1.96 * 0.336 / np.sqrt(n)
print(f" n={n:5d}: width ≈ {w:.4f} (+/- {w/2:.4f})")
Correlation
Pearson correlation measures linear relationship between $X$ and $Y$:
Spearman rank correlation uses ranks instead of values — more robust to outliers and non-linear monotonic relationships.
import numpy as np
from scipy import stats
# Pearson and Spearman correlation comparison
rng = np.random.default_rng(42)
# Case 1: Linear relationship
x = np.linspace(0, 10, 100)
y_linear = 2 * x + rng.normal(0, 1, 100)
pearson_lin, _ = stats.pearsonr(x, y_linear)
spearman_lin, _ = stats.spearmanr(x, y_linear)
print(f"Linear relationship:")
print(f" Pearson r = {pearson_lin:.4f}")
print(f" Spearman r = {spearman_lin:.4f}")
# Case 2: Non-linear monotone (Pearson underestimates, Spearman captures)
y_nonlinear = x**3 + rng.normal(0, 5, 100)
pearson_nl, _ = stats.pearsonr(x, y_nonlinear)
spearman_nl, _ = stats.spearmanr(x, y_nonlinear)
print(f"\nNon-linear monotone (x^3):")
print(f" Pearson r = {pearson_nl:.4f} (underestimates)")
print(f" Spearman r = {spearman_nl:.4f} (correctly high)")
# Case 3: No relationship
y_random = rng.normal(0, 1, 100)
pearson_rnd, _ = stats.pearsonr(x, y_random)
print(f"\nNo relationship:")
print(f" Pearson r = {pearson_rnd:.4f} (near zero)")
Spearman Rank Correlation — Explicit Formula
For $n$ pairs, compute rank differences $d_i = \text{rank}(x_i) - \text{rank}(y_i)$. Then:
This shortcut formula is exact when there are no tied ranks. For tied ranks, use the standard Pearson formula applied to the ranks.
import numpy as np
from scipy import stats
# Spearman rank correlation — manual and scipy
x = np.array([10, 20, 30, 40, 50])
y = np.array([12, 35, 25, 48, 44])
# Manual: compute ranks, then differences
rank_x = stats.rankdata(x)
rank_y = stats.rankdata(y)
d = rank_x - rank_y
n = len(x)
rs_manual = 1 - (6 * np.sum(d**2)) / (n * (n**2 - 1))
rs_scipy, _ = stats.spearmanr(x, y)
print(f"Ranks x: {rank_x}")
print(f"Ranks y: {rank_y}")
print(f"d = {d}")
print(f"d^2 = {d**2}, sum = {np.sum(d**2)}")
print(f"\nSpearman rs (manual formula) = {rs_manual:.4f}")
print(f"Spearman rs (scipy) = {rs_scipy:.4f}")
Skewness & Kurtosis
Correlation and mean alone don't describe the shape of a distribution. Skewness measures asymmetry; kurtosis measures tail heaviness — both are crucial for understanding model residuals and selecting appropriate distributions.
Skewness
| Value | Shape | Mean vs Median | ML Examples |
|---|---|---|---|
| $\gamma_1 = 0$ | Symmetric | Mean = Median | Gaussian noise, weight initializations |
| $\gamma_1 \gt 0$ | Right-skewed (long right tail) | Mean > Median | Income data, response times, loss values |
| $\gamma_1 \lt 0$ | Left-skewed (long left tail) | Mean < Median | Accuracy distributions near ceiling |
Kurtosis (Excess Kurtosis)
The $-3$ normalises against the normal distribution (which has kurtosis 0 in this excess form). Raw kurtosis of normal = 3.
| Excess Kurtosis | Name | Tail Behaviour |
|---|---|---|
| $\gamma_2 = 0$ | Mesokurtic (normal) | Normal tails |
| $\gamma_2 \gt 0$ | Leptokurtic | Heavier tails, sharper peak — more outliers |
| $\gamma_2 \lt 0$ | Platykurtic | Lighter tails, flatter peak — fewer outliers |
import numpy as np
from scipy import stats
rng = np.random.default_rng(0)
distributions = {
'Normal': rng.normal(0, 1, 10000),
'Exponential': rng.exponential(1, 10000), # right-skewed
'Uniform': rng.uniform(-2, 2, 10000), # platykurtic
'Laplace': rng.laplace(0, 1, 10000), # leptokurtic
}
print(f"{'Distribution':15} {'Skewness':>10} {'Excess Kurt':>12}")
print("-" * 42)
for name, data in distributions.items():
sk = stats.skew(data)
kt = stats.kurtosis(data) # scipy returns EXCESS kurtosis by default
print(f"{name:15} {sk:>10.4f} {kt:>12.4f}")
Regression & Collinearity
Linear regression models the relationship between a target $y$ and features $\mathbf{x}$ as a linear combination:
In matrix form: $\mathbf{y} = X\boldsymbol{\beta} + \boldsymbol{\varepsilon}$ where $X$ is the $n \times (p+1)$ design matrix (first column of ones for intercept).
Ordinary Least Squares (OLS)
OLS minimises the sum of squared residuals $\sum_i (y_i - \hat{y}_i)^2$. The analytical solution is:
For simple linear regression ($p=1$, one feature):
Multicollinearity & VIF
Multicollinearity occurs when two or more features are highly correlated. This causes:
- Unstable $\hat{\boldsymbol{\beta}}$ estimates — small data changes flip signs of coefficients
- Inflated standard errors — coefficients are statistically insignificant even when the overall model is good
- Difficulty interpreting individual feature importance
The Variance Inflation Factor (VIF) quantifies multicollinearity for feature $j$:
where $R^2_j$ is the $R^2$ from regressing feature $j$ on all other features. Rule of thumb: VIF $\gt 5$ → investigate; VIF $\gt 10$ → serious multicollinearity.
import numpy as np
from sklearn.linear_model import LinearRegression
rng = np.random.default_rng(42)
n = 100
# Generate correlated features (x2 is nearly x1 + noise)
x1 = rng.normal(0, 1, n)
x2 = x1 * 0.95 + rng.normal(0, 0.2, n) # highly correlated with x1
x3 = rng.normal(0, 1, n) # independent
y = 2*x1 + 3*x3 + rng.normal(0, 0.5, n)
X = np.column_stack([x1, x2, x3])
# Compute VIF for each feature
def compute_vif(X):
vifs = []
for i in range(X.shape[1]):
y_i = X[:, i]
X_rest = np.delete(X, i, axis=1)
reg = LinearRegression().fit(X_rest, y_i)
r2 = reg.score(X_rest, y_i)
vif = 1 / (1 - r2) if r2 < 1.0 else float('inf')
vifs.append(vif)
return vifs
vifs = compute_vif(X)
for i, v in enumerate(vifs):
flag = " <-- HIGH multicollinearity!" if v > 5 else ""
print(f"Feature x{i+1}: VIF = {v:.2f}{flag}")
Regression Error Metrics
After fitting a regression model, we need metrics to assess how well $\hat{y}$ approximates $y$. Different metrics penalise errors differently and suit different use cases.
| Metric | Formula | Units | Penalises Outliers |
|---|---|---|---|
| MAE (Mean Absolute Error) | $\frac{1}{n}\sum|y_i - \hat{y}_i|$ | Same as $y$ | Linearly |
| MSE (Mean Squared Error) | $\frac{1}{n}\sum(y_i - \hat{y}_i)^2$ | $y^2$ | Quadratically (heavy) |
| RMSE (Root MSE) | $\sqrt{\text{MSE}}$ | Same as $y$ | Quadratically |
| $R^2$ (Coefficient of Determination) | $1 - \frac{\sum(y_i-\hat{y}_i)^2}{\sum(y_i-\bar{y})^2}$ | Unitless [0,1] | — |
| Adjusted $R^2$ | $1 - (1-R^2)\frac{n-1}{n-p-1}$ | Unitless | — |
- $R^2 = 1$: perfect fit. $R^2 = 0$: model is no better than predicting $\bar{y}$ always. $R^2 \lt 0$: model is worse than the mean — check for data leakage or a serious bug.
- Adjusted $R^2$ penalises adding useless features ($p$ = number of predictors). Adding a random feature increases $R^2$ but decreases Adjusted $R^2$.
- RMSE vs MAE: If large errors are especially bad (e.g., predicting drug dosage), use RMSE. If outliers should not dominate, use MAE.
- $R^2$ is not a complete diagnostic: a model with $R^2 = 0.95$ but highly skewed residuals is still a bad model.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
rng = np.random.default_rng(7)
n = 100
X = rng.normal(0, 1, (n, 3))
beta_true = np.array([2.0, -1.5, 0.8])
y = X @ beta_true + rng.normal(0, 1, n)
# Fit model
model = LinearRegression().fit(X, y)
y_hat = model.predict(X)
# Compute metrics
mae = mean_absolute_error(y, y_hat)
mse = mean_squared_error(y, y_hat)
rmse = np.sqrt(mse)
r2 = r2_score(y, y_hat)
# Adjusted R^2
p = X.shape[1] # number of features
adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
print(f"MAE = {mae:.4f}")
print(f"MSE = {mse:.4f}")
print(f"RMSE = {rmse:.4f}")
print(f"R^2 = {r2:.4f}")
print(f"Adjusted R^2= {adj_r2:.4f}")
print(f"\nEstimated coefficients: {model.coef_.round(3)}")
print(f"True coefficients: {beta_true}")
Practice Exercises
Is the New Model Actually Better?
You test two recommendation models on 500 users each. Model A clicks: 120/500 (24%). Model B clicks: 140/500 (28%). Is Model B significantly better at $\alpha = 0.05$?
Show Answer
This is a two-proportion z-test. Pooled proportion: $\hat{p} = (120+140)/(500+500) = 0.26$. Test statistic: $z = (0.28 - 0.24) / \sqrt{0.26 \times 0.74 \times (1/500 + 1/500)} = 0.04 / \sqrt{0.000385} \approx 0.04/0.01963 \approx 2.04$. p-value $\approx 0.041 \lt 0.05$. Yes, Model B is statistically significantly better. But is the 4% improvement practically significant? Depends on business context — statistical significance $\neq$ practical significance.
Conclusion & Next Steps
Statistics is the discipline of uncertainty quantification from data. Core takeaways:
- Data fundamentals: Frequency tables, tally charts, and class midpoints for grouped data — always know how your data is structured
- Descriptive stats: Mean (simple, weighted), median, mode — compute from raw data and frequency distributions; mean deviation is the absolute-value spread measure
- Properties of mean: Linearity, deviations sum to zero, mean minimises MSE — these underpin loss function math
- Variable types: Role (dependent/independent/control/moderating/mediating) and scale (nominal/ordinal/interval/ratio) determine what operations are valid
- Population vs sample: Bessel's correction makes variance estimators unbiased
- CLT: Sample means are approximately Gaussian for large $n$ — explains why mini-batch SGD works
- Hypothesis testing: p-values quantify evidence against null hypothesis — not probability of the null being true
- Skewness & Kurtosis: Measure distribution shape — right-skewed features need log-transforms; high kurtosis means heavy tails
- Correlation: Pearson (linear), Spearman with explicit rank formula $r_s = 1 - 6\sum d_i^2/(n(n^2-1))$ — visualize before trusting correlations
- Regression & VIF: OLS solution $(X^TX)^{-1}X^Ty$; VIF $\gt 5$ indicates multicollinearity — remove or regularise
- Error metrics: MAE (robust), MSE/RMSE (penalises outliers), $R^2$ (variance explained), Adjusted $R^2$ (penalises extra features)
Next in the Series
In Part 6: Information Theory, we use probability to quantify information — Shannon entropy, cross-entropy loss, KL divergence, and mutual information for feature selection. This is the math directly behind the loss functions you minimize every time you train a neural network.