Complete Math + Probability + Statistics for ML/AI/DS Bootcamp
Mathematical Thinking
Mindset, notation & functionsSet Theory & Foundations
Sets, operations & ML connectionsCombinatorics
Counting, permutations & combinationsProbability Fundamentals
Rules, Bayes & distributionsStatistics
Descriptive to inferentialInformation Theory
Entropy, cross-entropy & KL divergenceLinear Algebra
Vectors, matrices & transformationsCalculus & Optimization
Derivatives, gradients & descentML-Specific Math
Loss functions & regularizationComputational Math with Python
NumPy, Pandas & simulationAdvanced Topics
Multivariate stats & Bayesian inferenceProjects & Applications
Build from scratch: regression, Bayes, PCAWhy Mathematical Thinking?
Imagine two engineers. Both copy a neural network from a tutorial. It runs. Then the model starts predicting garbage. Engineer A opens Stack Overflow and tweaks random hyperparameters hoping something improves. Engineer B reads the loss curve, recalls that the sigmoid activation is causing vanishing gradients for deep networks, and swaps in ReLU. The model converges in minutes.
The difference isn't raw intelligence. It's mathematical literacy.
- Level 1 — Black Box User: "I run the library and get predictions." No understanding of what happens inside.
- Level 2 — Configurator: "I know which hyperparameters to tune and why." Some intuition, limited by library docs.
- Level 3 — Builder: "I can derive the algorithm from first principles and modify it." Full mathematical fluency.
This bootcamp takes you from Level 1 to Level 3 — systematically. But before any formula, we need to talk about how mathematicians think. The mindset comes before the mechanics.
How Math Powers ML
Every concept in ML is underpinned by a branch of mathematics:
flowchart LR
A[Your Data] --> B[Linear Algebra\nVectors & Matrices]
A --> C[Probability\nUncertainty & Distributions]
B --> D[Model\nFunction Composition]
C --> D
D --> E[Calculus\nGradient Descent]
E --> F[Trained Model\nPredictions]
G[Statistics\nEvaluation & Inference] --> F
H[Information Theory\nLoss Functions] --> E
None of these areas exist in isolation — they connect and reinforce each other. This series follows a deliberate order: each phase builds the vocabulary needed for the next. We start here, at Phase 0, with the prerequisite that makes everything else learnable: mathematical thinking itself.
The Mathematical Mindset
0.1.1 — Abstraction & Modeling
Abstraction is the act of stripping away irrelevant details to reveal underlying structure. It's not dumbing things down — it's identifying what actually matters for a given question.
The mathematical model that emerges is:
where $\varepsilon$ captures the noise we've chosen to ignore. That $\varepsilon$ is not laziness — it's an intentional modelling decision. All of machine learning is about building and refining such abstractions.
Spam Classification as Abstraction
An email has a subject, body, sender, timestamp, HTML formatting, attached images, and metadata. A spam filter abstracts this to a vector of word frequencies — a bag of words. The model then learns: emails containing "FREE!!!", "CLICK HERE", and "WINNER" are spam. By abstracting text into numbers, we made an unstructured problem solvable with algebra.
The abstraction discards meaning, grammar, and context. That's a deliberate tradeoff: we lose nuance but gain tractability. More sophisticated models (like Transformers) preserve more context — they use a richer abstraction, not a different principle.
Key insight: When you see a mathematical formula, ask yourself: what real-world phenomenon is this abstracting, and what is it intentionally ignoring? That question will unlock 90% of what a formula is trying to say.
0.1.2 — Notation Fluency
Mathematical notation is a compressed language. Like any language, it feels intimidating until you build vocabulary. The secret is that most ML papers use a small, consistent set of symbols repeatedly.
| Symbol | Name | Meaning in ML | Example |
|---|---|---|---|
| $\sum_{i=1}^{n}$ | Summation | Sum over $n$ items | $\sum_{i=1}^{n} x_i = x_1 + x_2 + \cdots + x_n$ |
| $\prod_{i=1}^{n}$ | Product | Multiply $n$ items | $\prod_{i=1}^{n} p_i = p_1 \cdot p_2 \cdots p_n$ |
| $\forall$ | For all | Applies to every element | $\forall x \in \mathbb{R}$: for every real number $x$ |
| $\exists$ | There exists | At least one element satisfies | $\exists\, w$ such that $f(w) = 0$ |
| $\in$ | Element of | Belongs to a set | $x_i \in \mathbb{R}^d$: $x_i$ is a $d$-dimensional vector |
| $\mathbb{R}^n$ | Real $n$-space | Feature vector space | $\mathbf{x} \in \mathbb{R}^{784}$ (28×28 image flattened) |
| $\hat{y}$ | "y hat" | Predicted value | $\hat{y} = \mathbf{w}^\top \mathbf{x} + b$ |
| $\|\cdot\|$ | Norm | Size/length of vector | $\|\mathbf{w}\|^2 = \sum_i w_i^2$ |
| $\nabla$ | Nabla / Gradient | Direction of steepest ascent | $\nabla_\theta \mathcal{L}$: gradient of loss w.r.t. $\theta$ |
| $\mathbb{E}[\cdot]$ | Expectation | Average over distribution | $\mathbb{E}[X] = \sum_x x \cdot P(X=x)$ |
Greek letters you'll see constantly: $\alpha$ (learning rate), $\beta$ (regression coefficients), $\theta$ (model parameters), $\mu$ (mean), $\sigma$ (standard deviation), $\lambda$ (regularisation strength), $\varepsilon$ (small noise / error), $\eta$ (step size).
import numpy as np
# Notation in code: the Sigma (summation) symbol
# Mathematical: sum_{i=1}^{n} x_i
x = np.array([3, 7, 2, 9, 1])
total = np.sum(x) # Σ x_i
print("Sum:", total) # 22
# The Product symbol: prod_{i=1}^{n} x_i
product = np.prod(x)
print("Product:", product) # 378
# Norm ||w||^2 = sum_i w_i^2
w = np.array([0.3, -0.7, 0.1])
norm_squared = np.sum(w ** 2)
print("||w||²:", round(norm_squared, 4)) # 0.59
0.1.3 — Proof vs Intuition
Pure mathematics demands proof: a rigorous logical argument showing a statement is always true. Machine learning is more pragmatic — you'll use both proof and intuition, knowing when each is appropriate.
Example — Why does gradient descent converge?
- Intuition: Imagine a ball rolling down a hilly surface. At every step, it moves in the direction of steepest descent. Eventually it reaches a valley (minimum). The ball "knows" where to go locally without needing to see the whole landscape.
- Proof: For a convex function $f$ with Lipschitz-continuous gradient, with step size $\alpha \leq \frac{1}{L}$ (where $L$ is the Lipschitz constant), gradient descent converges at rate $O(1/k)$ after $k$ iterations.
The proof tells you when it works and how fast. The intuition tells you why it makes sense. Non-convex neural network loss landscapes complicate the proof — but the intuition still guides practice.
For this bootcamp: We'll develop both. Some sections are proof-heavy (set theory, probability axioms), others are intuition-led (gradient descent, neural networks). You'll learn to recognise which is appropriate for each context.
0.1.4 — Approximation vs Exactness
Here's a fact that surprises many beginners: virtually nothing in machine learning is exact. And that's completely intentional.
Why Computers Can't Be Exact
The number $\frac{1}{3} = 0.33333...$ has infinitely many decimal places. Computers store numbers in 32-bit or 64-bit floating point format, which means every irrational number (like $\pi$, $e$, $\sqrt{2}$) is truncated. This creates floating point error. In most ML applications, this error is negligible — but in certain operations (like subtracting two nearly-equal numbers), it can amplify catastrophically.
import numpy as np
# Floating point arithmetic — not always exact
a = 0.1 + 0.2
print(a) # 0.30000000000000004 (not 0.3!)
print(a == 0.3) # False
# Use np.isclose() for safe comparisons
print(np.isclose(a, 0.3)) # True
# Catastrophic cancellation — subtracting nearby numbers
x = 1.0000001
y = 1.0000000
result = x - y
print(result) # ~1e-7, but accumulated errors can dominate
# ML implication: always use log-sum-exp trick for numerical stability
# Instead of log(exp(a) + exp(b)):
a_val, b_val = 1000.0, 999.0
# Naive (overflows):
# np.log(np.exp(a_val) + np.exp(b_val)) # inf!
# Stable version:
m = max(a_val, b_val)
stable = m + np.log(np.exp(a_val - m) + np.exp(b_val - m))
print("log-sum-exp:", stable) # 1000.3132...
In ML, we accept approximations willingly — stochastic gradient descent uses a random subset of data each step (an approximation of the full gradient). Monte Carlo methods estimate integrals by sampling. Variational inference approximates intractable posterior distributions. The art is knowing how much error is acceptable and where exactness genuinely matters.
log_softmax, not log(softmax(x))); matrix inversion (prefer pseudoinverse for ill-conditioned matrices); probability calculations (work in log-space to prevent underflow with very small values).
Functions & Graphs
A function is a rule that takes an input and produces a unique output. Every ML model is, fundamentally, a function: it takes in features and outputs a prediction. Before building complex models, you need deep familiarity with the building blocks.
For example, a classifier maps from feature space to label space: $f: \mathbb{R}^d \to \{0, 1\}$. A regressor maps to real numbers: $f: \mathbb{R}^d \to \mathbb{R}$.
0.2.1 — Linear Functions
The simplest and most important class. A linear function of one variable:
where $m$ is the slope (rate of change) and $b$ is the y-intercept (value when $x=0$). The slope tells you: "for every unit increase in $x$, $f(x)$ increases by $m$."
In multiple dimensions (the foundation of linear regression):
where $\mathbf{w} = [w_1, w_2, \ldots, w_n]^\top$ is the weight vector and $b$ is the bias. Every linear regression model, every neuron in a neural network before activation, is exactly this formula.
import numpy as np
import matplotlib.pyplot as plt
# Linear function: f(x) = 2x + 1
x = np.linspace(-5, 5, 100)
f = 2 * x + 1 # slope=2, intercept=1
plt.figure(figsize=(7, 4))
plt.plot(x, f, color='#3B9797', linewidth=2.5, label='f(x) = 2x + 1')
plt.axhline(0, color='gray', linewidth=0.5)
plt.axvline(0, color='gray', linewidth=0.5)
plt.scatter([0], [1], color='#BF092F', s=80, zorder=5, label='y-intercept (0, 1)')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('Linear Function: f(x) = mx + b')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
0.2.2 — Polynomial Functions
A polynomial extends the linear idea to higher powers:
The degree $n$ determines the shape and complexity:
- Degree 1 (linear): straight line — $f(x) = 2x + 1$
- Degree 2 (quadratic): parabola — $f(x) = x^2 - 4$, opens up or down
- Degree 3 (cubic): S-curve — one inflection point, can model more complex relationships
- Degree $n$: up to $n-1$ turning points
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
# Original feature: one variable x
x = np.array([[1], [2], [3], [4], [5]])
# Add polynomial features up to degree 3
poly = PolynomialFeatures(degree=3, include_bias=False)
x_poly = poly.fit_transform(x)
# Now each sample has features: [x, x^2, x^3]
print("Original features:\n", x.T)
print("Polynomial features:\n", x_poly.T)
# Output row 0: [1, 2, 3, 4, 5] <- x
# Output row 1: [1, 4, 9, 16, 25] <- x^2
# Output row 2: [1, 8, 27, 64, 125] <- x^3
0.2.3 — Exponential & Logarithmic Functions
The exponential function is everywhere in ML:
Its defining property: the derivative of $e^x$ is $e^x$ itself. It's the only function that equals its own rate of change. This makes it uniquely tractable in calculus-heavy ML derivations.
The natural logarithm $\ln(x) = \log_e(x)$ is the inverse:
Where you'll see these in ML:
- Sigmoid activation: $\sigma(x) = \dfrac{1}{1 + e^{-x}}$ — squashes any value into $(0, 1)$
- Softmax: $\text{softmax}(x_i) = \dfrac{e^{x_i}}{\sum_j e^{x_j}}$ — turns logits into a probability distribution
- Cross-entropy loss: $\mathcal{L} = -\sum_i y_i \ln(\hat{p}_i)$ — penalises wrong probability assignments harshly
- Log-likelihood: products of small probabilities become sums of logs — numerically stable and analytically convenient
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(-3, 3, 200)
# Three core exponential/log functions in ML
sigmoid = 1 / (1 + np.exp(-x))
exp_x = np.exp(x)
x_pos = np.linspace(0.01, 3, 200)
log_x = np.log(x_pos)
fig, axes = plt.subplots(1, 3, figsize=(13, 4))
axes[0].plot(x, sigmoid, color='#3B9797', linewidth=2.5)
axes[0].set_title('Sigmoid: 1 / (1 + e^{-x})')
axes[0].axhline(0.5, linestyle='--', color='gray', alpha=0.5)
axes[0].set_xlabel('x'); axes[0].set_ylabel('σ(x)')
axes[1].plot(x, exp_x, color='#16476A', linewidth=2.5)
axes[1].set_title('Exponential: e^x')
axes[1].set_xlabel('x'); axes[1].set_ylabel('e^x')
axes[2].plot(x_pos, log_x, color='#BF092F', linewidth=2.5)
axes[2].set_title('Natural Log: ln(x)')
axes[2].set_xlabel('x'); axes[2].set_ylabel('ln(x)')
for ax in axes:
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
0.2.4 — Function Transformations
Every activation function is a transformed version of a simpler function. Understanding transformations means you can decode any function visually:
| Transformation | Formula | Effect on Graph | ML Example |
|---|---|---|---|
| Vertical shift up by $c$ | $f(x) + c$ | Move graph $c$ units up | Adding bias $b$: $\mathbf{w}^\top\mathbf{x} + b$ |
| Horizontal shift right by $c$ | $f(x - c)$ | Move graph $c$ units right | Shifted activation threshold |
| Vertical scale by $a$ | $a \cdot f(x)$ | Stretch ($|a|>1$) or compress ($|a|<1$) vertically | Learning rate scaling gradient |
| Reflection about $x$-axis | $-f(x)$ | Flip upside down | Negating loss for maximisation |
| Horizontal scale by $\frac{1}{a}$ | $f(ax)$ | Compress horizontally by $a$ | Temperature in softmax: $e^{x/T}$ |
0.2.5 — Composition of Functions
Composition means applying one function to the output of another:
Read it right to left: first apply $g$, then apply $f$ to the result. Example:
flowchart LR
X["x ∈ ℝ^d\n(Input)"] --> L1["Layer 1\nh₁ = σ(W₁x + b₁)"]
L1 --> L2["Layer 2\nh₂ = σ(W₂h₁ + b₂)"]
L2 --> OUT["Output\nŷ = softmax(W₃h₂ + b₃)"]
OUT --> LOSS["Loss ℒ(ŷ, y)"]
LOSS -->|"Backprop:\nchain rule"| L2
LOSS -->|"chain rule"| L1
During training, we need $\frac{\partial \mathcal{L}}{\partial W_1}$ — how much does the loss change with respect to the first layer's weights? The chain rule for compositions gives us exactly that, layer by layer. We'll derive this fully in Part 8 (Calculus). For now, appreciate that knowing how composition works is what makes that derivation possible.
import numpy as np
# Composition of functions — the neural network forward pass
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def softmax(z):
e = np.exp(z - np.max(z)) # subtract max for numerical stability
return e / e.sum()
# Simulate a 3-layer network: input -> hidden1 -> hidden2 -> output
np.random.seed(42)
x = np.random.randn(4) # 4 input features
W1 = np.random.randn(6, 4) * 0.1 # 6 hidden units, 4 inputs
b1 = np.zeros(6)
W2 = np.random.randn(6, 6) * 0.1 # 6 -> 6
b2 = np.zeros(6)
W3 = np.random.randn(3, 6) * 0.1 # 6 -> 3 output classes
b3 = np.zeros(3)
# Forward pass: composition f3(f2(f1(x)))
h1 = sigmoid(W1 @ x + b1) # g1(x)
h2 = sigmoid(W2 @ h1 + b2) # g2(g1(x))
y_hat = softmax(W3 @ h2 + b3) # g3(g2(g1(x)))
print("Input: ", x.round(3))
print("Hidden layer 1:", h1.round(3))
print("Hidden layer 2:", h2.round(3))
print("Output probs: ", y_hat.round(4))
print("Predicted class:", np.argmax(y_hat))
Interactive Function Explorer
The chart below visualises the four core function families on the same axes. Use it to develop geometric intuition — notice how their rates of change differ.
Practice Exercises
Decode These Expressions
Translate each mathematical expression into plain English, then compute the value:
- $\displaystyle\sum_{i=1}^{4} i^2$ (hint: $1^2 + 2^2 + \ldots$)
- $\displaystyle\prod_{k=1}^{3} (2k - 1)$ (hint: $1 \times 3 \times 5$)
- $\mathbf{w}^\top \mathbf{x}$ where $\mathbf{w} = [2, -1, 3]^\top$ and $\mathbf{x} = [1, 4, 2]^\top$
Show Answers
- $1 + 4 + 9 + 16 = 30$
- $(2\cdot1-1)(2\cdot2-1)(2\cdot3-1) = 1 \cdot 3 \cdot 5 = 15$
- $2(1) + (-1)(4) + 3(2) = 2 - 4 + 6 = 4$
Identify the Function Family
For each ML formula below, identify which function family it belongs to (linear, polynomial, exponential, or composition), and explain why:
- Sigmoid: $\sigma(z) = \dfrac{1}{1 + e^{-z}}$
- Linear regression prediction: $\hat{y} = \mathbf{w}^\top \mathbf{x} + b$
- ReLU activation: $\text{ReLU}(x) = \max(0, x)$
- A two-layer neural net: $\hat{y} = \sigma(W_2 \sigma(W_1 x + b_1) + b_2)$
Show Answers
- Composition of exponential and linear: First compute $-z$ (linear), then $e^{-z}$ (exponential), then $1/(1 + \cdot)$ (rational/transformation)
- Linear: It's $\mathbf{w}^\top \mathbf{x} + b$ — a dot product (sum of products) plus a constant. Exactly $f(x) = mx + b$ in multiple dimensions.
- Piecewise linear: Two linear pieces joined at $x=0$. Not polynomial — it has a non-smooth "kink".
- Nested composition: Apply linear, then sigmoid, then linear again, then sigmoid — $(f_4 \circ f_3 \circ f_2 \circ f_1)(x)$.
Implement from Scratch
Without using any ML library, implement a linear function and evaluate it on a dataset. Then plot it against the data points.
import numpy as np
import matplotlib.pyplot as plt
# TODO: complete the linear function
def linear(x, w, b):
# Your implementation here
pass
# Generate toy data: y = 3x - 2 + noise
np.random.seed(0)
x_data = np.linspace(-3, 3, 30)
y_data = 3 * x_data - 2 + np.random.randn(30) * 0.8
# Evaluate your function with w=3, b=-2
# y_pred = linear(x_data, w=3, b=-2)
# Plot: scatter data, line prediction
# plt.scatter(x_data, y_data, ...)
# plt.plot(x_data, y_pred, ...)
# plt.show()
Show Solution
import numpy as np
import matplotlib.pyplot as plt
def linear(x, w, b):
return w * x + b # f(x) = wx + b
np.random.seed(0)
x_data = np.linspace(-3, 3, 30)
y_data = 3 * x_data - 2 + np.random.randn(30) * 0.8
y_pred = linear(x_data, w=3, b=-2)
plt.figure(figsize=(7, 4))
plt.scatter(x_data, y_data, color='#3B9797', label='Data', alpha=0.7)
plt.plot(x_data, y_pred, color='#BF092F', linewidth=2, label='f(x) = 3x - 2')
plt.xlabel('x'); plt.ylabel('y')
plt.legend(); plt.grid(True, alpha=0.3)
plt.title('Linear Function vs Data')
plt.tight_layout(); plt.show()
Conclusion & Next Steps
In this first part of the bootcamp, we've laid the psychological and conceptual groundwork for everything that follows:
- Abstraction & Modeling — every ML model is an intentional simplification of reality; the art is choosing what to keep
- Notation Fluency — $\Sigma$, $\Pi$, $\nabla$, $\mathbb{E}[\cdot]$, Greek letters — these are vocabulary, not barriers
- Proof vs Intuition — use intuition to understand, use proof to validate; both are necessary
- Approximation — nearly everything in ML is approximate by design; know when exactness matters (log-space, numerical stability)
- Functions — linear, polynomial, exponential, and compositions are the atoms of all ML models
Next in the Series
In Part 2: Set Theory & Foundations, we'll build the rigorous language for talking about collections of objects — sets, subsets, power sets, operations (union, intersection, complement), De Morgan's Laws, and their direct connections to feature spaces and probability event spaces in ML.