Complete Math + Probability + Statistics for ML/AI/DS Bootcamp
Mathematical Thinking
Mindset, notation & functionsSet Theory & Foundations
Sets, operations & ML connectionsCombinatorics
Counting, permutations & combinationsProbability Fundamentals
Rules, Bayes & distributionsStatistics
Descriptive to inferentialInformation Theory
Entropy, cross-entropy & KL divergenceLinear Algebra
Vectors, matrices & transformationsCalculus & Optimization
Derivatives, gradients & descentML-Specific Math
Loss functions & regularizationComputational Math
NumPy, SciPy & simulationAdvanced Topics
Multivariate stats & Bayesian inferenceProjects & Applications
Build from scratch: regression, Bayes, PCANumPy Foundation
NumPy provides contiguous-memory arrays and BLAS-backed math operations, making it 10–1000× faster than pure Python loops for numerical work.
import numpy as np
# Create arrays from Python lists
a = np.array([1, 2, 3, 4, 5], dtype=float)
b = np.array([5, 4, 3, 2, 1], dtype=float)
# Element-wise operations
print("a + b =", a + b)
print("a * b =", a * b)
print("a ** 2 =", a ** 2)
print("dot product:", np.dot(a, b)) # 1*5 + 2*4 + 3*3 + 4*2 + 5*1 = 35
# Aggregation
print("\nStatistics on a:")
print(f" mean={a.mean():.2f}, std={a.std():.2f}, min={a.min()}, max={a.max()}")
print(f" sum={a.sum()}, cumsum={np.cumsum(a)}")
# 2-D arrays (matrices)
M = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]], dtype=float)
print("\nMatrix M:")
print(M)
print(f"Shape: {M.shape}, Rank (actual): {np.linalg.matrix_rank(M)}")
print(f"Trace (sum of diagonal): {np.trace(M):.2f}")
print(f"Column sums: {M.sum(axis=0)}")
print(f"Row sums: {M.sum(axis=1)}")
Broadcasting
Broadcasting lets NumPy perform operations on arrays of different (but compatible) shapes without explicit copying. The rule: dimensions are aligned from the right; a size-1 dimension is stretched to match the other.
import numpy as np
# Feature normalization via broadcasting (common ML preprocessing)
X = np.array([[1.0, 200.0, 0.5],
[2.0, 150.0, 1.5],
[3.0, 300.0, 0.8],
[4.0, 250.0, 2.1]]) # (4, 3) matrix: 4 samples, 3 features
mu = X.mean(axis=0) # (3,) — mean per feature
std = X.std(axis=0) # (3,) — std per feature
X_norm = (X - mu) / (std + 1e-8) # broadcast: (4,3) - (3,) → (4,3)
print("Feature means:", np.round(mu, 2))
print("Feature stds: ", np.round(std, 2))
print("\nNormalized X (should have ~zero mean, ~unit variance per column):")
print(np.round(X_norm, 3))
print("\nVerification — column means:", np.round(X_norm.mean(axis=0), 6))
print("Verification — column stds: ", np.round(X_norm.std(axis=0), 4))
# Broadcasting with outer product (no loops needed)
u = np.array([1, 2, 3])
v = np.array([10, 20])
outer = u[:, np.newaxis] * v[np.newaxis, :] # (3,1) * (1,2) → (3,2)
print("\nOuter product u ⊗ v:")
print(outer)
print("Expected:", np.outer(u, v))
np.linalg — Linear Algebra Operations
NumPy's linear algebra submodule wraps LAPACK/BLAS for high-performance matrix operations:
import numpy as np
# Construct a well-conditioned symmetric positive definite matrix
rng = np.random.default_rng(42)
A_raw = rng.standard_normal((4, 4))
A = A_raw @ A_raw.T + 4 * np.eye(4) # guaranteed SPD
print("Matrix A (SPD):")
print(np.round(A, 3))
# Core operations
det = np.linalg.det(A)
inv_A = np.linalg.inv(A)
eigvals, eigvecs = np.linalg.eigh(A) # eigh for symmetric matrices
U, S, Vt = np.linalg.svd(A)
print(f"\nDeterminant: {det:.4f}")
print(f"Eigenvalues: {np.round(eigvals, 4)}")
print(f"Singular values: {np.round(S, 4)}")
print(f"Condition number: {S.max() / S.min():.4f} (S_max / S_min)")
# Solve linear system Ax = b (preferred over inverting A — numerically stable)
b = np.array([1.0, 2.0, 3.0, 4.0])
x = np.linalg.solve(A, b)
print(f"\nSolution x to Ax=b: {np.round(x, 4)}")
print(f"Verification Ax: {np.round(A @ x, 4)}")
print(f"Residual norm: {np.linalg.norm(A @ x - b):.2e}")
# Norm varieties
v = np.array([3.0, 4.0])
print(f"\nL1 norm: {np.linalg.norm(v, 1):.2f}, L2 norm: {np.linalg.norm(v, 2):.2f}, L∞ norm: {np.linalg.norm(v, np.inf):.2f}")
SciPy: Optimization & Statistics
SciPy extends NumPy with optimization algorithms, probability distributions, and numerical methods used throughout scientific computing and ML.
import numpy as np
from scipy import optimize, stats
# --- Optimization: minimize a function ---
def rosenbrock(x):
"""Classic optimization test function: banana-shaped valley."""
return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2
x0 = np.array([0.0, 0.0]) # initial guess
result_nelder = optimize.minimize(rosenbrock, x0, method='Nelder-Mead')
result_bfgs = optimize.minimize(rosenbrock, x0, method='BFGS')
print("Rosenbrock minimization (true minimum at [1, 1]):")
print(f" Nelder-Mead: x={np.round(result_nelder.x, 6)}, f={result_nelder.fun:.2e}, iters={result_nelder.nit}")
print(f" BFGS: x={np.round(result_bfgs.x, 6)}, f={result_bfgs.fun:.2e}, iters={result_bfgs.nit}")
# --- Scalar root finding ---
def f_scalar(x):
return x**3 - x - 2
root = optimize.brentq(f_scalar, 1, 2)
print(f"\nRoot of x³ - x - 2 = 0 on [1,2]: x ≈ {root:.6f}")
print(f"Verification f({root:.4f}) = {f_scalar(root):.2e}")
# --- Statistical testing ---
np.random.seed(42)
sample_a = np.random.normal(loc=5.0, scale=1.0, size=30)
sample_b = np.random.normal(loc=5.5, scale=1.0, size=30)
t_stat, p_value = stats.ttest_ind(sample_a, sample_b)
print(f"\nTwo-sample t-test:")
print(f" Mean A={sample_a.mean():.3f}, Mean B={sample_b.mean():.3f}")
print(f" t-statistic={t_stat:.3f}, p-value={p_value:.4f}")
print(f" Conclusion: {'Reject H₀ (significant difference)' if p_value < 0.05 else 'Fail to reject H₀'} at α=0.05")
SymPy: Symbolic Mathematics
SymPy performs exact symbolic computation — computing derivatives, integrals, and solving equations in closed form. Essential for verifying gradient derivations and exploring mathematical relationships.
from sympy import symbols, diff, integrate, solve, simplify, exp, log, pi, sqrt, latex, sin, cos
# Define symbolic variables
x, mu, sigma, lam = symbols('x mu sigma lambda', real=True, positive=True)
# --- Differentiate: sigmoid function ---
sigmoid_expr = 1 / (1 + exp(-x))
sigmoid_grad = diff(sigmoid_expr, x)
print("σ(x) =", sigmoid_expr)
print("σ'(x) =", simplify(sigmoid_grad))
# Should simplify to σ(x)(1 - σ(x))
# --- Chain rule: compose functions ---
u_expr = x**2 + 1
composite = sin(u_expr)
print("\nd/dx sin(x² + 1) =", diff(composite, x)) # cos(x²+1) * 2x
# --- Integrate: Gaussian normalizer ---
gaussian_unnorm = exp(-x**2 / (2 * sigma**2))
gaussian_integral = integrate(gaussian_unnorm, (x, -pi * 100, pi * 100)) # use large bounds
# Exact result: sigma * sqrt(2*pi)
from sympy import oo
exact_integral = integrate(exp(-x**2 / (2 * sigma**2)), (x, -oo, oo))
print("\n∫ exp(-x²/2σ²) dx from -∞ to ∞ =", simplify(exact_integral))
# Output: sqrt(2)*sqrt(pi)*sigma = σ√(2π) — the Gaussian normalizer
# --- Solve: critical points ---
f_poly = x**4 - 4*x**3 + 6*x**2 - 4*x + 1
f_prime = diff(f_poly, x)
critical_points = solve(f_prime, x)
print(f"\nf(x) = (x-1)⁴ → f'(x) = {f_prime}")
print(f"Critical points: {critical_points}")
# --- LaTeX output for typesetting ---
expr = x**2 / (1 + sqrt(x))
print(f"\nLaTeX for x²/(1+√x): {latex(expr)}")
Monte Carlo Methods
Monte Carlo methods approximate deterministic quantities using random sampling. They shine when analytical solutions are intractable or integration is high-dimensional.
import numpy as np
# --- Monte Carlo π estimation ---
np.random.seed(42)
n_samples = 1_000_000
x = np.random.uniform(-1, 1, n_samples)
y = np.random.uniform(-1, 1, n_samples)
inside_circle = (x**2 + y**2) <= 1.0
pi_estimate = 4 * inside_circle.mean()
print(f"Monte Carlo π estimate (n={n_samples:,}): {pi_estimate:.6f}")
print(f"True π: {np.pi:.6f}")
print(f"Error: {abs(pi_estimate - np.pi):.6f}")
# Standard error of the estimator (σ/√n)
p_hat = inside_circle.mean()
se = np.sqrt(p_hat * (1 - p_hat) / n_samples)
print(f"95% CI: ({4*(p_hat - 1.96*se):.5f}, {4*(p_hat + 1.96*se):.5f})")
# --- Monte Carlo integration: E[X²] for X ~ Uniform(0,1) ---
# True value: ∫₀¹ x² dx = 1/3 ≈ 0.3333
np.random.seed(0)
samples = np.random.uniform(0, 1, 100_000)
mc_integral = (samples**2).mean()
print(f"\nMC estimate of ∫₀¹ x² dx = {mc_integral:.6f} (true: {1/3:.6f})")
# --- Bootstrap confidence interval for the median ---
np.random.seed(1)
data = np.random.lognormal(mean=0, sigma=0.5, size=50)
true_median = np.median(data)
boot_medians = [np.median(np.random.choice(data, size=len(data), replace=True))
for _ in range(5000)]
boot_ci = np.percentile(boot_medians, [2.5, 97.5])
print(f"\nBootstrap median CI (n=50, B=5000 resamples):")
print(f" Sample median: {true_median:.4f}")
print(f" 95% Bootstrap CI: [{boot_ci[0]:.4f}, {boot_ci[1]:.4f}]")
Numerical Integration
When analytical antiderivatives are unavailable, we approximate integrals numerically. The accuracy improves with smaller step sizes.
import numpy as np
from scipy import integrate as sci_integrate
def f(x):
"""Integrate sin(x²) from 0 to √π — no closed form."""
return np.sin(x**2)
a, b = 0.0, np.sqrt(np.pi)
# Trapezoidal rule: O(h²) error
def trapezoidal(func, a, b, n):
x = np.linspace(a, b, n + 1)
y = func(x)
h = (b - a) / n
return h * (0.5 * y[0] + np.sum(y[1:-1]) + 0.5 * y[-1])
# Simpson's 1/3 rule: O(h⁴) error (requires even n)
def simpsons(func, a, b, n):
if n % 2 != 0:
n += 1
x = np.linspace(a, b, n + 1)
y = func(x)
h = (b - a) / n
return h / 3 * (y[0] + 4*np.sum(y[1:-1:2]) + 2*np.sum(y[2:-2:2]) + y[-1])
# SciPy adaptive quadrature (high accuracy reference)
scipy_result, scipy_error = sci_integrate.quad(f, a, b)
print(f"∫₀^√π sin(x²) dx:")
print(f" SciPy adaptive quadrature: {scipy_result:.8f} (error bound: {scipy_error:.2e})")
for n in [10, 100, 1000]:
trap = trapezoidal(f, a, b, n)
simp = simpsons(f, a, b, n)
print(f" n={n:5d}: Trapezoidal={trap:.8f} (err={abs(trap-scipy_result):.2e}), "
f"Simpson={simp:.8f} (err={abs(simp-scipy_result):.2e})")
# Expected: Simpson's converges ~1000× faster than Trapezoidal for smooth functions
Vectorization vs Loops
Python loops are slow for numerical work because they are interpreted one step at a time. NumPy vectorization delegates computation to compiled C/Fortran routines operating on entire arrays at once.
import numpy as np
import time
n = 5_000_000
# --- Python loop ---
a = list(range(n))
b = list(range(n, 2*n))
t0 = time.perf_counter()
result_loop = [a[i] * b[i] for i in range(n)]
t_loop = time.perf_counter() - t0
# --- NumPy vectorized ---
a_np = np.arange(n, dtype=float)
b_np = np.arange(n, n + n, dtype=float)
t0 = time.perf_counter()
result_np = a_np * b_np
t_np = time.perf_counter() - t0
print(f"Element-wise multiplication of {n:,} elements:")
print(f" Python loop: {t_loop:.3f} s")
print(f" NumPy: {t_np:.4f} s")
print(f" Speedup: {t_loop / t_np:.0f}×")
# --- Softmax: loop vs vectorized ---
def softmax_loop(z):
exp_z = [np.exp(zi) for zi in z]
total = sum(exp_z)
return [e / total for e in exp_z]
def softmax_vectorized(z):
z_shifted = z - z.max() # numerical stability
exp_z = np.exp(z_shifted)
return exp_z / exp_z.sum()
z = np.random.randn(10000)
t0 = time.perf_counter(); _ = softmax_loop(z); t_loop = time.perf_counter() - t0
t0 = time.perf_counter(); _ = softmax_vectorized(z); t_vec = time.perf_counter() - t0
print(f"\nSoftmax on 10,000-dim vector:")
print(f" Loop: {t_loop*1000:.2f} ms")
print(f" Vectorized: {t_vec*1000:.2f} ms")
print(f" Speedup: {t_loop/t_vec:.0f}×")
%timeit in Jupyter before optimizing.
Practice Exercises
Monte Carlo Integration — Expected Value of a Normal
Estimate $\mathbb{E}[X^2]$ for $X \sim \mathcal{N}(0, 1)$ using Monte Carlo with $n = 100{,}000$ samples. The true answer is $\text{Var}(X) + \mathbb{E}[X]^2 = 1 + 0 = 1$.
Show Answer
import numpy as np
np.random.seed(42)
samples = np.random.randn(100_000)
estimate = (samples**2).mean()
print(f"MC estimate of E[X²]: {estimate:.4f} (true: 1.0)")
print(f"Error: {abs(estimate - 1.0):.4f}")
# Typical output: 0.9989, error ≈ 0.001
Symbolic Gradient of Cross-Entropy + Softmax
Using SymPy, define softmax for 2 classes ($z_1, z_2$) and compute $\partial \mathcal{L} / \partial z_1$ where $\mathcal{L} = -\log(\text{softmax}_1)$. Confirm it equals $p_1 - 1$.
Show Answer
from sympy import symbols, exp, log, diff, simplify
z1, z2 = symbols('z1 z2', real=True)
p1 = exp(z1) / (exp(z1) + exp(z2))
loss = -log(p1)
grad = simplify(diff(loss, z1))
print("∂L/∂z1 =", grad) # Should simplify to exp(z2)/(exp(z1)+exp(z2)) = p2 = 1 - p1
# i.e., gradient = p1 - 1 for the true class (y=1), p1 for non-true class
Conclusion & Next Steps
Python's scientific stack maps almost perfectly onto the math of this series:
- NumPy — vectors, matrices, norms, eigenvalues (Parts 7, 3)
- SciPy optimize — gradient descent and constrained optimization (Part 8)
- SciPy stats — hypothesis testing, distributions (Parts 4, 5)
- SymPy — symbolic calculus, exact derivatives (Part 8), MLE derivations (Part 9)
- Monte Carlo — integration, bootstrap CI, simulation (Parts 4, 5)
- Vectorization — the performance foundation of every ML framework
Next in the Series
In Part 11: Advanced Topics, we push into graduate-level material: multivariate Gaussians, Bayesian inference, Markov chains, concentration inequalities, and the EM algorithm — the math behind some of ML's most powerful techniques.