Complete Math + Probability + Statistics for ML/AI/DS Bootcamp
Mathematical Thinking
Mindset, notation & functionsSet Theory & Foundations
Sets, operations & ML connectionsCombinatorics
Counting, permutations & combinationsProbability Fundamentals
Rules, Bayes & distributionsStatistics
Descriptive to inferentialInformation Theory
Entropy, cross-entropy & KL divergenceLinear Algebra
Vectors, matrices & transformationsCalculus & Optimization
Derivatives, gradients & descentML-Specific Math
Loss functions & regularizationComputational Math
NumPy, SciPy & simulationAdvanced Topics
Multivariate stats & Bayesian inferenceProjects & Applications
Build from scratch: regression, Bayes, PCADerivatives
A derivative measures how a function changes as its input changes — the instantaneous rate of change. Derivatives are the mathematical engine behind optimization, neural network training, and gradient descent.
Gradient of a Straight Line
For a linear function $y = mx + c$, the gradient (slope) is constant throughout:
This is "rise over run." The gradient is the same regardless of which two points you pick — straight lines have a uniform rate of change. A positive gradient means the line rises left-to-right; negative means it falls; zero means horizontal.
Gradient of a Curve (Tangent)
Curves have a different gradient at each point. The gradient at a point on a curve $y = f(x)$ is the slope of the tangent line at that point. As we zoom in on any smooth curve, it locally looks like a straight line — the tangent.
Differentiation from First Principles
The formal definition of the derivative:
Worked example — differentiate $f(x) = x^2$ from first principles:
import numpy as np
def derivative_first_principles(f, x, h=1e-7):
"""Numerical derivative via first principles (forward difference)."""
return (f(x + h) - f(x)) / h
def derivative_central(f, x, h=1e-6):
"""Central difference — more accurate (O(h^2) error vs O(h))."""
return (f(x + h) - f(x - h)) / (2 * h)
# Test on f(x) = x^2 — analytical derivative = 2x
f = lambda x: x**2
grad_f = lambda x: 2*x
print("f(x) = x^2 — First Principles vs Analytical:")
for x in [0, 1, 2, 3]:
forward_diff = derivative_first_principles(f, x)
central_diff = derivative_central(f, x)
analytical = grad_f(x)
print(f" x={x}: forward={forward_diff:.6f}, central={central_diff:.6f}, analytical={analytical:.6f}")
# Test on f(x) = sin(x) — derivative = cos(x)
g = np.sin
print("\nf(x) = sin(x) — derivative = cos(x):")
for x in [0, np.pi/4, np.pi/2, np.pi]:
num = derivative_central(g, x)
ana = np.cos(x)
print(f" x={x:.4f}: numerical={num:.6f}, analytical={ana:.6f}")
The three panels below illustrate what the first derivative means geometrically — from the limit definition as a shrinking secant line, to its sign ruling where a function rises or falls, to its direct correspondence with slope:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="whitegrid", palette="muted")
f = lambda x: x**2
f_prime = lambda x: 2 * x
x_full = np.linspace(-0.5, 2.5, 500)
x0 = 1.0 # point at which we visualise the limit
h_values = [1.0, 0.5, 0.2, 0.05] # secant lines converging to tangent
# ── Tangent line at x0 ────────────────────────────────────────────────────
slope_at_x0 = f_prime(x0)
tangent = lambda x: f(x0) + slope_at_x0 * (x - x0)
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
fig.suptitle(
"First Derivative — Geometric Significance",
fontsize=13, fontweight="bold", y=1.02
)
# ── Panel 1: Limit definition — secant → tangent ─────────────────────────
ax = axes[0]
ax.plot(x_full, f(x_full), color="#132440", lw=2.5, label="$f(x) = x^2$", zorder=3)
cmap_colors = sns.color_palette("flare", len(h_values))
for h, col in zip(h_values, cmap_colors):
x1, y1 = x0, f(x0)
x2, y2 = x0 + h, f(x0 + h)
secant_slope = (y2 - y1) / (x2 - x1)
secant_line = lambda x, s=secant_slope: f(x0) + s * (x - x0)
x_sec = np.linspace(x0 - 0.3, x0 + h + 0.2, 100)
ax.plot(x_sec, secant_line(x_sec), color=col, lw=1.4, ls="--",
label=f"h = {h:.2f} (slope ≈ {secant_slope:.2f})")
ax.scatter([x2], [y2], color=col, s=45, zorder=4)
# Tangent
x_tan = np.linspace(x0 - 0.6, x0 + 0.6, 100)
ax.plot(x_tan, tangent(x_tan), color="#BF092F", lw=2.2,
label=f"Tangent h→0 (slope={slope_at_x0:.1f})", zorder=5)
ax.scatter([x0], [f(x0)], color="#BF092F", s=80, zorder=6, ec="white", lw=1.5)
ax.set_title("Limit Definition\nSecant lines → tangent as $h \\to 0$")
ax.set_xlabel("$x$"); ax.set_ylabel("$f(x)$")
ax.legend(fontsize=7.5, loc="upper left")
ax.set_xlim(-0.3, 2.4); ax.set_ylim(-0.3, 6)
# ── Panel 2: f(x) = sin(x) — sign of f'(x) drives rise/fall ─────────────
ax = axes[1]
g = np.sin
g_prime = np.cos
x_trig = np.linspace(0, 3 * np.pi, 600)
ax.plot(x_trig, g(x_trig), color="#132440", lw=2.5, label="$f(x) = \\sin x$")
ax.axhline(0, color="gray", lw=0.7, ls="--", alpha=0.5)
ax.fill_between(x_trig, g(x_trig), 0, where=(g_prime(x_trig) > 0),
alpha=0.18, color="#16476A", label="$f'(x) > 0$ (rising ↑)")
ax.fill_between(x_trig, g(x_trig), 0, where=(g_prime(x_trig) < 0),
alpha=0.18, color="#BF092F", label="$f'(x) < 0$ (falling ↓)")
# Mark zeros of f'
crit_x = np.array([np.pi/2, 3*np.pi/2, 5*np.pi/2])
crit_y = g(crit_x)
crit_labels = ["Max\n$f'=0$", "Min\n$f'=0$", "Max\n$f'=0$"]
crit_colors = ["#16476A", "#BF092F", "#16476A"]
for xi, yi, lbl, col in zip(crit_x, crit_y, crit_labels, crit_colors):
ax.scatter(xi, yi, color=col, s=70, zorder=5, ec="white", lw=1.3)
ax.annotate(lbl, (xi, yi), textcoords="offset points",
xytext=(6, 8), fontsize=8, color=col)
ax.set_title("Sign of $f'(x)$ — Where Does $f$ Rise or Fall?")
ax.set_xlabel("$x$"); ax.set_ylabel("$f(x)$")
ax.set_xticks([0, np.pi/2, np.pi, 3*np.pi/2, 2*np.pi, 5*np.pi/2, 3*np.pi])
ax.set_xticklabels(["0", "$\\pi/2$", "$\\pi$", "$3\\pi/2$", "$2\\pi$", "$5\\pi/2$", "$3\\pi$"],
fontsize=8)
ax.legend(fontsize=8, loc="lower left")
# ── Panel 3: f(x) and f'(x) side-by-side — slope = derivative value ──────
ax = axes[2]
x_p = np.linspace(-2.5, 2.5, 500)
p = lambda x: x**3 / 3 - x # chosen so f' = x^2 - 1 (clear zeros)
dp = lambda x: x**2 - 1
line1, = ax.plot(x_p, p(x_p), color="#132440", lw=2.5, label="$f(x) = x^3/3 - x$")
ax2_twin = ax.twinx()
line2, = ax2_twin.plot(x_p, dp(x_p), color="#3B9797", lw=2.0, ls="-.",
label="$f'(x) = x^2 - 1$")
ax2_twin.axhline(0, color="#3B9797", lw=0.6, ls=":", alpha=0.4)
# Annotate where f'=0 → extrema
for xi in [-1.0, 1.0]:
ax.axvline(xi, color="#BF092F", lw=1.0, ls=":", alpha=0.6)
ax.scatter(xi, p(xi), color="#BF092F", s=70, zorder=5, ec="white", lw=1.3)
ax2_twin.scatter(xi, dp(xi), color="#BF092F", s=70, zorder=5, ec="white", lw=1.3)
ax.set_title("$f$ vs $f'$ — Slope of $f$ equals value of $f'$")
ax.set_xlabel("$x$")
ax.set_ylabel("$f(x)$", color="#132440")
ax2_twin.set_ylabel("$f'(x)$", color="#3B9797")
lines = [line1, line2]
ax.legend(lines, [l.get_label() for l in lines], fontsize=8, loc="upper center")
plt.tight_layout()
plt.show()
# ── Console: convergence of secant slope to true derivative ───────────────
print(f"Convergence of secant slope to f'({x0}) = {f_prime(x0):.1f} [f(x)=x²]")
print(f"{'h':>10} {'Secant slope':>14} {'Error':>10}")
print("-" * 40)
for h in [1.0, 0.5, 0.1, 0.01, 0.001, 1e-6]:
sec = (f(x0 + h) - f(x0)) / h
err = abs(sec - f_prime(x0))
print(f"{h:>10.6f} {sec:>14.8f} {err:>10.2e}")
Power Rule
The power rule is the most frequently used differentiation rule:
Examples: $\frac{d}{dx} x^3 = 3x^2$; $\frac{d}{dx} x^{1/2} = \frac{1}{2}x^{-1/2}$; $\frac{d}{dx} x^{-1} = -x^{-2}$; $\frac{d}{dx} 5 = 0$ (constants vanish).
Companion rules: $\frac{d}{dx}[cf(x)] = c f'(x)$ (constant multiple) and $\frac{d}{dx}[f(x) \pm g(x)] = f'(x) \pm g'(x)$ (sum rule).
Second Derivatives
The second derivative $f''(x)$ measures the rate of change of the slope — the curvature of the function:
Second derivatives classify critical points where $f'(x^*) = 0$:
- $f''(x^*) \gt 0$: local minimum (concave up — function curves like a bowl)
- $f''(x^*) \lt 0$: local maximum (concave down — function curves like a hill)
- $f''(x^*) = 0$: inconclusive — may be saddle point or inflection point
The visualization below plots $f(x) = x^3 - 3x$ alongside its first and second derivatives, showing how the sign of $f''$ at each critical point determines concavity:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="whitegrid", palette="muted")
# ── Function and its derivatives ──────────────────────────────────────────
def f(x): return x**3 - 3*x # local max at x=-1, local min at x=1
def f_prime(x): return 3*x**2 - 3 # zero at x = ±1
def f_double(x): return 6*x # sign determines concavity
x = np.linspace(-2.5, 2.5, 500)
# Critical points where f'(x*) = 0
x_crit = np.array([-1.0, 1.0])
y_crit = f(x_crit)
y2_crit = f_double(x_crit)
labels = ["Local Max\n$f''(-1) = -6 < 0$", "Local Min\n$f''(1) = +6 > 0$"]
colors = ["#BF092F", "#16476A"] # crimson = max, navy-blue = min
fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharey=False)
fig.suptitle(
"Second Derivative Test: Curvature & Critical Point Classification",
fontsize=13, fontweight="bold", y=1.02
)
# ── Panel 1: f(x) with concavity shading ──────────────────────────────────
ax = axes[0]
ax.plot(x, f(x), color="#132440", lw=2.5, label="$f(x) = x^3 - 3x$")
ax.fill_between(x, f(x), alpha=0.12, where=(f_double(x) > 0),
color="#16476A", label="Concave up ($f'' > 0$)")
ax.fill_between(x, f(x), alpha=0.12, where=(f_double(x) < 0),
color="#BF092F", label="Concave down ($f'' < 0$)")
ax.axhline(0, color="gray", lw=0.8, ls="--", alpha=0.5)
ax.axvline(0, color="gray", lw=0.8, ls=":", alpha=0.4, label="Inflection ($x=0$)")
for xi, yi, lbl, col in zip(x_crit, y_crit, labels, colors):
ax.scatter(xi, yi, zorder=5, color=col, s=90, ec="white", lw=1.5)
ax.annotate(lbl, (xi, yi), textcoords="offset points",
xytext=(14, 8), fontsize=8.5, color=col,
arrowprops=dict(arrowstyle="->", color=col, lw=1.2))
ax.set_title("$f(x) = x^3 - 3x$")
ax.set_xlabel("$x$"); ax.set_ylabel("$f(x)$")
ax.legend(fontsize=8, loc="upper left")
ax.set_ylim(-3.5, 3.5)
# ── Panel 2: f'(x) — slope (first derivative) ─────────────────────────────
ax = axes[1]
ax.plot(x, f_prime(x), color="#3B9797", lw=2.5, label="$f'(x) = 3x^2 - 3$")
ax.axhline(0, color="black", lw=0.8, ls="--", alpha=0.6)
ax.fill_between(x, f_prime(x), 0, where=(f_prime(x) > 0),
alpha=0.10, color="#3B9797", label="$f' > 0$ (increasing)")
ax.fill_between(x, f_prime(x), 0, where=(f_prime(x) < 0),
alpha=0.10, color="#BF092F", label="$f' < 0$ (decreasing)")
for xi, col in zip(x_crit, colors):
ax.scatter(xi, 0, zorder=5, color=col, s=90, ec="white", lw=1.5)
ax.axvline(xi, color=col, lw=1.0, ls=":", alpha=0.6)
ax.set_title("First Derivative $f'(x)$")
ax.set_xlabel("$x$"); ax.set_ylabel("$f'(x)$")
ax.legend(fontsize=8, loc="upper center")
# ── Panel 3: f''(x) — curvature (second derivative) ───────────────────────
ax = axes[2]
ax.plot(x, f_double(x), color="#B8860B", lw=2.5, label="$f''(x) = 6x$")
ax.axhline(0, color="black", lw=0.8, ls="--", alpha=0.6)
ax.fill_between(x, f_double(x), 0, where=(f_double(x) > 0),
alpha=0.15, color="#16476A", label="$f'' > 0$ → concave up")
ax.fill_between(x, f_double(x), 0, where=(f_double(x) < 0),
alpha=0.15, color="#BF092F", label="$f'' < 0$ → concave down")
for xi, y2, lbl, col in zip(x_crit, y2_crit, labels, colors):
ax.scatter(xi, y2, zorder=5, color=col, s=90, ec="white", lw=1.5)
ax.annotate(f"$f''({xi:.0f}) = {y2:.0f}$", (xi, y2),
textcoords="offset points", xytext=(8, 10),
fontsize=8.5, color=col,
arrowprops=dict(arrowstyle="->", color=col, lw=1.2))
ax.set_title("Second Derivative $f''(x)$")
ax.set_xlabel("$x$"); ax.set_ylabel("$f''(x)$")
ax.legend(fontsize=8, loc="upper left")
plt.tight_layout()
plt.show()
# ── Console summary ────────────────────────────────────────────────────────
print("Second Derivative Test — Critical Point Classification")
print("=" * 54)
for xi, yi, y2 in zip(x_crit, y_crit, y2_crit):
sign = ">" if y2 > 0 else "<"
kind = "Local Minimum (concave up ↑)" if y2 > 0 else "Local Maximum (concave down ↓)"
print(f" x = {xi:+.1f}: f(x) = {yi:+.1f}, f''(x) = {y2:+.0f} {sign} 0 → {kind}")
Special Derivatives Table
Core derivatives every ML practitioner should know:
| $f(x)$ | $f'(x)$ | ML Relevance |
|---|---|---|
| $x^n$ | $nx^{n-1}$ | Polynomial loss surfaces |
| $e^x$ | $e^x$ | Softmax numerator; unchanged by differentiation |
| $\ln x$ | $1/x$ | Cross-entropy: $-\frac{d}{d\hat{y}}\log\hat{y} = -1/\hat{y}$ |
| $\sin x$ | $\cos x$ | Positional encoding (Transformer) |
| $\cos x$ | $-\sin x$ | Positional encoding (Transformer) |
| $\sigma(x) = \frac{1}{1+e^{-x}}$ | $\sigma(x)(1-\sigma(x))$ | Sigmoid activation; max gradient = 0.25 → vanishing gradient risk |
| $\tanh(x)$ | $1 - \tanh^2(x)$ | LSTM/GRU gates; gradient $\in (-1,1)$ |
| $\text{ReLU}(x) = \max(0,x)$ | $0$ if $x \lt 0$, $1$ if $x \gt 0$ | Default hidden activation; dead neuron problem at $x=0$ |
import numpy as np
def numerical_deriv(f, x, h=1e-6):
"""Central difference derivative — O(h^2) accuracy."""
return (f(x + h) - f(x - h)) / (2 * h)
test_x = 1.0
functions = {
"e^x": (np.exp, lambda x: np.exp(x)),
"ln(x)": (np.log, lambda x: 1/x),
"sin(x)": (np.sin, np.cos),
"cos(x)": (np.cos, lambda x: -np.sin(x)),
"sigma(x)": (lambda x: 1/(1+np.exp(-x)), lambda x: (s := 1/(1+np.exp(-x))) * (1 - s)),
"tanh(x)": (np.tanh, lambda x: 1 - np.tanh(x)**2),
"ReLU(x)": (lambda x: np.maximum(0, x), lambda x: float(x > 0)),
}
print(f"{'Function':12s} {'Numerical':12s} {'Analytical':12s} {'Match':5s}")
print("-" * 52)
for name, (f, df) in functions.items():
num = numerical_deriv(f, test_x)
ana = df(test_x)
match = "✓" if abs(num - ana) < 1e-5 else "✗"
print(f"{name:12s} {num:12.6f} {ana:12.6f} {match}")
Differentiation Rules
Three rules — chain, product, quotient — unlock the derivatives of all composite and combined functions. Every backpropagation computation ultimately applies the chain rule.
Chain Rule
For a composition $y = f(g(x))$, the chain rule states:
Or equivalently: $(f \circ g)'(x) = f'(g(x)) \cdot g'(x)$.
loss.backward() call is the chain rule running in reverse.
import numpy as np
def numerical_deriv(f, x, h=1e-6):
return (f(x + h) - f(x - h)) / (2 * h)
# Chain rule: d/dx [sin(x^2)] = cos(x^2) * 2x
f_comp = lambda x: np.sin(x**2)
df_chain = lambda x: np.cos(x**2) * 2*x # outer' * inner'
print("Chain Rule: d/dx [sin(x^2)] = cos(x^2) * 2x")
for x in [0.5, 1.0, 2.0]:
num = numerical_deriv(f_comp, x)
ana = df_chain(x)
print(f" x={x}: numerical={num:.6f}, chain rule={ana:.6f}")
# Chain rule in a single neuron: y = sigma(wx + b), dy/dw = sigma'(z)*x
def sigmoid(z):
return 1.0 / (1.0 + np.exp(-z))
w, b, x = 0.5, 0.1, 2.0
z = w * x + b
dy_dw_chain = sigmoid(z) * (1 - sigmoid(z)) * x
dy_dw_numerical = numerical_deriv(lambda w_: sigmoid(w_*x + b), w)
print(f"\nNeuron y=sigma(wx+b), dy/dw at w={w}:")
print(f" Chain rule: {dy_dw_chain:.6f}")
print(f" Numerical: {dy_dw_numerical:.6f}")
Product Rule
For $y = u(x) \cdot v(x)$, the product rule states:
Mnemonic: "derivative of first times second, plus first times derivative of second."
Example: $\frac{d}{dx}[x^2 \sin x] = 2x \sin x + x^2 \cos x$.
import numpy as np
def numerical_deriv(f, x, h=1e-6):
return (f(x + h) - f(x - h)) / (2 * h)
# Product rule: d/dx [x^2 * sin(x)]
u = lambda x: x**2; up = lambda x: 2*x
v = lambda x: np.sin(x); vp = lambda x: np.cos(x)
f = lambda x: u(x) * v(x)
df_product = lambda x: up(x)*v(x) + u(x)*vp(x) # product rule
print("Product Rule: d/dx [x^2 * sin(x)] = 2x*sin(x) + x^2*cos(x)")
for x in [1.0, 2.0, np.pi/3]:
num = numerical_deriv(f, x)
ana = df_product(x)
print(f" x={x:.4f}: numerical={num:.6f}, product rule={ana:.6f}")
Quotient Rule
For $y = u(x)\,/\,v(x)$, the quotient rule states:
Mnemonic: "lo d-hi minus hi d-lo, over lo-squared" (hi = $u$, lo = $v$).
Classic example — derivative of $\tan x = \sin x\,/\,\cos x$:
ML application: the softmax $\hat{p}_k = e^{z_k}/\sum_j e^{z_j}$ differentiated w.r.t. $z_k$ uses the quotient rule, yielding $\hat{p}_k(1-\hat{p}_k)$ for the diagonal Jacobian entries.
import numpy as np
def numerical_deriv(f, x, h=1e-6):
return (f(x + h) - f(x - h)) / (2 * h)
# Quotient rule: d/dx [sin(x) / (1 + x^2)]
u = lambda x: np.sin(x); up = lambda x: np.cos(x)
v = lambda x: 1 + x**2; vp = lambda x: 2*x
f = lambda x: u(x) / v(x)
df_quotient = lambda x: (up(x)*v(x) - u(x)*vp(x)) / v(x)**2
print("Quotient Rule: d/dx [sin(x)/(1+x^2)]")
for x in [0.0, 1.0, 2.0]:
num = numerical_deriv(f, x)
ana = df_quotient(x)
print(f" x={x:.1f}: numerical={num:.6f}, quotient rule={ana:.6f}")
# Classic: d/dx [tan(x)] = sec^2(x)
f_tan = np.tan
df_tan = lambda x: 1.0 / np.cos(x)**2
print("\nQuotient Rule: d/dx [tan(x)] = sec^2(x)")
for x in [0.0, 0.5, 1.0]:
num = numerical_deriv(f_tan, x)
ana = df_tan(x)
print(f" x={x:.1f}: numerical={num:.6f}, sec^2(x)={ana:.6f}")
Partial Derivatives & Gradients
For $f: \mathbb{R}^n \to \mathbb{R}$, the partial derivative $\frac{\partial f}{\partial x_i}$ measures how $f$ changes when we vary $x_i$ while holding all other variables fixed.
The gradient $\nabla f$ is the vector of all partial derivatives:
Key property: $\nabla f(\mathbf{x})$ points in the direction of steepest ascent of $f$ at $\mathbf{x}$. Its negation $-\nabla f(\mathbf{x})$ points toward steepest descent — the direction gradient descent follows.
Jacobian & Hessian
For $f: \mathbb{R}^n \to \mathbb{R}^m$, the Jacobian $J \in \mathbb{R}^{m \times n}$ is the matrix of all first-order partial derivatives:
The Hessian $H \in \mathbb{R}^{n \times n}$ is the matrix of all second-order partial derivatives of a scalar function $f: \mathbb{R}^n \to \mathbb{R}$:
The Hessian encodes the local curvature of the loss landscape. Positive definite $H$ means local convexity (local minimum). In practice, Hessians are too large to compute for deep networks ($O(n^2)$ entries), so second-order methods like L-BFGS use approximations.
Gradient Descent
Gradient descent iteratively minimizes a differentiable function $\mathcal{L}(\theta)$ by moving parameters in the direction of steepest descent:
Where $\alpha \gt 0$ is the learning rate. Key variants:
- Batch gradient descent: gradient over entire dataset — stable but slow for large $N$
- Stochastic gradient descent (SGD): gradient over a single random sample — noisy but fast updates
- Mini-batch SGD: gradient over a batch of $B$ samples — balances speed and stability; standard in practice
import numpy as np
def gradient_descent(f, grad_f, x_init, lr=0.1, max_iters=100):
"""
Minimize f using gradient descent.
f: objective function
grad_f: gradient function
x_init: starting point (scalar or array)
"""
x = np.array(x_init, dtype=float)
history = [x.copy()]
for _ in range(max_iters):
g = grad_f(x)
x = x - lr * g
history.append(x.copy())
return x, history
# Example: minimize f(x) = (x - 3)^2
f = lambda x: (x - 3)**2
grad_f = lambda x: 2 * (x - 3)
x_opt, history = gradient_descent(f, grad_f, x_init=-2.0, lr=0.1, max_iters=50)
print(f"Gradient Descent on f(x) = (x-3)^2:")
print(f" Start: x = -2.0, f = {f(-2.0):.4f}")
print(f" Optimal: x = {x_opt:.6f}, f = {f(x_opt):.8f}")
print(f" True min: x = 3.0, f = 0.0")
# Show convergence trajectory (first 10 steps)
print("\nFirst 10 steps:")
for i, x_i in enumerate(history[:11]):
print(f" iter {i:2d}: x = {float(x_i):8.5f}, f(x) = {f(float(x_i)):.6f}")
Modern Optimizers
From SGD to Adam: A Progression
Momentum accumulates a velocity vector in directions of persistent gradient: $$v_{t+1} = \beta v_t + (1-\beta)\nabla_\theta \mathcal{L} \quad \theta_{t+1} = \theta_t - \alpha v_{t+1}$$
RMSProp adapts learning rate per-parameter using exponential moving average of squared gradients: $$s_{t+1} = \beta s_t + (1-\beta)(\nabla\mathcal{L})^2 \quad \theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{s_{t+1}+\epsilon}}\nabla\mathcal{L}$$
Adam combines momentum + RMSProp with bias correction:
$$m_{t+1} = \beta_1 m_t + (1-\beta_1)\nabla\mathcal{L}$$ $$v_{t+1} = \beta_2 v_t + (1-\beta_2)(\nabla\mathcal{L})^2$$ $$\hat{m} = \frac{m_{t+1}}{1-\beta_1^{t+1}}, \quad \hat{v} = \frac{v_{t+1}}{1-\beta_2^{t+1}}$$ $$\theta_{t+1} = \theta_t - \frac{\alpha\hat{m}}{\sqrt{\hat{v}}+\epsilon}$$Defaults: $\beta_1=0.9$, $\beta_2=0.999$, $\epsilon=10^{-8}$, $\alpha=10^{-3}$. Adam is the default optimizer for most deep learning applications.
Backpropagation
Backpropagation efficiently computes $\nabla_\theta \mathcal{L}$ for all parameters in a neural network using the chain rule. For a composition $f = f_L \circ f_{L-1} \circ \cdots \circ f_1$:
flowchart LR
X["Input x"] -->|"W1, b1"| H["Hidden h = σ(W1x + b1)"]
H -->|"W2, b2"| Y["Output ŷ = W2h + b2"]
Y --> L["Loss L(ŷ, y)"]
L -->|"∂L/∂ŷ = backward pass"| Y
Y -->|"∂L/∂W2 = (∂L/∂ŷ) h^T"| W2["∇W2"]
Y -->|"∂L/∂h = W2^T (∂L/∂ŷ)"| H
H -->|"∂L/∂W1 = (∂L/∂h ⊙ σ') x^T"| W1["∇W1"]
import numpy as np
# Implement backpropagation for a 1-layer network on a simple regression problem
def sigmoid(z):
return 1.0 / (1.0 + np.exp(-z))
def sigmoid_deriv(z):
s = sigmoid(z)
return s * (1 - s)
np.random.seed(0)
# Tiny dataset: XOR-like
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=float) # (4, 2)
y = np.array([[0], [1], [1], [0]], dtype=float) # (4, 1)
# Initialize weights
W1 = np.random.randn(2, 4) * 0.5 # (2, 4) — input to hidden
b1 = np.zeros((1, 4)) # (1, 4)
W2 = np.random.randn(4, 1) * 0.5 # (4, 1) — hidden to output
b2 = np.zeros((1, 1)) # (1, 1)
lr = 0.5
losses = []
for epoch in range(3000):
# Forward pass
z1 = X @ W1 + b1 # (4, 4)
a1 = sigmoid(z1) # (4, 4)
z2 = a1 @ W2 + b2 # (4, 1)
a2 = sigmoid(z2) # (4, 1) — output
# MSE loss
loss = np.mean((a2 - y)**2)
losses.append(loss)
# Backward pass (chain rule)
dL_da2 = 2 * (a2 - y) / len(y) # (4, 1)
dL_dz2 = dL_da2 * sigmoid_deriv(z2) # (4, 1)
dL_dW2 = a1.T @ dL_dz2 # (4, 1)
dL_db2 = dL_dz2.sum(axis=0, keepdims=True)
dL_da1 = dL_dz2 @ W2.T # (4, 4)
dL_dz1 = dL_da1 * sigmoid_deriv(z1) # (4, 4)
dL_dW1 = X.T @ dL_dz1 # (2, 4)
dL_db1 = dL_dz1.sum(axis=0, keepdims=True)
# Gradient updates
W2 -= lr * dL_dW2
b2 -= lr * dL_db2
W1 -= lr * dL_dW1
b1 -= lr * dL_db1
print("Backprop training on XOR:")
print(f" Epoch 0: loss = {losses[0]:.6f}")
print(f" Epoch 500: loss = {losses[500]:.6f}")
print(f" Epoch 2999: loss = {losses[2999]:.6f}")
print(f"\nFinal predictions (should be ~0, 1, 1, 0):")
print(np.round(a2, 3).flatten())
Convexity & Loss Landscapes
A function $f$ is convex if for all $\mathbf{x}, \mathbf{y}$ and $t \in [0,1]$:
For convex functions, any local minimum is the global minimum — gradient descent is guaranteed to converge. Linear regression's MSE loss is convex. Cross-entropy loss for logistic regression is convex in the weights.
- Saddle points are more common than local minima in high dimensions — gradient descent can stall but usually escapes
- Good local minima in overparameterized networks tend to generalize well (not obvious why — active research)
- Learning rate matters enormously: too large diverges, too small is slow; warm-up + decay schedules help
- Batch normalization smooths the loss landscape, improving gradient flow and enabling larger learning rates
Integration
Integration is the inverse of differentiation — it accumulates infinitesimal changes to compute totals. In ML, integrals appear in expected values, normalizing constants, and the probabilistic foundations underlying many algorithms.
Introduction to Integration
The antiderivative of $f(x)$ is a function $F(x)$ such that $F'(x) = f(x)$. Integration reverses differentiation:
where $C$ is the constant of integration (arbitrary, since constants vanish under differentiation). This is the indefinite integral.
Indefinite Integration I — Basic Rules
| Rule | Formula | Example |
|---|---|---|
| Power rule | $\int x^n\,dx = \frac{x^{n+1}}{n+1} + C\;(n \neq -1)$ | $\int x^4\,dx = \frac{x^5}{5} + C$ |
| Constant | $\int k\,dx = kx + C$ | $\int 5\,dx = 5x + C$ |
| Sum/Difference | $\int [f \pm g]\,dx = \int f\,dx \pm \int g\,dx$ | $\int (x^2 + 3)\,dx = \frac{x^3}{3} + 3x + C$ |
| Constant multiple | $\int kf(x)\,dx = k\int f(x)\,dx$ | $\int 4x^3\,dx = x^4 + C$ |
| Reciprocal | $\int \frac{1}{x}\,dx = \ln|x| + C$ | Arises in cross-entropy gradient |
import numpy as np
from scipy import integrate
# Power rule: integral of x^4 from 0 to 1 = 1/5 = 0.2
result, _ = integrate.quad(lambda x: x**4, 0, 1)
print(f"integral x^4 from 0 to 1")
print(f" Numerical: {result:.8f}")
print(f" Analytical (1/5): {0.2:.8f}")
# Sum rule: integral of (x^2 + 3) from 0 to 2 = x^3/3 + 3x from 0 to 2
# = (8/3 + 6) - 0 = 26/3
result2, _ = integrate.quad(lambda x: x**2 + 3, 0, 2)
print(f"\nintegral (x^2+3) from 0 to 2")
print(f" Numerical: {result2:.8f}")
print(f" Analytical (26/3): {26/3:.8f}")
# Reciprocal: integral 1/x from 1 to e = ln(e) - ln(1) = 1
result3, _ = integrate.quad(lambda x: 1/x, 1, np.e)
print(f"\nintegral 1/x from 1 to e")
print(f" Numerical: {result3:.8f}")
print(f" Analytical (ln e = 1): {1.0:.8f}")
Indefinite Integration II — Special Functions
Antiderivatives of exponential, logarithmic, and trigonometric functions appearing in ML:
| $f(x)$ | Antiderivative $F(x) + C$ | ML Use |
|---|---|---|
| $e^x$ | $e^x + C$ | Softmax; exponential smoothing |
| $e^{ax}$ | $\frac{1}{a}e^{ax} + C$ | Laplace distribution log-likelihood |
| $\ln x$ | $x\ln x - x + C$ | Entropy; integration by parts result |
| $\sin x$ | $-\cos x + C$ | Positional encoding; Fourier features |
| $\cos x$ | $\sin x + C$ | Positional encoding; Fourier features |
| $\frac{1}{1+x^2}$ | $\arctan x + C$ | Cauchy distribution; RKHS kernels |
| $e^{-x^2/2}$ | $\sqrt{2\pi}\,\Phi(x) + C$ | Gaussian normalization; GELU activation |
import numpy as np
from scipy import integrate
# e^x: integral from 0 to 2 = e^2 - e^0 = e^2 - 1
result, _ = integrate.quad(np.exp, 0, 2)
print(f"integral e^x from 0 to 2")
print(f" Numerical: {result:.8f}")
print(f" Analytical (e^2 - 1): {np.e**2 - 1:.8f}")
# sin(x): integral from 0 to pi = [-cos(x)] from 0 to pi = 1 + 1 = 2
result2, _ = integrate.quad(np.sin, 0, np.pi)
print(f"\nintegral sin(x) from 0 to pi")
print(f" Numerical: {result2:.8f}")
print(f" Analytical: {2.0:.8f}")
# ln(x): integral from 1 to e = [x*ln(x) - x] from 1 to e = (e - e) - (0 - 1) = 1
result3, _ = integrate.quad(np.log, 1, np.e)
print(f"\nintegral ln(x) from 1 to e")
print(f" Numerical: {result3:.8f}")
print(f" Analytical: {1.0:.8f}")
# Gaussian normalization: integral of N(0,1) from -inf to +inf = 1
result4, _ = integrate.quad(lambda x: np.exp(-x**2/2) / np.sqrt(2*np.pi), -np.inf, np.inf)
print(f"\nintegral of N(0,1) pdf from -inf to +inf")
print(f" Numerical: {result4:.8f}")
print(f" Analytical: {1.0:.8f}")
Definite Integration I — Fundamental Theorem
The Fundamental Theorem of Calculus links differentiation and integration. Part 2 (evaluation formula):
where $F$ is any antiderivative of $f$. The constant $C$ cancels, so any antiderivative works.
Evaluating $\int_1^4 (3x^2 - 2x + 1)\,dx$
Step 1 — Find antiderivative: $F(x) = x^3 - x^2 + x$
Step 2 — Evaluate: $F(4) - F(1) = (64 - 16 + 4) - (1 - 1 + 1) = 52 - 1 = 51$
import numpy as np
from scipy import integrate
# Fundamental Theorem: integral of (3x^2 - 2x + 1) from 1 to 4
f = lambda x: 3*x**2 - 2*x + 1
F = lambda x: x**3 - x**2 + x # antiderivative
a, b = 1, 4
ftc_result = F(b) - F(a)
print(f"FTC: F({b}) - F({a}) = {F(b)} - {F(a)} = {ftc_result}")
# Numerical quadrature confirms the result
numerical, error = integrate.quad(f, a, b)
print(f"Numerical (scipy.integrate.quad): {numerical:.8f}")
print(f"Match: {abs(ftc_result - numerical) < 1e-9}")
# Properties of definite integrals
print("\nKey properties:")
# 1. Linearity
r1, _ = integrate.quad(lambda x: 2*x**2 + 3*np.sin(x), 0, np.pi)
r2 = 2*integrate.quad(lambda x: x**2, 0, np.pi)[0] + 3*integrate.quad(np.sin, 0, np.pi)[0]
print(f" Linearity: int(2x^2+3sin) = {r1:.6f}, split = {r2:.6f}, match = {abs(r1-r2)<1e-9}")
# 2. Reverse limits: int_a^b f = -int_b^a f
r3, _ = integrate.quad(f, a, b)
r4, _ = integrate.quad(f, b, a)
print(f" Reverse limits: int_1^4 = {r3:.4f}, int_4^1 = {r4:.4f}, sum = {r3+r4:.4f}")
# 3. ML use: E[X^2] for X ~ N(0,1) = 1 (the variance)
ev_x2, _ = integrate.quad(lambda x: x**2 * np.exp(-x**2/2)/np.sqrt(2*np.pi), -10, 10)
print(f" E[X^2] for X~N(0,1) = {ev_x2:.8f} (expected: 1.0 = variance)")
Definite Integration II — Substitution & Integration by Parts
Integration by substitution (reverse chain rule): for $\int f(g(x))g'(x)\,dx$, let $u = g(x)$:
Example: $\int 2x\cos(x^2)\,dx$ — let $u = x^2$, $du = 2x\,dx$ → $\int \cos u\,du = \sin u + C = \sin(x^2) + C$.
Integration by parts (reverse product rule):
Example: $\int x e^x\,dx$ — let $u = x$, $dv = e^x\,dx$ → $xe^x - \int e^x\,dx = e^x(x-1) + C$.
import numpy as np
from scipy import integrate
# Substitution: integral of 2x*cos(x^2) from 0 to sqrt(pi)
# Antiderivative = sin(x^2), so F(sqrt(pi)) - F(0) = sin(pi) - sin(0) = 0
f_sub = lambda x: 2 * x * np.cos(x**2)
result_sub, _ = integrate.quad(f_sub, 0, np.sqrt(np.pi))
analytical_sub = np.sin(np.pi) - np.sin(0)
print(f"Substitution: integral 2x*cos(x^2) from 0 to sqrt(pi)")
print(f" Numerical: {result_sub:.8f}")
print(f" Analytical: {analytical_sub:.8f}")
# Integration by parts: integral of x*e^x from 0 to 1
# Antiderivative: e^x(x-1), F(1)-F(0) = 0 - (-1) = 1
f_ibp = lambda x: x * np.exp(x)
result_ibp, _ = integrate.quad(f_ibp, 0, 1)
F_ibp = lambda x: np.exp(x) * (x - 1)
analytical_ibp = F_ibp(1) - F_ibp(0)
print(f"\nIntegration by parts: integral x*e^x from 0 to 1")
print(f" Numerical: {result_ibp:.8f}")
print(f" Analytical: {analytical_ibp:.8f}") # = 1.0
# ML use: expected value E[X] for X ~ Exponential(lambda=1)
# = integral of x * e^(-x) from 0 to inf = 1 (by integration by parts)
ev_exp, _ = integrate.quad(lambda x: x * np.exp(-x), 0, np.inf)
print(f"\nE[X] for X ~ Exp(1) (integral x*e^(-x) from 0 to inf):")
print(f" Numerical: {ev_exp:.8f} (expected: 1.0 = 1/lambda)")
Area Under a Curve
The definite integral gives signed area. For total (unsigned) area between a curve and the x-axis:
Area between two curves $f(x) \geq g(x)$ on $[a,b]$:
import numpy as np
from scipy import integrate
# Area between g(x) = x and f(x) = x^2 on [0,1]
# x >= x^2 on [0,1], so area = integral(x - x^2) from 0 to 1
area, _ = integrate.quad(lambda x: x - x**2, 0, 1)
print(f"Area between g(x)=x and f(x)=x^2 on [0,1]:")
print(f" Numerical: {area:.8f}")
print(f" Analytical: {0.5 - 1/3:.8f}") # = 1/2 - 1/3 = 1/6
# Simulate AUC-ROC calculation (trapezoidal rule)
# Sorted scores and labels for a simple binary classifier
np.random.seed(42)
scores = np.random.beta(5, 2, 100) # biased positive scores
labels = (scores + np.random.normal(0, 0.3, 100) > 0.5).astype(int)
# Build ROC curve manually
thresholds = np.linspace(0, 1, 200)
tprs, fprs = [], []
pos_total = labels.sum()
neg_total = (1 - labels).sum()
for thresh in thresholds:
preds = (scores >= thresh).astype(int)
tp = ((preds == 1) & (labels == 1)).sum()
fp = ((preds == 1) & (labels == 0)).sum()
tprs.append(tp / pos_total)
fprs.append(fp / neg_total)
tprs, fprs = np.array(tprs), np.array(fprs)
# AUC via trapezoid rule (numpy)
auc_trapz = np.trapz(tprs[::-1], fprs[::-1]) # sort fpr ascending
print(f"\nAUC-ROC (trapezoidal rule): {auc_trapz:.4f}")
print(f" (1.0 = perfect, 0.5 = random)")
print(f" This is literally a definite integral computed numerically.")
Practice Exercises
Differentiate the Cross-Entropy Loss via Chain Rule
Let $\hat{y} = \sigma(w \cdot x + b)$ and $\mathcal{L} = -y\log\hat{y} - (1-y)\log(1-\hat{y})$. Compute $\frac{\partial \mathcal{L}}{\partial w}$.
Show Answer
Chain rule: $\frac{\partial \mathcal{L}}{\partial w} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial w}$, where $z = wx + b$.
$\frac{\partial \mathcal{L}}{\partial \hat{y}} = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}$
$\frac{\partial \hat{y}}{\partial z} = \sigma(z)(1-\sigma(z)) = \hat{y}(1-\hat{y})$
Multiplying: $\frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} = \left(-\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}\right)\hat{y}(1-\hat{y}) = -y(1-\hat{y}) + (1-y)\hat{y} = \hat{y} - y$
Then $\frac{\partial z}{\partial w} = x$. So: $\boxed{\frac{\partial \mathcal{L}}{\partial w} = (\hat{y} - y) \cdot x}$. Clean result — the gradient is the prediction error times the input.
Learning Rate Sensitivity
For $f(x) = x^2$, derive the closed-form expression for $x_t$ after $t$ gradient descent steps starting from $x_0$. What values of learning rate $\alpha$ ensure convergence? What is the optimal $\alpha$?
Show Answer
$f'(x) = 2x$. Update: $x_{t+1} = x_t - 2\alpha x_t = (1-2\alpha)x_t$.
After $t$ steps: $x_t = (1-2\alpha)^t x_0$.
Converges to 0 iff $|1-2\alpha| \lt 1$, i.e., $0 \lt \alpha \lt 1$.
Fastest convergence: minimize $|1-2\alpha|$, achieved at $\alpha^* = 0.5$ giving $x_t = 0$ in one step. This corresponds to Newton's method for quadratics. For general smooth functions, the optimal step size is $\alpha = 1/L$ where $L$ is the Lipschitz constant of the gradient.
Conclusion & Next Steps
Calculus is what makes neural networks trainable:
- Derivatives: measure instantaneous rate of change — the slope of the loss surface at any point
- Power, product, quotient rules: the mechanical toolkit for differentiating any function
- Chain rule: propagates gradients through compositions — the mathematical core of backpropagation
- Gradient descent: iteratively follows $-\nabla\mathcal{L}$ to minimize loss
- Hessian: encodes curvature — positive definite = convex = safe convergence
- Adam: adaptive learning rates with momentum — robust default optimizer
- Integration: computes expected values, normalizing constants, and the AUC-ROC metric — the probabilistic layer under ML
- Non-convexity: deep network landscapes have saddle points and many local minima
Next in the Series
In Part 9: ML-Specific Math, we apply everything learned so far directly to ML: loss functions, regularization, bias-variance tradeoff, MLE/MAP, activation function gradients, and batch normalization.