Back to Mathematics

Part 8: Calculus & Optimization

April 26, 2026 Wasil Zafar 50 min read

Every time you call loss.backward(), you're executing the chain rule of calculus across millions of parameters. Gradient descent is applied calculus. Understanding why it works — and when it doesn't — requires genuine understanding of derivatives, gradients, and the geometry of loss landscapes.

Table of Contents

  1. Derivatives
  2. Differentiation Rules
  3. Partial Derivatives & Gradients
  4. Jacobian & Hessian
  5. Gradient Descent
  6. Modern Optimizers
  7. Backpropagation
  8. Convexity & Loss Landscapes
  9. Integration
  10. Practice Exercises
  11. Conclusion & Next Steps

Derivatives

A derivative measures how a function changes as its input changes — the instantaneous rate of change. Derivatives are the mathematical engine behind optimization, neural network training, and gradient descent.

Gradient of a Straight Line

For a linear function $y = mx + c$, the gradient (slope) is constant throughout:

$$m = \frac{y_2 - y_1}{x_2 - x_1} = \frac{\Delta y}{\Delta x}$$

This is "rise over run." The gradient is the same regardless of which two points you pick — straight lines have a uniform rate of change. A positive gradient means the line rises left-to-right; negative means it falls; zero means horizontal.

Gradient of a Curve (Tangent)

Curves have a different gradient at each point. The gradient at a point on a curve $y = f(x)$ is the slope of the tangent line at that point. As we zoom in on any smooth curve, it locally looks like a straight line — the tangent.

Intuitively, the gradient of a curve at $x = a$ is the limit of the gradient of a secant line connecting $(a, f(a))$ and $(a+h, f(a+h))$ as $h \to 0$. This is exactly the definition of the derivative.

Differentiation from First Principles

The formal definition of the derivative:

$$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$$

Worked example — differentiate $f(x) = x^2$ from first principles:

$$f'(x) = \lim_{h \to 0} \frac{(x+h)^2 - x^2}{h} = \lim_{h \to 0} \frac{2xh + h^2}{h} = \lim_{h \to 0} (2x + h) = 2x$$
import numpy as np

def derivative_first_principles(f, x, h=1e-7):
    """Numerical derivative via first principles (forward difference)."""
    return (f(x + h) - f(x)) / h

def derivative_central(f, x, h=1e-6):
    """Central difference — more accurate (O(h^2) error vs O(h))."""
    return (f(x + h) - f(x - h)) / (2 * h)

# Test on f(x) = x^2 — analytical derivative = 2x
f      = lambda x: x**2
grad_f = lambda x: 2*x

print("f(x) = x^2 — First Principles vs Analytical:")
for x in [0, 1, 2, 3]:
    forward_diff = derivative_first_principles(f, x)
    central_diff = derivative_central(f, x)
    analytical   = grad_f(x)
    print(f"  x={x}: forward={forward_diff:.6f}, central={central_diff:.6f}, analytical={analytical:.6f}")

# Test on f(x) = sin(x) — derivative = cos(x)
g = np.sin
print("\nf(x) = sin(x) — derivative = cos(x):")
for x in [0, np.pi/4, np.pi/2, np.pi]:
    num = derivative_central(g, x)
    ana = np.cos(x)
    print(f"  x={x:.4f}: numerical={num:.6f}, analytical={ana:.6f}")
Interactive: Secant Line → Tangent as h → 0
1.0
1.0

Drag x₀ to choose a point on f(x) = x². Decrease h toward 0 to watch the gold secant line converge to the crimson tangent — this is the limit definition of the derivative f'(x₀) = 2x₀.

The three panels below illustrate what the first derivative means geometrically — from the limit definition as a shrinking secant line, to its sign ruling where a function rises or falls, to its direct correspondence with slope:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid", palette="muted")

f      = lambda x: x**2
f_prime = lambda x: 2 * x

x_full = np.linspace(-0.5, 2.5, 500)
x0 = 1.0          # point at which we visualise the limit
h_values = [1.0, 0.5, 0.2, 0.05]   # secant lines converging to tangent

# ── Tangent line at x0 ────────────────────────────────────────────────────
slope_at_x0 = f_prime(x0)
tangent = lambda x: f(x0) + slope_at_x0 * (x - x0)

fig, axes = plt.subplots(1, 3, figsize=(16, 5))
fig.suptitle(
    "First Derivative — Geometric Significance",
    fontsize=13, fontweight="bold", y=1.02
)

# ── Panel 1: Limit definition — secant → tangent ─────────────────────────
ax = axes[0]
ax.plot(x_full, f(x_full), color="#132440", lw=2.5, label="$f(x) = x^2$", zorder=3)

cmap_colors = sns.color_palette("flare", len(h_values))
for h, col in zip(h_values, cmap_colors):
    x1, y1 = x0, f(x0)
    x2, y2 = x0 + h, f(x0 + h)
    secant_slope = (y2 - y1) / (x2 - x1)
    secant_line = lambda x, s=secant_slope: f(x0) + s * (x - x0)
    x_sec = np.linspace(x0 - 0.3, x0 + h + 0.2, 100)
    ax.plot(x_sec, secant_line(x_sec), color=col, lw=1.4, ls="--",
            label=f"h = {h:.2f}  (slope ≈ {secant_slope:.2f})")
    ax.scatter([x2], [y2], color=col, s=45, zorder=4)

# Tangent
x_tan = np.linspace(x0 - 0.6, x0 + 0.6, 100)
ax.plot(x_tan, tangent(x_tan), color="#BF092F", lw=2.2,
        label=f"Tangent h→0 (slope={slope_at_x0:.1f})", zorder=5)
ax.scatter([x0], [f(x0)], color="#BF092F", s=80, zorder=6, ec="white", lw=1.5)
ax.set_title("Limit Definition\nSecant lines → tangent as $h \\to 0$")
ax.set_xlabel("$x$"); ax.set_ylabel("$f(x)$")
ax.legend(fontsize=7.5, loc="upper left")
ax.set_xlim(-0.3, 2.4); ax.set_ylim(-0.3, 6)

# ── Panel 2: f(x) = sin(x) — sign of f'(x) drives rise/fall ─────────────
ax = axes[1]
g       = np.sin
g_prime = np.cos
x_trig  = np.linspace(0, 3 * np.pi, 600)

ax.plot(x_trig, g(x_trig), color="#132440", lw=2.5, label="$f(x) = \\sin x$")
ax.axhline(0, color="gray", lw=0.7, ls="--", alpha=0.5)

ax.fill_between(x_trig, g(x_trig), 0, where=(g_prime(x_trig) > 0),
                alpha=0.18, color="#16476A", label="$f'(x) > 0$  (rising ↑)")
ax.fill_between(x_trig, g(x_trig), 0, where=(g_prime(x_trig) < 0),
                alpha=0.18, color="#BF092F", label="$f'(x) < 0$  (falling ↓)")

# Mark zeros of f'
crit_x = np.array([np.pi/2, 3*np.pi/2, 5*np.pi/2])
crit_y = g(crit_x)
crit_labels = ["Max\n$f'=0$", "Min\n$f'=0$", "Max\n$f'=0$"]
crit_colors = ["#16476A", "#BF092F", "#16476A"]
for xi, yi, lbl, col in zip(crit_x, crit_y, crit_labels, crit_colors):
    ax.scatter(xi, yi, color=col, s=70, zorder=5, ec="white", lw=1.3)
    ax.annotate(lbl, (xi, yi), textcoords="offset points",
                xytext=(6, 8), fontsize=8, color=col)

ax.set_title("Sign of $f'(x)$ — Where Does $f$ Rise or Fall?")
ax.set_xlabel("$x$"); ax.set_ylabel("$f(x)$")
ax.set_xticks([0, np.pi/2, np.pi, 3*np.pi/2, 2*np.pi, 5*np.pi/2, 3*np.pi])
ax.set_xticklabels(["0", "$\\pi/2$", "$\\pi$", "$3\\pi/2$", "$2\\pi$", "$5\\pi/2$", "$3\\pi$"],
                   fontsize=8)
ax.legend(fontsize=8, loc="lower left")

# ── Panel 3: f(x) and f'(x) side-by-side — slope = derivative value ──────
ax = axes[2]
x_p = np.linspace(-2.5, 2.5, 500)
p   = lambda x: x**3 / 3 - x          # chosen so f' = x^2 - 1 (clear zeros)
dp  = lambda x: x**2 - 1

line1, = ax.plot(x_p, p(x_p), color="#132440", lw=2.5, label="$f(x) = x^3/3 - x$")
ax2_twin = ax.twinx()
line2, = ax2_twin.plot(x_p, dp(x_p), color="#3B9797", lw=2.0, ls="-.",
                       label="$f'(x) = x^2 - 1$")
ax2_twin.axhline(0, color="#3B9797", lw=0.6, ls=":", alpha=0.4)

# Annotate where f'=0 → extrema
for xi in [-1.0, 1.0]:
    ax.axvline(xi, color="#BF092F", lw=1.0, ls=":", alpha=0.6)
    ax.scatter(xi, p(xi), color="#BF092F", s=70, zorder=5, ec="white", lw=1.3)
    ax2_twin.scatter(xi, dp(xi), color="#BF092F", s=70, zorder=5, ec="white", lw=1.3)

ax.set_title("$f$ vs $f'$ — Slope of $f$ equals value of $f'$")
ax.set_xlabel("$x$")
ax.set_ylabel("$f(x)$", color="#132440")
ax2_twin.set_ylabel("$f'(x)$", color="#3B9797")

lines = [line1, line2]
ax.legend(lines, [l.get_label() for l in lines], fontsize=8, loc="upper center")

plt.tight_layout()
plt.show()

# ── Console: convergence of secant slope to true derivative ───────────────
print(f"Convergence of secant slope to f'({x0}) = {f_prime(x0):.1f}  [f(x)=x²]")
print(f"{'h':>10}  {'Secant slope':>14}  {'Error':>10}")
print("-" * 40)
for h in [1.0, 0.5, 0.1, 0.01, 0.001, 1e-6]:
    sec = (f(x0 + h) - f(x0)) / h
    err = abs(sec - f_prime(x0))
    print(f"{h:>10.6f}  {sec:>14.8f}  {err:>10.2e}")

Power Rule

The power rule is the most frequently used differentiation rule:

$$\frac{d}{dx} x^n = nx^{n-1} \quad \text{for any real } n$$

Examples: $\frac{d}{dx} x^3 = 3x^2$; $\frac{d}{dx} x^{1/2} = \frac{1}{2}x^{-1/2}$; $\frac{d}{dx} x^{-1} = -x^{-2}$; $\frac{d}{dx} 5 = 0$ (constants vanish).

Companion rules: $\frac{d}{dx}[cf(x)] = c f'(x)$ (constant multiple) and $\frac{d}{dx}[f(x) \pm g(x)] = f'(x) \pm g'(x)$ (sum rule).

Second Derivatives

The second derivative $f''(x)$ measures the rate of change of the slope — the curvature of the function:

$$f''(x) = \frac{d^2f}{dx^2} = \frac{d}{dx}\!\left[\frac{df}{dx}\right]$$

Second derivatives classify critical points where $f'(x^*) = 0$:

  • $f''(x^*) \gt 0$: local minimum (concave up — function curves like a bowl)
  • $f''(x^*) \lt 0$: local maximum (concave down — function curves like a hill)
  • $f''(x^*) = 0$: inconclusive — may be saddle point or inflection point

The visualization below plots $f(x) = x^3 - 3x$ alongside its first and second derivatives, showing how the sign of $f''$ at each critical point determines concavity:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid", palette="muted")

# ── Function and its derivatives ──────────────────────────────────────────
def f(x):        return x**3 - 3*x         # local max at x=-1, local min at x=1
def f_prime(x):  return 3*x**2 - 3         # zero at x = ±1
def f_double(x): return 6*x                # sign determines concavity

x = np.linspace(-2.5, 2.5, 500)

# Critical points where f'(x*) = 0
x_crit = np.array([-1.0, 1.0])
y_crit = f(x_crit)
y2_crit = f_double(x_crit)
labels = ["Local Max\n$f''(-1) = -6 < 0$", "Local Min\n$f''(1) = +6 > 0$"]
colors = ["#BF092F", "#16476A"]   # crimson = max, navy-blue = min

fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharey=False)
fig.suptitle(
    "Second Derivative Test: Curvature & Critical Point Classification",
    fontsize=13, fontweight="bold", y=1.02
)

# ── Panel 1: f(x) with concavity shading ──────────────────────────────────
ax = axes[0]
ax.plot(x, f(x), color="#132440", lw=2.5, label="$f(x) = x^3 - 3x$")
ax.fill_between(x, f(x), alpha=0.12, where=(f_double(x) > 0),
                color="#16476A", label="Concave up  ($f'' > 0$)")
ax.fill_between(x, f(x), alpha=0.12, where=(f_double(x) < 0),
                color="#BF092F", label="Concave down ($f'' < 0$)")
ax.axhline(0, color="gray", lw=0.8, ls="--", alpha=0.5)
ax.axvline(0, color="gray", lw=0.8, ls=":", alpha=0.4, label="Inflection ($x=0$)")

for xi, yi, lbl, col in zip(x_crit, y_crit, labels, colors):
    ax.scatter(xi, yi, zorder=5, color=col, s=90, ec="white", lw=1.5)
    ax.annotate(lbl, (xi, yi), textcoords="offset points",
                xytext=(14, 8), fontsize=8.5, color=col,
                arrowprops=dict(arrowstyle="->", color=col, lw=1.2))

ax.set_title("$f(x) = x^3 - 3x$")
ax.set_xlabel("$x$"); ax.set_ylabel("$f(x)$")
ax.legend(fontsize=8, loc="upper left")
ax.set_ylim(-3.5, 3.5)

# ── Panel 2: f'(x) — slope (first derivative) ─────────────────────────────
ax = axes[1]
ax.plot(x, f_prime(x), color="#3B9797", lw=2.5, label="$f'(x) = 3x^2 - 3$")
ax.axhline(0, color="black", lw=0.8, ls="--", alpha=0.6)
ax.fill_between(x, f_prime(x), 0, where=(f_prime(x) > 0),
                alpha=0.10, color="#3B9797", label="$f' > 0$ (increasing)")
ax.fill_between(x, f_prime(x), 0, where=(f_prime(x) < 0),
                alpha=0.10, color="#BF092F", label="$f' < 0$ (decreasing)")

for xi, col in zip(x_crit, colors):
    ax.scatter(xi, 0, zorder=5, color=col, s=90, ec="white", lw=1.5)
    ax.axvline(xi, color=col, lw=1.0, ls=":", alpha=0.6)

ax.set_title("First Derivative $f'(x)$")
ax.set_xlabel("$x$"); ax.set_ylabel("$f'(x)$")
ax.legend(fontsize=8, loc="upper center")

# ── Panel 3: f''(x) — curvature (second derivative) ───────────────────────
ax = axes[2]
ax.plot(x, f_double(x), color="#B8860B", lw=2.5, label="$f''(x) = 6x$")
ax.axhline(0, color="black", lw=0.8, ls="--", alpha=0.6)
ax.fill_between(x, f_double(x), 0, where=(f_double(x) > 0),
                alpha=0.15, color="#16476A", label="$f'' > 0$ → concave up")
ax.fill_between(x, f_double(x), 0, where=(f_double(x) < 0),
                alpha=0.15, color="#BF092F", label="$f'' < 0$ → concave down")

for xi, y2, lbl, col in zip(x_crit, y2_crit, labels, colors):
    ax.scatter(xi, y2, zorder=5, color=col, s=90, ec="white", lw=1.5)
    ax.annotate(f"$f''({xi:.0f}) = {y2:.0f}$", (xi, y2),
                textcoords="offset points", xytext=(8, 10),
                fontsize=8.5, color=col,
                arrowprops=dict(arrowstyle="->", color=col, lw=1.2))

ax.set_title("Second Derivative $f''(x)$")
ax.set_xlabel("$x$"); ax.set_ylabel("$f''(x)$")
ax.legend(fontsize=8, loc="upper left")

plt.tight_layout()
plt.show()

# ── Console summary ────────────────────────────────────────────────────────
print("Second Derivative Test — Critical Point Classification")
print("=" * 54)
for xi, yi, y2 in zip(x_crit, y_crit, y2_crit):
    sign  = ">" if y2 > 0 else "<"
    kind  = "Local Minimum  (concave up ↑)" if y2 > 0 else "Local Maximum  (concave down ↓)"
    print(f"  x = {xi:+.1f}:  f(x) = {yi:+.1f},  f''(x) = {y2:+.0f} {sign} 0  →  {kind}")
Hessian matrix in ML: The multivariate analogue of $f''$ is the Hessian $H$, where $H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$. A positive definite Hessian at a critical point confirms a local minimum. Large Hessian eigenvalues indicate sharp curvature (fast loss change), small eigenvalues indicate flat regions. Second-order optimizers use Hessian information but are $O(n^2)$ per step — infeasible for large networks, motivating L-BFGS approximations.

Special Derivatives Table

Core derivatives every ML practitioner should know:

$f(x)$$f'(x)$ML Relevance
$x^n$$nx^{n-1}$Polynomial loss surfaces
$e^x$$e^x$Softmax numerator; unchanged by differentiation
$\ln x$$1/x$Cross-entropy: $-\frac{d}{d\hat{y}}\log\hat{y} = -1/\hat{y}$
$\sin x$$\cos x$Positional encoding (Transformer)
$\cos x$$-\sin x$Positional encoding (Transformer)
$\sigma(x) = \frac{1}{1+e^{-x}}$$\sigma(x)(1-\sigma(x))$Sigmoid activation; max gradient = 0.25 → vanishing gradient risk
$\tanh(x)$$1 - \tanh^2(x)$LSTM/GRU gates; gradient $\in (-1,1)$
$\text{ReLU}(x) = \max(0,x)$$0$ if $x \lt 0$, $1$ if $x \gt 0$Default hidden activation; dead neuron problem at $x=0$
import numpy as np

def numerical_deriv(f, x, h=1e-6):
    """Central difference derivative — O(h^2) accuracy."""
    return (f(x + h) - f(x - h)) / (2 * h)

test_x = 1.0

functions = {
    "e^x":      (np.exp,                       lambda x: np.exp(x)),
    "ln(x)":    (np.log,                        lambda x: 1/x),
    "sin(x)":   (np.sin,                        np.cos),
    "cos(x)":   (np.cos,                        lambda x: -np.sin(x)),
    "sigma(x)": (lambda x: 1/(1+np.exp(-x)),    lambda x: (s := 1/(1+np.exp(-x))) * (1 - s)),
    "tanh(x)":  (np.tanh,                       lambda x: 1 - np.tanh(x)**2),
    "ReLU(x)":  (lambda x: np.maximum(0, x),    lambda x: float(x > 0)),
}

print(f"{'Function':12s}  {'Numerical':12s}  {'Analytical':12s}  {'Match':5s}")
print("-" * 52)
for name, (f, df) in functions.items():
    num = numerical_deriv(f, test_x)
    ana = df(test_x)
    match = "✓" if abs(num - ana) < 1e-5 else "✗"
    print(f"{name:12s}  {num:12.6f}  {ana:12.6f}  {match}")

Differentiation Rules

Three rules — chain, product, quotient — unlock the derivatives of all composite and combined functions. Every backpropagation computation ultimately applies the chain rule.

Chain Rule

For a composition $y = f(g(x))$, the chain rule states:

$$\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}, \quad \text{where } u = g(x)$$

Or equivalently: $(f \circ g)'(x) = f'(g(x)) \cdot g'(x)$.

Chain rule in backpropagation: A neural network is a deep composition of functions: $\hat{y} = f_L(f_{L-1}(\ldots f_1(\mathbf{x})))$. Backpropagation is the chain rule applied systematically layer by layer, accumulating the product of local Jacobians from output to input. Each loss.backward() call is the chain rule running in reverse.
import numpy as np

def numerical_deriv(f, x, h=1e-6):
    return (f(x + h) - f(x - h)) / (2 * h)

# Chain rule: d/dx [sin(x^2)] = cos(x^2) * 2x
f_comp       = lambda x: np.sin(x**2)
df_chain     = lambda x: np.cos(x**2) * 2*x   # outer' * inner'

print("Chain Rule: d/dx [sin(x^2)] = cos(x^2) * 2x")
for x in [0.5, 1.0, 2.0]:
    num = numerical_deriv(f_comp, x)
    ana = df_chain(x)
    print(f"  x={x}: numerical={num:.6f}, chain rule={ana:.6f}")

# Chain rule in a single neuron: y = sigma(wx + b), dy/dw = sigma'(z)*x
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

w, b, x = 0.5, 0.1, 2.0
z = w * x + b
dy_dw_chain     = sigmoid(z) * (1 - sigmoid(z)) * x
dy_dw_numerical = numerical_deriv(lambda w_: sigmoid(w_*x + b), w)
print(f"\nNeuron y=sigma(wx+b), dy/dw at w={w}:")
print(f"  Chain rule:  {dy_dw_chain:.6f}")
print(f"  Numerical:   {dy_dw_numerical:.6f}")

Product Rule

For $y = u(x) \cdot v(x)$, the product rule states:

$$\frac{d}{dx}[u \cdot v] = u'v + uv'$$

Mnemonic: "derivative of first times second, plus first times derivative of second."

Example: $\frac{d}{dx}[x^2 \sin x] = 2x \sin x + x^2 \cos x$.

import numpy as np

def numerical_deriv(f, x, h=1e-6):
    return (f(x + h) - f(x - h)) / (2 * h)

# Product rule: d/dx [x^2 * sin(x)]
u  = lambda x: x**2;        up = lambda x: 2*x
v  = lambda x: np.sin(x);   vp = lambda x: np.cos(x)

f  = lambda x: u(x) * v(x)
df_product = lambda x: up(x)*v(x) + u(x)*vp(x)   # product rule

print("Product Rule: d/dx [x^2 * sin(x)] = 2x*sin(x) + x^2*cos(x)")
for x in [1.0, 2.0, np.pi/3]:
    num = numerical_deriv(f, x)
    ana = df_product(x)
    print(f"  x={x:.4f}: numerical={num:.6f}, product rule={ana:.6f}")

Quotient Rule

For $y = u(x)\,/\,v(x)$, the quotient rule states:

$$\frac{d}{dx}\!\left[\frac{u}{v}\right] = \frac{u'v - uv'}{v^2}$$

Mnemonic: "lo d-hi minus hi d-lo, over lo-squared" (hi = $u$, lo = $v$).

Classic example — derivative of $\tan x = \sin x\,/\,\cos x$:

$$\frac{d}{dx}\tan x = \frac{\cos x \cdot \cos x - \sin x \cdot (-\sin x)}{\cos^2 x} = \frac{\cos^2 x + \sin^2 x}{\cos^2 x} = \sec^2 x$$

ML application: the softmax $\hat{p}_k = e^{z_k}/\sum_j e^{z_j}$ differentiated w.r.t. $z_k$ uses the quotient rule, yielding $\hat{p}_k(1-\hat{p}_k)$ for the diagonal Jacobian entries.

import numpy as np

def numerical_deriv(f, x, h=1e-6):
    return (f(x + h) - f(x - h)) / (2 * h)

# Quotient rule: d/dx [sin(x) / (1 + x^2)]
u  = lambda x: np.sin(x);   up = lambda x: np.cos(x)
v  = lambda x: 1 + x**2;    vp = lambda x: 2*x

f  = lambda x: u(x) / v(x)
df_quotient = lambda x: (up(x)*v(x) - u(x)*vp(x)) / v(x)**2

print("Quotient Rule: d/dx [sin(x)/(1+x^2)]")
for x in [0.0, 1.0, 2.0]:
    num = numerical_deriv(f, x)
    ana = df_quotient(x)
    print(f"  x={x:.1f}: numerical={num:.6f}, quotient rule={ana:.6f}")

# Classic: d/dx [tan(x)] = sec^2(x)
f_tan = np.tan
df_tan = lambda x: 1.0 / np.cos(x)**2
print("\nQuotient Rule: d/dx [tan(x)] = sec^2(x)")
for x in [0.0, 0.5, 1.0]:
    num = numerical_deriv(f_tan, x)
    ana = df_tan(x)
    print(f"  x={x:.1f}: numerical={num:.6f}, sec^2(x)={ana:.6f}")

Partial Derivatives & Gradients

A neural network might have millions of parameters — weights $w_1, w_2, \ldots, w_n$. The loss $\mathcal{L}$ depends on all of them simultaneously. To train the network, we need to know: "if I nudge this one weight $w_i$ slightly, how much does the loss change?" That is exactly what a partial derivative answers.

The temperature analogy: Imagine you're standing on a hillside and the temperature depends on both your east-west position ($x$) and your north-south position ($y$). The partial derivative $\frac{\partial T}{\partial x}$ tells you: "if I take one step east (keeping my north-south position fixed), how does the temperature change?" You are isolating one direction at a time, pretending the other dimensions are frozen.

How to Compute a Partial Derivative

To compute $\frac{\partial f}{\partial x_i}$, treat every variable except $x_i$ as a constant and differentiate as usual. For example, with $f(x, y) = x^2 y + 3y^2$:

  • $\frac{\partial f}{\partial x} = 2xy + 0 = 2xy$ — treat $y$ as a constant, differentiate $x^2 y$ with respect to $x$
  • $\frac{\partial f}{\partial y} = x^2 + 6y$ — treat $x$ as a constant, differentiate $3y^2$ with respect to $y$

At the specific point $(x, y) = (2, 1)$: $\frac{\partial f}{\partial x} = 4$, $\frac{\partial f}{\partial y} = 10$. Increasing $y$ a tiny bit here changes $f$ more than increasing $x$ by the same amount.

The Gradient — A Compass on the Loss Surface

The gradient $\nabla f$ stacks all partial derivatives into a single vector:

$$\nabla f(\mathbf{x}) = \begin{pmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{pmatrix}$$

This vector has two critical properties that make it the heart of optimization:

  • Direction: $\nabla f(\mathbf{x})$ points in the direction of steepest uphill climb from point $\mathbf{x}$. Flipping it, $-\nabla f(\mathbf{x})$ points downhill — toward lower loss.
  • Magnitude: $\|\nabla f(\mathbf{x})\|$ tells you how steep the slope is. A large magnitude means the loss is changing rapidly; near zero means you're on a flat plateau (or near a minimum).
IntuitionML Context
Gradient in Neural Network Training

Suppose a network has two weights $w_1$ and $w_2$, and the loss surface looks like a bowl. At the current position $(w_1, w_2)$, the gradient might be $\nabla \mathcal{L} = (3.2, -0.4)$. This says: increasing $w_1$ increases loss fast (slope 3.2); increasing $w_2$ decreases loss slowly (slope −0.4). So to go downhill, update $w_1$ by a lot in the negative direction and $w_2$ by a little in the positive direction. The gradient tells each parameter exactly how to move.

import numpy as np

# Partial derivatives computed analytically vs numerically
# f(x, y) = x^2 * y + 3*y^2

def f(x, y):
    return x**2 * y + 3 * y**2

# Analytical partial derivatives
def df_dx(x, y):
    return 2 * x * y          # d/dx of x^2*y is 2xy (y constant)

def df_dy(x, y):
    return x**2 + 6 * y       # d/dy of x^2*y + 3y^2 is x^2 + 6y

# Numerical approximation using finite differences: df/dx ≈ (f(x+h)-f(x-h)) / 2h
def numerical_partial(fn, x, y, var='x', h=1e-5):
    if var == 'x':
        return (fn(x + h, y) - fn(x - h, y)) / (2 * h)
    else:
        return (fn(x, y + h) - fn(x, y - h)) / (2 * h)

# Evaluate at point (2, 1)
x0, y0 = 2.0, 1.0
print(f"f(2, 1) = {f(x0, y0)}")
print()
print(f"  df/dx at (2,1): analytical = {df_dx(x0, y0):.4f},  numerical = {numerical_partial(f, x0, y0, 'x'):.4f}")
print(f"  df/dy at (2,1): analytical = {df_dy(x0, y0):.4f}, numerical = {numerical_partial(f, x0, y0, 'y'):.4f}")
print()

# The gradient vector
gradient = np.array([df_dx(x0, y0), df_dy(x0, y0)])
print(f"  Gradient at (2, 1): {gradient}")
print(f"  Gradient magnitude (steepness): {np.linalg.norm(gradient):.4f}")
print(f"  Direction of steepest ascent:   {gradient / np.linalg.norm(gradient)}")
print(f"  Direction of steepest descent:  {-gradient / np.linalg.norm(gradient)}  (← this is what GD follows)")
import numpy as np

# Visualise how gradient magnitude signals proximity to a minimum
# Loss function: L(w1, w2) = (w1 - 2)^2 + (w2 + 1)^2  (minimum at w1=2, w2=-1)

def loss(w):
    return (w[0] - 2)**2 + (w[1] + 1)**2

def gradient(w):
    return np.array([2*(w[0] - 2), 2*(w[1] + 1)])

test_points = [
    np.array([5.0,  3.0]),   # far from minimum
    np.array([3.0,  0.0]),   # medium distance
    np.array([2.5, -0.5]),   # getting closer
    np.array([2.1, -0.9]),   # near minimum
    np.array([2.0, -1.0]),   # at minimum
]

print("Point (w1, w2)     | Loss     | Gradient           | ||grad|| (steepness)")
print("-------------------|----------|--------------------|---------------------")
for w in test_points:
    g = gradient(w)
    print(f"  ({w[0]:+.1f}, {w[1]:+.1f})       | {loss(w):7.3f}  | [{g[0]:+.2f}, {g[1]:+.2f}]       | {np.linalg.norm(g):.4f}")

print("\nAs we approach the minimum, gradient magnitude → 0 (flat region).")
Gradient = GPS for the optimizer. The gradient vector tells the optimizer: (a) which direction is "uphill" on the loss surface, and (b) how steep that hill is right now. Gradient descent simply follows the negative gradient — always choosing to walk downhill, step by step.
Interactive: Gradient Arrows on Loss Surface L(w₁, w₂) = w₁² + w₂²
1.5
1.0

Loss surface L(w₁, w₂) = w₁² + w₂² — a bowl with minimum at the origin. The crimson arrow shows the gradient at your chosen point (uphill direction); the teal dashed arrow shows −∇L (steepest descent direction). Move the sliders to see how the gradient points away from the minimum and shrinks to zero as you approach it.

Jacobian & Hessian

The gradient works perfectly when a function produces a single number output (like a loss value). But neural network layers take a vector of inputs and produce a vector of outputs. To track how all outputs change with respect to all inputs, we need the Jacobian. And to understand the curvature of the loss landscape — not just which way is downhill, but how the slope itself is changing — we need the Hessian.

The Jacobian — Gradient for Vector-Valued Functions

The gradient $\nabla f$ exists for functions that output a scalar ($f: \mathbb{R}^n \to \mathbb{R}$). The Jacobian generalises this to functions that output a vector ($f: \mathbb{R}^n \to \mathbb{R}^m$). It is a matrix where each row is the gradient of one output.

$$J = \begin{pmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{pmatrix} \in \mathbb{R}^{m \times n}$$

Entry $J_{ij}$ answers: "how does output $i$ change when input $j$ is nudged?" The entire matrix captures all input-output sensitivities at once.

The spreadsheet analogy: Think of a function that takes 3 raw ingredient quantities and produces 2 nutritional values (calories, protein). The Jacobian is like a sensitivity table: row 1 shows how calories respond to each ingredient; row 2 shows how protein responds to each ingredient. It encodes the complete picture of how outputs respond to inputs.

Why it matters in deep learning: During backpropagation, gradients flow backward through each layer. If a layer maps $\mathbb{R}^n \to \mathbb{R}^m$ (e.g., a linear layer), the backward pass multiplies by the transpose of the Jacobian $J^\top$. This is happening implicitly every time you call loss.backward() in PyTorch.

import numpy as np

# Jacobian example: f: R^2 -> R^2
# f1(x, y) = x^2 + y
# f2(x, y) = 3x - y^2
#
# Analytical Jacobian:
#   J = [[df1/dx, df1/dy],   =   [[2x,  1 ],
#         [df2/dx, df2/dy]]        [ 3, -2y]]

def f_vec(xy):
    x, y = xy
    return np.array([x**2 + y, 3*x - y**2])

def jacobian_analytical(x, y):
    return np.array([[2*x,   1],
                     [  3, -2*y]])

# Numerical Jacobian via finite differences
def jacobian_numerical(fn, xy, h=1e-5):
    n = len(xy)
    m = len(fn(xy))
    J = np.zeros((m, n))
    for j in range(n):
        e = np.zeros(n)
        e[j] = h
        J[:, j] = (fn(xy + e) - fn(xy - e)) / (2 * h)
    return J

point = np.array([2.0, 3.0])
J_analytical = jacobian_analytical(*point)
J_numerical  = jacobian_numerical(f_vec, point)

print("f([2, 3]) =", f_vec(point))
print()
print("Analytical Jacobian at (2, 3):")
print(J_analytical)
print()
print("Numerical Jacobian at (2, 3):")
print(np.round(J_numerical, 5))
print()
print("Interpretation:")
print(f"  df1/dx = {J_analytical[0,0]} → increasing x by 1 increases f1 by ~{J_analytical[0,0]}")
print(f"  df1/dy = {J_analytical[0,1]} → increasing y by 1 increases f1 by ~{J_analytical[0,1]}")
print(f"  df2/dx = {J_analytical[1,0]} → increasing x by 1 increases f2 by ~{J_analytical[1,0]}")
print(f"  df2/dy = {J_analytical[1,1]} → increasing y by 1 changes f2 by ~{J_analytical[1,1]}")

The Hessian — Curvature of the Loss Landscape

The gradient tells you which way is downhill. The Hessian tells you how the slope is changing — whether the surface is curving toward you (bowl shape) or away (saddle/ridge). It is the matrix of all second-order partial derivatives of a scalar function:

$$H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$$

For a function $f(x, y)$, the Hessian is a $2 \times 2$ matrix:

$$H = \begin{pmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ \frac{\partial^2 f}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \end{pmatrix}$$
HessianCurvature
What the Hessian Eigenvalues Tell You

The eigenvalues of the Hessian $H$ reveal the local shape of the loss surface:

  • All positive eigenvalues (positive definite $H$): The surface curves upward in every direction — you are at a local minimum. Like the bottom of a bowl.
  • All negative eigenvalues (negative definite $H$): The surface curves downward in every direction — you are at a local maximum. Like the top of a hill.
  • Mixed eigenvalues: The surface curves up in some directions and down in others — you are at a saddle point. Very common in high-dimensional neural network loss landscapes.
  • Some zero eigenvalues: Flat directions exist — the loss doesn't change if you move in that direction. Common in overparameterized networks.
import numpy as np

# Hessian example: f(x, y) = x^2 + 4*x*y + y^2
# Analytical Hessian:
#   d^2f/dx^2  = 2
#   d^2f/dy^2  = 2
#   d^2f/dxdy  = 4
#   H = [[2, 4], [4, 2]]

def f(xy):
    x, y = xy
    return x**2 + 4*x*y + y**2

def hessian_numerical(fn, xy, h=1e-4):
    """Compute Hessian numerically via finite differences."""
    n = len(xy)
    H = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            ei, ej = np.zeros(n), np.zeros(n)
            ei[i], ej[j] = h, h
            H[i, j] = (fn(xy+ei+ej) - fn(xy+ei-ej) - fn(xy-ei+ej) + fn(xy-ei-ej)) / (4*h**2)
    return H

point = np.array([1.0, 0.5])
H = hessian_numerical(f, point)
print("Hessian at (1, 0.5):")
print(np.round(H, 4))

eigenvalues = np.linalg.eigvalsh(H)
print(f"\nEigenvalues: {eigenvalues}")
if np.all(eigenvalues > 0):
    print("  → All positive: local MINIMUM (bowl shape)")
elif np.all(eigenvalues < 0):
    print("  → All negative: local MAXIMUM (hill top)")
else:
    print("  → Mixed signs: SADDLE POINT")

print()
# Compare three loss surfaces at their critical points
functions = [
    ("x^2 + y^2 (bowl/min)",     lambda xy: xy[0]**2 + xy[1]**2),
    ("-x^2 - y^2 (hill/max)",    lambda xy: -xy[0]**2 - xy[1]**2),
    ("x^2 - y^2 (saddle)",       lambda xy: xy[0]**2 - xy[1]**2),
]
origin = np.array([0.0, 0.0])
print("Surface shape diagnosis at (0, 0):")
for name, fn in functions:
    evs = np.linalg.eigvalsh(hessian_numerical(fn, origin))
    shape = "local min" if np.all(evs>0) else ("local max" if np.all(evs<0) else "saddle point")
    print(f"  {name:30s} → eigenvalues {evs} → {shape}")
Why we rarely use the Hessian in deep learning: For a network with $n$ parameters, the Hessian has $n^2$ entries. GPT-3 has 175 billion parameters — its Hessian would have ~$3 \times 10^{22}$ entries. That is physically impossible to store or compute. Methods like L-BFGS approximate the Hessian from gradient history. First-order methods (SGD, Adam) ignore it entirely — that's why they scale to billions of parameters.

Gradient Descent

Knowing the gradient is only half the story. Gradient descent is the actual algorithm that uses the gradient to train a model — it is the engine behind virtually every neural network ever trained.

The blindfolded hiker analogy: You're blindfolded on a hilly landscape and want to reach the lowest valley. You can't see the whole map — you can only feel the slope beneath your feet right now. Your strategy: take a small step in whichever direction feels most downhill, stop, re-check the slope, repeat. That's gradient descent. The step size is the learning rate $\alpha$. The slope beneath your feet is the gradient $\nabla \mathcal{L}$.

Formally, gradient descent updates parameters by subtracting a scaled version of the gradient:

$$\theta_{t+1} = \theta_t - \alpha \cdot \nabla_\theta \mathcal{L}(\theta_t)$$

The minus sign is crucial — we move opposite to the gradient because the gradient points uphill and we want to go downhill. $\alpha$ (the learning rate) controls how big each step is.

Learning Rate — The Most Important Hyperparameter

The learning rate $\alpha$ is deceptively simple but enormously impactful. Set it wrong and training either diverges or crawls:

Too LargeToo SmallJust Right
Three Learning Rate Scenarios
  • Too large ($\alpha = 1.5$): The step overshoots the minimum. You land on the other side of the valley, then overshoot back, oscillating wildly. Loss diverges instead of converging.
  • Too small ($\alpha = 0.001$): Each step is so tiny that reaching the minimum takes thousands of iterations. Training is painfully slow and you may never finish.
  • Just right ($\alpha = 0.1$): Steps are large enough to make progress but small enough not to overshoot. Loss decreases smoothly and reaches the minimum efficiently.

In practice, use $\alpha = 10^{-3}$ as a starting point with Adam, or $\alpha = 10^{-1}$ to $10^{-2}$ with SGD. Then use a learning rate schedule (e.g., cosine decay or reduce-on-plateau) to decrease $\alpha$ as training progresses.

Interactive: Gradient Descent — Effect of Learning Rate
0.1

Minimising L(w) = (w − 3)². The animated ball traces gradient descent steps from w = −2. Try α = 0.02 (very slow), α = 0.1 (just right), α = 0.95 (oscillates), α > 1.0 (diverges). Step count and final value update live.

import numpy as np

# Demonstrate learning rate effect on convergence
# Minimize: f(x) = (x - 3)^2, gradient = 2*(x - 3)

def f(x):      return (x - 3)**2
def grad(x):   return 2 * (x - 3)

def run_gd(lr, steps=30, x0=-2.0):
    x = x0
    losses = []
    for _ in range(steps):
        x = x - lr * grad(x)
        losses.append(f(x))
    return x, losses

learning_rates = [0.001, 0.1, 0.9, 1.1]

print(f"{'LR':>8} | {'Final x':>10} | {'Final loss':>12} | Verdict")
print("-" * 60)
for lr in learning_rates:
    x_final, losses = run_gd(lr, steps=50)
    final_loss = losses[-1]
    if np.isnan(final_loss) or final_loss > 100:
        verdict = "DIVERGED"
    elif final_loss < 0.001:
        verdict = "converged well"
    elif final_loss < 1.0:
        verdict = "converging (slow)"
    else:
        verdict = "barely moving"
    print(f"  {lr:>6} | {x_final:>10.4f} | {final_loss:>12.6f} | {verdict}")

Batch, SGD, and Mini-Batch — Three Flavours

In practice the dataset has $N$ training examples. The three variants differ in how many examples are used to estimate the gradient at each step:

VariantExamples per stepGradient qualitySpeed per stepTypical use
Batch GDAll $N$ExactSlow ($O(N)$)Small datasets, convex problems
SGD1 random sampleVery noisyVery fastOnline learning, huge datasets
Mini-batch SGD$B$ samples (32–512)Good estimateFast + parallelDeep learning standard

Why mini-batch? GPUs are built for parallel matrix operations — processing 32 or 256 examples together uses the hardware efficiently. The gradient over a mini-batch is noisy enough to help escape saddle points but stable enough to make reliable progress. The noise is actually helpful: it acts as a kind of regularisation, preventing the optimizer from memorising specific training examples.

import numpy as np

# Compare Batch GD vs Mini-batch SGD on a linear regression problem
# Data: y = 2x + 1 + noise.  Model: y_hat = w*x + b

np.random.seed(42)
N = 200
X = np.random.randn(N, 1)
y = 2 * X + 1 + 0.3 * np.random.randn(N, 1)

def mse_loss(X, y, w, b):
    preds = X * w + b
    return np.mean((preds - y) ** 2)

def mse_gradients(X_batch, y_batch, w, b):
    preds = X_batch * w + b
    err   = preds - y_batch
    dw    = 2 * np.mean(err * X_batch)
    db    = 2 * np.mean(err)
    return dw, db

lr = 0.1
epochs = 30

# --- Batch Gradient Descent ---
w_batch, b_batch = 0.0, 0.0
losses_batch = []
for _ in range(epochs):
    dw, db = mse_gradients(X, y, w_batch, b_batch)  # full dataset
    w_batch -= lr * dw
    b_batch -= lr * db
    losses_batch.append(mse_loss(X, y, w_batch, b_batch))

# --- Mini-batch SGD (batch size = 32) ---
w_mini, b_mini = 0.0, 0.0
losses_mini = []
batch_size = 32
for _ in range(epochs):
    idx = np.random.permutation(N)
    for start in range(0, N, batch_size):
        batch = idx[start:start + batch_size]
        dw, db = mse_gradients(X[batch], y[batch], w_mini, b_mini)
        w_mini -= lr * dw
        b_mini -= lr * db
    losses_mini.append(mse_loss(X, y, w_mini, b_mini))

print("Linear regression training (true: w=2, b=1)")
print(f"\nBatch GD after {epochs} epochs:     w={w_batch:.4f}, b={b_batch:.4f}, loss={losses_batch[-1]:.6f}")
print(f"Mini-batch SGD after {epochs} epochs: w={w_mini:.4f},  b={b_mini:.4f},  loss={losses_mini[-1]:.6f}")

print("\nLoss per 10 epochs:")
print(f"{'Epoch':>7} | {'Batch GD':>10} | {'Mini-batch':>10}")
print("-" * 35)
for i in [0, 9, 19, 29]:
    print(f"  {i+1:5d} | {losses_batch[i]:10.6f} | {losses_mini[i]:10.6f}")
import numpy as np

# Visualise how gradient descent converges on a 2D loss surface
# L(w1, w2) = (w1 - 2)^2 + 4*(w2 - 1)^2   (elliptical bowl)
# Minimum at (2, 1)

def loss(w):
    return (w[0] - 2)**2 + 4*(w[1] - 1)**2

def gradient(w):
    return np.array([2*(w[0] - 2), 8*(w[1] - 1)])

w = np.array([-1.0, 3.5])   # start far away
alpha = 0.1
path = [w.copy()]

for step in range(25):
    w = w - alpha * gradient(w)
    path.append(w.copy())

print("Gradient descent path on 2D elliptical bowl:")
print(f"  Start:  w={path[0]},  loss={loss(path[0]):.4f}")
print()
print(f"{'Step':>5} | {'w1':>8} | {'w2':>8} | {'loss':>10}")
print("-" * 40)
for i, pt in enumerate(path[::5]):
    print(f"  {i*5:3d}  | {pt[0]:8.4f} | {pt[1]:8.4f} | {loss(pt):10.6f}")
print()
print(f"  Final: w={path[-1]},  true minimum=(2, 1)")
print(f"  Loss reduction: {loss(path[0]):.4f} → {loss(path[-1]):.6f}")
Key takeaways on gradient descent:
  • The update rule $\theta \leftarrow \theta - \alpha \nabla \mathcal{L}$ is conceptually simple but powers all of modern AI
  • Learning rate $\alpha$ is the most sensitive hyperparameter: too big diverges, too small is painfully slow
  • Mini-batch SGD (batch size 32–512) is the standard — it balances GPU efficiency, gradient quality, and implicit regularisation
  • Vanilla gradient descent is rarely used directly — modern optimizers (Momentum, Adam) build on it to handle real-world loss landscapes
  • Convergence is not guaranteed for non-convex functions; gradient descent finds a good enough solution, not necessarily the global minimum

Modern Optimizers

An optimizer is the algorithm that updates a neural network's weights to reduce the loss. Gradient descent tells us which direction to move (opposite to the gradient), but it says nothing about how fast to move, or whether to adapt to the shape of the loss landscape. Modern optimizers solve exactly these problems.

The hill-rolling analogy: Imagine you're blindfolded on a hilly landscape and want to reach the lowest valley. Gradient descent says "take a small step downhill." But what if the slope is very steep in one direction and nearly flat in another? You'd want to slow down on steep slopes and speed up on flat ones — that's exactly what adaptive optimizers do.

Vanilla SGD — The Baseline

Stochastic Gradient Descent (SGD) is the simplest optimizer. At each step it moves the weights in the direction that reduces the loss, by a fixed amount called the learning rate $\alpha$:

$$\theta_{t+1} = \theta_t - \alpha \cdot \nabla_\theta \mathcal{L}$$

Think of $\nabla_\theta \mathcal{L}$ as a signpost that says "loss goes up this way" and you go the opposite direction. The learning rate $\alpha$ controls the step size.

Problem with vanilla SGD: A fixed step size is clumsy. On a steep slope you overshoot; on a flat region you move too slowly. It also oscillates wildly in narrow ravines (common in deep networks) and can get stuck in saddle points.
import numpy as np

# Vanilla SGD on a simple 1D loss: L(w) = (w - 3)^2
# Minimum is at w = 3, gradient is 2*(w - 3)

def loss(w):
    return (w - 3) ** 2

def grad(w):
    return 2 * (w - 3)

w = 0.0          # start far from minimum
alpha = 0.1      # learning rate
history = [w]

for step in range(20):
    w = w - alpha * grad(w)
    history.append(w)

print("Vanilla SGD — tracking w toward minimum (w=3):")
for i, val in enumerate(history[::4]):
    print(f"  Step {i*4:2d}: w = {val:.4f}, loss = {loss(val):.4f}")

Momentum — Don't Stop, Keep Rolling

Momentum fixes the oscillation problem by giving the optimizer a "memory" of past gradients. Instead of reacting to each new gradient independently, it accumulates a running average — just like a ball rolling downhill picks up speed and rolls past small bumps.

We maintain a velocity vector $v$ that blends the old velocity with the new gradient:

$$v_{t+1} = \beta \cdot v_t + (1-\beta) \cdot \nabla_\theta \mathcal{L}$$
$$\theta_{t+1} = \theta_t - \alpha \cdot v_{t+1}$$

The hyperparameter $\beta$ (typically 0.9) controls how much past gradients are remembered. With $\beta = 0.9$, each step is 90% old velocity + 10% new gradient — so the update is smooth.

IntuitionAnalogy
Why Momentum Helps: The Ravine Problem

Picture a long, narrow valley. The loss drops steeply from side to side (high gradient in the narrow direction) but only very gently toward the bottom of the valley (small gradient in the long direction). Vanilla SGD bounces back and forth on the steep walls while barely progressing toward the goal. Momentum damps out the bouncing (gradients cancel in the narrow direction) while the consistent downward gradient in the long direction accumulates — the optimizer slides smoothly to the minimum.

import numpy as np

# Compare SGD vs Momentum on a 2D "ravine" loss
# L(w1, w2) = w1^2 + 10*w2^2  (steep in w2, shallow in w1)
# Minimum at (0, 0)

def grad_ravine(w):
    return np.array([2 * w[0], 20 * w[1]])

def loss_ravine(w):
    return w[0]**2 + 10 * w[1]**2

w_sgd = np.array([2.0, 1.0])
w_mom = np.array([2.0, 1.0])
v = np.zeros(2)

alpha = 0.05
beta  = 0.9

print("Step | SGD loss   | Momentum loss")
print("-----|------------|---------------")
for step in range(1, 26):
    # SGD update
    w_sgd = w_sgd - alpha * grad_ravine(w_sgd)

    # Momentum update
    v = beta * v + (1 - beta) * grad_ravine(w_mom)
    w_mom = w_mom - alpha * v

    if step % 5 == 0:
        print(f"  {step:2d} | {loss_ravine(w_sgd):.6f} | {loss_ravine(w_mom):.6f}")

print(f"\nFinal w_sgd : {w_sgd}")
print(f"Final w_mom : {w_mom}  ← closer to (0, 0)")

RMSProp — Adaptive Learning Rates

Momentum smooths the direction of updates. RMSProp instead adapts the step size separately for each parameter — shrinking steps for parameters with large gradients and growing steps for parameters with small gradients. This is crucial when different weights need very different learning rates.

It tracks a running average of the squared gradients $s$ for each parameter:

$$s_{t+1} = \beta \cdot s_t + (1-\beta) \cdot (\nabla_\theta \mathcal{L})^2$$
$$\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{s_{t+1} + \epsilon}} \cdot \nabla_\theta \mathcal{L}$$

The $\sqrt{s_{t+1}}$ in the denominator is a measure of "how large this gradient has been recently." Dividing by it means: if a gradient has been large, divide by a big number → take a small step. If it's been small, divide by a small number → take a big step. $\epsilon \approx 10^{-8}$ prevents division by zero.

Real-world use case: In NLP, word embeddings for rare words get very small gradients (they rarely appear), while common words get large gradients. RMSProp automatically gives rare-word embeddings larger update steps, so they learn quickly from the few examples they do get.
import numpy as np

# RMSProp on the same ravine loss
# L(w1, w2) = w1^2 + 10*w2^2

def grad_ravine(w):
    return np.array([2 * w[0], 20 * w[1]])

def loss_ravine(w):
    return w[0]**2 + 10 * w[1]**2

w = np.array([2.0, 1.0])
s = np.zeros(2)     # running avg of squared gradients

alpha = 0.05
beta  = 0.9
eps   = 1e-8

print("RMSProp on ravine loss:")
print(f"  Step  0: w = {w}, loss = {loss_ravine(w):.6f}")
for step in range(1, 26):
    g = grad_ravine(w)
    s = beta * s + (1 - beta) * g**2           # update squared-grad average
    w = w - (alpha / (np.sqrt(s) + eps)) * g   # adaptive step

    if step % 5 == 0:
        print(f"  Step {step:2d}: w = [{w[0]:+.4f}, {w[1]:+.4f}], loss = {loss_ravine(w):.6f}")

print(f"\nFinal w: {w}  (target: [0, 0])")

Adam — The Best of Both Worlds

Adam (Adaptive Moment Estimation) combines Momentum and RMSProp into one algorithm. It keeps two running averages per parameter:

  • First moment $m$ (like Momentum): tracks the mean gradient → smooths direction
  • Second moment $v$ (like RMSProp): tracks mean squared gradient → adapts step size
$$m_{t+1} = \beta_1 \cdot m_t + (1-\beta_1) \cdot \nabla\mathcal{L} \qquad \text{(smoothed gradient)}$$
$$v_{t+1} = \beta_2 \cdot v_t + (1-\beta_2) \cdot (\nabla\mathcal{L})^2 \qquad \text{(smoothed squared gradient)}$$

There's a catch: at the very start ($t=1$), both $m$ and $v$ are zero. That means the first few steps are artificially tiny — the averages haven't "warmed up" yet. Adam fixes this with bias correction:

$$\hat{m} = \frac{m_{t+1}}{1 - \beta_1^{t+1}}, \qquad \hat{v} = \frac{v_{t+1}}{1 - \beta_2^{t+1}}$$

At $t=1$ with $\beta_1 = 0.9$: the denominator is $1 - 0.9 = 0.1$, so the corrected estimate is scaled up 10×, compensating for the cold start. As $t$ grows the denominators approach 1 and the correction vanishes.

The final update is:

$$\theta_{t+1} = \theta_t - \frac{\alpha \cdot \hat{m}}{\sqrt{\hat{v}} + \epsilon}$$

Default hyperparameters: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$, $\alpha = 0.001$. These work well out of the box for the vast majority of deep learning tasks.

import numpy as np

# Implement Adam from scratch and compare with SGD on the ravine loss
# L(w1, w2) = w1^2 + 10*w2^2, minimum at (0, 0)

def grad_ravine(w):
    return np.array([2 * w[0], 20 * w[1]])

def loss_ravine(w):
    return w[0]**2 + 10 * w[1]**2

# --- Adam ---
w_adam = np.array([2.0, 1.0])
m = np.zeros(2)   # first moment (momentum)
v = np.zeros(2)   # second moment (RMSProp)
beta1, beta2, eps, alpha = 0.9, 0.999, 1e-8, 0.1

# --- SGD ---
w_sgd = np.array([2.0, 1.0])
lr_sgd = 0.05

print("Step | Adam loss  | SGD loss")
print("-----|------------|----------")
for t in range(1, 31):
    g_adam = grad_ravine(w_adam)
    m = beta1 * m + (1 - beta1) * g_adam
    v = beta2 * v + (1 - beta2) * g_adam**2
    m_hat = m / (1 - beta1**t)          # bias correction
    v_hat = v / (1 - beta2**t)          # bias correction
    w_adam = w_adam - alpha * m_hat / (np.sqrt(v_hat) + eps)

    g_sgd = grad_ravine(w_sgd)
    w_sgd = w_sgd - lr_sgd * g_sgd

    if t % 5 == 0:
        print(f"  {t:2d} | {loss_ravine(w_adam):.6f} | {loss_ravine(w_sgd):.6f}")

print(f"\nFinal Adam w : {w_adam}")
print(f"Final SGD  w : {w_sgd}")
print("Adam converges faster and more smoothly on this asymmetric loss.")
Optimizer Progression: From SGD to Adam
flowchart TD
    SGD["Vanilla SGD
θ = θ − α·∇L
Fixed step size, no memory"] SGD -->|"+ gradient memory"| MOM["Momentum
v = β·v + (1−β)·∇L
θ = θ − α·v
Smooths direction"] SGD -->|"+ adaptive step size"| RMS["RMSProp
s = β·s + (1−β)·(∇L)²
θ = θ − α·∇L / √s
Adapts per-parameter LR"] MOM -->|"combine both"| ADAM["Adam
m = β₁·m + (1−β₁)·∇L
v = β₂·v + (1−β₂)·(∇L)²
θ = θ − α·m̂ / (√v̂ + ε)
Smooth direction + adaptive LR"] ADAM -->|"+ decoupled weight decay"| ADAMW["AdamW
Same as Adam, but weight decay
applied directly to weights
(not inside the gradient)"]
AdamWPractical Tips
AdamW and Choosing an Optimizer in Practice

AdamW is a small but important fix to Adam. In standard Adam, weight decay (L2 regularization) gets tangled with the adaptive learning rates, making it weaker than intended. AdamW decouples them — the weight decay is applied directly to the parameters, independent of the gradient statistics. This is why transformers like BERT and GPT are trained with AdamW rather than Adam.

When to use which:

  • Adam / AdamW — default choice for almost everything; especially neural networks, transformers, CNNs
  • SGD + Momentum — often preferred for CNNs in vision tasks when careful learning rate tuning is done (e.g., ResNet training); can generalize better than Adam with proper schedules
  • RMSProp — historically used for RNNs; largely superseded by Adam
  • AdaGrad — sparse data (e.g., text with rare tokens); rarely used directly today

In practice, start with Adam (lr=1e-3). If you care about best final accuracy, try SGD + Momentum + cosine LR decay. AdamW with a learning rate schedule is the standard for transformer fine-tuning.

import numpy as np

# Side-by-side comparison: SGD, Momentum, RMSProp, Adam on a 2D non-convex loss
# L(w1, w2) = sin(w1) + 0.5*w2^2   (non-convex in w1 dimension)

def loss_fn(w):
    return np.sin(w[0]) + 0.5 * w[1]**2

def grad_fn(w):
    return np.array([np.cos(w[0]), w[1]])

np.random.seed(42)
start = np.array([-1.5, 2.0])   # same start for all

steps = 40
alpha = 0.1

# SGD
w = start.copy()
losses_sgd = []
for _ in range(steps):
    w = w - alpha * grad_fn(w)
    losses_sgd.append(loss_fn(w))

# Momentum
w, vel = start.copy(), np.zeros(2)
losses_mom = []
for _ in range(steps):
    vel = 0.9 * vel + 0.1 * grad_fn(w)
    w = w - alpha * vel
    losses_mom.append(loss_fn(w))

# RMSProp
w, s = start.copy(), np.zeros(2)
losses_rms = []
for _ in range(steps):
    g = grad_fn(w)
    s = 0.9 * s + 0.1 * g**2
    w = w - (alpha / (np.sqrt(s) + 1e-8)) * g
    losses_rms.append(loss_fn(w))

# Adam
w, m, v = start.copy(), np.zeros(2), np.zeros(2)
losses_adam = []
for t in range(1, steps + 1):
    g = grad_fn(w)
    m = 0.9 * m + 0.1 * g
    v = 0.999 * v + 0.001 * g**2
    m_hat = m / (1 - 0.9**t)
    v_hat = v / (1 - 0.999**t)
    w = w - alpha * m_hat / (np.sqrt(v_hat) + 1e-8)
    losses_adam.append(loss_fn(w))

print("Loss after N steps (lower is better):")
print(f"{'Step':>5} | {'SGD':>8} | {'Momentum':>8} | {'RMSProp':>8} | {'Adam':>8}")
print("-" * 52)
for i in [4, 9, 19, 39]:
    print(f"  {i+1:3d}  | {losses_sgd[i]:8.4f} | {losses_mom[i]:8.4f} | {losses_rms[i]:8.4f} | {losses_adam[i]:8.4f}")
Key takeaways on optimizers:
  • SGD: Simple but sensitive to learning rate and loss landscape shape
  • Momentum: Adds directional memory, damping oscillations and accelerating in consistent directions
  • RMSProp: Automatically scales learning rate per-parameter — large gradients → small step, small gradients → large step
  • Adam: Combines both, plus bias correction for stable early updates; best default choice
  • AdamW: Adam with proper weight decay; standard for transformers
  • No single optimizer wins every task — the best choice depends on architecture, data, and available compute for tuning

Backpropagation

Backpropagation efficiently computes $\nabla_\theta \mathcal{L}$ for all parameters in a neural network using the chain rule. For a composition $f = f_L \circ f_{L-1} \circ \cdots \circ f_1$:

$$\frac{\partial \mathcal{L}}{\partial \theta_k} = \frac{\partial \mathcal{L}}{\partial a_L} \cdot \frac{\partial a_L}{\partial a_{L-1}} \cdots \frac{\partial a_{k+1}}{\partial a_k} \cdot \frac{\partial a_k}{\partial \theta_k}$$
Backpropagation: Chain Rule Through a Two-Layer Network
flowchart LR
    X["Input x"] -->|"W1, b1"| H["Hidden h = σ(W1x + b1)"]
    H -->|"W2, b2"| Y["Output ŷ = W2h + b2"]
    Y --> L["Loss L(ŷ, y)"]
    L -->|"∂L/∂ŷ = backward pass"| Y
    Y -->|"∂L/∂W2 = (∂L/∂ŷ) h^T"| W2["∇W2"]
    Y -->|"∂L/∂h = W2^T (∂L/∂ŷ)"| H
    H -->|"∂L/∂W1 = (∂L/∂h ⊙ σ') x^T"| W1["∇W1"]
            
import numpy as np

# Implement backpropagation for a 1-layer network on a simple regression problem

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

def sigmoid_deriv(z):
    s = sigmoid(z)
    return s * (1 - s)

np.random.seed(0)
# Tiny dataset: XOR-like
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=float)  # (4, 2)
y = np.array([[0], [1], [1], [0]], dtype=float)               # (4, 1)

# Initialize weights
W1 = np.random.randn(2, 4) * 0.5   # (2, 4) — input to hidden
b1 = np.zeros((1, 4))               # (1, 4)
W2 = np.random.randn(4, 1) * 0.5   # (4, 1) — hidden to output
b2 = np.zeros((1, 1))               # (1, 1)

lr = 0.5
losses = []

for epoch in range(3000):
    # Forward pass
    z1 = X @ W1 + b1         # (4, 4)
    a1 = sigmoid(z1)          # (4, 4)
    z2 = a1 @ W2 + b2         # (4, 1)
    a2 = sigmoid(z2)          # (4, 1) — output

    # MSE loss
    loss = np.mean((a2 - y)**2)
    losses.append(loss)

    # Backward pass (chain rule)
    dL_da2 = 2 * (a2 - y) / len(y)         # (4, 1)
    dL_dz2 = dL_da2 * sigmoid_deriv(z2)    # (4, 1)
    dL_dW2 = a1.T @ dL_dz2                  # (4, 1)
    dL_db2 = dL_dz2.sum(axis=0, keepdims=True)

    dL_da1 = dL_dz2 @ W2.T                  # (4, 4)
    dL_dz1 = dL_da1 * sigmoid_deriv(z1)    # (4, 4)
    dL_dW1 = X.T @ dL_dz1                   # (2, 4)
    dL_db1 = dL_dz1.sum(axis=0, keepdims=True)

    # Gradient updates
    W2 -= lr * dL_dW2
    b2 -= lr * dL_db2
    W1 -= lr * dL_dW1
    b1 -= lr * dL_db1

print("Backprop training on XOR:")
print(f"  Epoch   0: loss = {losses[0]:.6f}")
print(f"  Epoch 500: loss = {losses[500]:.6f}")
print(f"  Epoch 2999: loss = {losses[2999]:.6f}")
print(f"\nFinal predictions (should be ~0, 1, 1, 0):")
print(np.round(a2, 3).flatten())

Convexity & Loss Landscapes

A function $f$ is convex if for all $\mathbf{x}, \mathbf{y}$ and $t \in [0,1]$:

$$f(t\mathbf{x} + (1-t)\mathbf{y}) \leq t f(\mathbf{x}) + (1-t)f(\mathbf{y})$$

For convex functions, any local minimum is the global minimum — gradient descent is guaranteed to converge. Linear regression's MSE loss is convex. Cross-entropy loss for logistic regression is convex in the weights.

Neural network loss landscapes are non-convex, containing numerous local minima, saddle points, and flat regions (plateaus). Practical observations:
  • Saddle points are more common than local minima in high dimensions — gradient descent can stall but usually escapes
  • Good local minima in overparameterized networks tend to generalize well (not obvious why — active research)
  • Learning rate matters enormously: too large diverges, too small is slow; warm-up + decay schedules help
  • Batch normalization smooths the loss landscape, improving gradient flow and enabling larger learning rates

Convergence Rate Theory

Understanding how fast gradient descent converges requires two key properties of the loss function:

L-smoothness: $f$ is $L$-smooth if its gradient is Lipschitz continuous:

$$\|\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})\| \leq L\|\mathbf{x} - \mathbf{y}\| \quad \forall \mathbf{x}, \mathbf{y}$$

Equivalently, the Hessian eigenvalues are bounded: $\lambda_{\max}(H) \leq L$. This means the gradient can't change too abruptly.

$\mu$-strong convexity: $f$ is $\mu$-strongly convex if:

$$f(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})^T(\mathbf{y}-\mathbf{x}) + \frac{\mu}{2}\|\mathbf{y}-\mathbf{x}\|^2$$

Equivalently, $\lambda_{\min}(H) \geq \mu > 0$. This means the function curves at least as fast as a quadratic.

Convergence rates for GD with step size $\eta = 1/L$:

SettingRateIterations to $\epsilon$-optimal
Convex + $L$-smooth$f(x_k) - f^* \leq \frac{L\|x_0-x^*\|^2}{2k}$$O(1/\epsilon)$
$\mu$-strongly convex + $L$-smooth$\|x_k - x^*\|^2 \leq (1 - \mu/L)^k \|x_0-x^*\|^2$$O(\kappa \log(1/\epsilon))$
Non-convex + $L$-smooth$\min_{i\leq k}\|\nabla f(x_i)\|^2 \leq \frac{2L(f(x_0)-f^*)}{k}$$O(1/\epsilon^2)$ to stationary point

The condition number $\kappa = L/\mu$ controls convergence speed. High $\kappa$ means elongated loss landscape (slow convergence). For strongly convex problems, convergence is geometric (exponentially fast), but the rate degrades linearly with $\kappa$.

Saddle Points & Escaping Them

In high-dimensional loss landscapes, saddle points (where $\nabla f = 0$ but the Hessian has both positive and negative eigenvalues) are far more common than local minima. For a random function in $d$ dimensions, the probability that all $d$ Hessian eigenvalues are positive (true local minimum) is exponentially small in $d$.

Why SGD escapes saddle points: The noise in stochastic gradients perturbs the iterate away from saddle points. Near a saddle with negative curvature direction $v$ (Hessian eigenvalue $\lambda < 0$), the dynamics along $v$ are repulsive: $x_{k+1} \approx (1 - \eta\lambda)x_k$ with $|1-\eta\lambda| > 1$. SGD noise projects onto $v$ with high probability in high dimensions.

Condition number in practice: Adam and other adaptive methods effectively precondition the optimization, making the effective condition number closer to 1. This is why Adam converges faster than SGD on ill-conditioned problems (like attention layers with vastly different gradient scales across parameters).

import numpy as np

# Convergence rate comparison: well-conditioned vs ill-conditioned
def gradient_descent(grad_fn, x0, lr, n_steps):
    """Run GD and return trajectory of function values."""
    x = x0.copy()
    trajectory = []
    for _ in range(n_steps):
        g = grad_fn(x)
        x = x - lr * g
        trajectory.append(x.copy())
    return np.array(trajectory)

# 2D quadratic: f(x) = 0.5 * x^T A x (minimum at origin)
# Case 1: Well-conditioned (kappa = 2)
A_good = np.diag([1.0, 2.0])  # eigenvalues: 1, 2 → kappa = 2
# Case 2: Ill-conditioned (kappa = 50)
A_bad = np.diag([1.0, 50.0])  # eigenvalues: 1, 50 → kappa = 50

x0 = np.array([5.0, 5.0])

# Optimal learning rate for L-smooth: 1/L
lr_good = 1.0 / 2.0  # 1/L where L=2
lr_bad = 1.0 / 50.0  # 1/L where L=50

traj_good = gradient_descent(lambda x: A_good @ x, x0, lr_good, 50)
traj_bad = gradient_descent(lambda x: A_bad @ x, x0, lr_bad, 50)

# Distance to optimum at each step
dist_good = np.linalg.norm(traj_good, axis=1)
dist_bad = np.linalg.norm(traj_bad, axis=1)

print("Convergence comparison (distance to optimum):")
print(f"{'Step':>5} {'kappa=2':>10} {'kappa=50':>10}")
for k in [0, 5, 10, 20, 50]:
    idx = min(k, len(dist_good)-1)
    print(f"{k:5d} {dist_good[idx]:10.4f} {dist_bad[idx]:10.4f}")

# Theoretical rates
kappa_good, kappa_bad = 2, 50
print(f"\nTheoretical: (1 - 1/kappa)^k convergence factor")
print(f"  kappa=2:  (1-1/2)^50  = {(1-1/kappa_good)**50:.6f}")
print(f"  kappa=50: (1-1/50)^50 = {(1-1/kappa_bad)**50:.6f}")

Integration

Integration is the inverse of differentiation — it accumulates infinitesimal changes to compute totals. In ML, integrals appear in expected values, normalizing constants, and the probabilistic foundations underlying many algorithms.

Introduction to Integration

The antiderivative of $f(x)$ is a function $F(x)$ such that $F'(x) = f(x)$. Integration reverses differentiation:

$$\int f(x)\,dx = F(x) + C$$

where $C$ is the constant of integration (arbitrary, since constants vanish under differentiation). This is the indefinite integral.

Geometric interpretation: The definite integral $\int_a^b f(x)\,dx$ equals the signed area between the curve $y = f(x)$ and the x-axis from $a$ to $b$. Areas above the x-axis contribute positively, areas below negatively. Total (unsigned) area uses $\int_a^b |f(x)|\,dx$.

Indefinite Integration I — Basic Rules

RuleFormulaExample
Power rule$\int x^n\,dx = \frac{x^{n+1}}{n+1} + C\;(n \neq -1)$$\int x^4\,dx = \frac{x^5}{5} + C$
Constant$\int k\,dx = kx + C$$\int 5\,dx = 5x + C$
Sum/Difference$\int [f \pm g]\,dx = \int f\,dx \pm \int g\,dx$$\int (x^2 + 3)\,dx = \frac{x^3}{3} + 3x + C$
Constant multiple$\int kf(x)\,dx = k\int f(x)\,dx$$\int 4x^3\,dx = x^4 + C$
Reciprocal$\int \frac{1}{x}\,dx = \ln|x| + C$Arises in cross-entropy gradient
import numpy as np
from scipy import integrate

# Power rule: integral of x^4 from 0 to 1 = 1/5 = 0.2
result, _ = integrate.quad(lambda x: x**4, 0, 1)
print(f"integral x^4 from 0 to 1")
print(f"  Numerical:  {result:.8f}")
print(f"  Analytical (1/5): {0.2:.8f}")

# Sum rule: integral of (x^2 + 3) from 0 to 2 = x^3/3 + 3x from 0 to 2
# = (8/3 + 6) - 0 = 26/3
result2, _ = integrate.quad(lambda x: x**2 + 3, 0, 2)
print(f"\nintegral (x^2+3) from 0 to 2")
print(f"  Numerical:  {result2:.8f}")
print(f"  Analytical (26/3): {26/3:.8f}")

# Reciprocal: integral 1/x from 1 to e = ln(e) - ln(1) = 1
result3, _ = integrate.quad(lambda x: 1/x, 1, np.e)
print(f"\nintegral 1/x from 1 to e")
print(f"  Numerical:  {result3:.8f}")
print(f"  Analytical (ln e = 1): {1.0:.8f}")

Indefinite Integration II — Special Functions

Antiderivatives of exponential, logarithmic, and trigonometric functions appearing in ML:

$f(x)$Antiderivative $F(x) + C$ML Use
$e^x$$e^x + C$Softmax; exponential smoothing
$e^{ax}$$\frac{1}{a}e^{ax} + C$Laplace distribution log-likelihood
$\ln x$$x\ln x - x + C$Entropy; integration by parts result
$\sin x$$-\cos x + C$Positional encoding; Fourier features
$\cos x$$\sin x + C$Positional encoding; Fourier features
$\frac{1}{1+x^2}$$\arctan x + C$Cauchy distribution; RKHS kernels
$e^{-x^2/2}$$\sqrt{2\pi}\,\Phi(x) + C$Gaussian normalization; GELU activation
import numpy as np
from scipy import integrate

# e^x: integral from 0 to 2 = e^2 - e^0 = e^2 - 1
result, _ = integrate.quad(np.exp, 0, 2)
print(f"integral e^x from 0 to 2")
print(f"  Numerical:  {result:.8f}")
print(f"  Analytical (e^2 - 1): {np.e**2 - 1:.8f}")

# sin(x): integral from 0 to pi = [-cos(x)] from 0 to pi = 1 + 1 = 2
result2, _ = integrate.quad(np.sin, 0, np.pi)
print(f"\nintegral sin(x) from 0 to pi")
print(f"  Numerical:  {result2:.8f}")
print(f"  Analytical: {2.0:.8f}")

# ln(x): integral from 1 to e = [x*ln(x) - x] from 1 to e = (e - e) - (0 - 1) = 1
result3, _ = integrate.quad(np.log, 1, np.e)
print(f"\nintegral ln(x) from 1 to e")
print(f"  Numerical:  {result3:.8f}")
print(f"  Analytical: {1.0:.8f}")

# Gaussian normalization: integral of N(0,1) from -inf to +inf = 1
result4, _ = integrate.quad(lambda x: np.exp(-x**2/2) / np.sqrt(2*np.pi), -np.inf, np.inf)
print(f"\nintegral of N(0,1) pdf from -inf to +inf")
print(f"  Numerical:  {result4:.8f}")
print(f"  Analytical: {1.0:.8f}")

Definite Integration I — Fundamental Theorem

The Fundamental Theorem of Calculus links differentiation and integration. Part 2 (evaluation formula):

$$\int_a^b f(x)\,dx = F(b) - F(a)$$

where $F$ is any antiderivative of $f$. The constant $C$ cancels, so any antiderivative works.

ExampleFTC Step-by-Step
Evaluating $\int_1^4 (3x^2 - 2x + 1)\,dx$

Step 1 — Find antiderivative: $F(x) = x^3 - x^2 + x$

Step 2 — Evaluate: $F(4) - F(1) = (64 - 16 + 4) - (1 - 1 + 1) = 52 - 1 = 51$

import numpy as np
from scipy import integrate

# Fundamental Theorem: integral of (3x^2 - 2x + 1) from 1 to 4
f = lambda x: 3*x**2 - 2*x + 1
F = lambda x: x**3 - x**2 + x   # antiderivative

a, b = 1, 4
ftc_result = F(b) - F(a)
print(f"FTC: F({b}) - F({a}) = {F(b)} - {F(a)} = {ftc_result}")

# Numerical quadrature confirms the result
numerical, error = integrate.quad(f, a, b)
print(f"Numerical (scipy.integrate.quad): {numerical:.8f}")
print(f"Match: {abs(ftc_result - numerical) < 1e-9}")

# Properties of definite integrals
print("\nKey properties:")

# 1. Linearity
r1, _ = integrate.quad(lambda x: 2*x**2 + 3*np.sin(x), 0, np.pi)
r2 = 2*integrate.quad(lambda x: x**2, 0, np.pi)[0] + 3*integrate.quad(np.sin, 0, np.pi)[0]
print(f"  Linearity: int(2x^2+3sin) = {r1:.6f}, split = {r2:.6f}, match = {abs(r1-r2)<1e-9}")

# 2. Reverse limits: int_a^b f = -int_b^a f
r3, _ = integrate.quad(f, a, b)
r4, _ = integrate.quad(f, b, a)
print(f"  Reverse limits: int_1^4 = {r3:.4f}, int_4^1 = {r4:.4f}, sum = {r3+r4:.4f}")

# 3. ML use: E[X^2] for X ~ N(0,1) = 1 (the variance)
ev_x2, _ = integrate.quad(lambda x: x**2 * np.exp(-x**2/2)/np.sqrt(2*np.pi), -10, 10)
print(f"  E[X^2] for X~N(0,1) = {ev_x2:.8f}  (expected: 1.0 = variance)")

Definite Integration II — Substitution & Integration by Parts

Integration by substitution (reverse chain rule): for $\int f(g(x))g'(x)\,dx$, let $u = g(x)$:

$$\int f(g(x))\,g'(x)\,dx = \int f(u)\,du$$

Example: $\int 2x\cos(x^2)\,dx$ — let $u = x^2$, $du = 2x\,dx$ → $\int \cos u\,du = \sin u + C = \sin(x^2) + C$.

Integration by parts (reverse product rule):

$$\int u\,dv = uv - \int v\,du$$

Example: $\int x e^x\,dx$ — let $u = x$, $dv = e^x\,dx$ → $xe^x - \int e^x\,dx = e^x(x-1) + C$.

import numpy as np
from scipy import integrate

# Substitution: integral of 2x*cos(x^2) from 0 to sqrt(pi)
# Antiderivative = sin(x^2), so F(sqrt(pi)) - F(0) = sin(pi) - sin(0) = 0
f_sub = lambda x: 2 * x * np.cos(x**2)
result_sub, _ = integrate.quad(f_sub, 0, np.sqrt(np.pi))
analytical_sub = np.sin(np.pi) - np.sin(0)
print(f"Substitution: integral 2x*cos(x^2) from 0 to sqrt(pi)")
print(f"  Numerical:  {result_sub:.8f}")
print(f"  Analytical: {analytical_sub:.8f}")

# Integration by parts: integral of x*e^x from 0 to 1
# Antiderivative: e^x(x-1), F(1)-F(0) = 0 - (-1) = 1
f_ibp = lambda x: x * np.exp(x)
result_ibp, _ = integrate.quad(f_ibp, 0, 1)
F_ibp = lambda x: np.exp(x) * (x - 1)
analytical_ibp = F_ibp(1) - F_ibp(0)
print(f"\nIntegration by parts: integral x*e^x from 0 to 1")
print(f"  Numerical:  {result_ibp:.8f}")
print(f"  Analytical: {analytical_ibp:.8f}")   # = 1.0

# ML use: expected value E[X] for X ~ Exponential(lambda=1)
# = integral of x * e^(-x) from 0 to inf = 1 (by integration by parts)
ev_exp, _ = integrate.quad(lambda x: x * np.exp(-x), 0, np.inf)
print(f"\nE[X] for X ~ Exp(1) (integral x*e^(-x) from 0 to inf):")
print(f"  Numerical: {ev_exp:.8f}  (expected: 1.0 = 1/lambda)")

Area Under a Curve

The definite integral gives signed area. For total (unsigned) area between a curve and the x-axis:

$$\text{Total area} = \int_a^b |f(x)|\,dx$$

Area between two curves $f(x) \geq g(x)$ on $[a,b]$:

$$\text{Area} = \int_a^b [f(x) - g(x)]\,dx$$
AUC-ROC is a definite integral: The ROC curve plots True Positive Rate vs False Positive Rate as the classification threshold varies. AUC = $\int_0^1 \text{TPR}(\text{FPR})\,d(\text{FPR})$. A perfect classifier has AUC = 1.0 (full unit-square area); a random classifier has AUC = 0.5 (triangle under the diagonal). Numerically, sklearn computes AUC via the trapezoidal rule — a discrete approximation to the integral.
import numpy as np
from scipy import integrate

# Area between g(x) = x and f(x) = x^2 on [0,1]
# x >= x^2 on [0,1], so area = integral(x - x^2) from 0 to 1
area, _ = integrate.quad(lambda x: x - x**2, 0, 1)
print(f"Area between g(x)=x and f(x)=x^2 on [0,1]:")
print(f"  Numerical:  {area:.8f}")
print(f"  Analytical: {0.5 - 1/3:.8f}")  # = 1/2 - 1/3 = 1/6

# Simulate AUC-ROC calculation (trapezoidal rule)
# Sorted scores and labels for a simple binary classifier
np.random.seed(42)
scores = np.random.beta(5, 2, 100)   # biased positive scores
labels = (scores + np.random.normal(0, 0.3, 100) > 0.5).astype(int)

# Build ROC curve manually
thresholds = np.linspace(0, 1, 200)
tprs, fprs = [], []
pos_total = labels.sum()
neg_total = (1 - labels).sum()

for thresh in thresholds:
    preds = (scores >= thresh).astype(int)
    tp = ((preds == 1) & (labels == 1)).sum()
    fp = ((preds == 1) & (labels == 0)).sum()
    tprs.append(tp / pos_total)
    fprs.append(fp / neg_total)

tprs, fprs = np.array(tprs), np.array(fprs)

# AUC via trapezoid rule (numpy)
auc_trapz = np.trapz(tprs[::-1], fprs[::-1])   # sort fpr ascending

print(f"\nAUC-ROC (trapezoidal rule): {auc_trapz:.4f}")
print(f"  (1.0 = perfect, 0.5 = random)")
print(f"  This is literally a definite integral computed numerically.")

Practice Exercises

Exercise 1Chain Rule
Differentiate the Cross-Entropy Loss via Chain Rule

Let $\hat{y} = \sigma(w \cdot x + b)$ and $\mathcal{L} = -y\log\hat{y} - (1-y)\log(1-\hat{y})$. Compute $\frac{\partial \mathcal{L}}{\partial w}$.

Show Answer

Chain rule: $\frac{\partial \mathcal{L}}{\partial w} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial w}$, where $z = wx + b$.

$\frac{\partial \mathcal{L}}{\partial \hat{y}} = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}$

$\frac{\partial \hat{y}}{\partial z} = \sigma(z)(1-\sigma(z)) = \hat{y}(1-\hat{y})$

Multiplying: $\frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} = \left(-\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}\right)\hat{y}(1-\hat{y}) = -y(1-\hat{y}) + (1-y)\hat{y} = \hat{y} - y$

Then $\frac{\partial z}{\partial w} = x$. So: $\boxed{\frac{\partial \mathcal{L}}{\partial w} = (\hat{y} - y) \cdot x}$. Clean result — the gradient is the prediction error times the input.

Exercise 2Gradient Descent
Learning Rate Sensitivity

For $f(x) = x^2$, derive the closed-form expression for $x_t$ after $t$ gradient descent steps starting from $x_0$. What values of learning rate $\alpha$ ensure convergence? What is the optimal $\alpha$?

Show Answer

$f'(x) = 2x$. Update: $x_{t+1} = x_t - 2\alpha x_t = (1-2\alpha)x_t$.

After $t$ steps: $x_t = (1-2\alpha)^t x_0$.

Converges to 0 iff $|1-2\alpha| \lt 1$, i.e., $0 \lt \alpha \lt 1$.

Fastest convergence: minimize $|1-2\alpha|$, achieved at $\alpha^* = 0.5$ giving $x_t = 0$ in one step. This corresponds to Newton's method for quadratics. For general smooth functions, the optimal step size is $\alpha = 1/L$ where $L$ is the Lipschitz constant of the gradient.

Conclusion & Next Steps

Calculus is what makes neural networks trainable:

  • Derivatives: measure instantaneous rate of change — the slope of the loss surface at any point
  • Power, product, quotient rules: the mechanical toolkit for differentiating any function
  • Chain rule: propagates gradients through compositions — the mathematical core of backpropagation
  • Gradient descent: iteratively follows $-\nabla\mathcal{L}$ to minimize loss
  • Hessian: encodes curvature — positive definite = convex = safe convergence
  • Adam: adaptive learning rates with momentum — robust default optimizer
  • Integration: computes expected values, normalizing constants, and the AUC-ROC metric — the probabilistic layer under ML
  • Non-convexity: deep network landscapes have saddle points and many local minima

Next in the Series

In Part 9: ML-Specific Math, we apply everything learned so far directly to ML: loss functions, regularization, bias-variance tradeoff, MLE/MAP, activation function gradients, and batch normalization.