Back to Mathematics

Part 8: Calculus & Optimization

April 26, 2026 Wasil Zafar 50 min read

Every time you call loss.backward(), you're executing the chain rule of calculus across millions of parameters. Gradient descent is applied calculus. Understanding why it works — and when it doesn't — requires genuine understanding of derivatives, gradients, and the geometry of loss landscapes.

Table of Contents

  1. Derivatives
  2. Differentiation Rules
  3. Partial Derivatives & Gradients
  4. Jacobian & Hessian
  5. Gradient Descent
  6. Modern Optimizers
  7. Backpropagation
  8. Convexity & Loss Landscapes
  9. Integration
  10. Practice Exercises
  11. Conclusion & Next Steps

Derivatives

A derivative measures how a function changes as its input changes — the instantaneous rate of change. Derivatives are the mathematical engine behind optimization, neural network training, and gradient descent.

Gradient of a Straight Line

For a linear function $y = mx + c$, the gradient (slope) is constant throughout:

$$m = \frac{y_2 - y_1}{x_2 - x_1} = \frac{\Delta y}{\Delta x}$$

This is "rise over run." The gradient is the same regardless of which two points you pick — straight lines have a uniform rate of change. A positive gradient means the line rises left-to-right; negative means it falls; zero means horizontal.

Gradient of a Curve (Tangent)

Curves have a different gradient at each point. The gradient at a point on a curve $y = f(x)$ is the slope of the tangent line at that point. As we zoom in on any smooth curve, it locally looks like a straight line — the tangent.

Intuitively, the gradient of a curve at $x = a$ is the limit of the gradient of a secant line connecting $(a, f(a))$ and $(a+h, f(a+h))$ as $h \to 0$. This is exactly the definition of the derivative.

Differentiation from First Principles

The formal definition of the derivative:

$$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$$

Worked example — differentiate $f(x) = x^2$ from first principles:

$$f'(x) = \lim_{h \to 0} \frac{(x+h)^2 - x^2}{h} = \lim_{h \to 0} \frac{2xh + h^2}{h} = \lim_{h \to 0} (2x + h) = 2x$$
import numpy as np

def derivative_first_principles(f, x, h=1e-7):
    """Numerical derivative via first principles (forward difference)."""
    return (f(x + h) - f(x)) / h

def derivative_central(f, x, h=1e-6):
    """Central difference — more accurate (O(h^2) error vs O(h))."""
    return (f(x + h) - f(x - h)) / (2 * h)

# Test on f(x) = x^2 — analytical derivative = 2x
f      = lambda x: x**2
grad_f = lambda x: 2*x

print("f(x) = x^2 — First Principles vs Analytical:")
for x in [0, 1, 2, 3]:
    forward_diff = derivative_first_principles(f, x)
    central_diff = derivative_central(f, x)
    analytical   = grad_f(x)
    print(f"  x={x}: forward={forward_diff:.6f}, central={central_diff:.6f}, analytical={analytical:.6f}")

# Test on f(x) = sin(x) — derivative = cos(x)
g = np.sin
print("\nf(x) = sin(x) — derivative = cos(x):")
for x in [0, np.pi/4, np.pi/2, np.pi]:
    num = derivative_central(g, x)
    ana = np.cos(x)
    print(f"  x={x:.4f}: numerical={num:.6f}, analytical={ana:.6f}")

The three panels below illustrate what the first derivative means geometrically — from the limit definition as a shrinking secant line, to its sign ruling where a function rises or falls, to its direct correspondence with slope:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid", palette="muted")

f      = lambda x: x**2
f_prime = lambda x: 2 * x

x_full = np.linspace(-0.5, 2.5, 500)
x0 = 1.0          # point at which we visualise the limit
h_values = [1.0, 0.5, 0.2, 0.05]   # secant lines converging to tangent

# ── Tangent line at x0 ────────────────────────────────────────────────────
slope_at_x0 = f_prime(x0)
tangent = lambda x: f(x0) + slope_at_x0 * (x - x0)

fig, axes = plt.subplots(1, 3, figsize=(16, 5))
fig.suptitle(
    "First Derivative — Geometric Significance",
    fontsize=13, fontweight="bold", y=1.02
)

# ── Panel 1: Limit definition — secant → tangent ─────────────────────────
ax = axes[0]
ax.plot(x_full, f(x_full), color="#132440", lw=2.5, label="$f(x) = x^2$", zorder=3)

cmap_colors = sns.color_palette("flare", len(h_values))
for h, col in zip(h_values, cmap_colors):
    x1, y1 = x0, f(x0)
    x2, y2 = x0 + h, f(x0 + h)
    secant_slope = (y2 - y1) / (x2 - x1)
    secant_line = lambda x, s=secant_slope: f(x0) + s * (x - x0)
    x_sec = np.linspace(x0 - 0.3, x0 + h + 0.2, 100)
    ax.plot(x_sec, secant_line(x_sec), color=col, lw=1.4, ls="--",
            label=f"h = {h:.2f}  (slope ≈ {secant_slope:.2f})")
    ax.scatter([x2], [y2], color=col, s=45, zorder=4)

# Tangent
x_tan = np.linspace(x0 - 0.6, x0 + 0.6, 100)
ax.plot(x_tan, tangent(x_tan), color="#BF092F", lw=2.2,
        label=f"Tangent h→0 (slope={slope_at_x0:.1f})", zorder=5)
ax.scatter([x0], [f(x0)], color="#BF092F", s=80, zorder=6, ec="white", lw=1.5)
ax.set_title("Limit Definition\nSecant lines → tangent as $h \\to 0$")
ax.set_xlabel("$x$"); ax.set_ylabel("$f(x)$")
ax.legend(fontsize=7.5, loc="upper left")
ax.set_xlim(-0.3, 2.4); ax.set_ylim(-0.3, 6)

# ── Panel 2: f(x) = sin(x) — sign of f'(x) drives rise/fall ─────────────
ax = axes[1]
g       = np.sin
g_prime = np.cos
x_trig  = np.linspace(0, 3 * np.pi, 600)

ax.plot(x_trig, g(x_trig), color="#132440", lw=2.5, label="$f(x) = \\sin x$")
ax.axhline(0, color="gray", lw=0.7, ls="--", alpha=0.5)

ax.fill_between(x_trig, g(x_trig), 0, where=(g_prime(x_trig) > 0),
                alpha=0.18, color="#16476A", label="$f'(x) > 0$  (rising ↑)")
ax.fill_between(x_trig, g(x_trig), 0, where=(g_prime(x_trig) < 0),
                alpha=0.18, color="#BF092F", label="$f'(x) < 0$  (falling ↓)")

# Mark zeros of f'
crit_x = np.array([np.pi/2, 3*np.pi/2, 5*np.pi/2])
crit_y = g(crit_x)
crit_labels = ["Max\n$f'=0$", "Min\n$f'=0$", "Max\n$f'=0$"]
crit_colors = ["#16476A", "#BF092F", "#16476A"]
for xi, yi, lbl, col in zip(crit_x, crit_y, crit_labels, crit_colors):
    ax.scatter(xi, yi, color=col, s=70, zorder=5, ec="white", lw=1.3)
    ax.annotate(lbl, (xi, yi), textcoords="offset points",
                xytext=(6, 8), fontsize=8, color=col)

ax.set_title("Sign of $f'(x)$ — Where Does $f$ Rise or Fall?")
ax.set_xlabel("$x$"); ax.set_ylabel("$f(x)$")
ax.set_xticks([0, np.pi/2, np.pi, 3*np.pi/2, 2*np.pi, 5*np.pi/2, 3*np.pi])
ax.set_xticklabels(["0", "$\\pi/2$", "$\\pi$", "$3\\pi/2$", "$2\\pi$", "$5\\pi/2$", "$3\\pi$"],
                   fontsize=8)
ax.legend(fontsize=8, loc="lower left")

# ── Panel 3: f(x) and f'(x) side-by-side — slope = derivative value ──────
ax = axes[2]
x_p = np.linspace(-2.5, 2.5, 500)
p   = lambda x: x**3 / 3 - x          # chosen so f' = x^2 - 1 (clear zeros)
dp  = lambda x: x**2 - 1

line1, = ax.plot(x_p, p(x_p), color="#132440", lw=2.5, label="$f(x) = x^3/3 - x$")
ax2_twin = ax.twinx()
line2, = ax2_twin.plot(x_p, dp(x_p), color="#3B9797", lw=2.0, ls="-.",
                       label="$f'(x) = x^2 - 1$")
ax2_twin.axhline(0, color="#3B9797", lw=0.6, ls=":", alpha=0.4)

# Annotate where f'=0 → extrema
for xi in [-1.0, 1.0]:
    ax.axvline(xi, color="#BF092F", lw=1.0, ls=":", alpha=0.6)
    ax.scatter(xi, p(xi), color="#BF092F", s=70, zorder=5, ec="white", lw=1.3)
    ax2_twin.scatter(xi, dp(xi), color="#BF092F", s=70, zorder=5, ec="white", lw=1.3)

ax.set_title("$f$ vs $f'$ — Slope of $f$ equals value of $f'$")
ax.set_xlabel("$x$")
ax.set_ylabel("$f(x)$", color="#132440")
ax2_twin.set_ylabel("$f'(x)$", color="#3B9797")

lines = [line1, line2]
ax.legend(lines, [l.get_label() for l in lines], fontsize=8, loc="upper center")

plt.tight_layout()
plt.show()

# ── Console: convergence of secant slope to true derivative ───────────────
print(f"Convergence of secant slope to f'({x0}) = {f_prime(x0):.1f}  [f(x)=x²]")
print(f"{'h':>10}  {'Secant slope':>14}  {'Error':>10}")
print("-" * 40)
for h in [1.0, 0.5, 0.1, 0.01, 0.001, 1e-6]:
    sec = (f(x0 + h) - f(x0)) / h
    err = abs(sec - f_prime(x0))
    print(f"{h:>10.6f}  {sec:>14.8f}  {err:>10.2e}")

Power Rule

The power rule is the most frequently used differentiation rule:

$$\frac{d}{dx} x^n = nx^{n-1} \quad \text{for any real } n$$

Examples: $\frac{d}{dx} x^3 = 3x^2$; $\frac{d}{dx} x^{1/2} = \frac{1}{2}x^{-1/2}$; $\frac{d}{dx} x^{-1} = -x^{-2}$; $\frac{d}{dx} 5 = 0$ (constants vanish).

Companion rules: $\frac{d}{dx}[cf(x)] = c f'(x)$ (constant multiple) and $\frac{d}{dx}[f(x) \pm g(x)] = f'(x) \pm g'(x)$ (sum rule).

Second Derivatives

The second derivative $f''(x)$ measures the rate of change of the slope — the curvature of the function:

$$f''(x) = \frac{d^2f}{dx^2} = \frac{d}{dx}\!\left[\frac{df}{dx}\right]$$

Second derivatives classify critical points where $f'(x^*) = 0$:

  • $f''(x^*) \gt 0$: local minimum (concave up — function curves like a bowl)
  • $f''(x^*) \lt 0$: local maximum (concave down — function curves like a hill)
  • $f''(x^*) = 0$: inconclusive — may be saddle point or inflection point

The visualization below plots $f(x) = x^3 - 3x$ alongside its first and second derivatives, showing how the sign of $f''$ at each critical point determines concavity:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid", palette="muted")

# ── Function and its derivatives ──────────────────────────────────────────
def f(x):        return x**3 - 3*x         # local max at x=-1, local min at x=1
def f_prime(x):  return 3*x**2 - 3         # zero at x = ±1
def f_double(x): return 6*x                # sign determines concavity

x = np.linspace(-2.5, 2.5, 500)

# Critical points where f'(x*) = 0
x_crit = np.array([-1.0, 1.0])
y_crit = f(x_crit)
y2_crit = f_double(x_crit)
labels = ["Local Max\n$f''(-1) = -6 < 0$", "Local Min\n$f''(1) = +6 > 0$"]
colors = ["#BF092F", "#16476A"]   # crimson = max, navy-blue = min

fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharey=False)
fig.suptitle(
    "Second Derivative Test: Curvature & Critical Point Classification",
    fontsize=13, fontweight="bold", y=1.02
)

# ── Panel 1: f(x) with concavity shading ──────────────────────────────────
ax = axes[0]
ax.plot(x, f(x), color="#132440", lw=2.5, label="$f(x) = x^3 - 3x$")
ax.fill_between(x, f(x), alpha=0.12, where=(f_double(x) > 0),
                color="#16476A", label="Concave up  ($f'' > 0$)")
ax.fill_between(x, f(x), alpha=0.12, where=(f_double(x) < 0),
                color="#BF092F", label="Concave down ($f'' < 0$)")
ax.axhline(0, color="gray", lw=0.8, ls="--", alpha=0.5)
ax.axvline(0, color="gray", lw=0.8, ls=":", alpha=0.4, label="Inflection ($x=0$)")

for xi, yi, lbl, col in zip(x_crit, y_crit, labels, colors):
    ax.scatter(xi, yi, zorder=5, color=col, s=90, ec="white", lw=1.5)
    ax.annotate(lbl, (xi, yi), textcoords="offset points",
                xytext=(14, 8), fontsize=8.5, color=col,
                arrowprops=dict(arrowstyle="->", color=col, lw=1.2))

ax.set_title("$f(x) = x^3 - 3x$")
ax.set_xlabel("$x$"); ax.set_ylabel("$f(x)$")
ax.legend(fontsize=8, loc="upper left")
ax.set_ylim(-3.5, 3.5)

# ── Panel 2: f'(x) — slope (first derivative) ─────────────────────────────
ax = axes[1]
ax.plot(x, f_prime(x), color="#3B9797", lw=2.5, label="$f'(x) = 3x^2 - 3$")
ax.axhline(0, color="black", lw=0.8, ls="--", alpha=0.6)
ax.fill_between(x, f_prime(x), 0, where=(f_prime(x) > 0),
                alpha=0.10, color="#3B9797", label="$f' > 0$ (increasing)")
ax.fill_between(x, f_prime(x), 0, where=(f_prime(x) < 0),
                alpha=0.10, color="#BF092F", label="$f' < 0$ (decreasing)")

for xi, col in zip(x_crit, colors):
    ax.scatter(xi, 0, zorder=5, color=col, s=90, ec="white", lw=1.5)
    ax.axvline(xi, color=col, lw=1.0, ls=":", alpha=0.6)

ax.set_title("First Derivative $f'(x)$")
ax.set_xlabel("$x$"); ax.set_ylabel("$f'(x)$")
ax.legend(fontsize=8, loc="upper center")

# ── Panel 3: f''(x) — curvature (second derivative) ───────────────────────
ax = axes[2]
ax.plot(x, f_double(x), color="#B8860B", lw=2.5, label="$f''(x) = 6x$")
ax.axhline(0, color="black", lw=0.8, ls="--", alpha=0.6)
ax.fill_between(x, f_double(x), 0, where=(f_double(x) > 0),
                alpha=0.15, color="#16476A", label="$f'' > 0$ → concave up")
ax.fill_between(x, f_double(x), 0, where=(f_double(x) < 0),
                alpha=0.15, color="#BF092F", label="$f'' < 0$ → concave down")

for xi, y2, lbl, col in zip(x_crit, y2_crit, labels, colors):
    ax.scatter(xi, y2, zorder=5, color=col, s=90, ec="white", lw=1.5)
    ax.annotate(f"$f''({xi:.0f}) = {y2:.0f}$", (xi, y2),
                textcoords="offset points", xytext=(8, 10),
                fontsize=8.5, color=col,
                arrowprops=dict(arrowstyle="->", color=col, lw=1.2))

ax.set_title("Second Derivative $f''(x)$")
ax.set_xlabel("$x$"); ax.set_ylabel("$f''(x)$")
ax.legend(fontsize=8, loc="upper left")

plt.tight_layout()
plt.show()

# ── Console summary ────────────────────────────────────────────────────────
print("Second Derivative Test — Critical Point Classification")
print("=" * 54)
for xi, yi, y2 in zip(x_crit, y_crit, y2_crit):
    sign  = ">" if y2 > 0 else "<"
    kind  = "Local Minimum  (concave up ↑)" if y2 > 0 else "Local Maximum  (concave down ↓)"
    print(f"  x = {xi:+.1f}:  f(x) = {yi:+.1f},  f''(x) = {y2:+.0f} {sign} 0  →  {kind}")
Hessian matrix in ML: The multivariate analogue of $f''$ is the Hessian $H$, where $H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$. A positive definite Hessian at a critical point confirms a local minimum. Large Hessian eigenvalues indicate sharp curvature (fast loss change), small eigenvalues indicate flat regions. Second-order optimizers use Hessian information but are $O(n^2)$ per step — infeasible for large networks, motivating L-BFGS approximations.

Special Derivatives Table

Core derivatives every ML practitioner should know:

$f(x)$$f'(x)$ML Relevance
$x^n$$nx^{n-1}$Polynomial loss surfaces
$e^x$$e^x$Softmax numerator; unchanged by differentiation
$\ln x$$1/x$Cross-entropy: $-\frac{d}{d\hat{y}}\log\hat{y} = -1/\hat{y}$
$\sin x$$\cos x$Positional encoding (Transformer)
$\cos x$$-\sin x$Positional encoding (Transformer)
$\sigma(x) = \frac{1}{1+e^{-x}}$$\sigma(x)(1-\sigma(x))$Sigmoid activation; max gradient = 0.25 → vanishing gradient risk
$\tanh(x)$$1 - \tanh^2(x)$LSTM/GRU gates; gradient $\in (-1,1)$
$\text{ReLU}(x) = \max(0,x)$$0$ if $x \lt 0$, $1$ if $x \gt 0$Default hidden activation; dead neuron problem at $x=0$
import numpy as np

def numerical_deriv(f, x, h=1e-6):
    """Central difference derivative — O(h^2) accuracy."""
    return (f(x + h) - f(x - h)) / (2 * h)

test_x = 1.0

functions = {
    "e^x":      (np.exp,                       lambda x: np.exp(x)),
    "ln(x)":    (np.log,                        lambda x: 1/x),
    "sin(x)":   (np.sin,                        np.cos),
    "cos(x)":   (np.cos,                        lambda x: -np.sin(x)),
    "sigma(x)": (lambda x: 1/(1+np.exp(-x)),    lambda x: (s := 1/(1+np.exp(-x))) * (1 - s)),
    "tanh(x)":  (np.tanh,                       lambda x: 1 - np.tanh(x)**2),
    "ReLU(x)":  (lambda x: np.maximum(0, x),    lambda x: float(x > 0)),
}

print(f"{'Function':12s}  {'Numerical':12s}  {'Analytical':12s}  {'Match':5s}")
print("-" * 52)
for name, (f, df) in functions.items():
    num = numerical_deriv(f, test_x)
    ana = df(test_x)
    match = "✓" if abs(num - ana) < 1e-5 else "✗"
    print(f"{name:12s}  {num:12.6f}  {ana:12.6f}  {match}")

Differentiation Rules

Three rules — chain, product, quotient — unlock the derivatives of all composite and combined functions. Every backpropagation computation ultimately applies the chain rule.

Chain Rule

For a composition $y = f(g(x))$, the chain rule states:

$$\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}, \quad \text{where } u = g(x)$$

Or equivalently: $(f \circ g)'(x) = f'(g(x)) \cdot g'(x)$.

Chain rule in backpropagation: A neural network is a deep composition of functions: $\hat{y} = f_L(f_{L-1}(\ldots f_1(\mathbf{x})))$. Backpropagation is the chain rule applied systematically layer by layer, accumulating the product of local Jacobians from output to input. Each loss.backward() call is the chain rule running in reverse.
import numpy as np

def numerical_deriv(f, x, h=1e-6):
    return (f(x + h) - f(x - h)) / (2 * h)

# Chain rule: d/dx [sin(x^2)] = cos(x^2) * 2x
f_comp       = lambda x: np.sin(x**2)
df_chain     = lambda x: np.cos(x**2) * 2*x   # outer' * inner'

print("Chain Rule: d/dx [sin(x^2)] = cos(x^2) * 2x")
for x in [0.5, 1.0, 2.0]:
    num = numerical_deriv(f_comp, x)
    ana = df_chain(x)
    print(f"  x={x}: numerical={num:.6f}, chain rule={ana:.6f}")

# Chain rule in a single neuron: y = sigma(wx + b), dy/dw = sigma'(z)*x
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

w, b, x = 0.5, 0.1, 2.0
z = w * x + b
dy_dw_chain     = sigmoid(z) * (1 - sigmoid(z)) * x
dy_dw_numerical = numerical_deriv(lambda w_: sigmoid(w_*x + b), w)
print(f"\nNeuron y=sigma(wx+b), dy/dw at w={w}:")
print(f"  Chain rule:  {dy_dw_chain:.6f}")
print(f"  Numerical:   {dy_dw_numerical:.6f}")

Product Rule

For $y = u(x) \cdot v(x)$, the product rule states:

$$\frac{d}{dx}[u \cdot v] = u'v + uv'$$

Mnemonic: "derivative of first times second, plus first times derivative of second."

Example: $\frac{d}{dx}[x^2 \sin x] = 2x \sin x + x^2 \cos x$.

import numpy as np

def numerical_deriv(f, x, h=1e-6):
    return (f(x + h) - f(x - h)) / (2 * h)

# Product rule: d/dx [x^2 * sin(x)]
u  = lambda x: x**2;        up = lambda x: 2*x
v  = lambda x: np.sin(x);   vp = lambda x: np.cos(x)

f  = lambda x: u(x) * v(x)
df_product = lambda x: up(x)*v(x) + u(x)*vp(x)   # product rule

print("Product Rule: d/dx [x^2 * sin(x)] = 2x*sin(x) + x^2*cos(x)")
for x in [1.0, 2.0, np.pi/3]:
    num = numerical_deriv(f, x)
    ana = df_product(x)
    print(f"  x={x:.4f}: numerical={num:.6f}, product rule={ana:.6f}")

Quotient Rule

For $y = u(x)\,/\,v(x)$, the quotient rule states:

$$\frac{d}{dx}\!\left[\frac{u}{v}\right] = \frac{u'v - uv'}{v^2}$$

Mnemonic: "lo d-hi minus hi d-lo, over lo-squared" (hi = $u$, lo = $v$).

Classic example — derivative of $\tan x = \sin x\,/\,\cos x$:

$$\frac{d}{dx}\tan x = \frac{\cos x \cdot \cos x - \sin x \cdot (-\sin x)}{\cos^2 x} = \frac{\cos^2 x + \sin^2 x}{\cos^2 x} = \sec^2 x$$

ML application: the softmax $\hat{p}_k = e^{z_k}/\sum_j e^{z_j}$ differentiated w.r.t. $z_k$ uses the quotient rule, yielding $\hat{p}_k(1-\hat{p}_k)$ for the diagonal Jacobian entries.

import numpy as np

def numerical_deriv(f, x, h=1e-6):
    return (f(x + h) - f(x - h)) / (2 * h)

# Quotient rule: d/dx [sin(x) / (1 + x^2)]
u  = lambda x: np.sin(x);   up = lambda x: np.cos(x)
v  = lambda x: 1 + x**2;    vp = lambda x: 2*x

f  = lambda x: u(x) / v(x)
df_quotient = lambda x: (up(x)*v(x) - u(x)*vp(x)) / v(x)**2

print("Quotient Rule: d/dx [sin(x)/(1+x^2)]")
for x in [0.0, 1.0, 2.0]:
    num = numerical_deriv(f, x)
    ana = df_quotient(x)
    print(f"  x={x:.1f}: numerical={num:.6f}, quotient rule={ana:.6f}")

# Classic: d/dx [tan(x)] = sec^2(x)
f_tan = np.tan
df_tan = lambda x: 1.0 / np.cos(x)**2
print("\nQuotient Rule: d/dx [tan(x)] = sec^2(x)")
for x in [0.0, 0.5, 1.0]:
    num = numerical_deriv(f_tan, x)
    ana = df_tan(x)
    print(f"  x={x:.1f}: numerical={num:.6f}, sec^2(x)={ana:.6f}")

Partial Derivatives & Gradients

For $f: \mathbb{R}^n \to \mathbb{R}$, the partial derivative $\frac{\partial f}{\partial x_i}$ measures how $f$ changes when we vary $x_i$ while holding all other variables fixed.

The gradient $\nabla f$ is the vector of all partial derivatives:

$$\nabla f(\mathbf{x}) = \begin{pmatrix} \frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{pmatrix}$$

Key property: $\nabla f(\mathbf{x})$ points in the direction of steepest ascent of $f$ at $\mathbf{x}$. Its negation $-\nabla f(\mathbf{x})$ points toward steepest descent — the direction gradient descent follows.

Jacobian & Hessian

For $f: \mathbb{R}^n \to \mathbb{R}^m$, the Jacobian $J \in \mathbb{R}^{m \times n}$ is the matrix of all first-order partial derivatives:

$$J_{ij} = \frac{\partial f_i}{\partial x_j}$$

The Hessian $H \in \mathbb{R}^{n \times n}$ is the matrix of all second-order partial derivatives of a scalar function $f: \mathbb{R}^n \to \mathbb{R}$:

$$H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$$

The Hessian encodes the local curvature of the loss landscape. Positive definite $H$ means local convexity (local minimum). In practice, Hessians are too large to compute for deep networks ($O(n^2)$ entries), so second-order methods like L-BFGS use approximations.

Gradient Descent

Gradient descent iteratively minimizes a differentiable function $\mathcal{L}(\theta)$ by moving parameters in the direction of steepest descent:

$$\theta_{t+1} = \theta_t - \alpha \nabla_\theta \mathcal{L}(\theta_t)$$

Where $\alpha \gt 0$ is the learning rate. Key variants:

  • Batch gradient descent: gradient over entire dataset — stable but slow for large $N$
  • Stochastic gradient descent (SGD): gradient over a single random sample — noisy but fast updates
  • Mini-batch SGD: gradient over a batch of $B$ samples — balances speed and stability; standard in practice
import numpy as np

def gradient_descent(f, grad_f, x_init, lr=0.1, max_iters=100):
    """
    Minimize f using gradient descent.
    f: objective function
    grad_f: gradient function
    x_init: starting point (scalar or array)
    """
    x = np.array(x_init, dtype=float)
    history = [x.copy()]
    for _ in range(max_iters):
        g = grad_f(x)
        x = x - lr * g
        history.append(x.copy())
    return x, history

# Example: minimize f(x) = (x - 3)^2
f      = lambda x: (x - 3)**2
grad_f = lambda x: 2 * (x - 3)

x_opt, history = gradient_descent(f, grad_f, x_init=-2.0, lr=0.1, max_iters=50)

print(f"Gradient Descent on f(x) = (x-3)^2:")
print(f"  Start:    x = -2.0,   f = {f(-2.0):.4f}")
print(f"  Optimal:  x = {x_opt:.6f}, f = {f(x_opt):.8f}")
print(f"  True min: x = 3.0,    f = 0.0")

# Show convergence trajectory (first 10 steps)
print("\nFirst 10 steps:")
for i, x_i in enumerate(history[:11]):
    print(f"  iter {i:2d}: x = {float(x_i):8.5f}, f(x) = {f(float(x_i)):.6f}")

Modern Optimizers

MomentumAdam
From SGD to Adam: A Progression

Momentum accumulates a velocity vector in directions of persistent gradient: $$v_{t+1} = \beta v_t + (1-\beta)\nabla_\theta \mathcal{L} \quad \theta_{t+1} = \theta_t - \alpha v_{t+1}$$

RMSProp adapts learning rate per-parameter using exponential moving average of squared gradients: $$s_{t+1} = \beta s_t + (1-\beta)(\nabla\mathcal{L})^2 \quad \theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{s_{t+1}+\epsilon}}\nabla\mathcal{L}$$

Adam combines momentum + RMSProp with bias correction:

$$m_{t+1} = \beta_1 m_t + (1-\beta_1)\nabla\mathcal{L}$$ $$v_{t+1} = \beta_2 v_t + (1-\beta_2)(\nabla\mathcal{L})^2$$ $$\hat{m} = \frac{m_{t+1}}{1-\beta_1^{t+1}}, \quad \hat{v} = \frac{v_{t+1}}{1-\beta_2^{t+1}}$$ $$\theta_{t+1} = \theta_t - \frac{\alpha\hat{m}}{\sqrt{\hat{v}}+\epsilon}$$

Defaults: $\beta_1=0.9$, $\beta_2=0.999$, $\epsilon=10^{-8}$, $\alpha=10^{-3}$. Adam is the default optimizer for most deep learning applications.

Backpropagation

Backpropagation efficiently computes $\nabla_\theta \mathcal{L}$ for all parameters in a neural network using the chain rule. For a composition $f = f_L \circ f_{L-1} \circ \cdots \circ f_1$:

$$\frac{\partial \mathcal{L}}{\partial \theta_k} = \frac{\partial \mathcal{L}}{\partial a_L} \cdot \frac{\partial a_L}{\partial a_{L-1}} \cdots \frac{\partial a_{k+1}}{\partial a_k} \cdot \frac{\partial a_k}{\partial \theta_k}$$
Backpropagation: Chain Rule Through a Two-Layer Network
flowchart LR
    X["Input x"] -->|"W1, b1"| H["Hidden h = σ(W1x + b1)"]
    H -->|"W2, b2"| Y["Output ŷ = W2h + b2"]
    Y --> L["Loss L(ŷ, y)"]
    L -->|"∂L/∂ŷ = backward pass"| Y
    Y -->|"∂L/∂W2 = (∂L/∂ŷ) h^T"| W2["∇W2"]
    Y -->|"∂L/∂h = W2^T (∂L/∂ŷ)"| H
    H -->|"∂L/∂W1 = (∂L/∂h ⊙ σ') x^T"| W1["∇W1"]
            
import numpy as np

# Implement backpropagation for a 1-layer network on a simple regression problem

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

def sigmoid_deriv(z):
    s = sigmoid(z)
    return s * (1 - s)

np.random.seed(0)
# Tiny dataset: XOR-like
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=float)  # (4, 2)
y = np.array([[0], [1], [1], [0]], dtype=float)               # (4, 1)

# Initialize weights
W1 = np.random.randn(2, 4) * 0.5   # (2, 4) — input to hidden
b1 = np.zeros((1, 4))               # (1, 4)
W2 = np.random.randn(4, 1) * 0.5   # (4, 1) — hidden to output
b2 = np.zeros((1, 1))               # (1, 1)

lr = 0.5
losses = []

for epoch in range(3000):
    # Forward pass
    z1 = X @ W1 + b1         # (4, 4)
    a1 = sigmoid(z1)          # (4, 4)
    z2 = a1 @ W2 + b2         # (4, 1)
    a2 = sigmoid(z2)          # (4, 1) — output

    # MSE loss
    loss = np.mean((a2 - y)**2)
    losses.append(loss)

    # Backward pass (chain rule)
    dL_da2 = 2 * (a2 - y) / len(y)         # (4, 1)
    dL_dz2 = dL_da2 * sigmoid_deriv(z2)    # (4, 1)
    dL_dW2 = a1.T @ dL_dz2                  # (4, 1)
    dL_db2 = dL_dz2.sum(axis=0, keepdims=True)

    dL_da1 = dL_dz2 @ W2.T                  # (4, 4)
    dL_dz1 = dL_da1 * sigmoid_deriv(z1)    # (4, 4)
    dL_dW1 = X.T @ dL_dz1                   # (2, 4)
    dL_db1 = dL_dz1.sum(axis=0, keepdims=True)

    # Gradient updates
    W2 -= lr * dL_dW2
    b2 -= lr * dL_db2
    W1 -= lr * dL_dW1
    b1 -= lr * dL_db1

print("Backprop training on XOR:")
print(f"  Epoch   0: loss = {losses[0]:.6f}")
print(f"  Epoch 500: loss = {losses[500]:.6f}")
print(f"  Epoch 2999: loss = {losses[2999]:.6f}")
print(f"\nFinal predictions (should be ~0, 1, 1, 0):")
print(np.round(a2, 3).flatten())

Convexity & Loss Landscapes

A function $f$ is convex if for all $\mathbf{x}, \mathbf{y}$ and $t \in [0,1]$:

$$f(t\mathbf{x} + (1-t)\mathbf{y}) \leq t f(\mathbf{x}) + (1-t)f(\mathbf{y})$$

For convex functions, any local minimum is the global minimum — gradient descent is guaranteed to converge. Linear regression's MSE loss is convex. Cross-entropy loss for logistic regression is convex in the weights.

Neural network loss landscapes are non-convex, containing numerous local minima, saddle points, and flat regions (plateaus). Practical observations:
  • Saddle points are more common than local minima in high dimensions — gradient descent can stall but usually escapes
  • Good local minima in overparameterized networks tend to generalize well (not obvious why — active research)
  • Learning rate matters enormously: too large diverges, too small is slow; warm-up + decay schedules help
  • Batch normalization smooths the loss landscape, improving gradient flow and enabling larger learning rates

Integration

Integration is the inverse of differentiation — it accumulates infinitesimal changes to compute totals. In ML, integrals appear in expected values, normalizing constants, and the probabilistic foundations underlying many algorithms.

Introduction to Integration

The antiderivative of $f(x)$ is a function $F(x)$ such that $F'(x) = f(x)$. Integration reverses differentiation:

$$\int f(x)\,dx = F(x) + C$$

where $C$ is the constant of integration (arbitrary, since constants vanish under differentiation). This is the indefinite integral.

Geometric interpretation: The definite integral $\int_a^b f(x)\,dx$ equals the signed area between the curve $y = f(x)$ and the x-axis from $a$ to $b$. Areas above the x-axis contribute positively, areas below negatively. Total (unsigned) area uses $\int_a^b |f(x)|\,dx$.

Indefinite Integration I — Basic Rules

RuleFormulaExample
Power rule$\int x^n\,dx = \frac{x^{n+1}}{n+1} + C\;(n \neq -1)$$\int x^4\,dx = \frac{x^5}{5} + C$
Constant$\int k\,dx = kx + C$$\int 5\,dx = 5x + C$
Sum/Difference$\int [f \pm g]\,dx = \int f\,dx \pm \int g\,dx$$\int (x^2 + 3)\,dx = \frac{x^3}{3} + 3x + C$
Constant multiple$\int kf(x)\,dx = k\int f(x)\,dx$$\int 4x^3\,dx = x^4 + C$
Reciprocal$\int \frac{1}{x}\,dx = \ln|x| + C$Arises in cross-entropy gradient
import numpy as np
from scipy import integrate

# Power rule: integral of x^4 from 0 to 1 = 1/5 = 0.2
result, _ = integrate.quad(lambda x: x**4, 0, 1)
print(f"integral x^4 from 0 to 1")
print(f"  Numerical:  {result:.8f}")
print(f"  Analytical (1/5): {0.2:.8f}")

# Sum rule: integral of (x^2 + 3) from 0 to 2 = x^3/3 + 3x from 0 to 2
# = (8/3 + 6) - 0 = 26/3
result2, _ = integrate.quad(lambda x: x**2 + 3, 0, 2)
print(f"\nintegral (x^2+3) from 0 to 2")
print(f"  Numerical:  {result2:.8f}")
print(f"  Analytical (26/3): {26/3:.8f}")

# Reciprocal: integral 1/x from 1 to e = ln(e) - ln(1) = 1
result3, _ = integrate.quad(lambda x: 1/x, 1, np.e)
print(f"\nintegral 1/x from 1 to e")
print(f"  Numerical:  {result3:.8f}")
print(f"  Analytical (ln e = 1): {1.0:.8f}")

Indefinite Integration II — Special Functions

Antiderivatives of exponential, logarithmic, and trigonometric functions appearing in ML:

$f(x)$Antiderivative $F(x) + C$ML Use
$e^x$$e^x + C$Softmax; exponential smoothing
$e^{ax}$$\frac{1}{a}e^{ax} + C$Laplace distribution log-likelihood
$\ln x$$x\ln x - x + C$Entropy; integration by parts result
$\sin x$$-\cos x + C$Positional encoding; Fourier features
$\cos x$$\sin x + C$Positional encoding; Fourier features
$\frac{1}{1+x^2}$$\arctan x + C$Cauchy distribution; RKHS kernels
$e^{-x^2/2}$$\sqrt{2\pi}\,\Phi(x) + C$Gaussian normalization; GELU activation
import numpy as np
from scipy import integrate

# e^x: integral from 0 to 2 = e^2 - e^0 = e^2 - 1
result, _ = integrate.quad(np.exp, 0, 2)
print(f"integral e^x from 0 to 2")
print(f"  Numerical:  {result:.8f}")
print(f"  Analytical (e^2 - 1): {np.e**2 - 1:.8f}")

# sin(x): integral from 0 to pi = [-cos(x)] from 0 to pi = 1 + 1 = 2
result2, _ = integrate.quad(np.sin, 0, np.pi)
print(f"\nintegral sin(x) from 0 to pi")
print(f"  Numerical:  {result2:.8f}")
print(f"  Analytical: {2.0:.8f}")

# ln(x): integral from 1 to e = [x*ln(x) - x] from 1 to e = (e - e) - (0 - 1) = 1
result3, _ = integrate.quad(np.log, 1, np.e)
print(f"\nintegral ln(x) from 1 to e")
print(f"  Numerical:  {result3:.8f}")
print(f"  Analytical: {1.0:.8f}")

# Gaussian normalization: integral of N(0,1) from -inf to +inf = 1
result4, _ = integrate.quad(lambda x: np.exp(-x**2/2) / np.sqrt(2*np.pi), -np.inf, np.inf)
print(f"\nintegral of N(0,1) pdf from -inf to +inf")
print(f"  Numerical:  {result4:.8f}")
print(f"  Analytical: {1.0:.8f}")

Definite Integration I — Fundamental Theorem

The Fundamental Theorem of Calculus links differentiation and integration. Part 2 (evaluation formula):

$$\int_a^b f(x)\,dx = F(b) - F(a)$$

where $F$ is any antiderivative of $f$. The constant $C$ cancels, so any antiderivative works.

ExampleFTC Step-by-Step
Evaluating $\int_1^4 (3x^2 - 2x + 1)\,dx$

Step 1 — Find antiderivative: $F(x) = x^3 - x^2 + x$

Step 2 — Evaluate: $F(4) - F(1) = (64 - 16 + 4) - (1 - 1 + 1) = 52 - 1 = 51$

import numpy as np
from scipy import integrate

# Fundamental Theorem: integral of (3x^2 - 2x + 1) from 1 to 4
f = lambda x: 3*x**2 - 2*x + 1
F = lambda x: x**3 - x**2 + x   # antiderivative

a, b = 1, 4
ftc_result = F(b) - F(a)
print(f"FTC: F({b}) - F({a}) = {F(b)} - {F(a)} = {ftc_result}")

# Numerical quadrature confirms the result
numerical, error = integrate.quad(f, a, b)
print(f"Numerical (scipy.integrate.quad): {numerical:.8f}")
print(f"Match: {abs(ftc_result - numerical) < 1e-9}")

# Properties of definite integrals
print("\nKey properties:")

# 1. Linearity
r1, _ = integrate.quad(lambda x: 2*x**2 + 3*np.sin(x), 0, np.pi)
r2 = 2*integrate.quad(lambda x: x**2, 0, np.pi)[0] + 3*integrate.quad(np.sin, 0, np.pi)[0]
print(f"  Linearity: int(2x^2+3sin) = {r1:.6f}, split = {r2:.6f}, match = {abs(r1-r2)<1e-9}")

# 2. Reverse limits: int_a^b f = -int_b^a f
r3, _ = integrate.quad(f, a, b)
r4, _ = integrate.quad(f, b, a)
print(f"  Reverse limits: int_1^4 = {r3:.4f}, int_4^1 = {r4:.4f}, sum = {r3+r4:.4f}")

# 3. ML use: E[X^2] for X ~ N(0,1) = 1 (the variance)
ev_x2, _ = integrate.quad(lambda x: x**2 * np.exp(-x**2/2)/np.sqrt(2*np.pi), -10, 10)
print(f"  E[X^2] for X~N(0,1) = {ev_x2:.8f}  (expected: 1.0 = variance)")

Definite Integration II — Substitution & Integration by Parts

Integration by substitution (reverse chain rule): for $\int f(g(x))g'(x)\,dx$, let $u = g(x)$:

$$\int f(g(x))\,g'(x)\,dx = \int f(u)\,du$$

Example: $\int 2x\cos(x^2)\,dx$ — let $u = x^2$, $du = 2x\,dx$ → $\int \cos u\,du = \sin u + C = \sin(x^2) + C$.

Integration by parts (reverse product rule):

$$\int u\,dv = uv - \int v\,du$$

Example: $\int x e^x\,dx$ — let $u = x$, $dv = e^x\,dx$ → $xe^x - \int e^x\,dx = e^x(x-1) + C$.

import numpy as np
from scipy import integrate

# Substitution: integral of 2x*cos(x^2) from 0 to sqrt(pi)
# Antiderivative = sin(x^2), so F(sqrt(pi)) - F(0) = sin(pi) - sin(0) = 0
f_sub = lambda x: 2 * x * np.cos(x**2)
result_sub, _ = integrate.quad(f_sub, 0, np.sqrt(np.pi))
analytical_sub = np.sin(np.pi) - np.sin(0)
print(f"Substitution: integral 2x*cos(x^2) from 0 to sqrt(pi)")
print(f"  Numerical:  {result_sub:.8f}")
print(f"  Analytical: {analytical_sub:.8f}")

# Integration by parts: integral of x*e^x from 0 to 1
# Antiderivative: e^x(x-1), F(1)-F(0) = 0 - (-1) = 1
f_ibp = lambda x: x * np.exp(x)
result_ibp, _ = integrate.quad(f_ibp, 0, 1)
F_ibp = lambda x: np.exp(x) * (x - 1)
analytical_ibp = F_ibp(1) - F_ibp(0)
print(f"\nIntegration by parts: integral x*e^x from 0 to 1")
print(f"  Numerical:  {result_ibp:.8f}")
print(f"  Analytical: {analytical_ibp:.8f}")   # = 1.0

# ML use: expected value E[X] for X ~ Exponential(lambda=1)
# = integral of x * e^(-x) from 0 to inf = 1 (by integration by parts)
ev_exp, _ = integrate.quad(lambda x: x * np.exp(-x), 0, np.inf)
print(f"\nE[X] for X ~ Exp(1) (integral x*e^(-x) from 0 to inf):")
print(f"  Numerical: {ev_exp:.8f}  (expected: 1.0 = 1/lambda)")

Area Under a Curve

The definite integral gives signed area. For total (unsigned) area between a curve and the x-axis:

$$\text{Total area} = \int_a^b |f(x)|\,dx$$

Area between two curves $f(x) \geq g(x)$ on $[a,b]$:

$$\text{Area} = \int_a^b [f(x) - g(x)]\,dx$$
AUC-ROC is a definite integral: The ROC curve plots True Positive Rate vs False Positive Rate as the classification threshold varies. AUC = $\int_0^1 \text{TPR}(\text{FPR})\,d(\text{FPR})$. A perfect classifier has AUC = 1.0 (full unit-square area); a random classifier has AUC = 0.5 (triangle under the diagonal). Numerically, sklearn computes AUC via the trapezoidal rule — a discrete approximation to the integral.
import numpy as np
from scipy import integrate

# Area between g(x) = x and f(x) = x^2 on [0,1]
# x >= x^2 on [0,1], so area = integral(x - x^2) from 0 to 1
area, _ = integrate.quad(lambda x: x - x**2, 0, 1)
print(f"Area between g(x)=x and f(x)=x^2 on [0,1]:")
print(f"  Numerical:  {area:.8f}")
print(f"  Analytical: {0.5 - 1/3:.8f}")  # = 1/2 - 1/3 = 1/6

# Simulate AUC-ROC calculation (trapezoidal rule)
# Sorted scores and labels for a simple binary classifier
np.random.seed(42)
scores = np.random.beta(5, 2, 100)   # biased positive scores
labels = (scores + np.random.normal(0, 0.3, 100) > 0.5).astype(int)

# Build ROC curve manually
thresholds = np.linspace(0, 1, 200)
tprs, fprs = [], []
pos_total = labels.sum()
neg_total = (1 - labels).sum()

for thresh in thresholds:
    preds = (scores >= thresh).astype(int)
    tp = ((preds == 1) & (labels == 1)).sum()
    fp = ((preds == 1) & (labels == 0)).sum()
    tprs.append(tp / pos_total)
    fprs.append(fp / neg_total)

tprs, fprs = np.array(tprs), np.array(fprs)

# AUC via trapezoid rule (numpy)
auc_trapz = np.trapz(tprs[::-1], fprs[::-1])   # sort fpr ascending

print(f"\nAUC-ROC (trapezoidal rule): {auc_trapz:.4f}")
print(f"  (1.0 = perfect, 0.5 = random)")
print(f"  This is literally a definite integral computed numerically.")

Practice Exercises

Exercise 1Chain Rule
Differentiate the Cross-Entropy Loss via Chain Rule

Let $\hat{y} = \sigma(w \cdot x + b)$ and $\mathcal{L} = -y\log\hat{y} - (1-y)\log(1-\hat{y})$. Compute $\frac{\partial \mathcal{L}}{\partial w}$.

Show Answer

Chain rule: $\frac{\partial \mathcal{L}}{\partial w} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial w}$, where $z = wx + b$.

$\frac{\partial \mathcal{L}}{\partial \hat{y}} = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}$

$\frac{\partial \hat{y}}{\partial z} = \sigma(z)(1-\sigma(z)) = \hat{y}(1-\hat{y})$

Multiplying: $\frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} = \left(-\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}\right)\hat{y}(1-\hat{y}) = -y(1-\hat{y}) + (1-y)\hat{y} = \hat{y} - y$

Then $\frac{\partial z}{\partial w} = x$. So: $\boxed{\frac{\partial \mathcal{L}}{\partial w} = (\hat{y} - y) \cdot x}$. Clean result — the gradient is the prediction error times the input.

Exercise 2Gradient Descent
Learning Rate Sensitivity

For $f(x) = x^2$, derive the closed-form expression for $x_t$ after $t$ gradient descent steps starting from $x_0$. What values of learning rate $\alpha$ ensure convergence? What is the optimal $\alpha$?

Show Answer

$f'(x) = 2x$. Update: $x_{t+1} = x_t - 2\alpha x_t = (1-2\alpha)x_t$.

After $t$ steps: $x_t = (1-2\alpha)^t x_0$.

Converges to 0 iff $|1-2\alpha| \lt 1$, i.e., $0 \lt \alpha \lt 1$.

Fastest convergence: minimize $|1-2\alpha|$, achieved at $\alpha^* = 0.5$ giving $x_t = 0$ in one step. This corresponds to Newton's method for quadratics. For general smooth functions, the optimal step size is $\alpha = 1/L$ where $L$ is the Lipschitz constant of the gradient.

Conclusion & Next Steps

Calculus is what makes neural networks trainable:

  • Derivatives: measure instantaneous rate of change — the slope of the loss surface at any point
  • Power, product, quotient rules: the mechanical toolkit for differentiating any function
  • Chain rule: propagates gradients through compositions — the mathematical core of backpropagation
  • Gradient descent: iteratively follows $-\nabla\mathcal{L}$ to minimize loss
  • Hessian: encodes curvature — positive definite = convex = safe convergence
  • Adam: adaptive learning rates with momentum — robust default optimizer
  • Integration: computes expected values, normalizing constants, and the AUC-ROC metric — the probabilistic layer under ML
  • Non-convexity: deep network landscapes have saddle points and many local minima

Next in the Series

In Part 9: ML-Specific Math, we apply everything learned so far directly to ML: loss functions, regularization, bias-variance tradeoff, MLE/MAP, activation function gradients, and batch normalization.