The Learning Problem
Imagine tuning a radio with dozens of knobs. Each knob position affects the clarity of the signal. Your goal: find the exact combination that minimizes static and maximizes clarity. Neural network training works the same way — except instead of a few knobs, you have millions of “weights” to adjust, and instead of static, you’re minimizing a mathematical loss function.
Loss Landscape Visualization
The relationship between weights and loss creates a loss landscape — a high-dimensional surface where valleys represent good solutions and peaks represent poor ones. Let’s visualize a simple 2-parameter loss landscape:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# Create a 2D loss landscape (bowl-shaped with local minima)
w1 = np.linspace(-3, 3, 100)
w2 = np.linspace(-3, 3, 100)
W1, W2 = np.meshgrid(w1, w2)
# Loss function: combination of quadratic + sinusoidal bumps
Loss = 0.5 * (W1**2 + W2**2) + 0.3 * np.sin(3 * W1) * np.cos(3 * W2)
# Plot 3D surface
fig = plt.figure(figsize=(12, 5))
# 3D surface plot
ax1 = fig.add_subplot(121, projection='3d')
ax1.plot_surface(W1, W2, Loss, cmap='viridis', alpha=0.8)
ax1.set_xlabel('Weight 1')
ax1.set_ylabel('Weight 2')
ax1.set_zlabel('Loss')
ax1.set_title('3D Loss Landscape')
# Contour plot (bird-eye view)
ax2 = fig.add_subplot(122)
contour = ax2.contourf(W1, W2, Loss, levels=30, cmap='viridis')
plt.colorbar(contour, ax=ax2)
ax2.set_xlabel('Weight 1')
ax2.set_ylabel('Weight 2')
ax2.set_title('Loss Contour Map')
ax2.plot(0, 0, 'r*', markersize=15, label='Global Minimum')
ax2.legend()
plt.tight_layout()
plt.show()
print("The goal: navigate from any starting point to the valley (minimum loss)")
In this landscape, gradient descent acts like a ball rolling downhill — it follows the steepest slope toward the nearest minimum. The challenge is finding the global minimum rather than getting stuck in local minima.
Loss Functions
A loss function (also called cost function or objective function) quantifies how far our network’s predictions are from the true values. It provides the signal that drives learning — without it, the network has no concept of “right” or “wrong.”
The loss function you choose depends on your task:
- Regression (predicting continuous values) → Mean Squared Error
- Binary classification (yes/no) → Binary Cross-Entropy
- Multi-class classification (cat/dog/bird) → Categorical Cross-Entropy
Mean Squared Error (MSE)
MSE measures the average squared difference between predictions and true values:
$$L_{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$
Squaring ensures all errors are positive and penalizes large errors more heavily than small ones.
Binary Cross-Entropy
For binary classification, cross-entropy measures how well predicted probabilities match actual labels:
$$L_{BCE} = -\frac{1}{n}\sum_{i=1}^{n}\left[y_i \log(\hat{y}_i) + (1 - y_i)\log(1 - \hat{y}_i)\right]$$
Categorical Cross-Entropy
For multi-class problems with one-hot encoded labels:
$$L_{CCE} = -\sum_{i=1}^{C} y_i \log(\hat{y}_i)$$
where $C$ is the number of classes.
import numpy as np
# ============================================================
# Implementing Loss Functions from Scratch
# ============================================================
def mse_loss(y_true, y_pred):
"""Mean Squared Error for regression tasks."""
return np.mean((y_true - y_pred) ** 2)
def binary_cross_entropy(y_true, y_pred):
"""Binary Cross-Entropy for binary classification."""
# Clip predictions to avoid log(0)
epsilon = 1e-15
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
def categorical_cross_entropy(y_true, y_pred):
"""Categorical Cross-Entropy for multi-class classification."""
epsilon = 1e-15
y_pred = np.clip(y_pred, epsilon, 1.0)
return -np.sum(y_true * np.log(y_pred)) / len(y_true)
# --- Demonstrate MSE ---
y_true_reg = np.array([3.0, 5.0, 7.0, 9.0])
y_pred_bad = np.array([2.0, 4.0, 8.0, 11.0]) # Poor predictions
y_pred_good = np.array([3.1, 4.9, 7.1, 8.9]) # Good predictions
print("=== Mean Squared Error (Regression) ===")
print(f"Poor predictions MSE: {mse_loss(y_true_reg, y_pred_bad):.4f}")
print(f"Good predictions MSE: {mse_loss(y_true_reg, y_pred_good):.4f}")
# --- Demonstrate Binary Cross-Entropy ---
y_true_bin = np.array([1, 0, 1, 1, 0])
y_pred_bad_bin = np.array([0.4, 0.7, 0.3, 0.5, 0.6]) # Poor
y_pred_good_bin = np.array([0.9, 0.1, 0.8, 0.95, 0.05]) # Good
print("\n=== Binary Cross-Entropy (Classification) ===")
print(f"Poor predictions BCE: {binary_cross_entropy(y_true_bin, y_pred_bad_bin):.4f}")
print(f"Good predictions BCE: {binary_cross_entropy(y_true_bin, y_pred_good_bin):.4f}")
# --- Demonstrate Categorical Cross-Entropy ---
# One-hot: class 2 out of 3
y_true_cat = np.array([0, 0, 1])
y_pred_bad_cat = np.array([0.4, 0.4, 0.2]) # Confident in wrong class
y_pred_good_cat = np.array([0.05, 0.05, 0.9]) # Confident in right class
print("\n=== Categorical Cross-Entropy (Multi-class) ===")
print(f"Poor predictions CCE: {categorical_cross_entropy(y_true_cat, y_pred_bad_cat):.4f}")
print(f"Good predictions CCE: {categorical_cross_entropy(y_true_cat, y_pred_good_cat):.4f}")
Gradient Descent Explained
Now that we can measure how wrong our network is (via loss), how do we improve it? The answer is gradient descent — an iterative algorithm that adjusts weights in the direction that reduces loss.
The Hill-Walking Analogy
Imagine you’re blindfolded on a hilly landscape and want to reach the lowest valley. Your strategy:
- Feel the slope of the ground beneath your feet (compute gradient)
- Take a step in the downhill direction (update weights)
- Repeat until the ground feels flat (converged)
Mathematically, the weight update rule is:
$$w_{new} = w_{old} - \eta \frac{\partial L}{\partial w}$$
where:
- $\eta$ (eta) = learning rate — how big each step is
- $\frac{\partial L}{\partial w}$ = gradient — slope of loss with respect to weight
- The negative sign ensures we move downhill (toward lower loss)
1D Gradient Descent Implementation
import numpy as np
import matplotlib.pyplot as plt
# Define a loss function (non-convex for interest)
def loss_fn(w):
return w**4 - 3*w**3 + 2*w**2 + w
# Analytical gradient (derivative of loss)
def gradient_fn(w):
return 4*w**3 - 9*w**2 + 4*w + 1
# Gradient descent
learning_rate = 0.01
w = 3.0 # Starting point (far from minimum)
history = [w]
for step in range(200):
grad = gradient_fn(w)
w = w - learning_rate * grad
history.append(w)
print(f"Starting weight: {history[0]:.4f}")
print(f"Final weight: {history[-1]:.4f}")
print(f"Final loss: {loss_fn(history[-1]):.6f}")
print(f"Steps taken: {len(history) - 1}")
# Visualize the descent path
w_range = np.linspace(-1, 3.5, 200)
plt.figure(figsize=(10, 5))
plt.plot(w_range, loss_fn(w_range), 'b-', linewidth=2, label='Loss Function')
# Plot descent steps (show every 10th step)
steps_to_show = history[::10]
for i, w_step in enumerate(steps_to_show):
alpha = 0.3 + 0.7 * (i / len(steps_to_show))
plt.plot(w_step, loss_fn(w_step), 'ro', markersize=8, alpha=alpha)
plt.plot(history[0], loss_fn(history[0]), 'g^', markersize=15, label='Start')
plt.plot(history[-1], loss_fn(history[-1]), 'r*', markersize=15, label='End (Minimum)')
plt.xlabel('Weight (w)')
plt.ylabel('Loss L(w)')
plt.title('Gradient Descent: Following the Slope Downhill')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
Backpropagation: The Magic Behind Learning
Gradient descent tells us what to do with gradients (subtract them from weights). But how do we compute gradients for a deep network with many layers? The answer is backpropagation — an efficient algorithm that propagates error signals backward through the network using the chain rule of calculus.
flowchart LR
subgraph Forward["Forward Pass (Compute Predictions)"]
direction LR
X["Input X"] --> Z1["z1 = W1*X + b1"]
Z1 --> A1["a1 = ReLU(z1)"]
A1 --> Z2["z2 = W2*a1 + b2"]
Z2 --> A2["a2 = sigmoid(z2)"]
A2 --> L["Loss L"]
end
subgraph Backward["Backward Pass (Compute Gradients)"]
direction RL
dL["dL/da2"] --> dZ2["dL/dz2"]
dZ2 --> dW2["dL/dW2"]
dZ2 --> dA1["dL/da1"]
dA1 --> dZ1["dL/dz1"]
dZ1 --> dW1["dL/dW1"]
end
L -.->|"Error Signal"| dL
The key insight: each layer’s gradient depends on the gradients from the layers after it. We compute gradients from the output layer backward to the input layer — hence “back”-propagation.
Complete 2-Layer Backpropagation (NumPy)
import numpy as np
# ============================================================
# Complete 2-Layer Neural Network with Backpropagation
# ============================================================
np.random.seed(42)
# --- Activation functions and their derivatives ---
def sigmoid(z):
return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
def sigmoid_derivative(a):
"""Derivative of sigmoid given the activation (not z)."""
return a * (1 - a)
def relu(z):
return np.maximum(0, z)
def relu_derivative(z):
return (z > 0).astype(float)
# --- Training data: XOR-like problem ---
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]) # 4 samples, 2 features
y = np.array([[0], [1], [1], [0]]) # XOR labels
# --- Initialize weights ---
# Layer 1: 2 inputs -> 4 hidden neurons
W1 = np.random.randn(2, 4) * 0.5
b1 = np.zeros((1, 4))
# Layer 2: 4 hidden -> 1 output
W2 = np.random.randn(4, 1) * 0.5
b2 = np.zeros((1, 1))
learning_rate = 0.5
losses = []
# --- Training loop ---
for epoch in range(1000):
# ===== FORWARD PASS =====
z1 = X.dot(W1) + b1 # Linear: (4,2)*(2,4) = (4,4)
a1 = relu(z1) # Activation: (4,4)
z2 = a1.dot(W2) + b2 # Linear: (4,4)*(4,1) = (4,1)
a2 = sigmoid(z2) # Output: (4,1)
# Compute loss (Binary Cross-Entropy)
epsilon = 1e-15
loss = -np.mean(y * np.log(a2 + epsilon) + (1 - y) * np.log(1 - a2 + epsilon))
losses.append(loss)
# ===== BACKWARD PASS =====
m = X.shape[0] # Number of samples
# Output layer gradients
dz2 = a2 - y # (4,1) - shape of output
dW2 = (1/m) * a1.T.dot(dz2) # (4,4).T * (4,1) = (4,1)
db2 = (1/m) * np.sum(dz2, axis=0, keepdims=True)
# Hidden layer gradients (chain rule!)
da1 = dz2.dot(W2.T) # (4,1) * (1,4) = (4,4)
dz1 = da1 * relu_derivative(z1) # Element-wise: (4,4)
dW1 = (1/m) * X.T.dot(dz1) # (2,4).T * (4,4) = (2,4)
db1 = (1/m) * np.sum(dz1, axis=0, keepdims=True)
# ===== UPDATE WEIGHTS =====
W2 -= learning_rate * dW2
b2 -= learning_rate * db2
W1 -= learning_rate * dW1
b1 -= learning_rate * db1
# --- Results ---
print("=== Training Complete ===")
print(f"Initial loss: {losses[0]:.4f}")
print(f"Final loss: {losses[-1]:.4f}")
print(f"\nPredictions after training:")
for i in range(4):
print(f" Input: {X[i]} -> Predicted: {a2[i][0]:.4f}, Actual: {y[i][0]}")
Forward pass computes activations layer by layer:
- $z^{[1]} = W^{[1]}X + b^{[1]}$, then $a^{[1]} = \text{ReLU}(z^{[1]})$
- $z^{[2]} = W^{[2]}a^{[1]} + b^{[2]}$, then $a^{[2]} = \sigma(z^{[2]})$
Backward pass computes gradients in reverse:
- $dz^{[2]} = a^{[2]} - y$ (output error)
- $dW^{[2]} = \frac{1}{m} a^{[1]T} \cdot dz^{[2]}$
- $da^{[1]} = dz^{[2]} \cdot W^{[2]T}$ (propagate error back)
- $dz^{[1]} = da^{[1]} \odot \text{ReLU}'(z^{[1]})$ (apply activation derivative)
- $dW^{[1]} = \frac{1}{m} X^T \cdot dz^{[1]}$
The Chain Rule in Action
Backpropagation is fundamentally the chain rule of calculus applied to computational graphs. Every operation in a neural network (multiplication, addition, activation) is a node in a graph. The chain rule lets us compute how the final loss depends on any weight, no matter how deep in the network.
For a weight $w_1$ in the first layer of a 2-layer network:
$$\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial a_2} \cdot \frac{\partial a_2}{\partial z_2} \cdot \frac{\partial z_2}{\partial a_1} \cdot \frac{\partial a_1}{\partial z_1} \cdot \frac{\partial z_1}{\partial w_1}$$
Each term in this product corresponds to one edge in the computational graph:
flowchart LR
w1["w1"] --> z1["z1 = w1*x + b1"]
x["x (input)"] --> z1
z1 --> a1["a1 = ReLU(z1)"]
a1 --> z2["z2 = w2*a1 + b2"]
w2["w2"] --> z2
z2 --> a2["a2 = sigmoid(z2)"]
a2 --> L["L = Loss(a2, y)"]
L -.->|"dL/da2"| a2
a2 -.->|"da2/dz2"| z2
z2 -.->|"dz2/da1 = w2"| a1
a1 -.->|"da1/dz1"| z1
z1 -.->|"dz1/dw1 = x"| w1
Gradient Checking: Verify Your Backprop
A critical debugging technique: compare your analytical gradients (from backprop) against numerical gradients (computed by nudging weights slightly). If they match, your backprop implementation is correct.
import numpy as np
# ============================================================
# Gradient Checking: Numerical vs Analytical Gradients
# ============================================================
def sigmoid(z):
return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
def compute_loss(X, y, W1, b1, W2, b2):
"""Forward pass and loss computation."""
z1 = X.dot(W1) + b1
a1 = np.maximum(0, z1) # ReLU
z2 = a1.dot(W2) + b2
a2 = sigmoid(z2)
epsilon = 1e-15
loss = -np.mean(y * np.log(a2 + epsilon) + (1 - y) * np.log(1 - a2 + epsilon))
return loss
def numerical_gradient(X, y, W1, b1, W2, b2, param, i, j, h=1e-5):
"""Compute gradient numerically using finite differences."""
# f(x + h)
param[i, j] += h
loss_plus = compute_loss(X, y, W1, b1, W2, b2)
# f(x - h)
param[i, j] -= 2 * h
loss_minus = compute_loss(X, y, W1, b1, W2, b2)
# Restore original value
param[i, j] += h
# Central difference approximation
return (loss_plus - loss_minus) / (2 * h)
# Setup
np.random.seed(42)
X = np.array([[0.5, 0.3], [0.8, 0.1]])
y = np.array([[1], [0]])
W1 = np.random.randn(2, 3) * 0.5
b1 = np.zeros((1, 3))
W2 = np.random.randn(3, 1) * 0.5
b2 = np.zeros((1, 1))
# Analytical gradient (backprop)
m = X.shape[0]
z1 = X.dot(W1) + b1
a1 = np.maximum(0, z1)
z2 = a1.dot(W2) + b2
a2 = sigmoid(z2)
dz2 = a2 - y
dW2_analytical = (1/m) * a1.T.dot(dz2)
da1 = dz2.dot(W2.T)
dz1 = da1 * (z1 > 0).astype(float)
dW1_analytical = (1/m) * X.T.dot(dz1)
# Numerical gradient for W1[0,0]
dW1_numerical_00 = numerical_gradient(X, y, W1, b1, W2, b2, W1, 0, 0)
# Compare
print("=== Gradient Checking ===")
print(f"dW1[0,0] Analytical: {dW1_analytical[0, 0]:.8f}")
print(f"dW1[0,0] Numerical: {dW1_numerical_00:.8f}")
print(f"Relative difference: {abs(dW1_analytical[0,0] - dW1_numerical_00) / (abs(dW1_analytical[0,0]) + 1e-8):.2e}")
print("\nIf relative difference < 1e-5, backprop is correct!")
# Check multiple weights
print("\n--- Full W2 Gradient Check ---")
for i in range(W2.shape[0]):
for j in range(W2.shape[1]):
num = numerical_gradient(X, y, W1, b1, W2, b2, W2, i, j)
ana = dW2_analytical[i, j]
rel_diff = abs(ana - num) / (abs(ana) + 1e-8)
status = "PASS" if rel_diff < 1e-5 else "FAIL"
print(f" W2[{i},{j}]: analytical={ana:.6f}, numerical={num:.6f}, diff={rel_diff:.2e} [{status}]")
Optimizers: Beyond Basic SGD
Vanilla Stochastic Gradient Descent (SGD) has several problems:
- Oscillation in ravines (narrow valleys) where the gradient bounces back and forth
- Same learning rate for all parameters, even when some need larger/smaller updates
- Getting stuck in saddle points where gradients are near zero
Modern optimizers fix these issues. Let’s implement the three most important ones from scratch.
SGD with Momentum
Momentum adds a “velocity” term that accumulates past gradients, like a ball rolling downhill gaining speed:
$$v_t = \gamma v_{t-1} + \eta \nabla L$$
$$w_t = w_{t-1} - v_t$$
where $\gamma$ (typically 0.9) is the momentum coefficient.
RMSprop
RMSprop adapts the learning rate per-parameter by dividing by the running average of squared gradients:
$$s_t = \beta s_{t-1} + (1-\beta)(\nabla L)^2$$
$$w_t = w_{t-1} - \frac{\eta}{\sqrt{s_t + \epsilon}} \nabla L$$
Adam Optimizer
Adam combines the best of Momentum and RMSprop with bias correction:
$$m_t = \beta_1 m_{t-1} + (1-\beta_1)\nabla L \quad \text{(first moment)}$$
$$v_t = \beta_2 v_{t-1} + (1-\beta_2)(\nabla L)^2 \quad \text{(second moment)}$$
$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} \quad \text{(bias correction)}$$
$$w_t = w_{t-1} - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon}\hat{m}_t$$
import numpy as np
import matplotlib.pyplot as plt
# ============================================================
# Implementing Optimizers from Scratch
# ============================================================
# Test function: Rosenbrock (notoriously hard to optimize)
def rosenbrock(w):
"""f(x,y) = (1-x)^2 + 100*(y-x^2)^2"""
return (1 - w[0])**2 + 100 * (w[1] - w[0]**2)**2
def rosenbrock_grad(w):
"""Gradient of Rosenbrock function."""
dx = -2*(1 - w[0]) + 200*(w[1] - w[0]**2)*(-2*w[0])
dy = 200*(w[1] - w[0]**2)
return np.array([dx, dy])
class SGDMomentum:
def __init__(self, lr=0.001, momentum=0.9):
self.lr = lr
self.momentum = momentum
self.velocity = None
def step(self, w, grad):
if self.velocity is None:
self.velocity = np.zeros_like(w)
self.velocity = self.momentum * self.velocity + self.lr * grad
return w - self.velocity
class RMSprop:
def __init__(self, lr=0.001, beta=0.9, epsilon=1e-8):
self.lr = lr
self.beta = beta
self.epsilon = epsilon
self.s = None
def step(self, w, grad):
if self.s is None:
self.s = np.zeros_like(w)
self.s = self.beta * self.s + (1 - self.beta) * grad**2
return w - self.lr * grad / (np.sqrt(self.s) + self.epsilon)
class Adam:
def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.epsilon = epsilon
self.m = None
self.v = None
self.t = 0
def step(self, w, grad):
if self.m is None:
self.m = np.zeros_like(w)
self.v = np.zeros_like(w)
self.t += 1
self.m = self.beta1 * self.m + (1 - self.beta1) * grad
self.v = self.beta2 * self.v + (1 - self.beta2) * grad**2
m_hat = self.m / (1 - self.beta1**self.t)
v_hat = self.v / (1 - self.beta2**self.t)
return w - self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)
# --- Run all three optimizers ---
start = np.array([-1.5, 2.0])
n_steps = 500
optimizers = {
'SGD + Momentum': SGDMomentum(lr=0.001, momentum=0.9),
'RMSprop': RMSprop(lr=0.01),
'Adam': Adam(lr=0.01)
}
paths = {}
loss_histories = {}
for name, opt in optimizers.items():
w = start.copy()
path = [w.copy()]
losses = [rosenbrock(w)]
for _ in range(n_steps):
grad = rosenbrock_grad(w)
w = opt.step(w, grad)
path.append(w.copy())
losses.append(rosenbrock(w))
paths[name] = np.array(path)
loss_histories[name] = losses
print(f"{name:18s} -> Final position: ({w[0]:.4f}, {w[1]:.4f}), Loss: {rosenbrock(w):.6f}")
# --- Plot loss curves ---
plt.figure(figsize=(10, 5))
for name, losses in loss_histories.items():
plt.plot(losses[:200], label=name, linewidth=2)
plt.xlabel('Step')
plt.ylabel('Loss (Rosenbrock)')
plt.title('Optimizer Comparison: Convergence Speed')
plt.legend()
plt.yscale('log')
plt.grid(True, alpha=0.3)
plt.show()
print("\nAdam typically converges fastest due to adaptive learning rates + momentum")
Learning Rate Impact
The learning rate ($\eta$) is arguably the most important hyperparameter in training. Too large and the optimizer overshoots the minimum (diverges). Too small and training takes forever (or gets stuck).
import numpy as np
import matplotlib.pyplot as plt
# ============================================================
# Learning Rate Comparison: Same Problem, Different Rates
# ============================================================
def loss_function(w):
"""Simple bowl: L(w) = w^2 + 0.5*sin(4w)"""
return w**2 + 0.5 * np.sin(4 * w)
def loss_gradient(w):
"""Gradient: dL/dw = 2w + 2*cos(4w)"""
return 2 * w + 2 * np.cos(4 * w)
learning_rates = [0.001, 0.01, 0.1, 0.5, 1.2]
colors = ['blue', 'green', 'orange', 'red', 'purple']
labels = ['0.001 (too slow)', '0.01 (slow)', '0.1 (good)', '0.5 (fast)', '1.2 (diverges)']
plt.figure(figsize=(12, 5))
# Plot 1: Loss over time
plt.subplot(1, 2, 1)
for lr, color, label in zip(learning_rates, colors, labels):
w = 3.0
losses = [loss_function(w)]
for _ in range(50):
grad = loss_gradient(w)
w = w - lr * grad
# Clip to prevent overflow
w = np.clip(w, -10, 10)
losses.append(loss_function(w))
plt.plot(losses, color=color, linewidth=2, label=f'lr={label}')
plt.xlabel('Step')
plt.ylabel('Loss')
plt.title('Learning Rate Impact on Convergence')
plt.legend(fontsize=8)
plt.grid(True, alpha=0.3)
plt.ylim(-1, 15)
# Plot 2: Path on loss curve
plt.subplot(1, 2, 2)
w_range = np.linspace(-4, 4, 200)
plt.plot(w_range, loss_function(w_range), 'k-', linewidth=2, alpha=0.5, label='Loss curve')
for lr, color, label in zip([0.01, 0.1, 0.5], ['green', 'orange', 'red'], ['0.01', '0.1', '0.5']):
w = 3.0
path = [w]
for _ in range(20):
grad = loss_gradient(w)
w = w - lr * grad
w = np.clip(w, -10, 10)
path.append(w)
path = np.array(path)
plt.plot(path, loss_function(path), 'o-', color=color, markersize=4, label=f'lr={label}')
plt.xlabel('Weight (w)')
plt.ylabel('Loss')
plt.title('Optimization Paths for Different Learning Rates')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("Summary:")
print(" lr=0.001: Barely moves in 50 steps")
print(" lr=0.01: Converges slowly but safely")
print(" lr=0.1: Good balance of speed and stability")
print(" lr=0.5: Oscillates but eventually converges")
print(" lr=1.2: Diverges! Loss increases without bound")
Learning Rate Schedules
Rather than using a fixed learning rate, modern training uses schedules that decrease the rate over time:
import numpy as np
import matplotlib.pyplot as plt
# ============================================================
# Learning Rate Schedules
# ============================================================
epochs = 100
initial_lr = 0.1
# Step decay: reduce by factor every N epochs
def step_decay(epoch, initial_lr=0.1, drop=0.5, epochs_drop=20):
return initial_lr * (drop ** (epoch // epochs_drop))
# Exponential decay
def exponential_decay(epoch, initial_lr=0.1, decay_rate=0.05):
return initial_lr * np.exp(-decay_rate * epoch)
# Cosine annealing
def cosine_annealing(epoch, initial_lr=0.1, T_max=100):
return initial_lr * (1 + np.cos(np.pi * epoch / T_max)) / 2
# Warmup + cosine decay
def warmup_cosine(epoch, initial_lr=0.1, warmup_epochs=10, total_epochs=100):
if epoch < warmup_epochs:
return initial_lr * epoch / warmup_epochs
progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs)
return initial_lr * (1 + np.cos(np.pi * progress)) / 2
epoch_range = np.arange(epochs)
plt.figure(figsize=(10, 6))
plt.plot(epoch_range, [step_decay(e) for e in epoch_range], 'b-', linewidth=2, label='Step Decay')
plt.plot(epoch_range, [exponential_decay(e) for e in epoch_range], 'r-', linewidth=2, label='Exponential Decay')
plt.plot(epoch_range, [cosine_annealing(e) for e in epoch_range], 'g-', linewidth=2, label='Cosine Annealing')
plt.plot(epoch_range, [warmup_cosine(e) for e in epoch_range], 'm-', linewidth=2, label='Warmup + Cosine')
plt.xlabel('Epoch')
plt.ylabel('Learning Rate')
plt.title('Common Learning Rate Schedules')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
print("Schedule recommendations:")
print(" Step Decay: Simple, good baseline")
print(" Exponential: Smooth decrease, common in older papers")
print(" Cosine Annealing: Popular in modern training (ResNets, Transformers)")
print(" Warmup + Cosine: Best for Transformers, prevents early instability")
The learning rate finder (Smith, 2017) is a practical technique:
- Start with a very small LR (e.g., $10^{-7}$)
- Gradually increase it each batch until loss explodes
- Pick the LR where loss was decreasing fastest (typically 1/10th of the rate where loss starts increasing)
Rules of thumb:
- Adam: start with $\eta = 0.001$
- SGD + Momentum: start with $\eta = 0.01$ to $0.1$
- Always use warmup for Transformers
What’s Next
You now understand the complete learning pipeline: loss functions measure error, gradient descent minimizes that error, backpropagation computes the necessary gradients, and optimizers accelerate convergence. These are the fundamental mechanisms that power every neural network ever trained.
In the next article, we’ll put all of this together and build a complete neural network from scratch — tackling the classic XOR problem and extending to real datasets, all without any deep learning framework.
Next in the Series
In Part 4: Build Your First Neural Network from Scratch, we’ll combine everything from Parts 1–3 into a working network that solves XOR, classifies data, and trains on real datasets — all in pure NumPy.