Back to Neural Networks Mastery Series

Part 3: How Neural Networks Learn

May 3, 2026 Wasil Zafar 35 min read

Master loss functions, gradient descent, backpropagation, and modern optimizers — implemented from scratch so you understand exactly how networks learn.

Table of Contents

  1. The Learning Problem
  2. Loss Functions
  3. Gradient Descent
  4. Backpropagation
  5. The Chain Rule in Action
  6. Optimizers: Beyond Basic SGD
  7. Learning Rate Impact
  8. What’s Next

The Learning Problem

Imagine tuning a radio with dozens of knobs. Each knob position affects the clarity of the signal. Your goal: find the exact combination that minimizes static and maximizes clarity. Neural network training works the same way — except instead of a few knobs, you have millions of “weights” to adjust, and instead of static, you’re minimizing a mathematical loss function.

Core Insight: Training a neural network is an optimization problem. We search through a vast space of possible weight configurations to find the set that produces the lowest error on our training data.

Loss Landscape Visualization

The relationship between weights and loss creates a loss landscape — a high-dimensional surface where valleys represent good solutions and peaks represent poor ones. Let’s visualize a simple 2-parameter loss landscape:

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Create a 2D loss landscape (bowl-shaped with local minima)
w1 = np.linspace(-3, 3, 100)
w2 = np.linspace(-3, 3, 100)
W1, W2 = np.meshgrid(w1, w2)

# Loss function: combination of quadratic + sinusoidal bumps
Loss = 0.5 * (W1**2 + W2**2) + 0.3 * np.sin(3 * W1) * np.cos(3 * W2)

# Plot 3D surface
fig = plt.figure(figsize=(12, 5))

# 3D surface plot
ax1 = fig.add_subplot(121, projection='3d')
ax1.plot_surface(W1, W2, Loss, cmap='viridis', alpha=0.8)
ax1.set_xlabel('Weight 1')
ax1.set_ylabel('Weight 2')
ax1.set_zlabel('Loss')
ax1.set_title('3D Loss Landscape')

# Contour plot (bird-eye view)
ax2 = fig.add_subplot(122)
contour = ax2.contourf(W1, W2, Loss, levels=30, cmap='viridis')
plt.colorbar(contour, ax=ax2)
ax2.set_xlabel('Weight 1')
ax2.set_ylabel('Weight 2')
ax2.set_title('Loss Contour Map')
ax2.plot(0, 0, 'r*', markersize=15, label='Global Minimum')
ax2.legend()

plt.tight_layout()
plt.show()
print("The goal: navigate from any starting point to the valley (minimum loss)")

In this landscape, gradient descent acts like a ball rolling downhill — it follows the steepest slope toward the nearest minimum. The challenge is finding the global minimum rather than getting stuck in local minima.

Loss Functions

A loss function (also called cost function or objective function) quantifies how far our network’s predictions are from the true values. It provides the signal that drives learning — without it, the network has no concept of “right” or “wrong.”

Concept Choosing Loss Functions

The loss function you choose depends on your task:

  • Regression (predicting continuous values) → Mean Squared Error
  • Binary classification (yes/no) → Binary Cross-Entropy
  • Multi-class classification (cat/dog/bird) → Categorical Cross-Entropy

Mean Squared Error (MSE)

MSE measures the average squared difference between predictions and true values:

$$L_{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$

Squaring ensures all errors are positive and penalizes large errors more heavily than small ones.

Binary Cross-Entropy

For binary classification, cross-entropy measures how well predicted probabilities match actual labels:

$$L_{BCE} = -\frac{1}{n}\sum_{i=1}^{n}\left[y_i \log(\hat{y}_i) + (1 - y_i)\log(1 - \hat{y}_i)\right]$$

Categorical Cross-Entropy

For multi-class problems with one-hot encoded labels:

$$L_{CCE} = -\sum_{i=1}^{C} y_i \log(\hat{y}_i)$$

where $C$ is the number of classes.

import numpy as np

# ============================================================
# Implementing Loss Functions from Scratch
# ============================================================

def mse_loss(y_true, y_pred):
    """Mean Squared Error for regression tasks."""
    return np.mean((y_true - y_pred) ** 2)

def binary_cross_entropy(y_true, y_pred):
    """Binary Cross-Entropy for binary classification."""
    # Clip predictions to avoid log(0)
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

def categorical_cross_entropy(y_true, y_pred):
    """Categorical Cross-Entropy for multi-class classification."""
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1.0)
    return -np.sum(y_true * np.log(y_pred)) / len(y_true)

# --- Demonstrate MSE ---
y_true_reg = np.array([3.0, 5.0, 7.0, 9.0])
y_pred_bad = np.array([2.0, 4.0, 8.0, 11.0])    # Poor predictions
y_pred_good = np.array([3.1, 4.9, 7.1, 8.9])    # Good predictions

print("=== Mean Squared Error (Regression) ===")
print(f"Poor predictions MSE:  {mse_loss(y_true_reg, y_pred_bad):.4f}")
print(f"Good predictions MSE:  {mse_loss(y_true_reg, y_pred_good):.4f}")

# --- Demonstrate Binary Cross-Entropy ---
y_true_bin = np.array([1, 0, 1, 1, 0])
y_pred_bad_bin = np.array([0.4, 0.7, 0.3, 0.5, 0.6])   # Poor
y_pred_good_bin = np.array([0.9, 0.1, 0.8, 0.95, 0.05]) # Good

print("\n=== Binary Cross-Entropy (Classification) ===")
print(f"Poor predictions BCE:  {binary_cross_entropy(y_true_bin, y_pred_bad_bin):.4f}")
print(f"Good predictions BCE:  {binary_cross_entropy(y_true_bin, y_pred_good_bin):.4f}")

# --- Demonstrate Categorical Cross-Entropy ---
# One-hot: class 2 out of 3
y_true_cat = np.array([0, 0, 1])
y_pred_bad_cat = np.array([0.4, 0.4, 0.2])   # Confident in wrong class
y_pred_good_cat = np.array([0.05, 0.05, 0.9]) # Confident in right class

print("\n=== Categorical Cross-Entropy (Multi-class) ===")
print(f"Poor predictions CCE:  {categorical_cross_entropy(y_true_cat, y_pred_bad_cat):.4f}")
print(f"Good predictions CCE:  {categorical_cross_entropy(y_true_cat, y_pred_good_cat):.4f}")
Why clipping matters: We clip predictions to $[\epsilon, 1-\epsilon]$ because $\log(0) = -\infty$. Without clipping, a single confident wrong prediction would produce infinite loss and break training.

Gradient Descent Explained

Now that we can measure how wrong our network is (via loss), how do we improve it? The answer is gradient descent — an iterative algorithm that adjusts weights in the direction that reduces loss.

The Hill-Walking Analogy

Imagine you’re blindfolded on a hilly landscape and want to reach the lowest valley. Your strategy:

  1. Feel the slope of the ground beneath your feet (compute gradient)
  2. Take a step in the downhill direction (update weights)
  3. Repeat until the ground feels flat (converged)

Mathematically, the weight update rule is:

$$w_{new} = w_{old} - \eta \frac{\partial L}{\partial w}$$

where:

  • $\eta$ (eta) = learning rate — how big each step is
  • $\frac{\partial L}{\partial w}$ = gradient — slope of loss with respect to weight
  • The negative sign ensures we move downhill (toward lower loss)

1D Gradient Descent Implementation

import numpy as np
import matplotlib.pyplot as plt

# Define a loss function (non-convex for interest)
def loss_fn(w):
    return w**4 - 3*w**3 + 2*w**2 + w

# Analytical gradient (derivative of loss)
def gradient_fn(w):
    return 4*w**3 - 9*w**2 + 4*w + 1

# Gradient descent
learning_rate = 0.01
w = 3.0  # Starting point (far from minimum)
history = [w]

for step in range(200):
    grad = gradient_fn(w)
    w = w - learning_rate * grad
    history.append(w)

print(f"Starting weight: {history[0]:.4f}")
print(f"Final weight:    {history[-1]:.4f}")
print(f"Final loss:      {loss_fn(history[-1]):.6f}")
print(f"Steps taken:     {len(history) - 1}")

# Visualize the descent path
w_range = np.linspace(-1, 3.5, 200)
plt.figure(figsize=(10, 5))
plt.plot(w_range, loss_fn(w_range), 'b-', linewidth=2, label='Loss Function')

# Plot descent steps (show every 10th step)
steps_to_show = history[::10]
for i, w_step in enumerate(steps_to_show):
    alpha = 0.3 + 0.7 * (i / len(steps_to_show))
    plt.plot(w_step, loss_fn(w_step), 'ro', markersize=8, alpha=alpha)

plt.plot(history[0], loss_fn(history[0]), 'g^', markersize=15, label='Start')
plt.plot(history[-1], loss_fn(history[-1]), 'r*', markersize=15, label='End (Minimum)')
plt.xlabel('Weight (w)')
plt.ylabel('Loss L(w)')
plt.title('Gradient Descent: Following the Slope Downhill')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Backpropagation: The Magic Behind Learning

Gradient descent tells us what to do with gradients (subtract them from weights). But how do we compute gradients for a deep network with many layers? The answer is backpropagation — an efficient algorithm that propagates error signals backward through the network using the chain rule of calculus.

Forward Pass & Backward Pass Flow
flowchart LR
    subgraph Forward["Forward Pass (Compute Predictions)"]
        direction LR
        X["Input X"] --> Z1["z1 = W1*X + b1"]
        Z1 --> A1["a1 = ReLU(z1)"]
        A1 --> Z2["z2 = W2*a1 + b2"]
        Z2 --> A2["a2 = sigmoid(z2)"]
        A2 --> L["Loss L"]
    end

    subgraph Backward["Backward Pass (Compute Gradients)"]
        direction RL
        dL["dL/da2"] --> dZ2["dL/dz2"]
        dZ2 --> dW2["dL/dW2"]
        dZ2 --> dA1["dL/da1"]
        dA1 --> dZ1["dL/dz1"]
        dZ1 --> dW1["dL/dW1"]
    end

    L -.->|"Error Signal"| dL
            

The key insight: each layer’s gradient depends on the gradients from the layers after it. We compute gradients from the output layer backward to the input layer — hence “back”-propagation.

Complete 2-Layer Backpropagation (NumPy)

import numpy as np

# ============================================================
# Complete 2-Layer Neural Network with Backpropagation
# ============================================================

np.random.seed(42)

# --- Activation functions and their derivatives ---
def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

def sigmoid_derivative(a):
    """Derivative of sigmoid given the activation (not z)."""
    return a * (1 - a)

def relu(z):
    return np.maximum(0, z)

def relu_derivative(z):
    return (z > 0).astype(float)

# --- Training data: XOR-like problem ---
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])  # 4 samples, 2 features
y = np.array([[0], [1], [1], [0]])                 # XOR labels

# --- Initialize weights ---
# Layer 1: 2 inputs -> 4 hidden neurons
W1 = np.random.randn(2, 4) * 0.5
b1 = np.zeros((1, 4))

# Layer 2: 4 hidden -> 1 output
W2 = np.random.randn(4, 1) * 0.5
b2 = np.zeros((1, 1))

learning_rate = 0.5
losses = []

# --- Training loop ---
for epoch in range(1000):
    # ===== FORWARD PASS =====
    z1 = X.dot(W1) + b1          # Linear: (4,2)*(2,4) = (4,4)
    a1 = relu(z1)                  # Activation: (4,4)
    z2 = a1.dot(W2) + b2          # Linear: (4,4)*(4,1) = (4,1)
    a2 = sigmoid(z2)               # Output: (4,1)

    # Compute loss (Binary Cross-Entropy)
    epsilon = 1e-15
    loss = -np.mean(y * np.log(a2 + epsilon) + (1 - y) * np.log(1 - a2 + epsilon))
    losses.append(loss)

    # ===== BACKWARD PASS =====
    m = X.shape[0]  # Number of samples

    # Output layer gradients
    dz2 = a2 - y                           # (4,1) - shape of output
    dW2 = (1/m) * a1.T.dot(dz2)           # (4,4).T * (4,1) = (4,1)
    db2 = (1/m) * np.sum(dz2, axis=0, keepdims=True)

    # Hidden layer gradients (chain rule!)
    da1 = dz2.dot(W2.T)                    # (4,1) * (1,4) = (4,4)
    dz1 = da1 * relu_derivative(z1)        # Element-wise: (4,4)
    dW1 = (1/m) * X.T.dot(dz1)            # (2,4).T * (4,4) = (2,4)
    db1 = (1/m) * np.sum(dz1, axis=0, keepdims=True)

    # ===== UPDATE WEIGHTS =====
    W2 -= learning_rate * dW2
    b2 -= learning_rate * db2
    W1 -= learning_rate * dW1
    b1 -= learning_rate * db1

# --- Results ---
print("=== Training Complete ===")
print(f"Initial loss: {losses[0]:.4f}")
print(f"Final loss:   {losses[-1]:.4f}")
print(f"\nPredictions after training:")
for i in range(4):
    print(f"  Input: {X[i]} -> Predicted: {a2[i][0]:.4f}, Actual: {y[i][0]}")
Key Equations Backprop Step-by-Step

Forward pass computes activations layer by layer:

  • $z^{[1]} = W^{[1]}X + b^{[1]}$, then $a^{[1]} = \text{ReLU}(z^{[1]})$
  • $z^{[2]} = W^{[2]}a^{[1]} + b^{[2]}$, then $a^{[2]} = \sigma(z^{[2]})$

Backward pass computes gradients in reverse:

  • $dz^{[2]} = a^{[2]} - y$ (output error)
  • $dW^{[2]} = \frac{1}{m} a^{[1]T} \cdot dz^{[2]}$
  • $da^{[1]} = dz^{[2]} \cdot W^{[2]T}$ (propagate error back)
  • $dz^{[1]} = da^{[1]} \odot \text{ReLU}'(z^{[1]})$ (apply activation derivative)
  • $dW^{[1]} = \frac{1}{m} X^T \cdot dz^{[1]}$

The Chain Rule in Action

Backpropagation is fundamentally the chain rule of calculus applied to computational graphs. Every operation in a neural network (multiplication, addition, activation) is a node in a graph. The chain rule lets us compute how the final loss depends on any weight, no matter how deep in the network.

For a weight $w_1$ in the first layer of a 2-layer network:

$$\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial a_2} \cdot \frac{\partial a_2}{\partial z_2} \cdot \frac{\partial z_2}{\partial a_1} \cdot \frac{\partial a_1}{\partial z_1} \cdot \frac{\partial z_1}{\partial w_1}$$

Each term in this product corresponds to one edge in the computational graph:

Computational Graph with Chain Rule
flowchart LR
    w1["w1"] --> z1["z1 = w1*x + b1"]
    x["x (input)"] --> z1
    z1 --> a1["a1 = ReLU(z1)"]
    a1 --> z2["z2 = w2*a1 + b2"]
    w2["w2"] --> z2
    z2 --> a2["a2 = sigmoid(z2)"]
    a2 --> L["L = Loss(a2, y)"]

    L -.->|"dL/da2"| a2
    a2 -.->|"da2/dz2"| z2
    z2 -.->|"dz2/da1 = w2"| a1
    a1 -.->|"da1/dz1"| z1
    z1 -.->|"dz1/dw1 = x"| w1
            

Gradient Checking: Verify Your Backprop

A critical debugging technique: compare your analytical gradients (from backprop) against numerical gradients (computed by nudging weights slightly). If they match, your backprop implementation is correct.

import numpy as np

# ============================================================
# Gradient Checking: Numerical vs Analytical Gradients
# ============================================================

def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

def compute_loss(X, y, W1, b1, W2, b2):
    """Forward pass and loss computation."""
    z1 = X.dot(W1) + b1
    a1 = np.maximum(0, z1)  # ReLU
    z2 = a1.dot(W2) + b2
    a2 = sigmoid(z2)
    epsilon = 1e-15
    loss = -np.mean(y * np.log(a2 + epsilon) + (1 - y) * np.log(1 - a2 + epsilon))
    return loss

def numerical_gradient(X, y, W1, b1, W2, b2, param, i, j, h=1e-5):
    """Compute gradient numerically using finite differences."""
    # f(x + h)
    param[i, j] += h
    loss_plus = compute_loss(X, y, W1, b1, W2, b2)

    # f(x - h)
    param[i, j] -= 2 * h
    loss_minus = compute_loss(X, y, W1, b1, W2, b2)

    # Restore original value
    param[i, j] += h

    # Central difference approximation
    return (loss_plus - loss_minus) / (2 * h)

# Setup
np.random.seed(42)
X = np.array([[0.5, 0.3], [0.8, 0.1]])
y = np.array([[1], [0]])
W1 = np.random.randn(2, 3) * 0.5
b1 = np.zeros((1, 3))
W2 = np.random.randn(3, 1) * 0.5
b2 = np.zeros((1, 1))

# Analytical gradient (backprop)
m = X.shape[0]
z1 = X.dot(W1) + b1
a1 = np.maximum(0, z1)
z2 = a1.dot(W2) + b2
a2 = sigmoid(z2)

dz2 = a2 - y
dW2_analytical = (1/m) * a1.T.dot(dz2)
da1 = dz2.dot(W2.T)
dz1 = da1 * (z1 > 0).astype(float)
dW1_analytical = (1/m) * X.T.dot(dz1)

# Numerical gradient for W1[0,0]
dW1_numerical_00 = numerical_gradient(X, y, W1, b1, W2, b2, W1, 0, 0)

# Compare
print("=== Gradient Checking ===")
print(f"dW1[0,0] Analytical: {dW1_analytical[0, 0]:.8f}")
print(f"dW1[0,0] Numerical:  {dW1_numerical_00:.8f}")
print(f"Relative difference: {abs(dW1_analytical[0,0] - dW1_numerical_00) / (abs(dW1_analytical[0,0]) + 1e-8):.2e}")
print("\nIf relative difference < 1e-5, backprop is correct!")

# Check multiple weights
print("\n--- Full W2 Gradient Check ---")
for i in range(W2.shape[0]):
    for j in range(W2.shape[1]):
        num = numerical_gradient(X, y, W1, b1, W2, b2, W2, i, j)
        ana = dW2_analytical[i, j]
        rel_diff = abs(ana - num) / (abs(ana) + 1e-8)
        status = "PASS" if rel_diff < 1e-5 else "FAIL"
        print(f"  W2[{i},{j}]: analytical={ana:.6f}, numerical={num:.6f}, diff={rel_diff:.2e} [{status}]")
Debugging tip: Always run gradient checking when implementing backprop from scratch. A relative difference greater than $10^{-5}$ usually indicates a bug in your gradient computation. Remember to disable gradient checking during actual training — it’s extremely slow!

Optimizers: Beyond Basic SGD

Vanilla Stochastic Gradient Descent (SGD) has several problems:

  • Oscillation in ravines (narrow valleys) where the gradient bounces back and forth
  • Same learning rate for all parameters, even when some need larger/smaller updates
  • Getting stuck in saddle points where gradients are near zero

Modern optimizers fix these issues. Let’s implement the three most important ones from scratch.

SGD with Momentum

Momentum adds a “velocity” term that accumulates past gradients, like a ball rolling downhill gaining speed:

$$v_t = \gamma v_{t-1} + \eta \nabla L$$

$$w_t = w_{t-1} - v_t$$

where $\gamma$ (typically 0.9) is the momentum coefficient.

RMSprop

RMSprop adapts the learning rate per-parameter by dividing by the running average of squared gradients:

$$s_t = \beta s_{t-1} + (1-\beta)(\nabla L)^2$$

$$w_t = w_{t-1} - \frac{\eta}{\sqrt{s_t + \epsilon}} \nabla L$$

Adam Optimizer

Adam combines the best of Momentum and RMSprop with bias correction:

$$m_t = \beta_1 m_{t-1} + (1-\beta_1)\nabla L \quad \text{(first moment)}$$

$$v_t = \beta_2 v_{t-1} + (1-\beta_2)(\nabla L)^2 \quad \text{(second moment)}$$

$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} \quad \text{(bias correction)}$$

$$w_t = w_{t-1} - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon}\hat{m}_t$$

import numpy as np
import matplotlib.pyplot as plt

# ============================================================
# Implementing Optimizers from Scratch
# ============================================================

# Test function: Rosenbrock (notoriously hard to optimize)
def rosenbrock(w):
    """f(x,y) = (1-x)^2 + 100*(y-x^2)^2"""
    return (1 - w[0])**2 + 100 * (w[1] - w[0]**2)**2

def rosenbrock_grad(w):
    """Gradient of Rosenbrock function."""
    dx = -2*(1 - w[0]) + 200*(w[1] - w[0]**2)*(-2*w[0])
    dy = 200*(w[1] - w[0]**2)
    return np.array([dx, dy])

class SGDMomentum:
    def __init__(self, lr=0.001, momentum=0.9):
        self.lr = lr
        self.momentum = momentum
        self.velocity = None

    def step(self, w, grad):
        if self.velocity is None:
            self.velocity = np.zeros_like(w)
        self.velocity = self.momentum * self.velocity + self.lr * grad
        return w - self.velocity

class RMSprop:
    def __init__(self, lr=0.001, beta=0.9, epsilon=1e-8):
        self.lr = lr
        self.beta = beta
        self.epsilon = epsilon
        self.s = None

    def step(self, w, grad):
        if self.s is None:
            self.s = np.zeros_like(w)
        self.s = self.beta * self.s + (1 - self.beta) * grad**2
        return w - self.lr * grad / (np.sqrt(self.s) + self.epsilon)

class Adam:
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = None
        self.v = None
        self.t = 0

    def step(self, w, grad):
        if self.m is None:
            self.m = np.zeros_like(w)
            self.v = np.zeros_like(w)
        self.t += 1
        self.m = self.beta1 * self.m + (1 - self.beta1) * grad
        self.v = self.beta2 * self.v + (1 - self.beta2) * grad**2
        m_hat = self.m / (1 - self.beta1**self.t)
        v_hat = self.v / (1 - self.beta2**self.t)
        return w - self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)

# --- Run all three optimizers ---
start = np.array([-1.5, 2.0])
n_steps = 500

optimizers = {
    'SGD + Momentum': SGDMomentum(lr=0.001, momentum=0.9),
    'RMSprop': RMSprop(lr=0.01),
    'Adam': Adam(lr=0.01)
}

paths = {}
loss_histories = {}

for name, opt in optimizers.items():
    w = start.copy()
    path = [w.copy()]
    losses = [rosenbrock(w)]

    for _ in range(n_steps):
        grad = rosenbrock_grad(w)
        w = opt.step(w, grad)
        path.append(w.copy())
        losses.append(rosenbrock(w))

    paths[name] = np.array(path)
    loss_histories[name] = losses
    print(f"{name:18s} -> Final position: ({w[0]:.4f}, {w[1]:.4f}), Loss: {rosenbrock(w):.6f}")

# --- Plot loss curves ---
plt.figure(figsize=(10, 5))
for name, losses in loss_histories.items():
    plt.plot(losses[:200], label=name, linewidth=2)
plt.xlabel('Step')
plt.ylabel('Loss (Rosenbrock)')
plt.title('Optimizer Comparison: Convergence Speed')
plt.legend()
plt.yscale('log')
plt.grid(True, alpha=0.3)
plt.show()
print("\nAdam typically converges fastest due to adaptive learning rates + momentum")
Adam is the default choice for most deep learning tasks. It adapts learning rates per-parameter (like RMSprop) and uses momentum for faster convergence. Default hyperparameters ($\beta_1=0.9$, $\beta_2=0.999$, $\eta=0.001$) work well in most cases.

Learning Rate Impact

The learning rate ($\eta$) is arguably the most important hyperparameter in training. Too large and the optimizer overshoots the minimum (diverges). Too small and training takes forever (or gets stuck).

import numpy as np
import matplotlib.pyplot as plt

# ============================================================
# Learning Rate Comparison: Same Problem, Different Rates
# ============================================================

def loss_function(w):
    """Simple bowl: L(w) = w^2 + 0.5*sin(4w)"""
    return w**2 + 0.5 * np.sin(4 * w)

def loss_gradient(w):
    """Gradient: dL/dw = 2w + 2*cos(4w)"""
    return 2 * w + 2 * np.cos(4 * w)

learning_rates = [0.001, 0.01, 0.1, 0.5, 1.2]
colors = ['blue', 'green', 'orange', 'red', 'purple']
labels = ['0.001 (too slow)', '0.01 (slow)', '0.1 (good)', '0.5 (fast)', '1.2 (diverges)']

plt.figure(figsize=(12, 5))

# Plot 1: Loss over time
plt.subplot(1, 2, 1)
for lr, color, label in zip(learning_rates, colors, labels):
    w = 3.0
    losses = [loss_function(w)]
    for _ in range(50):
        grad = loss_gradient(w)
        w = w - lr * grad
        # Clip to prevent overflow
        w = np.clip(w, -10, 10)
        losses.append(loss_function(w))
    plt.plot(losses, color=color, linewidth=2, label=f'lr={label}')

plt.xlabel('Step')
plt.ylabel('Loss')
plt.title('Learning Rate Impact on Convergence')
plt.legend(fontsize=8)
plt.grid(True, alpha=0.3)
plt.ylim(-1, 15)

# Plot 2: Path on loss curve
plt.subplot(1, 2, 2)
w_range = np.linspace(-4, 4, 200)
plt.plot(w_range, loss_function(w_range), 'k-', linewidth=2, alpha=0.5, label='Loss curve')

for lr, color, label in zip([0.01, 0.1, 0.5], ['green', 'orange', 'red'], ['0.01', '0.1', '0.5']):
    w = 3.0
    path = [w]
    for _ in range(20):
        grad = loss_gradient(w)
        w = w - lr * grad
        w = np.clip(w, -10, 10)
        path.append(w)
    path = np.array(path)
    plt.plot(path, loss_function(path), 'o-', color=color, markersize=4, label=f'lr={label}')

plt.xlabel('Weight (w)')
plt.ylabel('Loss')
plt.title('Optimization Paths for Different Learning Rates')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Summary:")
print("  lr=0.001: Barely moves in 50 steps")
print("  lr=0.01:  Converges slowly but safely")
print("  lr=0.1:   Good balance of speed and stability")
print("  lr=0.5:   Oscillates but eventually converges")
print("  lr=1.2:   Diverges! Loss increases without bound")

Learning Rate Schedules

Rather than using a fixed learning rate, modern training uses schedules that decrease the rate over time:

import numpy as np
import matplotlib.pyplot as plt

# ============================================================
# Learning Rate Schedules
# ============================================================

epochs = 100
initial_lr = 0.1

# Step decay: reduce by factor every N epochs
def step_decay(epoch, initial_lr=0.1, drop=0.5, epochs_drop=20):
    return initial_lr * (drop ** (epoch // epochs_drop))

# Exponential decay
def exponential_decay(epoch, initial_lr=0.1, decay_rate=0.05):
    return initial_lr * np.exp(-decay_rate * epoch)

# Cosine annealing
def cosine_annealing(epoch, initial_lr=0.1, T_max=100):
    return initial_lr * (1 + np.cos(np.pi * epoch / T_max)) / 2

# Warmup + cosine decay
def warmup_cosine(epoch, initial_lr=0.1, warmup_epochs=10, total_epochs=100):
    if epoch < warmup_epochs:
        return initial_lr * epoch / warmup_epochs
    progress = (epoch - warmup_epochs) / (total_epochs - warmup_epochs)
    return initial_lr * (1 + np.cos(np.pi * progress)) / 2

epoch_range = np.arange(epochs)

plt.figure(figsize=(10, 6))
plt.plot(epoch_range, [step_decay(e) for e in epoch_range], 'b-', linewidth=2, label='Step Decay')
plt.plot(epoch_range, [exponential_decay(e) for e in epoch_range], 'r-', linewidth=2, label='Exponential Decay')
plt.plot(epoch_range, [cosine_annealing(e) for e in epoch_range], 'g-', linewidth=2, label='Cosine Annealing')
plt.plot(epoch_range, [warmup_cosine(e) for e in epoch_range], 'm-', linewidth=2, label='Warmup + Cosine')

plt.xlabel('Epoch')
plt.ylabel('Learning Rate')
plt.title('Common Learning Rate Schedules')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("Schedule recommendations:")
print("  Step Decay:       Simple, good baseline")
print("  Exponential:      Smooth decrease, common in older papers")
print("  Cosine Annealing: Popular in modern training (ResNets, Transformers)")
print("  Warmup + Cosine:  Best for Transformers, prevents early instability")
Best Practice Choosing Learning Rates

The learning rate finder (Smith, 2017) is a practical technique:

  1. Start with a very small LR (e.g., $10^{-7}$)
  2. Gradually increase it each batch until loss explodes
  3. Pick the LR where loss was decreasing fastest (typically 1/10th of the rate where loss starts increasing)

Rules of thumb:

  • Adam: start with $\eta = 0.001$
  • SGD + Momentum: start with $\eta = 0.01$ to $0.1$
  • Always use warmup for Transformers

What’s Next

You now understand the complete learning pipeline: loss functions measure error, gradient descent minimizes that error, backpropagation computes the necessary gradients, and optimizers accelerate convergence. These are the fundamental mechanisms that power every neural network ever trained.

In the next article, we’ll put all of this together and build a complete neural network from scratch — tackling the classic XOR problem and extending to real datasets, all without any deep learning framework.

Next in the Series

In Part 4: Build Your First Neural Network from Scratch, we’ll combine everything from Parts 1–3 into a working network that solves XOR, classifies data, and trains on real datasets — all in pure NumPy.