Part 4: Build Your First Neural Network from Scratch

The XOR Problem: Why It Matters

In Part 1 of this series, we proved that a single perceptron cannot solve the XOR (exclusive or) problem. A single neuron can only learn linearly separable patterns — and XOR is fundamentally not linearly separable. This limitation, famously highlighted by Minsky and Papert in 1969, nearly killed neural network research for a decade.

                            
                            Key Insight: The XOR problem is the “hello world” of neural networks. If your network can solve XOR, it has learned to combine multiple linear boundaries into a non-linear decision surface — the fundamental capability that makes deep learning powerful.
                        

The XOR truth table defines the problem: the output is 1 only when the two inputs differ.

import numpy as np

# XOR truth table
# Input pairs and their expected outputs
X = np.array([[0, 0],
              [0, 1],
              [1, 0],
              [1, 1]])

y = np.array([[0],
              [1],
              [1],
              [0]])

print("XOR Truth Table:")
print("-" * 25)
print(f"{'Input 1':<10}{'Input 2':<10}{'Output':<10}")
print("-" * 25)
for inputs, output in zip(X, y):
    print(f"{inputs[0]:<10}{inputs[1]:<10}{output[0]:<10}")

Visualizing the Data

When we plot the four data points, the impossibility of linear separation becomes visually obvious. Points of the same class sit at opposite corners of a square — no single straight line can divide them.

import numpy as np
import matplotlib.pyplot as plt

# XOR data points
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 0])

# Plot XOR points colored by class
plt.figure(figsize=(6, 6))
colors = ['red' if label == 0 else 'blue' for label in y]
plt.scatter(X[:, 0], X[:, 1], c=colors, s=200, edgecolors='black', zorder=5)

# Label points
for i, (xi, yi) in enumerate(zip(X, y)):
    plt.annotate(f"({xi[0]},{xi[1]}) -> {yi[i]}",
                 xy=(xi[0], xi[1]),
                 xytext=(xi[0] + 0.05, xi[1] + 0.08),
                 fontsize=10)

# Show that no single line separates the classes
plt.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5, label='Any line fails')
plt.axvline(x=0.5, color='gray', linestyle='--', alpha=0.5)

plt.xlim(-0.3, 1.5)
plt.ylim(-0.3, 1.5)
plt.xlabel('Input 1')
plt.ylabel('Input 2')
plt.title('XOR Problem: Not Linearly Separable')
plt.legend(['Class 0 (red)', 'Class 1 (blue)'])
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

                            
                            The Core Challenge: No matter how you draw a single straight line through this 2D space, you cannot separate the red points (0,0) and (1,1) from the blue points (0,1) and (1,0). The solution requires a hidden layer that first transforms the space into one where the points become linearly separable.
                        

Designing the Network Architecture

To solve XOR, we need at least one hidden layer. Our network will have:

Input layer: 2 neurons (one for each input feature)
Hidden layer: 4 neurons (overcomplete representation)
Output layer: 1 neuron (binary classification output)

Why 4 hidden neurons? While 2 hidden neurons are the theoretical minimum to solve XOR, using 4 provides an overcomplete representation that makes training faster and more reliable. With more neurons, the network has multiple paths to find a solution.

Neural Network Architecture for XOR

flowchart LR
    subgraph Input["Input Layer (2)"]
        x1["x1"]
        x2["x2"]
    end
    subgraph Hidden["Hidden Layer (4)"]
        h1["h1"]
        h2["h2"]
        h3["h3"]
        h4["h4"]
    end
    subgraph Output["Output Layer (1)"]
        o1["y"]
    end
    x1 --> h1
    x1 --> h2
    x1 --> h3
    x1 --> h4
    x2 --> h1
    x2 --> h2
    x2 --> h3
    x2 --> h4
    h1 --> o1
    h2 --> o1
    h3 --> o1
    h4 --> o1

Weight Matrix Shapes

Understanding the shape of each weight matrix is crucial for implementing the network correctly:

Architecture Details

$W_1$ — shape (2, 4): connects 2 inputs to 4 hidden neurons
$b_1$ — shape (4,): one bias per hidden neuron
$W_2$ — shape (4, 1): connects 4 hidden neurons to 1 output
$b_2$ — shape (1,): one bias for the output neuron

Total trainable parameters: $(2 \times 4) + 4 + (4 \times 1) + 1 = 17$

The forward pass computes:

$z_1 = X \cdot W_1 + b_1$ — linear transform to hidden layer
$a_1 = \sigma(z_1)$ — sigmoid activation on hidden layer
$z_2 = a_1 \cdot W_2 + b_2$ — linear transform to output
$a_2 = \sigma(z_2)$ — sigmoid activation on output (final prediction)

Implementing the Network Class

Now let’s implement the complete neural network. Every line is carefully commented so you can follow the math from the previous sections.

import numpy as np

class NeuralNetwork:
    """A simple 2-layer neural network for binary classification."""

    def __init__(self, input_size=2, hidden_size=4, output_size=1, learning_rate=1.0):
        """Initialize weights using Xavier initialization."""
        self.lr = learning_rate

        # Xavier initialization: scale weights by sqrt(1/fan_in)
        # This keeps activations in a reasonable range
        self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(1.0 / input_size)
        self.b1 = np.zeros((1, hidden_size))

        self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(1.0 / hidden_size)
        self.b2 = np.zeros((1, output_size))

    def sigmoid(self, z):
        """Sigmoid activation function: 1 / (1 + exp(-z))"""
        return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))

    def sigmoid_derivative(self, a):
        """Derivative of sigmoid: a * (1 - a) where a = sigmoid(z)"""
        return a * (1.0 - a)

    def forward(self, X):
        """Forward pass: compute predictions step by step."""
        # Hidden layer
        self.z1 = np.dot(X, self.W1) + self.b1       # (N, 4)
        self.a1 = self.sigmoid(self.z1)               # (N, 4)

        # Output layer
        self.z2 = np.dot(self.a1, self.W2) + self.b2  # (N, 1)
        self.a2 = self.sigmoid(self.z2)               # (N, 1)

        return self.a2

    def backward(self, X, y):
        """Backward pass: compute gradients using the chain rule."""
        m = X.shape[0]  # number of samples

        # Output layer error
        # dL/da2 * da2/dz2 = (a2 - y) for binary cross-entropy + sigmoid
        dz2 = self.a2 - y                              # (N, 1)
        dW2 = np.dot(self.a1.T, dz2) / m              # (4, 1)
        db2 = np.sum(dz2, axis=0, keepdims=True) / m   # (1, 1)

        # Hidden layer error (backpropagate through W2)
        da1 = np.dot(dz2, self.W2.T)                   # (N, 4)
        dz1 = da1 * self.sigmoid_derivative(self.a1)   # (N, 4)
        dW1 = np.dot(X.T, dz1) / m                    # (2, 4)
        db1 = np.sum(dz1, axis=0, keepdims=True) / m   # (1, 4)

        # Update weights using gradient descent
        self.W2 -= self.lr * dW2
        self.b2 -= self.lr * db2
        self.W1 -= self.lr * dW1
        self.b1 -= self.lr * db1

    def compute_loss(self, y_pred, y_true):
        """Binary cross-entropy loss."""
        epsilon = 1e-8  # prevent log(0)
        loss = -np.mean(
            y_true * np.log(y_pred + epsilon) +
            (1 - y_true) * np.log(1 - y_pred + epsilon)
        )
        return loss

    def train(self, X, y, epochs=10000, print_every=1000):
        """Train the network, printing loss at intervals."""
        losses = []
        for epoch in range(epochs):
            # Forward pass
            y_pred = self.forward(X)

            # Compute and store loss
            loss = self.compute_loss(y_pred, y)
            losses.append(loss)

            # Backward pass (updates weights)
            self.backward(X, y)

            # Print progress
            if epoch % print_every == 0:
                print(f"Epoch {epoch:>5d} | Loss: {loss:.6f}")

        return losses

# Create and display the network
np.random.seed(42)
nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1, learning_rate=2.0)
print("Network created!")
print(f"W1 shape: {nn.W1.shape}")
print(f"b1 shape: {nn.b1.shape}")
print(f"W2 shape: {nn.W2.shape}")
print(f"b2 shape: {nn.b2.shape}")
print(f"Total parameters: {nn.W1.size + nn.b1.size + nn.W2.size + nn.b2.size}")

Forward Pass Explained

The forward pass transforms inputs through the network layer by layer. Each layer applies a linear transformation ($z = Wx + b$) followed by a non-linear activation ($a = \sigma(z)$). Without the non-linearity, stacking layers would be pointless — multiple linear transformations collapse into a single linear transformation.

Backward Pass Explained

The backward pass uses the chain rule to compute how much each weight contributed to the error. We work backwards from the output:

Compute the output error: $\delta_2 = a_2 - y$
Compute output layer gradients: $\frac{\partial L}{\partial W_2} = a_1^T \cdot \delta_2$
Propagate error to hidden layer: $\delta_1 = (\delta_2 \cdot W_2^T) \odot \sigma'(z_1)$
Compute hidden layer gradients: $\frac{\partial L}{\partial W_1} = X^T \cdot \delta_1$

Training Step by Step

Let’s train our network and observe how it learns. We’ll watch the predictions evolve from random guesses to near-perfect solutions.

import numpy as np

class NeuralNetwork:
    """A simple 2-layer neural network for binary classification."""

    def __init__(self, input_size=2, hidden_size=4, output_size=1, learning_rate=1.0):
        self.lr = learning_rate
        self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(1.0 / input_size)
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(1.0 / hidden_size)
        self.b2 = np.zeros((1, output_size))

    def sigmoid(self, z):
        return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))

    def sigmoid_derivative(self, a):
        return a * (1.0 - a)

    def forward(self, X):
        self.z1 = np.dot(X, self.W1) + self.b1
        self.a1 = self.sigmoid(self.z1)
        self.z2 = np.dot(self.a1, self.W2) + self.b2
        self.a2 = self.sigmoid(self.z2)
        return self.a2

    def backward(self, X, y):
        m = X.shape[0]
        dz2 = self.a2 - y
        dW2 = np.dot(self.a1.T, dz2) / m
        db2 = np.sum(dz2, axis=0, keepdims=True) / m
        da1 = np.dot(dz2, self.W2.T)
        dz1 = da1 * self.sigmoid_derivative(self.a1)
        dW1 = np.dot(X.T, dz1) / m
        db1 = np.sum(dz1, axis=0, keepdims=True) / m
        self.W2 -= self.lr * dW2
        self.b2 -= self.lr * db2
        self.W1 -= self.lr * dW1
        self.b1 -= self.lr * db1

    def compute_loss(self, y_pred, y_true):
        epsilon = 1e-8
        return -np.mean(
            y_true * np.log(y_pred + epsilon) +
            (1 - y_true) * np.log(1 - y_pred + epsilon)
        )

    def train(self, X, y, epochs=10000, print_every=1000):
        losses = []
        for epoch in range(epochs):
            y_pred = self.forward(X)
            loss = self.compute_loss(y_pred, y)
            losses.append(loss)
            self.backward(X, y)
            if epoch % print_every == 0:
                preds = (y_pred > 0.5).astype(int)
                accuracy = np.mean(preds == y) * 100
                print(f"Epoch {epoch:>5d} | Loss: {loss:.6f} | Accuracy: {accuracy:.0f}%")
        return losses

# XOR data
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])

# Train the network
np.random.seed(42)
nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1, learning_rate=2.0)

print("=" * 50)
print("Training Neural Network on XOR")
print("=" * 50)
losses = nn.train(X, y, epochs=10001, print_every=1000)

# Final predictions
print("\n" + "=" * 50)
print("Final Predictions:")
print("=" * 50)
predictions = nn.forward(X)
for inputs, pred, target in zip(X, predictions, y):
    print(f"Input: {inputs} | Predicted: {pred[0]:.4f} | Target: {target[0]} | {'CORRECT' if round(pred[0]) == target[0] else 'WRONG'}")

Visualizing the Loss Curve

The loss curve reveals how learning progresses. Notice the characteristic shape: rapid initial improvement followed by slower refinement as the network fine-tunes its weights.

import numpy as np
import matplotlib.pyplot as plt

class NeuralNetwork:
    def __init__(self, input_size=2, hidden_size=4, output_size=1, learning_rate=1.0):
        self.lr = learning_rate
        self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(1.0 / input_size)
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(1.0 / hidden_size)
        self.b2 = np.zeros((1, output_size))

    def sigmoid(self, z):
        return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))

    def sigmoid_derivative(self, a):
        return a * (1.0 - a)

    def forward(self, X):
        self.z1 = np.dot(X, self.W1) + self.b1
        self.a1 = self.sigmoid(self.z1)
        self.z2 = np.dot(self.a1, self.W2) + self.b2
        self.a2 = self.sigmoid(self.z2)
        return self.a2

    def backward(self, X, y):
        m = X.shape[0]
        dz2 = self.a2 - y
        dW2 = np.dot(self.a1.T, dz2) / m
        db2 = np.sum(dz2, axis=0, keepdims=True) / m
        da1 = np.dot(dz2, self.W2.T)
        dz1 = da1 * self.sigmoid_derivative(self.a1)
        dW1 = np.dot(X.T, dz1) / m
        db1 = np.sum(dz1, axis=0, keepdims=True) / m
        self.W2 -= self.lr * dW2
        self.b2 -= self.lr * db2
        self.W1 -= self.lr * dW1
        self.b1 -= self.lr * db1

    def compute_loss(self, y_pred, y_true):
        epsilon = 1e-8
        return -np.mean(
            y_true * np.log(y_pred + epsilon) +
            (1 - y_true) * np.log(1 - y_pred + epsilon)
        )

# Train and collect losses
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])

np.random.seed(42)
nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1, learning_rate=2.0)

losses = []
for epoch in range(10000):
    y_pred = nn.forward(X)
    loss = nn.compute_loss(y_pred, y)
    losses.append(loss)
    nn.backward(X, y)

# Plot loss curve
plt.figure(figsize=(10, 5))
plt.plot(losses, color='teal', linewidth=1.5)
plt.xlabel('Epoch')
plt.ylabel('Binary Cross-Entropy Loss')
plt.title('Training Loss Curve - XOR Problem')
plt.grid(True, alpha=0.3)
plt.axhline(y=0.01, color='red', linestyle='--', alpha=0.5, label='Near-zero loss')
plt.legend()
plt.tight_layout()
plt.show()

print(f"Initial loss: {losses[0]:.4f}")
print(f"Final loss:   {losses[-1]:.6f}")
print(f"Loss reduction: {((losses[0] - losses[-1]) / losses[0] * 100):.1f}%")

Visualizing Decision Boundaries

The most revealing visualization is the decision boundary — the regions of input space that the network classifies as 0 or 1. For XOR, the network must learn a non-linear boundary that wraps around the class-1 points.

import numpy as np
import matplotlib.pyplot as plt

class NeuralNetwork:
    def __init__(self, input_size=2, hidden_size=4, output_size=1, learning_rate=1.0):
        self.lr = learning_rate
        self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(1.0 / input_size)
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(1.0 / hidden_size)
        self.b2 = np.zeros((1, output_size))

    def sigmoid(self, z):
        return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))

    def sigmoid_derivative(self, a):
        return a * (1.0 - a)

    def forward(self, X):
        self.z1 = np.dot(X, self.W1) + self.b1
        self.a1 = self.sigmoid(self.z1)
        self.z2 = np.dot(self.a1, self.W2) + self.b2
        self.a2 = self.sigmoid(self.z2)
        return self.a2

    def backward(self, X, y):
        m = X.shape[0]
        dz2 = self.a2 - y
        dW2 = np.dot(self.a1.T, dz2) / m
        db2 = np.sum(dz2, axis=0, keepdims=True) / m
        da1 = np.dot(dz2, self.W2.T)
        dz1 = da1 * self.sigmoid_derivative(self.a1)
        dW1 = np.dot(X.T, dz1) / m
        db1 = np.sum(dz1, axis=0, keepdims=True) / m
        self.W2 -= self.lr * dW2
        self.b2 -= self.lr * db2
        self.W1 -= self.lr * dW1
        self.b1 -= self.lr * db1

    def compute_loss(self, y_pred, y_true):
        epsilon = 1e-8
        return -np.mean(
            y_true * np.log(y_pred + epsilon) +
            (1 - y_true) * np.log(1 - y_pred + epsilon)
        )

# Train the network
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])

np.random.seed(42)
nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1, learning_rate=2.0)

for epoch in range(10000):
    y_pred = nn.forward(X)
    nn.backward(X, y)

# Create meshgrid for decision boundary visualization
xx, yy = np.meshgrid(np.linspace(-0.5, 1.5, 200),
                     np.linspace(-0.5, 1.5, 200))
grid_points = np.c_[xx.ravel(), yy.ravel()]

# Get predictions for entire grid
Z = nn.forward(grid_points)
Z = Z.reshape(xx.shape)

# Plot decision boundary
plt.figure(figsize=(8, 7))
plt.contourf(xx, yy, Z, levels=50, cmap='RdBu_r', alpha=0.8)
plt.colorbar(label='Network Output')
plt.contour(xx, yy, Z, levels=[0.5], colors='black', linewidths=2)

# Plot XOR data points
colors = ['red' if label == 0 else 'blue' for label in y.flatten()]
plt.scatter(X[:, 0], X[:, 1], c=colors, s=200, edgecolors='black',
            linewidths=2, zorder=5)

# Annotate points
labels_text = ['(0,0)=0', '(0,1)=1', '(1,0)=1', '(1,1)=0']
for i, txt in enumerate(labels_text):
    plt.annotate(txt, (X[i, 0], X[i, 1]),
                 xytext=(5, 10), textcoords='offset points',
                 fontsize=10, fontweight='bold')

plt.xlabel('Input 1')
plt.ylabel('Input 2')
plt.title('Learned Decision Boundary for XOR')
plt.tight_layout()
plt.show()

How the Hidden Layer Transforms Space

                            
                            The Key Insight: The hidden layer acts as a feature extractor. It transforms the 2D input into a 4D representation where the XOR classes become linearly separable. Each hidden neuron learns one linear boundary, and the output layer combines them to create the non-linear XOR boundary.
                        

Think of it this way: hidden neuron 1 might learn “is the input in the top-left region?” and hidden neuron 2 might learn “is the input in the bottom-right region?”. The output neuron then combines these: output = 1 if (top-left OR bottom-right) but not both — which is exactly XOR logic, but implemented with smooth sigmoids instead of hard thresholds.

Experiments: What Happens When You Change Things

Now that we have a working network, let’s experiment. Each code block below is completely independent — copy and run any of them to explore different aspects of the network.

Experiment 1: Different Numbers of Hidden Neurons

How does the number of hidden neurons affect the network’s ability to learn XOR?

import numpy as np

class NeuralNetwork:
    def __init__(self, input_size=2, hidden_size=4, output_size=1, learning_rate=2.0):
        self.lr = learning_rate
        self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(1.0 / input_size)
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(1.0 / hidden_size)
        self.b2 = np.zeros((1, output_size))

    def sigmoid(self, z):
        return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))

    def sigmoid_derivative(self, a):
        return a * (1.0 - a)

    def forward(self, X):
        self.z1 = np.dot(X, self.W1) + self.b1
        self.a1 = self.sigmoid(self.z1)
        self.z2 = np.dot(self.a1, self.W2) + self.b2
        self.a2 = self.sigmoid(self.z2)
        return self.a2

    def backward(self, X, y):
        m = X.shape[0]
        dz2 = self.a2 - y
        dW2 = np.dot(self.a1.T, dz2) / m
        db2 = np.sum(dz2, axis=0, keepdims=True) / m
        da1 = np.dot(dz2, self.W2.T)
        dz1 = da1 * self.sigmoid_derivative(self.a1)
        dW1 = np.dot(X.T, dz1) / m
        db1 = np.sum(dz1, axis=0, keepdims=True) / m
        self.W2 -= self.lr * dW2
        self.b2 -= self.lr * db2
        self.W1 -= self.lr * dW1
        self.b1 -= self.lr * db1

    def compute_loss(self, y_pred, y_true):
        epsilon = 1e-8
        return -np.mean(
            y_true * np.log(y_pred + epsilon) +
            (1 - y_true) * np.log(1 - y_pred + epsilon)
        )

# XOR data
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])

# Test different hidden layer sizes
hidden_sizes = [2, 4, 8]
num_trials = 5
epochs = 10000

print("Effect of Hidden Layer Size on XOR Learning")
print("=" * 60)
print(f"{'Hidden Neurons':<16}{'Avg Final Loss':<18}{'Success Rate':<15}")
print("-" * 60)

for h_size in hidden_sizes:
    final_losses = []
    successes = 0

    for trial in range(num_trials):
        np.random.seed(trial * 10 + h_size)
        nn = NeuralNetwork(input_size=2, hidden_size=h_size, learning_rate=2.0)

        for epoch in range(epochs):
            y_pred = nn.forward(X)
            nn.backward(X, y)

        final_pred = nn.forward(X)
        final_loss = nn.compute_loss(final_pred, y)
        final_losses.append(final_loss)

        # Check if all predictions are correct
        preds = (final_pred > 0.5).astype(int)
        if np.all(preds == y):
            successes += 1

    avg_loss = np.mean(final_losses)
    success_rate = successes / num_trials * 100
    print(f"{h_size:<16}{avg_loss:<18.6f}{success_rate:.0f}%")

Experiment 2: Different Learning Rates

The learning rate $\eta$ controls how big each weight update step is. Too small and training is slow; too large and it overshoots the minimum.

import numpy as np
import matplotlib.pyplot as plt

class NeuralNetwork:
    def __init__(self, input_size=2, hidden_size=4, output_size=1, learning_rate=1.0):
        self.lr = learning_rate
        self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(1.0 / input_size)
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(1.0 / hidden_size)
        self.b2 = np.zeros((1, output_size))

    def sigmoid(self, z):
        return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))

    def sigmoid_derivative(self, a):
        return a * (1.0 - a)

    def forward(self, X):
        self.z1 = np.dot(X, self.W1) + self.b1
        self.a1 = self.sigmoid(self.z1)
        self.z2 = np.dot(self.a1, self.W2) + self.b2
        self.a2 = self.sigmoid(self.z2)
        return self.a2

    def backward(self, X, y):
        m = X.shape[0]
        dz2 = self.a2 - y
        dW2 = np.dot(self.a1.T, dz2) / m
        db2 = np.sum(dz2, axis=0, keepdims=True) / m
        da1 = np.dot(dz2, self.W2.T)
        dz1 = da1 * self.sigmoid_derivative(self.a1)
        dW1 = np.dot(X.T, dz1) / m
        db1 = np.sum(dz1, axis=0, keepdims=True) / m
        self.W2 -= self.lr * dW2
        self.b2 -= self.lr * db2
        self.W1 -= self.lr * dW1
        self.b1 -= self.lr * db1

    def compute_loss(self, y_pred, y_true):
        epsilon = 1e-8
        return -np.mean(
            y_true * np.log(y_pred + epsilon) +
            (1 - y_true) * np.log(1 - y_pred + epsilon)
        )

# XOR data
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])

# Compare learning rates
learning_rates = [0.1, 1.0, 5.0]
colors_lr = ['blue', 'green', 'red']
epochs = 10000

plt.figure(figsize=(10, 6))

for lr, color in zip(learning_rates, colors_lr):
    np.random.seed(42)
    nn = NeuralNetwork(input_size=2, hidden_size=4, learning_rate=lr)

    losses = []
    for epoch in range(epochs):
        y_pred = nn.forward(X)
        loss = nn.compute_loss(y_pred, y)
        losses.append(loss)
        nn.backward(X, y)

    plt.plot(losses, color=color, label=f'lr = {lr}', linewidth=1.5)
    print(f"Learning rate {lr}: Final loss = {losses[-1]:.6f}")

plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Effect of Learning Rate on Convergence')
plt.legend()
plt.grid(True, alpha=0.3)
plt.ylim(0, 1.0)
plt.tight_layout()
plt.show()

Experiment 3: Different Activation Functions

The choice of activation function affects both the speed of learning and the shape of decision boundaries.

import numpy as np
import matplotlib.pyplot as plt

class FlexibleNetwork:
    """Neural network with configurable activation function."""

    def __init__(self, hidden_size=4, learning_rate=2.0, activation='sigmoid'):
        self.lr = learning_rate
        self.activation = activation
        self.W1 = np.random.randn(2, hidden_size) * np.sqrt(1.0 / 2)
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, 1) * np.sqrt(1.0 / hidden_size)
        self.b2 = np.zeros((1, 1))

    def activate(self, z):
        if self.activation == 'sigmoid':
            return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))
        elif self.activation == 'tanh':
            return np.tanh(z)
        elif self.activation == 'relu':
            return np.maximum(0, z)

    def activate_derivative(self, a, z=None):
        if self.activation == 'sigmoid':
            return a * (1.0 - a)
        elif self.activation == 'tanh':
            return 1.0 - a ** 2
        elif self.activation == 'relu':
            return (z > 0).astype(float)

    def forward(self, X):
        self.z1 = np.dot(X, self.W1) + self.b1
        self.a1 = self.activate(self.z1)
        self.z2 = np.dot(self.a1, self.W2) + self.b2
        # Output always uses sigmoid for binary classification
        self.a2 = 1.0 / (1.0 + np.exp(-np.clip(self.z2, -500, 500)))
        return self.a2

    def backward(self, X, y):
        m = X.shape[0]
        dz2 = self.a2 - y
        dW2 = np.dot(self.a1.T, dz2) / m
        db2 = np.sum(dz2, axis=0, keepdims=True) / m
        da1 = np.dot(dz2, self.W2.T)
        dz1 = da1 * self.activate_derivative(self.a1, self.z1)
        dW1 = np.dot(X.T, dz1) / m
        db1 = np.sum(dz1, axis=0, keepdims=True) / m
        self.W2 -= self.lr * dW2
        self.b2 -= self.lr * db2
        self.W1 -= self.lr * dW1
        self.b1 -= self.lr * db1

    def compute_loss(self, y_pred, y_true):
        epsilon = 1e-8
        return -np.mean(
            y_true * np.log(y_pred + epsilon) +
            (1 - y_true) * np.log(1 - y_pred + epsilon)
        )

# XOR data
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])

# Compare activation functions
activations = ['sigmoid', 'tanh', 'relu']
colors_act = ['blue', 'green', 'orange']
epochs = 10000

plt.figure(figsize=(10, 6))

for act, color in zip(activations, colors_act):
    np.random.seed(42)
    nn = FlexibleNetwork(hidden_size=4, learning_rate=2.0, activation=act)

    losses = []
    for epoch in range(epochs):
        y_pred = nn.forward(X)
        loss = nn.compute_loss(y_pred, y)
        losses.append(loss)
        nn.backward(X, y)

    plt.plot(losses, color=color, label=f'{act}', linewidth=1.5)

    # Check final accuracy
    final_pred = nn.forward(X)
    preds = (final_pred > 0.5).astype(int)
    acc = np.mean(preds == y) * 100
    print(f"{act:>8s}: Final loss = {losses[-1]:.6f}, Accuracy = {acc:.0f}%")

plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Activation Function Comparison on XOR')
plt.legend()
plt.grid(True, alpha=0.3)
plt.ylim(0, 0.8)
plt.tight_layout()
plt.show()

Key Observations

Tanh often converges faster than sigmoid because its output is centered around zero, leading to larger gradients early in training.
ReLU can be fast but suffers from “dead neurons” — neurons that output zero and never recover because the gradient is zero for negative inputs.
Sigmoid is reliable but can suffer from vanishing gradients in deeper networks (less of an issue in our 2-layer network).

activation functions convergence gradient flow

What’s Next

Congratulations! You’ve built a fully working neural network from scratch. You now understand:

Why a hidden layer is necessary for non-linear problems
How to initialize, forward-propagate, and backpropagate through a multi-layer network
How training transforms random predictions into accurate solutions
How different hyperparameters (hidden size, learning rate, activation function) affect learning

This XOR network is small, but every concept scales directly to networks with millions of parameters. The difference between this and GPT-4 is scale, not principle.

Next in the Series

In Part 5: Neural Network Architectures Overview, we’ll survey the landscape of neural network architectures — from feedforward networks to CNNs, RNNs, Transformers, and beyond — understanding when and why each design is used.

Previous Part 3: How Neural Networks Learn Next Part 5: Neural Network Architectures Overview

Table of Contents