The XOR Problem: Why It Matters
In Part 1 of this series, we proved that a single perceptron cannot solve the XOR (exclusive or) problem. A single neuron can only learn linearly separable patterns — and XOR is fundamentally not linearly separable. This limitation, famously highlighted by Minsky and Papert in 1969, nearly killed neural network research for a decade.
The XOR truth table defines the problem: the output is 1 only when the two inputs differ.
import numpy as np
# XOR truth table
# Input pairs and their expected outputs
X = np.array([[0, 0],
[0, 1],
[1, 0],
[1, 1]])
y = np.array([[0],
[1],
[1],
[0]])
print("XOR Truth Table:")
print("-" * 25)
print(f"{'Input 1':<10}{'Input 2':<10}{'Output':<10}")
print("-" * 25)
for inputs, output in zip(X, y):
print(f"{inputs[0]:<10}{inputs[1]:<10}{output[0]:<10}")
Visualizing the Data
When we plot the four data points, the impossibility of linear separation becomes visually obvious. Points of the same class sit at opposite corners of a square — no single straight line can divide them.
import numpy as np
import matplotlib.pyplot as plt
# XOR data points
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 0])
# Plot XOR points colored by class
plt.figure(figsize=(6, 6))
colors = ['red' if label == 0 else 'blue' for label in y]
plt.scatter(X[:, 0], X[:, 1], c=colors, s=200, edgecolors='black', zorder=5)
# Label points
for i, (xi, yi) in enumerate(zip(X, y)):
plt.annotate(f"({xi[0]},{xi[1]}) -> {yi[i]}",
xy=(xi[0], xi[1]),
xytext=(xi[0] + 0.05, xi[1] + 0.08),
fontsize=10)
# Show that no single line separates the classes
plt.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5, label='Any line fails')
plt.axvline(x=0.5, color='gray', linestyle='--', alpha=0.5)
plt.xlim(-0.3, 1.5)
plt.ylim(-0.3, 1.5)
plt.xlabel('Input 1')
plt.ylabel('Input 2')
plt.title('XOR Problem: Not Linearly Separable')
plt.legend(['Class 0 (red)', 'Class 1 (blue)'])
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Designing the Network Architecture
To solve XOR, we need at least one hidden layer. Our network will have:
- Input layer: 2 neurons (one for each input feature)
- Hidden layer: 4 neurons (overcomplete representation)
- Output layer: 1 neuron (binary classification output)
Why 4 hidden neurons? While 2 hidden neurons are the theoretical minimum to solve XOR, using 4 provides an overcomplete representation that makes training faster and more reliable. With more neurons, the network has multiple paths to find a solution.
flowchart LR
subgraph Input["Input Layer (2)"]
x1["x1"]
x2["x2"]
end
subgraph Hidden["Hidden Layer (4)"]
h1["h1"]
h2["h2"]
h3["h3"]
h4["h4"]
end
subgraph Output["Output Layer (1)"]
o1["y"]
end
x1 --> h1
x1 --> h2
x1 --> h3
x1 --> h4
x2 --> h1
x2 --> h2
x2 --> h3
x2 --> h4
h1 --> o1
h2 --> o1
h3 --> o1
h4 --> o1
Weight Matrix Shapes
Understanding the shape of each weight matrix is crucial for implementing the network correctly:
- $W_1$ — shape (2, 4): connects 2 inputs to 4 hidden neurons
- $b_1$ — shape (4,): one bias per hidden neuron
- $W_2$ — shape (4, 1): connects 4 hidden neurons to 1 output
- $b_2$ — shape (1,): one bias for the output neuron
Total trainable parameters: $(2 \times 4) + 4 + (4 \times 1) + 1 = 17$
The forward pass computes:
- $z_1 = X \cdot W_1 + b_1$ — linear transform to hidden layer
- $a_1 = \sigma(z_1)$ — sigmoid activation on hidden layer
- $z_2 = a_1 \cdot W_2 + b_2$ — linear transform to output
- $a_2 = \sigma(z_2)$ — sigmoid activation on output (final prediction)
Implementing the Network Class
Now let’s implement the complete neural network. Every line is carefully commented so you can follow the math from the previous sections.
import numpy as np
class NeuralNetwork:
"""A simple 2-layer neural network for binary classification."""
def __init__(self, input_size=2, hidden_size=4, output_size=1, learning_rate=1.0):
"""Initialize weights using Xavier initialization."""
self.lr = learning_rate
# Xavier initialization: scale weights by sqrt(1/fan_in)
# This keeps activations in a reasonable range
self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(1.0 / input_size)
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(1.0 / hidden_size)
self.b2 = np.zeros((1, output_size))
def sigmoid(self, z):
"""Sigmoid activation function: 1 / (1 + exp(-z))"""
return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))
def sigmoid_derivative(self, a):
"""Derivative of sigmoid: a * (1 - a) where a = sigmoid(z)"""
return a * (1.0 - a)
def forward(self, X):
"""Forward pass: compute predictions step by step."""
# Hidden layer
self.z1 = np.dot(X, self.W1) + self.b1 # (N, 4)
self.a1 = self.sigmoid(self.z1) # (N, 4)
# Output layer
self.z2 = np.dot(self.a1, self.W2) + self.b2 # (N, 1)
self.a2 = self.sigmoid(self.z2) # (N, 1)
return self.a2
def backward(self, X, y):
"""Backward pass: compute gradients using the chain rule."""
m = X.shape[0] # number of samples
# Output layer error
# dL/da2 * da2/dz2 = (a2 - y) for binary cross-entropy + sigmoid
dz2 = self.a2 - y # (N, 1)
dW2 = np.dot(self.a1.T, dz2) / m # (4, 1)
db2 = np.sum(dz2, axis=0, keepdims=True) / m # (1, 1)
# Hidden layer error (backpropagate through W2)
da1 = np.dot(dz2, self.W2.T) # (N, 4)
dz1 = da1 * self.sigmoid_derivative(self.a1) # (N, 4)
dW1 = np.dot(X.T, dz1) / m # (2, 4)
db1 = np.sum(dz1, axis=0, keepdims=True) / m # (1, 4)
# Update weights using gradient descent
self.W2 -= self.lr * dW2
self.b2 -= self.lr * db2
self.W1 -= self.lr * dW1
self.b1 -= self.lr * db1
def compute_loss(self, y_pred, y_true):
"""Binary cross-entropy loss."""
epsilon = 1e-8 # prevent log(0)
loss = -np.mean(
y_true * np.log(y_pred + epsilon) +
(1 - y_true) * np.log(1 - y_pred + epsilon)
)
return loss
def train(self, X, y, epochs=10000, print_every=1000):
"""Train the network, printing loss at intervals."""
losses = []
for epoch in range(epochs):
# Forward pass
y_pred = self.forward(X)
# Compute and store loss
loss = self.compute_loss(y_pred, y)
losses.append(loss)
# Backward pass (updates weights)
self.backward(X, y)
# Print progress
if epoch % print_every == 0:
print(f"Epoch {epoch:>5d} | Loss: {loss:.6f}")
return losses
# Create and display the network
np.random.seed(42)
nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1, learning_rate=2.0)
print("Network created!")
print(f"W1 shape: {nn.W1.shape}")
print(f"b1 shape: {nn.b1.shape}")
print(f"W2 shape: {nn.W2.shape}")
print(f"b2 shape: {nn.b2.shape}")
print(f"Total parameters: {nn.W1.size + nn.b1.size + nn.W2.size + nn.b2.size}")
Forward Pass Explained
The forward pass transforms inputs through the network layer by layer. Each layer applies a linear transformation ($z = Wx + b$) followed by a non-linear activation ($a = \sigma(z)$). Without the non-linearity, stacking layers would be pointless — multiple linear transformations collapse into a single linear transformation.
Backward Pass Explained
The backward pass uses the chain rule to compute how much each weight contributed to the error. We work backwards from the output:
- Compute the output error: $\delta_2 = a_2 - y$
- Compute output layer gradients: $\frac{\partial L}{\partial W_2} = a_1^T \cdot \delta_2$
- Propagate error to hidden layer: $\delta_1 = (\delta_2 \cdot W_2^T) \odot \sigma'(z_1)$
- Compute hidden layer gradients: $\frac{\partial L}{\partial W_1} = X^T \cdot \delta_1$
Training Step by Step
Let’s train our network and observe how it learns. We’ll watch the predictions evolve from random guesses to near-perfect solutions.
import numpy as np
class NeuralNetwork:
"""A simple 2-layer neural network for binary classification."""
def __init__(self, input_size=2, hidden_size=4, output_size=1, learning_rate=1.0):
self.lr = learning_rate
self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(1.0 / input_size)
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(1.0 / hidden_size)
self.b2 = np.zeros((1, output_size))
def sigmoid(self, z):
return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))
def sigmoid_derivative(self, a):
return a * (1.0 - a)
def forward(self, X):
self.z1 = np.dot(X, self.W1) + self.b1
self.a1 = self.sigmoid(self.z1)
self.z2 = np.dot(self.a1, self.W2) + self.b2
self.a2 = self.sigmoid(self.z2)
return self.a2
def backward(self, X, y):
m = X.shape[0]
dz2 = self.a2 - y
dW2 = np.dot(self.a1.T, dz2) / m
db2 = np.sum(dz2, axis=0, keepdims=True) / m
da1 = np.dot(dz2, self.W2.T)
dz1 = da1 * self.sigmoid_derivative(self.a1)
dW1 = np.dot(X.T, dz1) / m
db1 = np.sum(dz1, axis=0, keepdims=True) / m
self.W2 -= self.lr * dW2
self.b2 -= self.lr * db2
self.W1 -= self.lr * dW1
self.b1 -= self.lr * db1
def compute_loss(self, y_pred, y_true):
epsilon = 1e-8
return -np.mean(
y_true * np.log(y_pred + epsilon) +
(1 - y_true) * np.log(1 - y_pred + epsilon)
)
def train(self, X, y, epochs=10000, print_every=1000):
losses = []
for epoch in range(epochs):
y_pred = self.forward(X)
loss = self.compute_loss(y_pred, y)
losses.append(loss)
self.backward(X, y)
if epoch % print_every == 0:
preds = (y_pred > 0.5).astype(int)
accuracy = np.mean(preds == y) * 100
print(f"Epoch {epoch:>5d} | Loss: {loss:.6f} | Accuracy: {accuracy:.0f}%")
return losses
# XOR data
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])
# Train the network
np.random.seed(42)
nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1, learning_rate=2.0)
print("=" * 50)
print("Training Neural Network on XOR")
print("=" * 50)
losses = nn.train(X, y, epochs=10001, print_every=1000)
# Final predictions
print("\n" + "=" * 50)
print("Final Predictions:")
print("=" * 50)
predictions = nn.forward(X)
for inputs, pred, target in zip(X, predictions, y):
print(f"Input: {inputs} | Predicted: {pred[0]:.4f} | Target: {target[0]} | {'CORRECT' if round(pred[0]) == target[0] else 'WRONG'}")
Visualizing the Loss Curve
The loss curve reveals how learning progresses. Notice the characteristic shape: rapid initial improvement followed by slower refinement as the network fine-tunes its weights.
import numpy as np
import matplotlib.pyplot as plt
class NeuralNetwork:
def __init__(self, input_size=2, hidden_size=4, output_size=1, learning_rate=1.0):
self.lr = learning_rate
self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(1.0 / input_size)
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(1.0 / hidden_size)
self.b2 = np.zeros((1, output_size))
def sigmoid(self, z):
return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))
def sigmoid_derivative(self, a):
return a * (1.0 - a)
def forward(self, X):
self.z1 = np.dot(X, self.W1) + self.b1
self.a1 = self.sigmoid(self.z1)
self.z2 = np.dot(self.a1, self.W2) + self.b2
self.a2 = self.sigmoid(self.z2)
return self.a2
def backward(self, X, y):
m = X.shape[0]
dz2 = self.a2 - y
dW2 = np.dot(self.a1.T, dz2) / m
db2 = np.sum(dz2, axis=0, keepdims=True) / m
da1 = np.dot(dz2, self.W2.T)
dz1 = da1 * self.sigmoid_derivative(self.a1)
dW1 = np.dot(X.T, dz1) / m
db1 = np.sum(dz1, axis=0, keepdims=True) / m
self.W2 -= self.lr * dW2
self.b2 -= self.lr * db2
self.W1 -= self.lr * dW1
self.b1 -= self.lr * db1
def compute_loss(self, y_pred, y_true):
epsilon = 1e-8
return -np.mean(
y_true * np.log(y_pred + epsilon) +
(1 - y_true) * np.log(1 - y_pred + epsilon)
)
# Train and collect losses
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])
np.random.seed(42)
nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1, learning_rate=2.0)
losses = []
for epoch in range(10000):
y_pred = nn.forward(X)
loss = nn.compute_loss(y_pred, y)
losses.append(loss)
nn.backward(X, y)
# Plot loss curve
plt.figure(figsize=(10, 5))
plt.plot(losses, color='teal', linewidth=1.5)
plt.xlabel('Epoch')
plt.ylabel('Binary Cross-Entropy Loss')
plt.title('Training Loss Curve - XOR Problem')
plt.grid(True, alpha=0.3)
plt.axhline(y=0.01, color='red', linestyle='--', alpha=0.5, label='Near-zero loss')
plt.legend()
plt.tight_layout()
plt.show()
print(f"Initial loss: {losses[0]:.4f}")
print(f"Final loss: {losses[-1]:.6f}")
print(f"Loss reduction: {((losses[0] - losses[-1]) / losses[0] * 100):.1f}%")
Visualizing Decision Boundaries
The most revealing visualization is the decision boundary — the regions of input space that the network classifies as 0 or 1. For XOR, the network must learn a non-linear boundary that wraps around the class-1 points.
import numpy as np
import matplotlib.pyplot as plt
class NeuralNetwork:
def __init__(self, input_size=2, hidden_size=4, output_size=1, learning_rate=1.0):
self.lr = learning_rate
self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(1.0 / input_size)
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(1.0 / hidden_size)
self.b2 = np.zeros((1, output_size))
def sigmoid(self, z):
return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))
def sigmoid_derivative(self, a):
return a * (1.0 - a)
def forward(self, X):
self.z1 = np.dot(X, self.W1) + self.b1
self.a1 = self.sigmoid(self.z1)
self.z2 = np.dot(self.a1, self.W2) + self.b2
self.a2 = self.sigmoid(self.z2)
return self.a2
def backward(self, X, y):
m = X.shape[0]
dz2 = self.a2 - y
dW2 = np.dot(self.a1.T, dz2) / m
db2 = np.sum(dz2, axis=0, keepdims=True) / m
da1 = np.dot(dz2, self.W2.T)
dz1 = da1 * self.sigmoid_derivative(self.a1)
dW1 = np.dot(X.T, dz1) / m
db1 = np.sum(dz1, axis=0, keepdims=True) / m
self.W2 -= self.lr * dW2
self.b2 -= self.lr * db2
self.W1 -= self.lr * dW1
self.b1 -= self.lr * db1
def compute_loss(self, y_pred, y_true):
epsilon = 1e-8
return -np.mean(
y_true * np.log(y_pred + epsilon) +
(1 - y_true) * np.log(1 - y_pred + epsilon)
)
# Train the network
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])
np.random.seed(42)
nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1, learning_rate=2.0)
for epoch in range(10000):
y_pred = nn.forward(X)
nn.backward(X, y)
# Create meshgrid for decision boundary visualization
xx, yy = np.meshgrid(np.linspace(-0.5, 1.5, 200),
np.linspace(-0.5, 1.5, 200))
grid_points = np.c_[xx.ravel(), yy.ravel()]
# Get predictions for entire grid
Z = nn.forward(grid_points)
Z = Z.reshape(xx.shape)
# Plot decision boundary
plt.figure(figsize=(8, 7))
plt.contourf(xx, yy, Z, levels=50, cmap='RdBu_r', alpha=0.8)
plt.colorbar(label='Network Output')
plt.contour(xx, yy, Z, levels=[0.5], colors='black', linewidths=2)
# Plot XOR data points
colors = ['red' if label == 0 else 'blue' for label in y.flatten()]
plt.scatter(X[:, 0], X[:, 1], c=colors, s=200, edgecolors='black',
linewidths=2, zorder=5)
# Annotate points
labels_text = ['(0,0)=0', '(0,1)=1', '(1,0)=1', '(1,1)=0']
for i, txt in enumerate(labels_text):
plt.annotate(txt, (X[i, 0], X[i, 1]),
xytext=(5, 10), textcoords='offset points',
fontsize=10, fontweight='bold')
plt.xlabel('Input 1')
plt.ylabel('Input 2')
plt.title('Learned Decision Boundary for XOR')
plt.tight_layout()
plt.show()
How the Hidden Layer Transforms Space
Think of it this way: hidden neuron 1 might learn “is the input in the top-left region?” and hidden neuron 2 might learn “is the input in the bottom-right region?”. The output neuron then combines these: output = 1 if (top-left OR bottom-right) but not both — which is exactly XOR logic, but implemented with smooth sigmoids instead of hard thresholds.
Experiments: What Happens When You Change Things
Now that we have a working network, let’s experiment. Each code block below is completely independent — copy and run any of them to explore different aspects of the network.
Experiment 1: Different Numbers of Hidden Neurons
How does the number of hidden neurons affect the network’s ability to learn XOR?
import numpy as np
class NeuralNetwork:
def __init__(self, input_size=2, hidden_size=4, output_size=1, learning_rate=2.0):
self.lr = learning_rate
self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(1.0 / input_size)
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(1.0 / hidden_size)
self.b2 = np.zeros((1, output_size))
def sigmoid(self, z):
return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))
def sigmoid_derivative(self, a):
return a * (1.0 - a)
def forward(self, X):
self.z1 = np.dot(X, self.W1) + self.b1
self.a1 = self.sigmoid(self.z1)
self.z2 = np.dot(self.a1, self.W2) + self.b2
self.a2 = self.sigmoid(self.z2)
return self.a2
def backward(self, X, y):
m = X.shape[0]
dz2 = self.a2 - y
dW2 = np.dot(self.a1.T, dz2) / m
db2 = np.sum(dz2, axis=0, keepdims=True) / m
da1 = np.dot(dz2, self.W2.T)
dz1 = da1 * self.sigmoid_derivative(self.a1)
dW1 = np.dot(X.T, dz1) / m
db1 = np.sum(dz1, axis=0, keepdims=True) / m
self.W2 -= self.lr * dW2
self.b2 -= self.lr * db2
self.W1 -= self.lr * dW1
self.b1 -= self.lr * db1
def compute_loss(self, y_pred, y_true):
epsilon = 1e-8
return -np.mean(
y_true * np.log(y_pred + epsilon) +
(1 - y_true) * np.log(1 - y_pred + epsilon)
)
# XOR data
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])
# Test different hidden layer sizes
hidden_sizes = [2, 4, 8]
num_trials = 5
epochs = 10000
print("Effect of Hidden Layer Size on XOR Learning")
print("=" * 60)
print(f"{'Hidden Neurons':<16}{'Avg Final Loss':<18}{'Success Rate':<15}")
print("-" * 60)
for h_size in hidden_sizes:
final_losses = []
successes = 0
for trial in range(num_trials):
np.random.seed(trial * 10 + h_size)
nn = NeuralNetwork(input_size=2, hidden_size=h_size, learning_rate=2.0)
for epoch in range(epochs):
y_pred = nn.forward(X)
nn.backward(X, y)
final_pred = nn.forward(X)
final_loss = nn.compute_loss(final_pred, y)
final_losses.append(final_loss)
# Check if all predictions are correct
preds = (final_pred > 0.5).astype(int)
if np.all(preds == y):
successes += 1
avg_loss = np.mean(final_losses)
success_rate = successes / num_trials * 100
print(f"{h_size:<16}{avg_loss:<18.6f}{success_rate:.0f}%")
Experiment 2: Different Learning Rates
The learning rate $\eta$ controls how big each weight update step is. Too small and training is slow; too large and it overshoots the minimum.
import numpy as np
import matplotlib.pyplot as plt
class NeuralNetwork:
def __init__(self, input_size=2, hidden_size=4, output_size=1, learning_rate=1.0):
self.lr = learning_rate
self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(1.0 / input_size)
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(1.0 / hidden_size)
self.b2 = np.zeros((1, output_size))
def sigmoid(self, z):
return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))
def sigmoid_derivative(self, a):
return a * (1.0 - a)
def forward(self, X):
self.z1 = np.dot(X, self.W1) + self.b1
self.a1 = self.sigmoid(self.z1)
self.z2 = np.dot(self.a1, self.W2) + self.b2
self.a2 = self.sigmoid(self.z2)
return self.a2
def backward(self, X, y):
m = X.shape[0]
dz2 = self.a2 - y
dW2 = np.dot(self.a1.T, dz2) / m
db2 = np.sum(dz2, axis=0, keepdims=True) / m
da1 = np.dot(dz2, self.W2.T)
dz1 = da1 * self.sigmoid_derivative(self.a1)
dW1 = np.dot(X.T, dz1) / m
db1 = np.sum(dz1, axis=0, keepdims=True) / m
self.W2 -= self.lr * dW2
self.b2 -= self.lr * db2
self.W1 -= self.lr * dW1
self.b1 -= self.lr * db1
def compute_loss(self, y_pred, y_true):
epsilon = 1e-8
return -np.mean(
y_true * np.log(y_pred + epsilon) +
(1 - y_true) * np.log(1 - y_pred + epsilon)
)
# XOR data
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])
# Compare learning rates
learning_rates = [0.1, 1.0, 5.0]
colors_lr = ['blue', 'green', 'red']
epochs = 10000
plt.figure(figsize=(10, 6))
for lr, color in zip(learning_rates, colors_lr):
np.random.seed(42)
nn = NeuralNetwork(input_size=2, hidden_size=4, learning_rate=lr)
losses = []
for epoch in range(epochs):
y_pred = nn.forward(X)
loss = nn.compute_loss(y_pred, y)
losses.append(loss)
nn.backward(X, y)
plt.plot(losses, color=color, label=f'lr = {lr}', linewidth=1.5)
print(f"Learning rate {lr}: Final loss = {losses[-1]:.6f}")
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Effect of Learning Rate on Convergence')
plt.legend()
plt.grid(True, alpha=0.3)
plt.ylim(0, 1.0)
plt.tight_layout()
plt.show()
Experiment 3: Different Activation Functions
The choice of activation function affects both the speed of learning and the shape of decision boundaries.
import numpy as np
import matplotlib.pyplot as plt
class FlexibleNetwork:
"""Neural network with configurable activation function."""
def __init__(self, hidden_size=4, learning_rate=2.0, activation='sigmoid'):
self.lr = learning_rate
self.activation = activation
self.W1 = np.random.randn(2, hidden_size) * np.sqrt(1.0 / 2)
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, 1) * np.sqrt(1.0 / hidden_size)
self.b2 = np.zeros((1, 1))
def activate(self, z):
if self.activation == 'sigmoid':
return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))
elif self.activation == 'tanh':
return np.tanh(z)
elif self.activation == 'relu':
return np.maximum(0, z)
def activate_derivative(self, a, z=None):
if self.activation == 'sigmoid':
return a * (1.0 - a)
elif self.activation == 'tanh':
return 1.0 - a ** 2
elif self.activation == 'relu':
return (z > 0).astype(float)
def forward(self, X):
self.z1 = np.dot(X, self.W1) + self.b1
self.a1 = self.activate(self.z1)
self.z2 = np.dot(self.a1, self.W2) + self.b2
# Output always uses sigmoid for binary classification
self.a2 = 1.0 / (1.0 + np.exp(-np.clip(self.z2, -500, 500)))
return self.a2
def backward(self, X, y):
m = X.shape[0]
dz2 = self.a2 - y
dW2 = np.dot(self.a1.T, dz2) / m
db2 = np.sum(dz2, axis=0, keepdims=True) / m
da1 = np.dot(dz2, self.W2.T)
dz1 = da1 * self.activate_derivative(self.a1, self.z1)
dW1 = np.dot(X.T, dz1) / m
db1 = np.sum(dz1, axis=0, keepdims=True) / m
self.W2 -= self.lr * dW2
self.b2 -= self.lr * db2
self.W1 -= self.lr * dW1
self.b1 -= self.lr * db1
def compute_loss(self, y_pred, y_true):
epsilon = 1e-8
return -np.mean(
y_true * np.log(y_pred + epsilon) +
(1 - y_true) * np.log(1 - y_pred + epsilon)
)
# XOR data
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])
# Compare activation functions
activations = ['sigmoid', 'tanh', 'relu']
colors_act = ['blue', 'green', 'orange']
epochs = 10000
plt.figure(figsize=(10, 6))
for act, color in zip(activations, colors_act):
np.random.seed(42)
nn = FlexibleNetwork(hidden_size=4, learning_rate=2.0, activation=act)
losses = []
for epoch in range(epochs):
y_pred = nn.forward(X)
loss = nn.compute_loss(y_pred, y)
losses.append(loss)
nn.backward(X, y)
plt.plot(losses, color=color, label=f'{act}', linewidth=1.5)
# Check final accuracy
final_pred = nn.forward(X)
preds = (final_pred > 0.5).astype(int)
acc = np.mean(preds == y) * 100
print(f"{act:>8s}: Final loss = {losses[-1]:.6f}, Accuracy = {acc:.0f}%")
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Activation Function Comparison on XOR')
plt.legend()
plt.grid(True, alpha=0.3)
plt.ylim(0, 0.8)
plt.tight_layout()
plt.show()
- Tanh often converges faster than sigmoid because its output is centered around zero, leading to larger gradients early in training.
- ReLU can be fast but suffers from “dead neurons” — neurons that output zero and never recover because the gradient is zero for negative inputs.
- Sigmoid is reliable but can suffer from vanishing gradients in deeper networks (less of an issue in our 2-layer network).
What’s Next
Congratulations! You’ve built a fully working neural network from scratch. You now understand:
- Why a hidden layer is necessary for non-linear problems
- How to initialize, forward-propagate, and backpropagate through a multi-layer network
- How training transforms random predictions into accurate solutions
- How different hyperparameters (hidden size, learning rate, activation function) affect learning
This XOR network is small, but every concept scales directly to networks with millions of parameters. The difference between this and GPT-4 is scale, not principle.
Next in the Series
In Part 5: Neural Network Architectures Overview, we’ll survey the landscape of neural network architectures — from feedforward networks to CNNs, RNNs, Transformers, and beyond — understanding when and why each design is used.