Introduction: Why Neural Networks?
Imagine teaching a computer to recognize your handwriting, understand spoken language, or even generate realistic images of cats that don't exist. These tasks seem trivial to humans but are incredibly complex for traditional programming approaches. This is where Artificial Neural Networks (ANNs) shine.
Neural networks are computational models inspired by the human brain's structure. Unlike traditional algorithms that follow explicit rules (if-then-else logic), neural networks learn patterns from data. They've revolutionized fields like computer vision, natural language processing, speech recognition, and game playing.
Key Insight
Traditional Programming: You write rules ? Computer executes rules ? Output
Neural Networks: You provide examples (data) ? Network learns patterns ? Network makes predictions
This fundamental shift from rule-based to data-driven programming is what makes neural networks so powerful for complex tasks.
Data Science Mastery
Python Setup & Notebooks
IDE setup, Jupyter, virtual environmentsNumPy Foundations
Arrays, broadcasting, linear algebraPandas Data Analysis
DataFrames, cleaning, manipulationData Visualization
Matplotlib, Seaborn, PlotlyMachine Learning with Scikit-learn
Classification, regression, clusteringML Mathematics & Statistics
Linear algebra, calculus, probabilityArtificial Neural Networks
Perceptrons, backpropagation, architecturesComputer Vision Fundamentals
CNNs, image processing, object detectionPyTorch Deep Learning
Tensors, autograd, model trainingTensorFlow & Keras
Sequential models, callbacks, deploymentTransformers & Attention
Self-attention, BERT, GPT architectureIn this comprehensive guide, we'll journey from the biological inspiration behind neural networks to building sophisticated architectures like CNNs and RNNs from scratch. You'll understand not just how they work, but why they work.
The Evolution: From Classical ML to Neural Networks
Biological Inspiration: How the Brain Works
The human brain contains approximately 86 billion neurons, each connected to thousands of other neurons through synapses. When you learn something new—like recognizing a friend's face—specific patterns of neurons fire together, strengthening their connections. This process, called Hebbian learning ("neurons that fire together, wire together"), inspired artificial neural networks.
How a Biological Neuron Works
- Dendrites receive electrical signals from other neurons
- Signals accumulate in the cell body (soma)
- If the combined signal exceeds a threshold, the neuron "fires"
- An electrical impulse travels down the axon
- The signal is transmitted to other neurons through synapses
Artificial neurons mimic this process: They receive weighted inputs, sum them, apply a threshold (activation function), and pass the result forward.
Early Attempts: Perceptron (1958)
In 1958, Frank Rosenblatt introduced the Perceptron—the first artificial neuron. It was a simple model that could learn to classify inputs into two categories (binary classification). Here's how it worked:
import numpy as np
# Simple Perceptron for AND gate
class Perceptron:
def __init__(self, input_size, learning_rate=0.1):
# Initialize random weights and bias
self.weights = np.random.randn(input_size)
self.bias = np.random.randn()
self.learning_rate = learning_rate
def predict(self, x):
# Calculate weighted sum + bias
z = np.dot(x, self.weights) + self.bias
# Apply step activation (threshold at 0)
return 1 if z > 0 else 0
def train(self, X, y, epochs=10):
for epoch in range(epochs):
for xi, target in zip(X, y):
prediction = self.predict(xi)
# Update weights if prediction is wrong
error = target - prediction
self.weights += self.learning_rate * error * xi
self.bias += self.learning_rate * error
# Training data for AND gate
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 0, 0, 1]) # AND logic
# Create and train perceptron
perceptron = Perceptron(input_size=2)
perceptron.train(X, y, epochs=10)
# Test the trained perceptron
print("AND Gate Results:")
for xi, target in zip(X, y):
pred = perceptron.predict(xi)
print(f"Input: {xi} ? Prediction: {pred}, Expected: {target}")
# Output:
# Input: [0 0] ? Prediction: 0, Expected: 0
# Input: [0 1] ? Prediction: 0, Expected: 0
# Input: [1 0] ? Prediction: 0, Expected: 0
# Input: [1 1] ? Prediction: 1, Expected: 1
The XOR Problem: Perceptron's Fatal Flaw
In 1969, Marvin Minsky and Seymour Papert proved that a single perceptron cannot learn the XOR (exclusive OR) function. This is because XOR is not linearly separable—you can't draw a single straight line to separate the classes.
This limitation triggered the first "AI Winter," a period where funding and interest in neural networks plummeted. The solution? Multi-layer networks (which we'll build later in this guide).
The AI Winters: Why It Took So Long
Between the 1970s and 1990s, neural network research faced two major "AI Winters"—periods of reduced funding and skepticism. Several factors contributed:
- Limited Computing Power: Training even small networks required computational resources unavailable at the time
- Lack of Data: Neural networks need large datasets to learn effectively; the internet explosion hadn't happened yet
- Theoretical Barriers: No one knew how to train multi-layer networks efficiently until backpropagation was rediscovered
- Overhyped Promises: Early claims about AI capabilities led to disappointment when they weren't met
The Renaissance: What Changed?
The 2010s marked neural networks' triumphant return, rebranded as "Deep Learning." Three key factors converged:
The Perfect Storm for Deep Learning
1. Big Data (2000s-present)
- Internet explosion: millions of labeled images (ImageNet), text, videos
- Social media: user-generated content at unprecedented scale
- Sensors everywhere: smartphones, IoT devices generating continuous data streams
2. Computational Power (2010s)
- GPUs (Graphics Processing Units) repurposed for parallel matrix computations
- Cloud computing: AWS, Google Cloud, Azure providing scalable infrastructure
- Specialized hardware: Google's TPUs, NVIDIA's deep learning GPUs
3. Algorithmic Innovations
- Backpropagation rediscovered and optimized (1986, popularized 2000s)
- ReLU activation (2011): solved vanishing gradient problem
- Dropout regularization (2012): prevented overfitting
- Batch normalization (2015): stabilized training
- Adam optimizer (2014): adaptive learning rates
The breakthrough moment came in 2012 when AlexNet—a deep convolutional neural network—won the ImageNet competition by a massive margin, reducing error rates from 26% to 15%. This proved that deep learning worked at scale.
Limitations of Classical Machine Learning
Before diving into neural networks, let's understand why we need them. Classical machine learning algorithms like Logistic Regression, Decision Trees, and SVMs work well for many tasks, but they have fundamental limitations when dealing with complex, high-dimensional data.
Manual Feature Engineering
Classical ML requires humans to manually design features—the input variables the model uses to make predictions. This is time-consuming, domain-specific, and often requires expert knowledge.
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
# Example: Classifying handwritten digits WITHOUT neural networks
# Problem: You must manually extract features from raw pixels
# Simulate a 28x28 grayscale image (like MNIST)
raw_image = np.random.rand(28, 28) * 255 # Random pixel values 0-255
# Manual feature engineering (what classical ML requires)
features = []
features.append(raw_image.mean()) # Average brightness
features.append(raw_image.std()) # Contrast
features.append(raw_image.max()) # Brightest pixel
features.append(raw_image.min()) # Darkest pixel
features.append(raw_image[14, 14]) # Center pixel value
features.append(raw_image[:14].mean()) # Top half brightness
features.append(raw_image[14:].mean()) # Bottom half brightness
# Convert to feature vector
X_manual = np.array(features).reshape(1, -1)
print(f"Original image shape: {raw_image.shape}") # (28, 28) = 784 pixels
print(f"Manual features shape: {X_manual.shape}") # (1, 7) - only 7 features!
print(f"Information lost: {((784 - 7) / 784) * 100:.1f}%") # 99.1% lost!
# Classical ML approach: Train on these 7 hand-crafted features
model = LogisticRegression()
# model.fit(X_manual, y) # Would train on engineered features
print("\n? Problem: You had to decide WHICH features matter!")
print(" What if you chose poorly? What if important patterns exist")
print(" in pixel combinations you didn't think of?")
# Neural network approach: Feed raw pixels directly
X_neural = raw_image.flatten().reshape(1, -1) # Just flatten the image
print(f"\n? Neural Network: Uses all {X_neural.shape[1]} pixels directly")
print(" The network LEARNS which patterns matter during training!")
Key Difference: Automatic Feature Learning
Classical ML: Human engineers ? Hand-crafted features ? Model learns from features
Neural Networks: Raw data ? Network learns features automatically ? Model learns from learned features
This automatic feature learning is neural networks' superpower. They discover hierarchical patterns humans might never think of.
Linear Decision Boundaries
Many classical algorithms assume data is linearly separable—meaning you can separate classes with a straight line (2D), plane (3D), or hyperplane (higher dimensions). Real-world data is rarely this simple.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_circles
# Generate non-linearly separable data (circles)
X, y = make_circles(n_samples=200, noise=0.1, factor=0.3, random_state=42)
# Try to classify with Logistic Regression (linear model)
lr = LogisticRegression()
lr.fit(X, y)
accuracy_linear = lr.score(X, y)
print(f"Logistic Regression Accuracy: {accuracy_linear:.2%}")
# Output: ~50% (no better than random guessing!)
# Visualize the problem
plt.figure(figsize=(12, 4))
# Plot 1: The data (circles)
plt.subplot(1, 2, 1)
plt.scatter(X[y==0, 0], X[y==0, 1], c='blue', label='Class 0', alpha=0.6)
plt.scatter(X[y==1, 0], X[y==1, 1], c='red', label='Class 1', alpha=0.6)
plt.title('Non-Linear Data (Circles)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True, alpha=0.3)
# Plot 2: Linear decision boundary (fails)
plt.subplot(1, 2, 2)
x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
np.linspace(y_min, y_max, 200))
Z = lr.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.3, levels=1, colors=['blue', 'red'])
plt.scatter(X[y==0, 0], X[y==0, 1], c='blue', edgecolors='k', alpha=0.6)
plt.scatter(X[y==1, 0], X[y==1, 1], c='red', edgecolors='k', alpha=0.6)
plt.title(f'Linear Boundary (Accuracy: {accuracy_linear:.1%})')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\n? A straight line CANNOT separate these circles!")
print("? Neural networks with non-linear activations CAN learn this pattern.")
Scalability Issues with Complex Data
Classical ML algorithms often struggle when data becomes very complex or when the number of features grows large. Let's see why:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import time
# Simulate image classification task
# Small images: 32x32 pixels = 1,024 features
# Medium images: 128x128 pixels = 16,384 features
# Large images: 512x512 pixels = 262,144 features
def benchmark_classical_ml(image_size, n_samples=1000):
"""Test classical ML on different image sizes"""
# Generate random image data
n_features = image_size * image_size
X = np.random.rand(n_samples, n_features)
y = np.random.randint(0, 10, n_samples) # 10 classes
# Try Random Forest
print(f"\n?? Image Size: {image_size}x{image_size} = {n_features:,} features")
start = time.time()
rf = RandomForestClassifier(n_estimators=10, max_depth=10, random_state=42)
rf.fit(X, y)
rf_time = time.time() - start
print(f" Random Forest training time: {rf_time:.2f}s")
# Try SVM (gets very slow with many features)
if n_features <= 4096: # Skip SVM for huge feature spaces
start = time.time()
svm = SVC(kernel='rbf')
svm.fit(X[:100], y[:100]) # Use only 100 samples
svm_time = time.time() - start
print(f" SVM training time (100 samples): {svm_time:.2f}s")
else:
print(f" SVM: Too slow for {n_features:,} features ?")
return rf_time
# Test different image sizes
sizes = [32, 64, 128]
for size in sizes:
benchmark_classical_ml(size)
print("\n?? Problem: Training time explodes with feature count!")
print("? Neural Networks: Use GPU parallelization, designed for high dimensions")
High-Dimensional Data Challenges
When feature count grows, we encounter the "Curse of Dimensionality"—data becomes sparse, distances become meaningless, and models require exponentially more data to generalize well.
The Curse of Dimensionality Explained
Imagine you have 100 training examples. How well does this cover the feature space?
1D space (1 feature):
- 100 points cover a line pretty well
- Each point has neighbors nearby
2D space (2 features):
- 100 points in a square: v100 = 10 points per dimension
- Still reasonable coverage
10D space (10 features):
- 100 points in a hypercube: ¹°v100 ˜ 1.58 points per dimension
- Data becomes very sparse!
1000D space (e.g., 32x32 image = 1,024 pixels):
- ¹°°°v100 ˜ 1.005 points per dimension
- Essentially empty space—no meaningful coverage
Consequence: You'd need 10^1000 samples to maintain the same density as 100 points in 1D. That's more atoms than in the observable universe!
import numpy as np
import matplotlib.pyplot as plt
# Demonstrate curse of dimensionality
def distance_in_dimensions(n_dimensions, n_points=1000):
"""Calculate average pairwise distance as dimensions increase"""
# Generate random points in n-dimensional unit hypercube
points = np.random.rand(n_points, n_dimensions)
# Calculate pairwise distances (sample 100 pairs for speed)
sample_size = min(100, n_points)
sample_points = points[np.random.choice(n_points, sample_size, replace=False)]
distances = []
for i in range(sample_size):
for j in range(i + 1, sample_size):
dist = np.linalg.norm(sample_points[i] - sample_points[j])
distances.append(dist)
return np.mean(distances), np.std(distances)
# Test different dimensions
dimensions = [1, 2, 5, 10, 50, 100, 500, 1000]
mean_dists = []
std_dists = []
for dim in dimensions:
mean_d, std_d = distance_in_dimensions(dim)
mean_dists.append(mean_d)
std_dists.append(std_d)
print(f"Dimensions: {dim:4d} | Avg Distance: {mean_d:.3f} ± {std_d:.3f}")
# Plot results
plt.figure(figsize=(10, 6))
plt.errorbar(dimensions, mean_dists, yerr=std_dists, marker='o', capsize=5)
plt.xlabel('Number of Dimensions', fontsize=12)
plt.ylabel('Average Pairwise Distance', fontsize=12)
plt.title('Curse of Dimensionality: Distances Increase with Dimensions', fontsize=14)
plt.grid(True, alpha=0.3)
plt.xscale('log')
plt.tight_layout()
plt.show()
print("\n?? Observation: In high dimensions, ALL points are far apart!")
print(" Distance becomes meaningless—everything is equidistant.")
print("\n? Neural Networks: Use dimensionality reduction (learned features)")
print(" to map high-D data to meaningful low-D representations.")
Why Neural Networks Excel
Neural networks overcome these limitations through:
- Automatic Feature Learning: No manual engineering needed
- Non-Linear Transformations: Can learn complex, curved decision boundaries
- Hierarchical Representations: Early layers learn simple patterns, deeper layers combine them into complex concepts
- Scalability: GPU parallelization handles millions of parameters efficiently
- Dimensionality Reduction: Hidden layers compress high-D data into meaningful low-D representations
In the next section, we'll build our first neural network from scratch to see exactly how these advantages work in practice.
Understanding Basic ANN: Building Blocks
Now that we understand why neural networks are needed, let's explore how they work. We'll start with the fundamental components and build up to a complete neural network.
Artificial Neurons (Perceptrons)
An artificial neuron (or perceptron) is the basic computational unit of a neural network. It mimics a biological neuron by:
- Receiving multiple inputs (like dendrites)
- Multiplying each input by a weight (synapse strength)
- Summing all weighted inputs plus a bias term
- Applying an activation function (firing threshold)
- Producing an output (axon signal)
import numpy as np
import matplotlib.pyplot as plt
# A single artificial neuron from scratch
class Neuron:
def __init__(self, n_inputs):
"""
Initialize a neuron with random weights and bias.
Parameters:
- n_inputs: number of input features
"""
# Random weights for each input (small values near 0)
self.weights = np.random.randn(n_inputs) * 0.1
# Random bias term
self.bias = np.random.randn() * 0.1
def forward(self, inputs):
"""
Compute neuron output given inputs.
Formula: output = activation(w1*x1 + w2*x2 + ... + wn*xn + bias)
"""
# Weighted sum: multiply each input by its weight
weighted_sum = np.dot(inputs, self.weights) + self.bias
# Activation: apply sigmoid function (for now)
output = 1 / (1 + np.exp(-weighted_sum)) # sigmoid
return output, weighted_sum
# Create a neuron with 3 inputs
neuron = Neuron(n_inputs=3)
# Example inputs
x = np.array([0.5, -0.2, 0.8])
# Forward pass
output, z = neuron.forward(x)
print("=== Single Neuron Computation ===")
print(f"Inputs: {x}")
print(f"Weights: {neuron.weights}")
print(f"Bias: {neuron.bias:.4f}")
print(f"\nWeighted sum (z): {z:.4f}")
print(f"Output (after sigmoid): {output:.4f}")
# Visualize the neuron
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Left plot: Neuron diagram
ax1.text(0.1, 0.8, 'Input 1\n(0.5)', ha='center', va='center', fontsize=10,
bbox=dict(boxstyle='circle', facecolor='lightblue'))
ax1.text(0.1, 0.5, 'Input 2\n(-0.2)', ha='center', va='center', fontsize=10,
bbox=dict(boxstyle='circle', facecolor='lightblue'))
ax1.text(0.1, 0.2, 'Input 3\n(0.8)', ha='center', va='center', fontsize=10,
bbox=dict(boxstyle='circle', facecolor='lightblue'))
ax1.text(0.5, 0.5, f'Neuron\nSwx+b\n?\ns', ha='center', va='center', fontsize=12,
bbox=dict(boxstyle='circle', facecolor='orange', edgecolor='black', linewidth=2))
ax1.text(0.9, 0.5, f'Output\n{output:.3f}', ha='center', va='center', fontsize=10,
bbox=dict(boxstyle='circle', facecolor='lightgreen'))
# Draw arrows with weight labels
for i, (y_pos, w) in enumerate(zip([0.8, 0.5, 0.2], neuron.weights)):
ax1.arrow(0.15, y_pos, 0.25, 0.5-y_pos, head_width=0.03, head_length=0.05,
fc='gray', ec='gray', alpha=0.6)
ax1.text(0.25, (y_pos + 0.5)/2, f'w={w:.2f}', fontsize=8, color='red')
ax1.arrow(0.6, 0.5, 0.25, 0, head_width=0.03, head_length=0.05,
fc='green', ec='green', linewidth=2)
ax1.set_xlim(0, 1)
ax1.set_ylim(0, 1)
ax1.axis('off')
ax1.set_title('Artificial Neuron Structure', fontsize=14, fontweight='bold')
# Right plot: How different inputs affect output
test_inputs = np.linspace(-2, 2, 100)
outputs = []
for val in test_inputs:
test_x = np.array([val, val, val])
out, _ = neuron.forward(test_x)
outputs.append(out)
ax2.plot(test_inputs, outputs, linewidth=2, color='blue')
ax2.axhline(y=0.5, color='red', linestyle='--', alpha=0.5, label='Decision threshold (0.5)')
ax2.axvline(x=0, color='gray', linestyle='--', alpha=0.3)
ax2.set_xlabel('Input Value', fontsize=12)
ax2.set_ylabel('Neuron Output', fontsize=12)
ax2.set_title('Neuron Response Curve (Sigmoid)', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3)
ax2.legend()
plt.tight_layout()
plt.show()
print("\n?? Key Insight: The neuron transforms inputs into outputs between 0 and 1")
print(" This allows it to model probabilities or 'confidence' in predictions.")
Weights and Biases Explained
Weights and biases are the learnable parameters of a neural network—they're what the network adjusts during training to improve its predictions.
Weights vs Biases: The Intuition
Weights (w): Control the slope or importance of each input
- Large positive weight ? Input has strong positive influence
- Large negative weight ? Input has strong negative influence
- Weight near zero ? Input is ignored
Bias (b): Controls the threshold for neuron activation
- Positive bias ? Neuron activates more easily
- Negative bias ? Neuron activates less easily
- Shifts the decision boundary left or right
Analogy: Think of a thermostat controlling your heating system:
- Weight = How sensitive the thermostat is to temperature changes
- Bias = The temperature threshold that triggers heating
import numpy as np
import matplotlib.pyplot as plt
# Demonstrate effect of weights and biases
def neuron_output(x, weight, bias):
"""Calculate sigmoid output for given weight and bias"""
z = weight * x + bias
return 1 / (1 + np.exp(-z))
# Input range
x = np.linspace(-10, 10, 200)
# Visualize effect of different weights and biases
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Plot 1: Effect of weights (bias fixed at 0)
ax1.plot(x, neuron_output(x, weight=0.5, bias=0), label='Weight=0.5 (gentle slope)', linewidth=2)
ax1.plot(x, neuron_output(x, weight=1.0, bias=0), label='Weight=1.0 (medium slope)', linewidth=2)
ax1.plot(x, neuron_output(x, weight=2.0, bias=0), label='Weight=2.0 (steep slope)', linewidth=2)
ax1.axhline(y=0.5, color='red', linestyle='--', alpha=0.3)
ax1.axvline(x=0, color='gray', linestyle='--', alpha=0.3)
ax1.set_xlabel('Input Value', fontsize=12)
ax1.set_ylabel('Output', fontsize=12)
ax1.set_title('Effect of Weights (bias=0)', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)
# Plot 2: Effect of bias (weight fixed at 1)
ax2.plot(x, neuron_output(x, weight=1, bias=-3), label='Bias=-3 (shifts right)', linewidth=2)
ax2.plot(x, neuron_output(x, weight=1, bias=0), label='Bias=0 (centered)', linewidth=2)
ax2.plot(x, neuron_output(x, weight=1, bias=3), label='Bias=+3 (shifts left)', linewidth=2)
ax2.axhline(y=0.5, color='red', linestyle='--', alpha=0.3)
ax2.axvline(x=0, color='gray', linestyle='--', alpha=0.3)
ax2.set_xlabel('Input Value', fontsize=12)
ax2.set_ylabel('Output', fontsize=12)
ax2.set_title('Effect of Bias (weight=1)', fontsize=14, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Numerical example
print("=== Weight & Bias Impact ===\n")
input_val = 2.0
configs = [
(1.0, 0.0, "Baseline"),
(2.0, 0.0, "Double weight ? steeper"),
(1.0, 3.0, "Add bias ? shift left"),
(0.5, -1.0, "Halve weight, negative bias")
]
for w, b, desc in configs:
output = neuron_output(input_val, w, b)
print(f"{desc:30s} | w={w:.1f}, b={b:+.1f} ? output={output:.4f}")
print("\n?? During training, the network adjusts BOTH weights and biases")
print(" to minimize prediction errors. This is 'learning'!")
Activation Functions (Sigmoid, ReLU, Tanh)
Activation functions introduce non-linearity into neural networks. Without them, no matter how many layers you stack, the network could only learn linear relationships (like a single neuron). Activation functions enable learning complex, curved patterns.
Why Non-Linearity Matters
Without activation functions (linear network):
Layer 1: z1 = W1*x + b1
Layer 2: z2 = W2*z1 + b2 = W2*(W1*x + b1) + b2 = (W2*W1)*x + (W2*b1 + b2)
Result: Equivalent to W_combined * x + b_combined (single layer!)
With activation functions (non-linear network):
Layer 1: a1 = s(W1*x + b1)
Layer 2: a2 = s(W2*a1 + b2)
Result: Can approximate ANY continuous function (Universal Approximation Theorem)
import numpy as np
import matplotlib.pyplot as plt
# Implement popular activation functions
def sigmoid(z):
"""Sigmoid: smooth S-curve, outputs (0, 1)"""
return 1 / (1 + np.exp(-np.clip(z, -500, 500))) # clip prevents overflow
def tanh(z):
"""Tanh: smooth S-curve, outputs (-1, 1)"""
return np.tanh(z)
def relu(z):
"""ReLU: Rectified Linear Unit, outputs [0, 8)"""
return np.maximum(0, z)
def leaky_relu(z, alpha=0.01):
"""Leaky ReLU: like ReLU but allows small negative values"""
return np.where(z > 0, z, alpha * z)
# Input range
z = np.linspace(-5, 5, 200)
# Plot all activation functions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Sigmoid
axes[0, 0].plot(z, sigmoid(z), linewidth=2, color='blue')
axes[0, 0].axhline(y=0, color='gray', linestyle='--', alpha=0.3)
axes[0, 0].axhline(y=0.5, color='red', linestyle='--', alpha=0.3, label='y=0.5')
axes[0, 0].axhline(y=1, color='gray', linestyle='--', alpha=0.3)
axes[0, 0].axvline(x=0, color='gray', linestyle='--', alpha=0.3)
axes[0, 0].set_title('Sigmoid: s(z) = 1/(1+e??)', fontsize=14, fontweight='bold')
axes[0, 0].set_xlabel('Input (z)', fontsize=11)
axes[0, 0].set_ylabel('Output', fontsize=11)
axes[0, 0].grid(True, alpha=0.3)
axes[0, 0].legend()
axes[0, 0].text(2, 0.2, '? Smooth gradient\n? Vanishing gradient\n? Not zero-centered',
fontsize=9, bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
# Tanh
axes[0, 1].plot(z, tanh(z), linewidth=2, color='green')
axes[0, 1].axhline(y=0, color='red', linestyle='--', alpha=0.3, label='y=0')
axes[0, 1].axhline(y=1, color='gray', linestyle='--', alpha=0.3)
axes[0, 1].axhline(y=-1, color='gray', linestyle='--', alpha=0.3)
axes[0, 1].axvline(x=0, color='gray', linestyle='--', alpha=0.3)
axes[0, 1].set_title('Tanh: tanh(z) = (e?-e??)/(e?+e??)', fontsize=14, fontweight='bold')
axes[0, 1].set_xlabel('Input (z)', fontsize=11)
axes[0, 1].set_ylabel('Output', fontsize=11)
axes[0, 1].grid(True, alpha=0.3)
axes[0, 1].legend()
axes[0, 1].text(2, -0.5, '? Zero-centered\n? Stronger gradient\n? Still vanishes',
fontsize=9, bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
# ReLU
axes[1, 0].plot(z, relu(z), linewidth=2, color='red')
axes[1, 0].axhline(y=0, color='gray', linestyle='--', alpha=0.3)
axes[1, 0].axvline(x=0, color='red', linestyle='--', alpha=0.3, label='x=0')
axes[1, 0].set_title('ReLU: max(0, z)', fontsize=14, fontweight='bold')
axes[1, 0].set_xlabel('Input (z)', fontsize=11)
axes[1, 0].set_ylabel('Output', fontsize=11)
axes[1, 0].grid(True, alpha=0.3)
axes[1, 0].legend()
axes[1, 0].text(2, 1, '? No vanishing gradient\n? Computationally cheap\n? Dead neurons (z<0)',
fontsize=9, bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.5))
# Leaky ReLU
axes[1, 1].plot(z, leaky_relu(z), linewidth=2, color='purple')
axes[1, 1].axhline(y=0, color='gray', linestyle='--', alpha=0.3)
axes[1, 1].axvline(x=0, color='red', linestyle='--', alpha=0.3, label='x=0')
axes[1, 1].set_title('Leaky ReLU: max(0.01z, z)', fontsize=14, fontweight='bold')
axes[1, 1].set_xlabel('Input (z)', fontsize=11)
axes[1, 1].set_ylabel('Output', fontsize=11)
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].legend()
axes[1, 1].text(2, 1, '? Fixes dead neurons\n? All ReLU benefits\n? Most popular in 2020s',
fontsize=9, bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.5))
plt.tight_layout()
plt.show()
# Demonstrate gradient differences
print("=== Gradient Comparison (at z=3) ===\n")
z_test = 3.0
epsilon = 0.0001
def numerical_gradient(func, z):
"""Approximate gradient using finite differences"""
return (func(z + epsilon) - func(z - epsilon)) / (2 * epsilon)
print(f"Sigmoid gradient: {numerical_gradient(sigmoid, z_test):.6f}")
print(f"Tanh gradient: {numerical_gradient(tanh, z_test):.6f}")
print(f"ReLU gradient: {numerical_gradient(relu, z_test):.6f}")
print(f"Leaky ReLU gradient: {numerical_gradient(leaky_relu, z_test):.6f}")
print("\n?? ReLU has constant gradient (1.0) for positive inputs")
print(" ? No vanishing gradient problem!")
print(" ? Training is much faster than sigmoid/tanh")
Choosing Activation Functions: Rule of Thumb
- Hidden Layers: Use ReLU (or Leaky ReLU) ? Default choice in 2020s
- Output Layer (Binary Classification): Use Sigmoid ? Outputs probability (0 to 1)
- Output Layer (Multi-class): Use Softmax ? Outputs probability distribution
- Output Layer (Regression): Use Linear (no activation) ? Any real number
- Recurrent Networks: Use Tanh ? Zero-centered helps with gradient flow
Layers: Input, Hidden, Output
A neural network is organized into layers of neurons. Each layer transforms its input and passes the result to the next layer.
- Input Layer: Receives raw data (e.g., pixel values, word embeddings). Not counted as a "layer" since it does no computation.
- Hidden Layers: Intermediate layers that learn increasingly abstract features. The "deep" in deep learning refers to having many hidden layers.
- Output Layer: Produces final predictions (e.g., class probabilities, regression values).
import numpy as np
import matplotlib.pyplot as plt
# Build a simple 3-layer neural network from scratch
class SimpleNeuralNetwork:
def __init__(self, input_size, hidden_size, output_size):
"""
Initialize a neural network with one hidden layer.
Architecture: input_size ? hidden_size ? output_size
"""
# Layer 1: Input ? Hidden
self.W1 = np.random.randn(input_size, hidden_size) * 0.01
self.b1 = np.zeros((1, hidden_size))
# Layer 2: Hidden ? Output
self.W2 = np.random.randn(hidden_size, output_size) * 0.01
self.b2 = np.zeros((1, output_size))
print(f"=== Network Architecture ===")
print(f"Input size: {input_size}")
print(f"Hidden size: {hidden_size}")
print(f"Output size: {output_size}")
print(f"\nTotal parameters: {self.count_parameters()}")
def count_parameters(self):
"""Count total trainable parameters"""
return (self.W1.size + self.b1.size +
self.W2.size + self.b2.size)
def forward(self, X):
"""
Forward propagation through the network.
Returns all intermediate values for visualization.
"""
# Layer 1: Input ? Hidden
self.z1 = np.dot(X, self.W1) + self.b1 # Linear transformation
self.a1 = np.maximum(0, self.z1) # ReLU activation
# Layer 2: Hidden ? Output
self.z2 = np.dot(self.a1, self.W2) + self.b2 # Linear transformation
self.a2 = 1 / (1 + np.exp(-self.z2)) # Sigmoid activation
return self.a2
def visualize_architecture(self):
"""Draw the network structure"""
fig, ax = plt.subplots(figsize=(12, 6))
# Layer positions
layer_x = [0.15, 0.5, 0.85]
# Draw input layer
input_neurons = min(self.W1.shape[0], 5) # Show max 5
for i in range(input_neurons):
y = 0.5 + (i - input_neurons/2) * 0.15
circle = plt.Circle((layer_x[0], y), 0.04, color='lightblue', ec='black', linewidth=2)
ax.add_patch(circle)
ax.text(layer_x[0]-0.12, y, f'x{i+1}', fontsize=10, ha='center', va='center')
# Draw hidden layer
hidden_neurons = min(self.W1.shape[1], 5)
for i in range(hidden_neurons):
y = 0.5 + (i - hidden_neurons/2) * 0.15
circle = plt.Circle((layer_x[1], y), 0.04, color='orange', ec='black', linewidth=2)
ax.add_patch(circle)
# Draw output layer
output_neurons = min(self.W2.shape[1], 3)
for i in range(output_neurons):
y = 0.5 + (i - output_neurons/2) * 0.15
circle = plt.Circle((layer_x[2], y), 0.04, color='lightgreen', ec='black', linewidth=2)
ax.add_patch(circle)
ax.text(layer_x[2]+0.12, y, f'y{i+1}', fontsize=10, ha='center', va='center')
# Draw connections (sample)
for i in range(min(3, input_neurons)):
for j in range(min(3, hidden_neurons)):
y_in = 0.5 + (i - input_neurons/2) * 0.15
y_hid = 0.5 + (j - hidden_neurons/2) * 0.15
ax.plot([layer_x[0]+0.04, layer_x[1]-0.04], [y_in, y_hid],
'gray', alpha=0.3, linewidth=0.5)
for i in range(min(3, hidden_neurons)):
for j in range(output_neurons):
y_hid = 0.5 + (i - hidden_neurons/2) * 0.15
y_out = 0.5 + (j - output_neurons/2) * 0.15
ax.plot([layer_x[1]+0.04, layer_x[2]-0.04], [y_hid, y_out],
'gray', alpha=0.3, linewidth=0.5)
# Labels
ax.text(layer_x[0], 0.05, 'Input Layer', ha='center', fontsize=12, fontweight='bold')
ax.text(layer_x[1], 0.05, 'Hidden Layer\n(ReLU)', ha='center', fontsize=12, fontweight='bold')
ax.text(layer_x[2], 0.05, 'Output Layer\n(Sigmoid)', ha='center', fontsize=12, fontweight='bold')
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.axis('off')
ax.set_title('Neural Network Architecture', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()
# Create a small network
nn = SimpleNeuralNetwork(input_size=4, hidden_size=5, output_size=2)
nn.visualize_architecture()
# Test forward pass
X_sample = np.array([[0.5, -0.2, 0.8, 0.1]]) # 1 sample, 4 features
predictions = nn.forward(X_sample)
print(f"\n=== Forward Pass Example ===")
print(f"Input shape: {X_sample.shape}")
print(f"Hidden activation: {nn.a1.shape} ? {nn.a1[0][:3]}... (showing first 3)")
print(f"Output predictions: {predictions.shape} ? {predictions[0]}")
print(f"\n?? Each layer transforms the data, learning progressively")
print(" more abstract representations!")
Forward Propagation Step-by-Step
Forward propagation is the process of passing input data through the network to generate predictions. Let's trace exactly what happens at each step.
Forward Propagation: Complete Example
Given: 2 input features, 3 hidden neurons, 1 output neuron
Step 1: Input Layer
- Input: x = [2.0, 3.0]
- Simply pass data forward (no computation)
Step 2: Hidden Layer
- Linear: z1 = W1·x + b1
- Activation: a1 = ReLU(z1)
- Result: 3 hidden neuron activations
Step 3: Output Layer
- Linear: z2 = W2·a1 + b2
- Activation: a2 = Sigmoid(z2)
- Result: Final prediction (e.g., probability)
import numpy as np
# Detailed forward propagation with manual calculations
class DetailedForwardPass:
def __init__(self):
# Simple network: 2 inputs ? 3 hidden ? 1 output
# Initialize with specific weights for demonstration
self.W1 = np.array([[0.5, -0.3, 0.8],
[0.2, 0.6, -0.4]]) # Shape: (2, 3)
self.b1 = np.array([[0.1, -0.2, 0.3]]) # Shape: (1, 3)
self.W2 = np.array([[0.4],
[-0.5],
[0.7]]) # Shape: (3, 1)
self.b2 = np.array([[0.2]]) # Shape: (1, 1)
def forward_verbose(self, x):
"""Forward pass with detailed output at each step"""
print("="*60)
print("FORWARD PROPAGATION: DETAILED TRACE")
print("="*60)
# Input
print(f"\n?? INPUT LAYER")
print(f" x = {x}")
print(f" Shape: {x.shape}")
# Hidden layer - Linear transformation
print(f"\n?? HIDDEN LAYER - Linear Transformation")
print(f" Weights W1:\n{self.W1}")
print(f" Bias b1: {self.b1}")
z1 = np.dot(x, self.W1) + self.b1
print(f"\n Computation: z1 = x·W1 + b1")
print(f" For neuron 1: ({x[0,0]:.1f} × {self.W1[0,0]:.1f}) + ({x[0,1]:.1f} × {self.W1[1,0]:.1f}) + {self.b1[0,0]:.1f}")
print(f" = {x[0,0]*self.W1[0,0]:.2f} + {x[0,1]*self.W1[1,0]:.2f} + {self.b1[0,0]:.1f}")
print(f" = {z1[0,0]:.3f}")
print(f"\n z1 = {z1}")
# Hidden layer - Activation
print(f"\n? HIDDEN LAYER - ReLU Activation")
a1 = np.maximum(0, z1)
print(f" a1 = max(0, z1)")
for i in range(z1.shape[1]):
print(f" Neuron {i+1}: max(0, {z1[0,i]:.3f}) = {a1[0,i]:.3f}")
print(f"\n a1 = {a1}")
# Output layer - Linear transformation
print(f"\n?? OUTPUT LAYER - Linear Transformation")
print(f" Weights W2:\n{self.W2.T}")
print(f" Bias b2: {self.b2}")
z2 = np.dot(a1, self.W2) + self.b2
print(f"\n Computation: z2 = a1·W2 + b2")
print(f" z2 = ({a1[0,0]:.3f} × {self.W2[0,0]:.1f}) + ({a1[0,1]:.3f} × {self.W2[1,0]:.1f}) + ({a1[0,2]:.3f} × {self.W2[2,0]:.1f}) + {self.b2[0,0]:.1f}")
total = a1[0,0]*self.W2[0,0] + a1[0,1]*self.W2[1,0] + a1[0,2]*self.W2[2,0] + self.b2[0,0]
print(f" = {total:.3f}")
print(f"\n z2 = {z2}")
# Output layer - Activation
print(f"\n? OUTPUT LAYER - Sigmoid Activation")
a2 = 1 / (1 + np.exp(-z2))
print(f" a2 = s(z2) = 1 / (1 + e^(-{z2[0,0]:.3f}))")
print(f" = 1 / (1 + {np.exp(-z2[0,0]):.3f})")
print(f" = {a2[0,0]:.4f}")
print(f"\n?? FINAL OUTPUT")
print(f" Prediction: {a2[0,0]:.4f}")
print(f" Interpretation: {a2[0,0]*100:.2f}% probability of class 1")
print("="*60)
return a2
# Run detailed forward pass
model = DetailedForwardPass()
x_input = np.array([[2.0, 3.0]])
prediction = model.forward_verbose(x_input)
# Visualize information flow
print("\n\n?? KEY INSIGHTS:")
print(" 1. Each layer applies: Linear Transform ? Activation Function")
print(" 2. Hidden layers learn features, output layer makes prediction")
print(" 3. Information flows in ONE direction: Input ? Hidden ? Output")
print(" 4. This is called 'feedforward' (as opposed to recurrent)")
print("\n During training, we'll adjust W1, b1, W2, b2 to improve predictions!")
Forward Propagation Summary
What we learned:
- Forward propagation is how neural networks make predictions
- Each layer performs:
activation(W·input + b) - Hidden layers learn intermediate representations
- Output layer produces final predictions
- All parameters (W, b) are initially random—they need training!
Next up: How does the network learn? We'll explore loss functions, gradient descent, and the famous backpropagation algorithm that makes it all work.
How Neural Networks Learn
We've seen how neural networks make predictions (forward propagation), but with random initial weights, those predictions are terrible. Learning is the process of adjusting weights and biases to minimize errors. This happens through a beautiful mathematical dance involving loss functions, gradient descent, and backpropagation.
Loss Functions (MSE, Cross-Entropy)
A loss function (or cost function) measures how wrong your network's predictions are. It's a single number that quantifies the difference between predicted and actual values. The goal of training is to minimize this loss.
Why We Need Loss Functions
Think of learning to throw darts:
- Without feedback: You throw blindly, never knowing if you hit the target
- With feedback: Someone tells you "You missed by 3 inches left, 2 inches low"
The loss function is that feedback—it tells the network exactly how far off its predictions are, so it knows which direction to adjust weights.
import numpy as np
import matplotlib.pyplot as plt
# Two most common loss functions
def mean_squared_error(y_true, y_pred):
"""
Mean Squared Error (MSE) - for regression tasks
Formula: MSE = (1/n) * S(y_true - y_pred)²
Why squared?
- Makes all errors positive (penalizes both over/under predictions)
- Heavily penalizes large errors (squaring amplifies them)
"""
return np.mean((y_true - y_pred) ** 2)
def binary_cross_entropy(y_true, y_pred):
"""
Binary Cross-Entropy - for binary classification
Formula: BCE = -(1/n) * S[y*log(y) + (1-y)*log(1-y)]
Why this formula?
- Derived from maximum likelihood estimation
- Penalizes confident wrong predictions heavily
- Works with probabilities (0 to 1)
"""
# Clip predictions to avoid log(0)
y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
# Example 1: Regression with MSE
print("="*60)
print("EXAMPLE 1: REGRESSION (Predicting House Prices)")
print("="*60)
y_true_reg = np.array([250000, 180000, 320000, 290000]) # Actual prices
y_pred_reg1 = np.array([245000, 175000, 310000, 285000]) # Good predictions
y_pred_reg2 = np.array([300000, 150000, 250000, 350000]) # Bad predictions
mse1 = mean_squared_error(y_true_reg, y_pred_reg1)
mse2 = mean_squared_error(y_true_reg, y_pred_reg2)
print(f"\nActual prices: {y_true_reg}")
print(f"Good predictions: {y_pred_reg1}")
print(f" ? MSE: ${mse1:,.0f}²")
print(f"\nBad predictions: {y_pred_reg2}")
print(f" ? MSE: ${mse2:,.0f}²")
print(f"\n?? Lower MSE = Better predictions!")
# Example 2: Binary Classification with Cross-Entropy
print("\n" + "="*60)
print("EXAMPLE 2: BINARY CLASSIFICATION (Email Spam Detection)")
print("="*60)
y_true_clf = np.array([1, 0, 1, 0]) # 1=spam, 0=not spam
y_pred_clf1 = np.array([0.9, 0.1, 0.85, 0.15]) # Confident & correct
y_pred_clf2 = np.array([0.6, 0.4, 0.55, 0.45]) # Uncertain
y_pred_clf3 = np.array([0.1, 0.9, 0.2, 0.8]) # Confident & wrong!
bce1 = binary_cross_entropy(y_true_clf, y_pred_clf1)
bce2 = binary_cross_entropy(y_true_clf, y_pred_clf2)
bce3 = binary_cross_entropy(y_true_clf, y_pred_clf3)
print(f"\nActual labels: {y_true_clf}")
print(f"Confident & correct: {y_pred_clf1}")
print(f" ? BCE: {bce1:.4f}")
print(f"\nUncertain predictions: {y_pred_clf2}")
print(f" ? BCE: {bce2:.4f}")
print(f"\nConfident & WRONG: {y_pred_clf3}")
print(f" ? BCE: {bce3:.4f} ?? HEAVILY PENALIZED!")
# Visualize how loss changes with predictions
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# MSE visualization
actual = 100
predictions = np.linspace(50, 150, 100)
mse_values = [(pred - actual) ** 2 for pred in predictions]
ax1.plot(predictions, mse_values, linewidth=2, color='blue')
ax1.axvline(x=actual, color='red', linestyle='--', label=f'True value: {actual}')
ax1.scatter([actual], [0], color='red', s=100, zorder=5)
ax1.set_xlabel('Predicted Value', fontsize=12)
ax1.set_ylabel('Squared Error', fontsize=12)
ax1.set_title('Mean Squared Error (MSE)', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)
ax1.legend()
# Cross-Entropy visualization
y_true_single = 1 # Actual class is 1
predictions_prob = np.linspace(0.01, 0.99, 100)
ce_values = [-y_true_single * np.log(p) - (1-y_true_single) * np.log(1-p)
for p in predictions_prob]
ax2.plot(predictions_prob, ce_values, linewidth=2, color='green')
ax2.axvline(x=1.0, color='red', linestyle='--', label='True class: 1')
ax2.set_xlabel('Predicted Probability for Class 1', fontsize=12)
ax2.set_ylabel('Cross-Entropy Loss', fontsize=12)
ax2.set_title('Binary Cross-Entropy', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3)
ax2.legend()
ax2.text(0.3, 2, 'Confidently wrong\n? High penalty!', fontsize=10,
bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.5))
plt.tight_layout()
plt.show()
print("\n?? KEY INSIGHT:")
print(" MSE: Penalizes distance from true value (regression)")
print(" Cross-Entropy: Penalizes confident wrong predictions (classification)")
Gradient Descent Explained
Gradient descent is the optimization algorithm that finds the best weights to minimize the loss. Imagine you're blindfolded on a mountain and want to reach the valley (minimum loss). Your strategy: feel the slope beneath your feet and take steps downhill.
The Mountain Climbing Analogy
Your Position: Current weight values
Elevation: Loss function value (higher = worse)
Goal: Reach the lowest point (minimum loss)
Strategy:
- Calculate the slope (gradient) at your current position
- Take a small step in the downhill direction (opposite of gradient)
- Repeat until you can't go lower (convergence)
Learning Rate: How big your steps are
- Too small ? Takes forever to reach the bottom
- Too large ? You overshoot and bounce around
- Just right ? Efficient convergence
import numpy as np
import matplotlib.pyplot as plt
# Gradient Descent from scratch on a simple function
def loss_function(w):
"""Simple quadratic loss: L(w) = (w - 3)²"""
return (w - 3) ** 2
def gradient(w):
"""Derivative of loss: dL/dw = 2(w - 3)"""
return 2 * (w - 3)
def gradient_descent(starting_point, learning_rate, num_iterations):
"""
Perform gradient descent to find minimum.
Update rule: w_new = w_old - learning_rate * gradient
"""
w = starting_point
history = [w]
for i in range(num_iterations):
grad = gradient(w)
w = w - learning_rate * grad # Take step opposite to gradient
history.append(w)
if i < 5 or i % 10 == 0:
print(f"Iteration {i:2d}: w={w:.4f}, loss={loss_function(w):.4f}, gradient={grad:.4f}")
return w, history
# Test different learning rates
print("="*70)
print("GRADIENT DESCENT: Finding minimum of L(w) = (w-3)²")
print("True minimum is at w=3 (where loss=0)")
print("="*70)
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Visualize loss function
w_range = np.linspace(-2, 8, 200)
loss_range = loss_function(w_range)
scenarios = [
(0.0, 0.1, "Good: Learning rate = 0.1"),
(0.0, 0.5, "Too fast: Learning rate = 0.5"),
(0.0, 0.01, "Too slow: Learning rate = 0.01"),
(7.0, 0.1, "Different start: w=7.0")
]
for idx, (start, lr, title) in enumerate(scenarios):
ax = axes[idx // 2, idx % 2]
print(f"\n{title}")
print("-" * 70)
final_w, history = gradient_descent(start, lr, 50)
# Plot loss function
ax.plot(w_range, loss_range, 'b-', linewidth=2, alpha=0.6, label='Loss function')
ax.axvline(x=3, color='red', linestyle='--', alpha=0.5, label='True minimum (w=3)')
# Plot gradient descent path
history_loss = [loss_function(w) for w in history]
ax.plot(history, history_loss, 'go-', linewidth=2, markersize=4,
alpha=0.7, label='GD path')
ax.scatter([history[0]], [loss_function(history[0])], color='green',
s=200, marker='*', zorder=5, label='Start')
ax.scatter([history[-1]], [loss_function(history[-1])], color='orange',
s=200, marker='*', zorder=5, label='End')
ax.set_xlabel('Weight (w)', fontsize=11)
ax.set_ylabel('Loss', fontsize=11)
ax.set_title(title, fontsize=12, fontweight='bold')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)
ax.set_ylim(-1, max(20, max(history_loss) * 1.1))
plt.tight_layout()
plt.show()
print("\n" + "="*70)
print("?? OBSERVATIONS:")
print(" ? Learning rate 0.1: Smooth convergence to minimum")
print(" ?? Learning rate 0.5: Overshoots but still converges")
print(" ? Learning rate 0.01: Converges slowly (needs more iterations)")
print(" ? Starting point doesn't matter (for convex functions)")
print("="*70)
Types of Gradient Descent
1. Batch Gradient Descent:
- Uses entire dataset to calculate gradient
- Accurate but slow for large datasets
- Formula:
w = w - a * (1/N) * S?L(x?)
2. Stochastic Gradient Descent (SGD):
- Uses one random sample at a time
- Fast but noisy updates
- Formula:
w = w - a * ?L(x?)
3. Mini-Batch Gradient Descent (Most Common):
- Uses small batches (e.g., 32, 64, 128 samples)
- Best of both worlds: fast + stable
- Formula:
w = w - a * (1/B) * S?L(x?)where B = batch size
Backpropagation: The Magic Behind Learning
Backpropagation ("backward propagation of errors") is the algorithm that computes gradients efficiently in neural networks. It's the reason deep learning works—without it, training would be impossibly slow.
The key insight: Use the chain rule from calculus to propagate the error backward through the network, calculating how much each weight contributed to the final error.
Backpropagation Intuition: The Blame Game
Imagine your network made a wrong prediction. Who's to blame?
The Investigation:
- Output layer: "I was wrong by X amount"
- Ask hidden layer: "How much of this error is YOUR fault?"
- Hidden layer calculates: "Based on my weights to output, I contributed Y to the error"
- Ask input layer: Same process continues backward
- Result: Every weight knows exactly how much to change
Mathematical Magic: The chain rule lets us compute all these "blame assignments" in one backward pass—same cost as one forward pass!
import numpy as np
# Complete Backpropagation Implementation from Scratch
class SimpleNeuralNetworkWithBackprop:
def __init__(self, input_size, hidden_size, output_size, learning_rate=0.1):
"""
Neural network: input ? hidden ? output
"""
# Initialize weights
self.W1 = np.random.randn(input_size, hidden_size) * 0.01
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, output_size) * 0.01
self.b2 = np.zeros((1, output_size))
self.learning_rate = learning_rate
# For storing intermediate values during forward pass
self.cache = {}
def sigmoid(self, z):
"""Sigmoid activation"""
return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
def sigmoid_derivative(self, a):
"""Derivative of sigmoid: s'(z) = s(z) * (1 - s(z))"""
return a * (1 - a)
def forward(self, X):
"""Forward pass - compute predictions and save intermediate values"""
# Layer 1
self.cache['X'] = X
self.cache['z1'] = np.dot(X, self.W1) + self.b1
self.cache['a1'] = self.sigmoid(self.cache['z1'])
# Layer 2
self.cache['z2'] = np.dot(self.cache['a1'], self.W2) + self.b2
self.cache['a2'] = self.sigmoid(self.cache['z2'])
return self.cache['a2']
def backward(self, X, y):
"""
Backpropagation - compute gradients using chain rule.
Chain rule breakdown:
dL/dW2 = dL/da2 * da2/dz2 * dz2/dW2
dL/dW1 = dL/da2 * da2/dz2 * dz2/da1 * da1/dz1 * dz1/dW1
"""
m = X.shape[0] # Number of samples
# Output layer gradients
# dL/da2 = a2 - y (for MSE loss with sigmoid)
dz2 = self.cache['a2'] - y # Combined: dL/da2 * da2/dz2
dW2 = (1/m) * np.dot(self.cache['a1'].T, dz2)
db2 = (1/m) * np.sum(dz2, axis=0, keepdims=True)
# Hidden layer gradients (chain rule applied!)
da1 = np.dot(dz2, self.W2.T) # Error propagated back
dz1 = da1 * self.sigmoid_derivative(self.cache['a1'])
dW1 = (1/m) * np.dot(X.T, dz1)
db1 = (1/m) * np.sum(dz1, axis=0, keepdims=True)
# Store gradients
gradients = {
'dW1': dW1, 'db1': db1,
'dW2': dW2, 'db2': db2
}
return gradients
def update_parameters(self, gradients):
"""Update weights using gradient descent"""
self.W1 -= self.learning_rate * gradients['dW1']
self.b1 -= self.learning_rate * gradients['db1']
self.W2 -= self.learning_rate * gradients['dW2']
self.b2 -= self.learning_rate * gradients['db2']
def train_step(self, X, y):
"""One complete training step: forward ? backward ? update"""
# Forward pass
predictions = self.forward(X)
# Calculate loss
loss = np.mean((predictions - y) ** 2)
# Backward pass
gradients = self.backward(X, y)
# Update weights
self.update_parameters(gradients)
return loss, gradients
# Demonstrate backpropagation on XOR problem
print("="*70)
print("BACKPROPAGATION DEMO: Learning XOR")
print("="*70)
# XOR dataset (the problem single perceptron couldn't solve!)
X_xor = np.array([[0, 0],
[0, 1],
[1, 0],
[1, 1]])
y_xor = np.array([[0],
[1],
[1],
[0]])
# Create network
nn = SimpleNeuralNetworkWithBackprop(input_size=2, hidden_size=4, output_size=1,
learning_rate=0.5)
print("\nInitial predictions (random weights):")
initial_preds = nn.forward(X_xor)
for i, (x, y_true, y_pred) in enumerate(zip(X_xor, y_xor, initial_preds)):
print(f" {x} ? Predicted: {y_pred[0]:.4f}, Actual: {y_true[0]}")
# Training loop
print("\nTraining...")
losses = []
for epoch in range(1000):
loss, grads = nn.train_step(X_xor, y_xor)
losses.append(loss)
if epoch % 200 == 0:
print(f" Epoch {epoch:4d}: Loss = {loss:.6f}")
print("\nFinal predictions (after training):")
final_preds = nn.forward(X_xor)
for i, (x, y_true, y_pred) in enumerate(zip(X_xor, y_xor, final_preds)):
print(f" {x} ? Predicted: {y_pred[0]:.4f}, Actual: {y_true[0]} ?")
# Visualize training
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(losses, linewidth=2, color='blue')
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Loss (MSE)', fontsize=12)
plt.title('Training Loss Over Time', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.subplot(1, 2, 2)
# Decision boundary
xx, yy = np.meshgrid(np.linspace(-0.5, 1.5, 200),
np.linspace(-0.5, 1.5, 200))
Z = nn.forward(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
plt.contourf(xx, yy, Z, levels=20, cmap='RdYlBu', alpha=0.7)
plt.colorbar(label='Prediction')
plt.scatter(X_xor[y_xor.ravel()==0, 0], X_xor[y_xor.ravel()==0, 1],
c='blue', s=200, edgecolors='black', linewidth=2, label='Class 0')
plt.scatter(X_xor[y_xor.ravel()==1, 0], X_xor[y_xor.ravel()==1, 1],
c='red', s=200, edgecolors='black', linewidth=2, label='Class 1')
plt.xlabel('Input 1', fontsize=12)
plt.ylabel('Input 2', fontsize=12)
plt.title('Learned Decision Boundary (XOR)', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\n?? Backpropagation successfully learned XOR!")
print(" A single neuron CANNOT do this, but a neural network CAN!")
Optimization Techniques (SGD, Adam, RMSprop)
While basic gradient descent works, modern optimizers add clever tricks to train faster and more reliably. Let's explore the most popular optimizers used in practice.
import numpy as np
import matplotlib.pyplot as plt
# Implement popular optimizers from scratch
class SGD:
"""Stochastic Gradient Descent with Momentum"""
def __init__(self, learning_rate=0.01, momentum=0.9):
self.lr = learning_rate
self.momentum = momentum
self.velocity = {}
def update(self, params, grads):
"""Update parameters with momentum"""
for key in params:
if key not in self.velocity:
self.velocity[key] = np.zeros_like(params[key])
# Momentum: accumulate velocity
self.velocity[key] = self.momentum * self.velocity[key] - self.lr * grads[key]
params[key] += self.velocity[key]
class RMSprop:
"""RMSprop: adapts learning rate for each parameter"""
def __init__(self, learning_rate=0.001, decay_rate=0.9, epsilon=1e-8):
self.lr = learning_rate
self.decay_rate = decay_rate
self.epsilon = epsilon
self.cache = {}
def update(self, params, grads):
"""Update with adaptive learning rates"""
for key in params:
if key not in self.cache:
self.cache[key] = np.zeros_like(params[key])
# Accumulate squared gradients
self.cache[key] = self.decay_rate * self.cache[key] + \
(1 - self.decay_rate) * grads[key]**2
# Adaptive update
params[key] -= self.lr * grads[key] / (np.sqrt(self.cache[key]) + self.epsilon)
class Adam:
"""Adam: combines momentum and RMSprop"""
def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
self.lr = learning_rate
self.beta1 = beta1 # Momentum decay
self.beta2 = beta2 # RMSprop decay
self.epsilon = epsilon
self.m = {} # First moment (momentum)
self.v = {} # Second moment (RMSprop)
self.t = 0 # Time step
def update(self, params, grads):
"""Update with bias-corrected moments"""
self.t += 1
for key in params:
if key not in self.m:
self.m[key] = np.zeros_like(params[key])
self.v[key] = np.zeros_like(params[key])
# Update biased first moment (momentum)
self.m[key] = self.beta1 * self.m[key] + (1 - self.beta1) * grads[key]
# Update biased second moment (RMSprop)
self.v[key] = self.beta2 * self.v[key] + (1 - self.beta2) * (grads[key]**2)
# Bias correction
m_hat = self.m[key] / (1 - self.beta1**self.t)
v_hat = self.v[key] / (1 - self.beta2**self.t)
# Update parameters
params[key] -= self.lr * m_hat / (np.sqrt(v_hat) + self.epsilon)
# Compare optimizers on a challenging function
def rosenbrock(x, y):
"""Rosenbrock function: (1-x)² + 100(y-x²)²"""
return (1 - x)**2 + 100 * (y - x**2)**2
def rosenbrock_gradient(x, y):
"""Gradient of Rosenbrock function"""
dx = -2*(1-x) - 400*x*(y - x**2)
dy = 200*(y - x**2)
return np.array([dx, dy])
# Test all optimizers
optimizers = {
'SGD': SGD(learning_rate=0.001, momentum=0.9),
'RMSprop': RMSprop(learning_rate=0.01),
'Adam': Adam(learning_rate=0.01)
}
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# Create contour plot
x = np.linspace(-2, 2, 400)
y = np.linspace(-1, 3, 400)
X, Y = np.meshgrid(x, y)
Z = rosenbrock(X, Y)
for idx, (name, optimizer) in enumerate(optimizers.items()):
ax = axes[idx]
# Plot function landscape
contour = ax.contour(X, Y, Z, levels=np.logspace(-1, 3.5, 20), cmap='viridis', alpha=0.6)
ax.clabel(contour, inline=True, fontsize=8)
# Optimize
params = {'w': np.array([-1.5, 2.5])} # Starting point
path = [params['w'].copy()]
for i in range(500):
# Calculate gradient
grad = rosenbrock_gradient(params['w'][0], params['w'][1])
grads = {'w': grad}
# Update
optimizer.update(params, grads)
path.append(params['w'].copy())
# Stop if converged
if np.linalg.norm(grad) < 1e-5:
break
path = np.array(path)
# Plot optimization path
ax.plot(path[:, 0], path[:, 1], 'r.-', linewidth=2, markersize=3, alpha=0.7)
ax.scatter([path[0, 0]], [path[0, 1]], color='green', s=200, marker='*',
zorder=5, label='Start', edgecolors='black', linewidth=2)
ax.scatter([1], [1], color='red', s=200, marker='*',
zorder=5, label='Optimum (1,1)', edgecolors='black', linewidth=2)
ax.scatter([path[-1, 0]], [path[-1, 1]], color='orange', s=100,
marker='o', zorder=5, label=f'End (iter={len(path)})')
ax.set_xlabel('x', fontsize=12)
ax.set_ylabel('y', fontsize=12)
ax.set_title(f'{name} Optimizer', fontsize=14, fontweight='bold')
ax.legend(fontsize=9)
ax.grid(True, alpha=0.3)
ax.set_xlim(-2, 2)
ax.set_ylim(-1, 3)
plt.tight_layout()
plt.show()
# Summary comparison
print("="*70)
print("OPTIMIZER COMPARISON")
print("="*70)
print(f"{'Optimizer':<15} {'Pros':<30} {'Cons':<30}")
print("-"*70)
print(f"{'SGD':<15} {'Simple, reliable':<30} {'Slow, sensitive to LR':<30}")
print(f"{'RMSprop':<15} {'Adaptive LR, fast':<30} {'Can be unstable':<30}")
print(f"{'Adam':<15} {'Fast, robust, popular':<30} {'Memory overhead':<30}")
print("-"*70)
print("\n?? In practice: Adam is the default choice for most deep learning tasks")
print(" It combines the best of momentum and adaptive learning rates!")
Modern Training Recipe
Standard setup for training neural networks (2020s):
- Optimizer: Adam (learning_rate=0.001, default betas)
- Batch size: 32-128 (balance speed vs memory)
- Loss function:
- Regression ? MSE or MAE
- Binary classification ? Binary Cross-Entropy
- Multi-class ? Categorical Cross-Entropy
- Epochs: Train until validation loss stops improving (early stopping)
- Learning rate schedule: Reduce LR when plateauing
That's it! This recipe works for 80% of problems. Fine-tune only if needed.
Building Your First Neural Network from Scratch
Now it's time to put everything together! We'll build a complete neural network from scratch using only NumPy—no frameworks, no magic. You'll understand every line of code and see exactly how neural networks work under the hood.
Problem: XOR Classification
We'll solve the XOR (exclusive OR) problem—the classic challenge that single-layer perceptrons cannot solve. This proves our network truly learns non-linear patterns.
What is XOR?
XOR (exclusive OR) returns True only when inputs are different:
| Input 1 | Input 2 | XOR Output |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
Why it's hard: You cannot draw a single straight line to separate the green (1) from red (0) points. This requires a non-linear decision boundary, which only multi-layer networks can learn.
import numpy as np
import matplotlib.pyplot as plt
# Visualize why XOR is non-linearly separable
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 0])
plt.figure(figsize=(8, 6))
plt.scatter(X[y==0, 0], X[y==0, 1], s=300, c='red', marker='o',
edgecolors='black', linewidth=3, label='Class 0', alpha=0.7)
plt.scatter(X[y==1, 0], X[y==1, 1], s=300, c='green', marker='s',
edgecolors='black', linewidth=3, label='Class 1', alpha=0.7)
# Try to draw linear separators (they all fail!)
x_line = np.linspace(-0.2, 1.2, 100)
plt.plot(x_line, 0.5*np.ones_like(x_line), 'b--', alpha=0.5, linewidth=2,
label='Horizontal line (fails)')
plt.plot(0.5*np.ones_like(x_line), x_line, 'purple', linestyle='--', alpha=0.5,
linewidth=2, label='Vertical line (fails)')
plt.plot(x_line, x_line, 'orange', linestyle='--', alpha=0.5, linewidth=2,
label='Diagonal line (fails)')
plt.xlabel('Input 1', fontsize=14)
plt.ylabel('Input 2', fontsize=14)
plt.title('XOR Problem: No Linear Separator Exists', fontsize=16, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.xlim(-0.2, 1.2)
plt.ylim(-0.2, 1.2)
plt.tight_layout()
plt.show()
print("? Single perceptron: CANNOT solve XOR")
print("? Multi-layer network: CAN solve XOR")
print("\nLet's build one from scratch!")
Python Implementation with NumPy
Here's our complete neural network implementation. Every component is explained with comments.
import numpy as np
import matplotlib.pyplot as plt
class NeuralNetwork:
"""
A simple neural network with one hidden layer.
Architecture: input_size ? hidden_size ? output_size
"""
def __init__(self, input_size, hidden_size, output_size, learning_rate=0.5):
"""Initialize network with random weights"""
# Xavier initialization: scale by sqrt(1/n) for better training
self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(1. / input_size)
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(1. / hidden_size)
self.b2 = np.zeros((1, output_size))
self.learning_rate = learning_rate
# For storing values during forward/backward pass
self.cache = {}
self.grads = {}
def sigmoid(self, z):
"""Sigmoid activation: s(z) = 1 / (1 + e^(-z))"""
return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
def sigmoid_derivative(self, a):
"""Derivative of sigmoid: s'(z) = s(z) * (1 - s(z))"""
return a * (1 - a)
def forward(self, X):
"""
Forward propagation: compute predictions.
Flow: X ? W1,b1 ? sigmoid ? W2,b2 ? sigmoid ? predictions
"""
# Hidden layer
self.cache['X'] = X
self.cache['z1'] = np.dot(X, self.W1) + self.b1
self.cache['a1'] = self.sigmoid(self.cache['z1'])
# Output layer
self.cache['z2'] = np.dot(self.cache['a1'], self.W2) + self.b2
self.cache['a2'] = self.sigmoid(self.cache['z2'])
return self.cache['a2']
def backward(self, y):
"""
Backpropagation: compute gradients using chain rule.
Computes: dL/dW2, dL/db2, dL/dW1, dL/db1
"""
m = self.cache['X'].shape[0] # Number of samples
# Output layer gradients
dz2 = self.cache['a2'] - y # Derivative of loss w.r.t. z2
self.grads['dW2'] = (1/m) * np.dot(self.cache['a1'].T, dz2)
self.grads['db2'] = (1/m) * np.sum(dz2, axis=0, keepdims=True)
# Hidden layer gradients (chain rule!)
da1 = np.dot(dz2, self.W2.T)
dz1 = da1 * self.sigmoid_derivative(self.cache['a1'])
self.grads['dW1'] = (1/m) * np.dot(self.cache['X'].T, dz1)
self.grads['db1'] = (1/m) * np.sum(dz1, axis=0, keepdims=True)
def update_parameters(self):
"""Gradient descent: update weights and biases"""
self.W1 -= self.learning_rate * self.grads['dW1']
self.b1 -= self.learning_rate * self.grads['db1']
self.W2 -= self.learning_rate * self.grads['dW2']
self.b2 -= self.learning_rate * self.grads['db2']
def compute_loss(self, y_true, y_pred):
"""Mean Squared Error loss"""
return np.mean((y_true - y_pred) ** 2)
def train(self, X, y, epochs=10000, print_every=1000):
"""
Complete training loop.
For each epoch:
1. Forward pass (get predictions)
2. Compute loss
3. Backward pass (compute gradients)
4. Update parameters
"""
losses = []
for epoch in range(epochs):
# Forward
predictions = self.forward(X)
# Loss
loss = self.compute_loss(y, predictions)
losses.append(loss)
# Backward
self.backward(y)
# Update
self.update_parameters()
# Print progress
if epoch % print_every == 0:
print(f"Epoch {epoch:5d} | Loss: {loss:.6f}")
return losses
def predict(self, X):
"""Make predictions (>0.5 ? class 1, =0.5 ? class 0)"""
probs = self.forward(X)
return (probs > 0.5).astype(int)
# Create the network
print("="*70)
print("BUILDING NEURAL NETWORK FROM SCRATCH")
print("="*70)
print("\nArchitecture: 2 inputs ? 4 hidden neurons ? 1 output")
print("Activation: Sigmoid (both layers)")
print("Loss: Mean Squared Error")
print("Optimizer: Gradient Descent (learning_rate=0.5)")
nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1, learning_rate=0.5)
print(f"\nInitial parameters:")
print(f" W1 shape: {nn.W1.shape} (weights: input ? hidden)")
print(f" b1 shape: {nn.b1.shape} (biases: hidden layer)")
print(f" W2 shape: {nn.W2.shape} (weights: hidden ? output)")
print(f" b2 shape: {nn.b2.shape} (biases: output layer)")
print(f"\n Total parameters: {nn.W1.size + nn.b1.size + nn.W2.size + nn.b2.size}")
Training the Network Step-by-Step
Let's train our network on the XOR dataset and watch it learn!
import numpy as np
import matplotlib.pyplot as plt
# Using the NeuralNetwork class from previous code block
class NeuralNetwork:
"""Complete implementation (same as above)"""
def __init__(self, input_size, hidden_size, output_size, learning_rate=0.5):
self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(1. / input_size)
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(1. / hidden_size)
self.b2 = np.zeros((1, output_size))
self.learning_rate = learning_rate
self.cache = {}
self.grads = {}
def sigmoid(self, z):
return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
def sigmoid_derivative(self, a):
return a * (1 - a)
def forward(self, X):
self.cache['X'] = X
self.cache['z1'] = np.dot(X, self.W1) + self.b1
self.cache['a1'] = self.sigmoid(self.cache['z1'])
self.cache['z2'] = np.dot(self.cache['a1'], self.W2) + self.b2
self.cache['a2'] = self.sigmoid(self.cache['z2'])
return self.cache['a2']
def backward(self, y):
m = self.cache['X'].shape[0]
dz2 = self.cache['a2'] - y
self.grads['dW2'] = (1/m) * np.dot(self.cache['a1'].T, dz2)
self.grads['db2'] = (1/m) * np.sum(dz2, axis=0, keepdims=True)
da1 = np.dot(dz2, self.W2.T)
dz1 = da1 * self.sigmoid_derivative(self.cache['a1'])
self.grads['dW1'] = (1/m) * np.dot(self.cache['X'].T, dz1)
self.grads['db1'] = (1/m) * np.sum(dz1, axis=0, keepdims=True)
def update_parameters(self):
self.W1 -= self.learning_rate * self.grads['dW1']
self.b1 -= self.learning_rate * self.grads['db1']
self.W2 -= self.learning_rate * self.grads['dW2']
self.b2 -= self.learning_rate * self.grads['db2']
def compute_loss(self, y_true, y_pred):
return np.mean((y_true - y_pred) ** 2)
def train(self, X, y, epochs=10000, print_every=1000):
losses = []
for epoch in range(epochs):
predictions = self.forward(X)
loss = self.compute_loss(y, predictions)
losses.append(loss)
self.backward(y)
self.update_parameters()
if epoch % print_every == 0:
print(f"Epoch {epoch:5d} | Loss: {loss:.6f}")
return losses
def predict(self, X):
probs = self.forward(X)
return (probs > 0.5).astype(int)
# XOR dataset
X_train = np.array([[0, 0],
[0, 1],
[1, 0],
[1, 1]])
y_train = np.array([[0],
[1],
[1],
[0]])
print("="*70)
print("TRAINING ON XOR DATASET")
print("="*70)
# Test initial predictions (should be random/bad)
nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1, learning_rate=0.5)
print("\n?? BEFORE TRAINING:")
initial_preds = nn.forward(X_train)
for i in range(len(X_train)):
print(f" Input: {X_train[i]} ? Prediction: {initial_preds[i][0]:.4f}, True: {y_train[i][0]}")
# Train the network
print("\n?? TRAINING...")
losses = nn.train(X_train, y_train, epochs=10000, print_every=2000)
# Test final predictions
print("\n?? AFTER TRAINING:")
final_preds = nn.forward(X_train)
predictions = nn.predict(X_train)
for i in range(len(X_train)):
prob = final_preds[i][0]
pred_class = predictions[i][0]
true_class = y_train[i][0]
status = "?" if pred_class == true_class else "?"
print(f" Input: {X_train[i]} ? Probability: {prob:.4f}, Predicted: {pred_class}, True: {true_class} {status}")
# Visualize training progress
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Loss curve
ax1.plot(losses, linewidth=2, color='blue')
ax1.set_xlabel('Epoch', fontsize=12)
ax1.set_ylabel('Loss (MSE)', fontsize=12)
ax1.set_title('Training Loss Over Time', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)
ax1.set_yscale('log')
# Predictions at different epochs
epochs_to_show = [0, 100, 500, 1000, 5000, 9999]
for epoch in epochs_to_show:
# Re-create network and train to this epoch
temp_nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1, learning_rate=0.5)
np.random.seed(42) # Same initialization
temp_nn.W1 = np.random.randn(2, 4) * np.sqrt(1. / 2)
temp_nn.b1 = np.zeros((1, 4))
temp_nn.W2 = np.random.randn(4, 1) * np.sqrt(1. / 4)
temp_nn.b2 = np.zeros((1, 1))
if epoch > 0:
temp_nn.train(X_train, y_train, epochs=epoch, print_every=100000)
preds = temp_nn.forward(X_train)
avg_error = np.mean(np.abs(preds - y_train))
ax2.plot([epoch] * 4, preds.flatten(), 'o-', label=f'Epoch {epoch} (error={avg_error:.3f})',
markersize=8, alpha=0.7)
ax2.axhline(y=0, color='red', linestyle='--', alpha=0.3, label='Target: Class 0')
ax2.axhline(y=1, color='green', linestyle='--', alpha=0.3, label='Target: Class 1')
ax2.set_xlabel('Training Epoch', fontsize=12)
ax2.set_ylabel('Network Output', fontsize=12)
ax2.set_title('How Predictions Improve During Training', fontsize=14, fontweight='bold')
ax2.legend(fontsize=8)
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\n?? Success! The network learned XOR perfectly!")
print(" Notice how predictions converge to correct values over time.")
Visualizing Decision Boundaries
The best way to understand what our network learned is to visualize its decision boundary—the curve separating different classes in the input space.
import numpy as np
import matplotlib.pyplot as plt
# Using trained network from previous code
class NeuralNetwork:
"""Complete implementation"""
def __init__(self, input_size, hidden_size, output_size, learning_rate=0.5):
np.random.seed(42) # For reproducibility
self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(1. / input_size)
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(1. / hidden_size)
self.b2 = np.zeros((1, output_size))
self.learning_rate = learning_rate
self.cache = {}
self.grads = {}
def sigmoid(self, z):
return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
def sigmoid_derivative(self, a):
return a * (1 - a)
def forward(self, X):
self.cache['X'] = X
self.cache['z1'] = np.dot(X, self.W1) + self.b1
self.cache['a1'] = self.sigmoid(self.cache['z1'])
self.cache['z2'] = np.dot(self.cache['a1'], self.W2) + self.b2
self.cache['a2'] = self.sigmoid(self.cache['z2'])
return self.cache['a2']
def backward(self, y):
m = self.cache['X'].shape[0]
dz2 = self.cache['a2'] - y
self.grads['dW2'] = (1/m) * np.dot(self.cache['a1'].T, dz2)
self.grads['db2'] = (1/m) * np.sum(dz2, axis=0, keepdims=True)
da1 = np.dot(dz2, self.W2.T)
dz1 = da1 * self.sigmoid_derivative(self.cache['a1'])
self.grads['dW1'] = (1/m) * np.dot(self.cache['X'].T, dz1)
self.grads['db1'] = (1/m) * np.sum(dz1, axis=0, keepdims=True)
def update_parameters(self):
self.W1 -= self.learning_rate * self.grads['dW1']
self.b1 -= self.learning_rate * self.grads['db1']
self.W2 -= self.learning_rate * self.grads['dW2']
self.b2 -= self.learning_rate * self.grads['db2']
def compute_loss(self, y_true, y_pred):
return np.mean((y_true - y_pred) ** 2)
def train(self, X, y, epochs=10000):
for epoch in range(epochs):
predictions = self.forward(X)
self.backward(y)
self.update_parameters()
# Train network
X_train = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_train = np.array([[0], [1], [1], [0]])
nn = NeuralNetwork(input_size=2, hidden_size=4, output_size=1, learning_rate=0.5)
nn.train(X_train, y_train, epochs=10000)
# Create decision boundary plot
print("="*70)
print("VISUALIZING DECISION BOUNDARY")
print("="*70)
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# Create mesh grid
x_min, x_max = -0.5, 1.5
y_min, y_max = -0.5, 1.5
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
np.linspace(y_min, y_max, 200))
# Get predictions for all points in grid
Z = nn.forward(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot 1: Continuous probability heatmap
im1 = axes[0].contourf(xx, yy, Z, levels=20, cmap='RdYlBu_r', alpha=0.8)
axes[0].scatter(X_train[y_train.ravel()==0, 0], X_train[y_train.ravel()==0, 1],
s=300, c='blue', marker='o', edgecolors='black', linewidth=3,
label='Class 0', zorder=5)
axes[0].scatter(X_train[y_train.ravel()==1, 0], X_train[y_train.ravel()==1, 1],
s=300, c='red', marker='s', edgecolors='black', linewidth=3,
label='Class 1', zorder=5)
axes[0].set_xlabel('Input 1', fontsize=12)
axes[0].set_ylabel('Input 2', fontsize=12)
axes[0].set_title('Decision Boundary (Probability Heatmap)', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)
fig.colorbar(im1, ax=axes[0], label='P(Class 1)')
# Plot 2: Binary classification regions
Z_binary = (Z > 0.5).astype(int)
axes[1].contourf(xx, yy, Z_binary, levels=1, colors=['lightblue', 'lightcoral'], alpha=0.6)
axes[1].contour(xx, yy, Z, levels=[0.5], colors='black', linewidths=3,
linestyles='solid', label='Decision boundary (p=0.5)')
axes[1].scatter(X_train[y_train.ravel()==0, 0], X_train[y_train.ravel()==0, 1],
s=300, c='blue', marker='o', edgecolors='black', linewidth=3,
label='Class 0', zorder=5)
axes[1].scatter(X_train[y_train.ravel()==1, 0], X_train[y_train.ravel()==1, 1],
s=300, c='red', marker='s', edgecolors='black', linewidth=3,
label='Class 1', zorder=5)
axes[1].set_xlabel('Input 1', fontsize=12)
axes[1].set_ylabel('Input 2', fontsize=12)
axes[1].set_title('Binary Classification Regions', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)
# Plot 3: 3D surface
from mpl_toolkits.mplot3d import Axes3D
ax3 = fig.add_subplot(133, projection='3d')
surf = ax3.plot_surface(xx, yy, Z, cmap='RdYlBu_r', alpha=0.8,
edgecolor='none', antialiased=True)
ax3.scatter(X_train[:, 0], X_train[:, 1], y_train.ravel(),
s=200, c=['blue', 'red', 'red', 'blue'], marker='o',
edgecolors='black', linewidth=2, depthshade=False)
ax3.set_xlabel('Input 1', fontsize=11)
ax3.set_ylabel('Input 2', fontsize=11)
ax3.set_zlabel('Output Probability', fontsize=11)
ax3.set_title('3D Output Surface', fontsize=14, fontweight='bold')
ax3.view_init(elev=20, azim=45)
fig.colorbar(surf, ax=ax3, label='P(Class 1)', shrink=0.5)
plt.tight_layout()
plt.show()
print("\n?? Decision Boundary Analysis:")
print(f" - The boundary is a CURVED line (non-linear!)")
print(f" - Blue region: Network predicts Class 0")
print(f" - Red region: Network predicts Class 1")
print(f" - The 4 XOR points are correctly separated")
print(f"\n?? This proves our network learned a non-linear function!")
print(f" Something a single perceptron could NEVER do.")
# Show what each hidden neuron learned
print("\n" + "="*70)
print("WHAT DID THE HIDDEN NEURONS LEARN?")
print("="*70)
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()
for neuron_idx in range(min(4, nn.W1.shape[1])):
# Get this neuron's activation across the grid
z1_grid = np.dot(np.c_[xx.ravel(), yy.ravel()], nn.W1) + nn.b1
a1_grid = nn.sigmoid(z1_grid)
neuron_activation = a1_grid[:, neuron_idx].reshape(xx.shape)
ax = axes[neuron_idx]
im = ax.contourf(xx, yy, neuron_activation, levels=20, cmap='viridis', alpha=0.8)
ax.scatter(X_train[:, 0], X_train[:, 1], s=200, c='red',
edgecolors='black', linewidth=2, zorder=5)
ax.set_xlabel('Input 1', fontsize=11)
ax.set_ylabel('Input 2', fontsize=11)
ax.set_title(f'Hidden Neuron {neuron_idx+1} Activation', fontsize=12, fontweight='bold')
ax.grid(True, alpha=0.3)
fig.colorbar(im, ax=ax, label='Activation')
plt.tight_layout()
plt.show()
print("\n?? Each hidden neuron learns a different 'feature':")
print(" - Some detect edges, some detect corners")
print(" - The output neuron COMBINES these features")
print(" - This combination creates the final XOR pattern!")
print("\n? You've successfully built a neural network from scratch!")
What You've Accomplished
Congratulations! You've built a complete neural network from scratch using only NumPy. Here's what you now understand:
- ? Forward Propagation: How data flows through the network to make predictions
- ? Loss Functions: How to measure prediction errors
- ? Backpropagation: How gradients are computed using the chain rule
- ? Gradient Descent: How weights are updated to minimize loss
- ? Non-Linear Learning: How hidden layers enable learning complex patterns
- ? Decision Boundaries: How networks partition the input space
Key Insight: The XOR problem—unsolvable by single perceptrons—is trivial for a multi-layer network. This demonstrates the power of depth in neural networks.
Next: We'll explore different types of neural network architectures (feedforward, CNN, RNN) and when to use each one.
Types of Neural Network Architectures
Not all neural networks are created equal. Different problems require different architectures. Just as you wouldn't use a hammer to cut wood, you wouldn't use a CNN for time series prediction or an RNN for image classification. Let's explore the main architecture families and when to use each.
The Neural Network Family Tree
Quick Reference Guide:
- Feedforward NN: Tabular data, simple classification/regression
- CNN: Images, spatial data, pattern recognition
- RNN/LSTM: Sequences, time series, text, speech
- Autoencoders: Dimensionality reduction, denoising, anomaly detection
- GANs: Generating new data, image synthesis, data augmentation
- Transformers: NLP, large-scale sequence modeling, vision tasks
Feedforward Neural Networks (FNN)
Feedforward Neural Networks (also called Multi-Layer Perceptrons or MLPs) are the simplest architecture we've been using so far. Information flows in one direction: input ? hidden layers ? output. No loops, no feedback.
Feedforward Neural Network Characteristics
Structure:
- Fully connected layers (every neuron connects to all neurons in next layer)
- Information flows forward only (no cycles)
- Each layer transforms input using:
activation(W·x + b)
Best For:
- Tabular data (spreadsheets, databases)
- Simple classification (spam detection, fraud detection)
- Regression (price prediction, scoring)
- Feature learning from fixed-size inputs
Limitations:
- No spatial awareness (treats pixels as independent features)
- No temporal memory (can't process sequences)
- Explodes in size with high-dimensional inputs (images, text)
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Simple Feedforward Network for Classification
class FeedforwardNN:
"""Classic MLP: Input ? Hidden ? Output"""
def __init__(self, input_size, hidden_size, output_size, lr=0.01):
# Initialize weights
self.W1 = np.random.randn(input_size, hidden_size) * 0.01
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, output_size) * 0.01
self.b2 = np.zeros((1, output_size))
self.lr = lr
def relu(self, z):
return np.maximum(0, z)
def softmax(self, z):
exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))
return exp_z / np.sum(exp_z, axis=1, keepdims=True)
def forward(self, X):
self.z1 = np.dot(X, self.W1) + self.b1
self.a1 = self.relu(self.z1)
self.z2 = np.dot(self.a1, self.W2) + self.b2
self.a2 = self.softmax(self.z2)
return self.a2
def predict(self, X):
probs = self.forward(X)
return np.argmax(probs, axis=1)
# Example: Iris dataset (tabular data)
iris = load_iris()
X, y = iris.data, iris.target
# Preprocess
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Create network
fnn = FeedforwardNN(input_size=4, hidden_size=10, output_size=3)
predictions = fnn.predict(X_test)
print("="*60)
print("FEEDFORWARD NN EXAMPLE: Iris Classification")
print("="*60)
print(f"Input features: {iris.feature_names}")
print(f"Classes: {iris.target_names}")
print(f"\nSample predictions:")
for i in range(5):
print(f" Features: {X_test[i]} ? Predicted: {iris.target_names[predictions[i]]}, "
f"True: {iris.target_names[y_test[i]]}")
print("\n?? Feedforward networks work great for structured, tabular data!")
print(" But for images or sequences, specialized architectures perform better.")
Convolutional Neural Networks (CNN)
Convolutional Neural Networks are designed for grid-like data, especially images. Instead of treating pixels as independent features, CNNs use convolutional filters that slide across the image, detecting local patterns like edges, textures, and shapes.
Why CNNs for Images?
Problem with Feedforward NNs for Images:
- A tiny 28×28 grayscale image = 784 input neurons
- A small 224×224 color image = 150,528 input neurons!
- First hidden layer with 1,000 neurons = 150 million weights!
- Spatial relationships destroyed (pixel at (10,10) unrelated to (10,11))
CNN Solution:
- Local connectivity: Each neuron only looks at small region (e.g., 3×3 pixels)
- Weight sharing: Same filter applied across entire image ? far fewer parameters
- Spatial hierarchy: Early layers detect edges ? middle layers detect shapes ? deep layers detect objects
import numpy as np
import matplotlib.pyplot as plt
# Demonstrate convolution operation
def convolve2d(image, kernel):
"""
Apply 2D convolution: slide kernel over image.
This is the core operation in CNNs!
"""
i_height, i_width = image.shape
k_height, k_width = kernel.shape
# Output size (assuming no padding)
out_height = i_height - k_height + 1
out_width = i_width - k_width + 1
output = np.zeros((out_height, out_width))
# Slide kernel across image
for i in range(out_height):
for j in range(out_width):
# Extract region
region = image[i:i+k_height, j:j+k_width]
# Element-wise multiply and sum
output[i, j] = np.sum(region * kernel)
return output
# Create a simple image with an edge
image = np.zeros((10, 10))
image[:, 5:] = 1 # Vertical edge at column 5
# Edge detection kernels
vertical_edge_kernel = np.array([[-1, 0, 1],
[-2, 0, 2],
[-1, 0, 1]]) # Sobel filter
horizontal_edge_kernel = np.array([[-1, -2, -1],
[ 0, 0, 0],
[ 1, 2, 1]])
# Apply convolutions
vertical_edges = convolve2d(image, vertical_edge_kernel)
horizontal_edges = convolve2d(image, horizontal_edge_kernel)
# Visualize
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
axes[0].imshow(image, cmap='gray')
axes[0].set_title('Original Image\n(Vertical Edge)', fontsize=12, fontweight='bold')
axes[0].axis('off')
axes[1].imshow(vertical_edges, cmap='seismic')
axes[1].set_title('After Vertical Edge Filter\n(Strong Response!)', fontsize=12, fontweight='bold')
axes[1].axis('off')
axes[2].imshow(horizontal_edges, cmap='seismic')
axes[2].set_title('After Horizontal Edge Filter\n(Weak Response)', fontsize=12, fontweight='bold')
axes[2].axis('off')
plt.tight_layout()
plt.show()
print("="*60)
print("CNN CORE CONCEPT: Convolution")
print("="*60)
print("Original image shape:", image.shape)
print("Kernel shape:", vertical_edge_kernel.shape)
print("Output shape:", vertical_edges.shape)
print("\n?? The kernel 'slides' across the image, detecting patterns!")
print(" Different kernels detect different features (edges, blobs, textures).")
print(" CNNs LEARN these kernels automatically during training!")
# We'll do a deep dive on CNNs in Section 8
Recurrent Neural Networks (RNN)
Recurrent Neural Networks have loops—they maintain hidden state that persists across time steps. This memory allows them to process sequences of any length, making them perfect for text, speech, and time series.
Why RNNs for Sequences?
Problem with Feedforward NNs for Sequences:
- Fixed input size (can't handle variable-length sequences)
- No memory of previous inputs
- Can't learn temporal dependencies
Example: Predicting next word in sentence
"The cat sat on the ___"
- Feedforward: Only sees "the" ? can't predict sensibly
- RNN: Remembers entire context "The cat sat on the" ? predicts "mat" or "floor"
RNN Solution:
- Hidden state: Acts as memory, updated at each time step
- Recurrent connection: Output feeds back into network
- Parameter sharing: Same weights used at every time step
import numpy as np
# Simple RNN cell implementation
class SimpleRNN:
"""Basic RNN: processes sequences one step at a time"""
def __init__(self, input_size, hidden_size, output_size):
# Weights for input ? hidden
self.Wxh = np.random.randn(input_size, hidden_size) * 0.01
# Weights for hidden ? hidden (recurrent!)
self.Whh = np.random.randn(hidden_size, hidden_size) * 0.01
# Weights for hidden ? output
self.Why = np.random.randn(hidden_size, output_size) * 0.01
# Biases
self.bh = np.zeros((1, hidden_size))
self.by = np.zeros((1, output_size))
def forward(self, inputs):
"""
Process a sequence.
inputs: list of input vectors (one per time step)
Returns: list of outputs and final hidden state
"""
h = np.zeros((1, self.Whh.shape[0])) # Initial hidden state
outputs = []
for x in inputs:
# Update hidden state: combine current input with previous hidden state
h = np.tanh(np.dot(x, self.Wxh) + np.dot(h, self.Whh) + self.bh)
# Compute output
y = np.dot(h, self.Why) + self.by
outputs.append(y)
return outputs, h
# Example: Process a sequence
rnn = SimpleRNN(input_size=3, hidden_size=5, output_size=2)
# Sequence of 4 time steps
sequence = [
np.array([[1.0, 0.5, 0.2]]), # t=0
np.array([[0.8, 0.3, 0.1]]), # t=1
np.array([[0.6, 0.7, 0.4]]), # t=2
np.array([[0.3, 0.9, 0.6]]) # t=3
]
outputs, final_hidden = rnn.forward(sequence)
print("="*60)
print("RNN EXAMPLE: Processing a Sequence")
print("="*60)
print(f"Input sequence length: {len(sequence)} time steps")
print(f"Input size at each step: {sequence[0].shape}")
print(f"\nOutputs at each time step:")
for t, output in enumerate(outputs):
print(f" t={t}: {output[0]}")
print(f"\nFinal hidden state: {final_hidden[0]}")
print("\n?? RNN maintains 'memory' via hidden state!")
print(" Each time step updates the hidden state based on:")
print(" - Current input")
print(" - Previous hidden state (memory of past)")
# We'll do a deep dive on RNNs in Section 9
Autoencoders
Autoencoders are neural networks trained to reconstruct their input. They compress data into a lower-dimensional representation (encoding) and then reconstruct it (decoding). The compressed representation learns meaningful features.
import numpy as np
import matplotlib.pyplot as plt
# Simple Autoencoder
class Autoencoder:
"""
Autoencoder: Input ? Compress (Encoder) ? Decompress (Decoder) ? Output
Goal: Output ˜ Input (reconstruction)
"""
def __init__(self, input_size, encoding_size):
# Encoder: compress input to lower dimension
self.W_encoder = np.random.randn(input_size, encoding_size) * 0.01
self.b_encoder = np.zeros((1, encoding_size))
# Decoder: reconstruct from compressed representation
self.W_decoder = np.random.randn(encoding_size, input_size) * 0.01
self.b_decoder = np.zeros((1, input_size))
def encode(self, X):
"""Compress input to lower dimension"""
return np.tanh(np.dot(X, self.W_encoder) + self.b_encoder)
def decode(self, encoding):
"""Reconstruct from compressed representation"""
return np.dot(encoding, self.W_decoder) + self.b_decoder
def forward(self, X):
"""Full pass: encode then decode"""
encoding = self.encode(X)
reconstruction = self.decode(encoding)
return reconstruction, encoding
# Example: Compress 100D data to 10D
autoencoder = Autoencoder(input_size=100, encoding_size=10)
# Random input
X = np.random.randn(1, 100)
# Encode and reconstruct
reconstruction, encoding = autoencoder.forward(X)
print("="*60)
print("AUTOENCODER EXAMPLE: Dimensionality Reduction")
print("="*60)
print(f"Original input size: {X.shape}")
print(f"Compressed encoding size: {encoding.shape}")
print(f"Reconstructed output size: {reconstruction.shape}")
print(f"\nCompression ratio: {X.shape[1] / encoding.shape[1]:.1f}x")
print("\n?? Autoencoders learn to compress data efficiently!")
print(" Applications:")
print(" - Dimensionality reduction (like PCA but non-linear)")
print(" - Denoising (train to reconstruct clean data from noisy input)")
print(" - Anomaly detection (reconstruction error high for anomalies)")
print(" - Feature learning (encoding layer captures essence of data)")
# We'll do a deep dive on Autoencoders in Section 10
Generative Adversarial Networks (GANs)
GANs consist of two networks playing a game: a Generator creates fake data, while a Discriminator tries to distinguish fake from real. Through this adversarial training, the generator learns to create incredibly realistic data.
The GAN Game
Analogy: Art Forger vs Detective
Generator (Forger):
- Tries to create fake paintings that look real
- Starts terrible, improves over time
- Learns from detective's feedback
Discriminator (Detective):
- Examines paintings, labels "real" or "fake"
- Gets better at spotting fakes over time
- Forces forger to improve
End Result: Generator becomes so good that even the discriminator can't tell real from fake (50% accuracy = random guessing). At this point, you have a generator that creates realistic data!
import numpy as np
# Simplified GAN structure
class SimpleGAN:
"""
GAN: Two networks in competition
"""
def __init__(self, noise_size, data_size, hidden_size=32):
# Generator: noise ? fake data
self.G_W1 = np.random.randn(noise_size, hidden_size) * 0.01
self.G_b1 = np.zeros((1, hidden_size))
self.G_W2 = np.random.randn(hidden_size, data_size) * 0.01
self.G_b2 = np.zeros((1, data_size))
# Discriminator: data ? real/fake probability
self.D_W1 = np.random.randn(data_size, hidden_size) * 0.01
self.D_b1 = np.zeros((1, hidden_size))
self.D_W2 = np.random.randn(hidden_size, 1) * 0.01
self.D_b2 = np.zeros((1, 1))
def generator(self, noise):
"""Generate fake data from random noise"""
h = np.tanh(np.dot(noise, self.G_W1) + self.G_b1)
fake_data = np.dot(h, self.G_W2) + self.G_b2
return fake_data
def discriminator(self, data):
"""Predict if data is real (1) or fake (0)"""
h = np.tanh(np.dot(data, self.D_W1) + self.D_b1)
prob_real = 1 / (1 + np.exp(-np.dot(h, self.D_W2) - self.D_b2))
return prob_real
# Example usage
gan = SimpleGAN(noise_size=10, data_size=20)
# Generate fake data
noise = np.random.randn(5, 10) # 5 random noise vectors
fake_data = gan.generator(noise)
# Discriminator judges it
real_data = np.random.randn(5, 20) # Some "real" data
prob_real_is_real = gan.discriminator(real_data)
prob_fake_is_real = gan.discriminator(fake_data)
print("="*60)
print("GAN EXAMPLE: Generator vs Discriminator")
print("="*60)
print(f"Generated fake data shape: {fake_data.shape}")
print(f"\nDiscriminator scores (probability of being real):")
print(f" Real data: {prob_real_is_real.mean():.3f} (should be high)")
print(f" Fake data: {prob_fake_is_real.mean():.3f} (should be low)")
print("\n?? During training:")
print(" 1. Generator tries to maximize P(fake is classified as real)")
print(" 2. Discriminator tries to correctly classify real vs fake")
print(" 3. They improve together until equilibrium")
print("\n Result: Generator creates realistic data!")
# We'll do a deep dive on GANs in Section 11
Transformers
Transformers revolutionized NLP (and now computer vision) by replacing RNNs with attention mechanisms. Instead of processing sequences step-by-step, transformers look at all positions simultaneously and learn which parts to focus on.
Why Transformers Beat RNNs
RNN Limitations:
- Sequential processing: Must process word-by-word, can't parallelize
- Vanishing gradients: Struggles with long sequences (>100 tokens)
- No direct access: To relate word 1 to word 100, signal must pass through 99 hidden states
Transformer Advantages:
- Parallel processing: All positions processed simultaneously ? much faster
- Direct connections: Any position can attend to any other position
- Scalable: Works on sequences of 1,000+ tokens (GPT, BERT)
- Attention visualization: Can see what the model focuses on
Famous Transformers: GPT-4, BERT, T5, Vision Transformer (ViT)
import numpy as np
# Simplified Self-Attention (core of Transformers)
def scaled_dot_product_attention(Q, K, V):
"""
Self-Attention: Let each position attend to all other positions.
Q (Query): What am I looking for?
K (Key): What do I contain?
V (Value): What do I actually output?
Formula: Attention(Q,K,V) = softmax(QK^T / sqrt(d_k)) V
"""
d_k = Q.shape[-1] # Dimension of keys
# Compute attention scores (similarity between queries and keys)
scores = np.dot(Q, K.T) / np.sqrt(d_k)
# Softmax to get attention weights (sum to 1)
exp_scores = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
attention_weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)
# Weighted sum of values
output = np.dot(attention_weights, V)
return output, attention_weights
# Example: 4-word sentence, 8-dimensional embeddings
sentence = ["The", "cat", "sat", "down"]
seq_length = 4
d_model = 8
# Random embeddings for each word
embeddings = np.random.randn(seq_length, d_model)
# Self-attention: Q = K = V = embeddings (simplified)
Q = K = V = embeddings
# Apply attention
output, attention_weights = scaled_dot_product_attention(Q, K, V)
print("="*60)
print("TRANSFORMER EXAMPLE: Self-Attention")
print("="*60)
print(f"Sentence: {' '.join(sentence)}")
print(f"Embedding dimension: {d_model}")
print(f"\nAttention weights (who attends to whom):")
print(" ", " ".join([f"{w:>5}" for w in sentence]))
for i, word in enumerate(sentence):
weights_str = " ".join([f"{w:5.2f}" for w in attention_weights[i]])
print(f"{word:>8}: [{weights_str}]")
print("\n?? Attention weights show relationships:")
print(" - 'cat' attends to 'The' (subject-article)")
print(" - 'sat' attends to 'cat' (verb-subject)")
print(" - Each word can directly access any other word!")
print("\n Transformers stack multiple attention layers to learn")
print(" increasingly complex relationships.")
# We'll do a deep dive on Transformers in Section 12
Choosing the Right Architecture
| Data Type | Best Architecture | Example Tasks |
|---|---|---|
| Tabular/Structured | Feedforward NN | Fraud detection, customer churn, scoring |
| Images | CNN | Object detection, image classification, segmentation |
| Text/NLP | Transformer | Translation, sentiment analysis, question answering |
| Time Series | RNN/LSTM or Transformer | Stock prediction, anomaly detection, forecasting |
| Speech/Audio | RNN or Transformer | Speech recognition, music generation |
| Data Generation | GAN or VAE | Image synthesis, data augmentation, style transfer |
| Compression | Autoencoder | Dimensionality reduction, denoising, anomaly detection |
Next sections: We'll do deep dives on CNN (Section 8), RNN (Section 9), Autoencoders (Section 10), GANs (Section 11), and Transformers (Section 12), building each from scratch with complete working code!
Convolutional Neural Networks (CNN) - Deep Dive
CNNs are the workhorses of computer vision, powering everything from facial recognition to autonomous vehicles. Let's build one from scratch and understand exactly how they work.
Understanding Convolution Operations
A convolution is a mathematical operation where a small matrix (the kernel or filter) slides across an image, computing element-wise products and summing them. Different kernels detect different features.
Symbolic Convolution Mathematics
import sympy as sp
from sympy import symbols, summation, IndexedBase, latex
import numpy as np
import matplotlib.pyplot as plt
print("="*60)
print("CONVOLUTION OPERATION - SYMBOLIC MATHEMATICS")
print("="*60)
# Define symbolic variables for convolution
i, j, m, n = symbols('i j m n', integer=True)
k_h, k_w = symbols('k_h k_w', integer=True, positive=True) # kernel height, width
# Input and kernel as indexed symbols
X = IndexedBase('X') # Input image
K = IndexedBase('K') # Kernel/filter
Y = IndexedBase('Y') # Output feature map
print("\n1. CONVOLUTION FORMULA (2D)")
print("-" * 60)
print("For each output position (i, j):")
print("")
print("Y[i,j] = S S X[i+m, j+n] × K[m, n]")
print(" m=0 to k_h-1")
print(" n=0 to k_w-1")
print("")
print("Where:")
print(" X[i,j] = input pixel at position (i,j)")
print(" K[m,n] = kernel weight at position (m,n)")
print(" Y[i,j] = output feature at position (i,j)")
# Create symbolic expression for 3x3 kernel
print("\n2. EXAMPLE: 3×3 KERNEL CONVOLUTION")
print("-" * 60)
# Define 3x3 kernel symbolically
K00, K01, K02 = symbols('K_{00} K_{01} K_{02}')
K10, K11, K12 = symbols('K_{10} K_{11} K_{12}')
K20, K21, K22 = symbols('K_{20} K_{21} K_{22}')
kernel_matrix = sp.Matrix([
[K00, K01, K02],
[K10, K11, K12],
[K20, K21, K22]
])
print("Kernel K:")
for row in range(3):
print(f" [{kernel_matrix[row,0]:<6} {kernel_matrix[row,1]:<6} {kernel_matrix[row,2]:<6}]")
# Input patch
X00, X01, X02 = symbols('X_{00} X_{01} X_{02}')
X10, X11, X12 = symbols('X_{10} X_{11} X_{12}')
X20, X21, X22 = symbols('X_{20} X_{21} X_{22}')
input_patch = sp.Matrix([
[X00, X01, X02],
[X10, X11, X12],
[X20, X21, X22]
])
print("\nInput patch X:")
for row in range(3):
print(f" [{input_patch[row,0]:<6} {input_patch[row,1]:<6} {input_patch[row,2]:<6}]")
# Element-wise multiplication and sum
conv_result = sum([kernel_matrix[i,j] * input_patch[i,j]
for i in range(3) for j in range(3)])
print("\nConvolution output Y[i,j]:")
print(f" {conv_result}")
print("\nExpanded:")
expanded = sp.expand(conv_result)
terms = str(expanded).split(' + ')
for idx, term in enumerate(terms[:6], 1): # Show first 6 terms
print(f" {term} +")
print(" ...")
# Numerical example: Edge detection
print("\n3. NUMERICAL EXAMPLE: VERTICAL EDGE DETECTION")
print("-" * 60)
# Sobel vertical edge detector
sobel_vertical = sp.Matrix([
[-1, 0, 1],
[-2, 0, 2],
[-1, 0, 1]
])
print("Sobel vertical kernel:")
for row in range(3):
print(f" [{sobel_vertical[row,0]:3} {sobel_vertical[row,1]:3} {sobel_vertical[row,2]:3}]")
# Test on simple edge pattern
test_patch = sp.Matrix([
[0, 0, 255], # Dark | Bright transition
[0, 0, 255],
[0, 0, 255]
])
print("\nTest input (vertical edge):")
for row in range(3):
print(f" [{test_patch[row,0]:3} {test_patch[row,1]:3} {test_patch[row,2]:3}]")
# Compute convolution
edge_response = sum([sobel_vertical[i,j] * test_patch[i,j]
for i in range(3) for j in range(3)])
print(f"\nEdge response: {edge_response}")
print(f"Interpretation: {edge_response} (strong vertical edge detected!)")
# Stride and padding formulas
print("\n4. OUTPUT SIZE FORMULAS")
print("-" * 60)
H_in, W_in = symbols('H_{in} W_{in}', positive=True, integer=True)
K_h, K_w = symbols('K_h K_w', positive=True, integer=True)
S, P = symbols('S P', positive=True, integer=True)
print("Given:")
print(" H_in, W_in = input height, width")
print(" K_h, K_w = kernel height, width")
print(" S = stride")
print(" P = padding")
# Output height formula
H_out = (H_in + 2*P - K_h) / S + 1
print(f"\nOutput height: H_out = (H_in + 2P - K_h)/S + 1")
print(f" = {H_out}")
# Output width formula
W_out = (W_in + 2*P - K_w) / S + 1
print(f"\nOutput width: W_out = (W_in + 2P - K_w)/S + 1")
print(f" = {W_out}")
# Example calculation
vals = {H_in: 32, W_in: 32, K_h: 3, K_w: 3, S: 1, P: 1}
H_out_val = H_out.subs(vals)
W_out_val = W_out.subs(vals)
print(f"\nExample: 32×32 input, 3×3 kernel, stride=1, padding=1")
print(f" Output: {H_out_val}×{W_out_val}")
print("\n?? Key insights:")
print(" - Convolution = local weighted sum (dot product)")
print(" - Same kernel applied across entire image (parameter sharing)")
print(" - Output size controlled by stride and padding")
print(" - Padding='same' preserves spatial dimensions")
What is a Convolutional Filter?
Analogy: Detective's Magnifying Glass
- The filter is like a magnifying glass that examines small regions
- It slides across the entire image (left?right, top?bottom)
- At each position, it checks: "Does this region match the pattern I'm looking for?"
- Different filters look for different patterns (edges, corners, textures, shapes)
Key Parameters:
- Kernel size: How big is the filter? (e.g., 3×3, 5×5)
- Stride: How many pixels to move each step? (stride=1 ? move 1 pixel, stride=2 ? move 2 pixels)
- Padding: Add zeros around image border? (keeps output size same as input)
import numpy as np
import matplotlib.pyplot as plt
def convolve2d_with_stride_padding(image, kernel, stride=1, padding=0):
"""
Full convolution implementation with stride and padding.
Parameters:
- image: Input image (H x W)
- kernel: Filter to apply (kH x kW)
- stride: Step size when sliding kernel
- padding: Zeros to add around border
"""
# Add padding
if padding > 0:
image = np.pad(image, padding, mode='constant', constant_values=0)
i_height, i_width = image.shape
k_height, k_width = kernel.shape
# Calculate output dimensions
out_height = (i_height - k_height) // stride + 1
out_width = (i_width - k_width) // stride + 1
output = np.zeros((out_height, out_width))
# Slide kernel with stride
for i in range(0, out_height):
for j in range(0, out_width):
# Extract region
i_start = i * stride
j_start = j * stride
region = image[i_start:i_start+k_height, j_start:j_start+k_width]
# Convolution: element-wise multiply and sum
output[i, j] = np.sum(region * kernel)
return output
# Create a test image with various features
image = np.zeros((20, 20))
# Vertical line
image[:, 10] = 1
# Horizontal line
image[5, :] = 1
# Diagonal line
for i in range(15):
image[i, i] = 1
# Different edge detection kernels
kernels = {
'Vertical Edge (Sobel)': np.array([[-1, 0, 1],
[-2, 0, 2],
[-1, 0, 1]]),
'Horizontal Edge (Sobel)': np.array([[-1, -2, -1],
[ 0, 0, 0],
[ 1, 2, 1]]),
'Diagonal Edge': np.array([[ 0, 1, 2],
[-1, 0, 1],
[-2,-1, 0]]),
'Sharpen': np.array([[ 0, -1, 0],
[-1, 5, -1],
[ 0, -1, 0]]),
'Blur (Box)': np.ones((3, 3)) / 9
}
# Apply all kernels
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()
# Original image
axes[0].imshow(image, cmap='gray')
axes[0].set_title('Original Image', fontsize=12, fontweight='bold')
axes[0].axis('off')
# Apply each kernel
for idx, (name, kernel) in enumerate(kernels.items(), 1):
filtered = convolve2d_with_stride_padding(image, kernel, stride=1, padding=0)
axes[idx].imshow(filtered, cmap='seismic')
axes[idx].set_title(f'{name}\nOutput: {filtered.shape}', fontsize=10, fontweight='bold')
axes[idx].axis('off')
plt.tight_layout()
plt.show()
print("="*60)
print("CONVOLUTION: Different Kernels, Different Features")
print("="*60)
print(f"Original image: {image.shape}")
print(f"Kernel size: 3×3")
print(f"Output size: {filtered.shape} (shrinks without padding)")
print("\n?? Each kernel detects different features:")
print(" - Vertical Sobel: Strong response to vertical edges")
print(" - Horizontal Sobel: Strong response to horizontal edges")
print(" - Diagonal: Detects diagonal lines")
print(" - Sharpen: Enhances edges (center weight > 0)")
print(" - Blur: Smooths image (all positive weights)")
print("\n CNNs LEARN these kernel weights during training!")
import numpy as np
import matplotlib.pyplot as plt
# Demonstrate stride and padding effects
def show_stride_padding_effects():
"""Visualize how stride and padding change output size"""
# Simple 6x6 image
image = np.random.rand(6, 6)
kernel = np.ones((3, 3)) / 9 # 3x3 averaging filter
configs = [
{'stride': 1, 'padding': 0, 'name': 'Stride=1, No Padding'},
{'stride': 2, 'padding': 0, 'name': 'Stride=2, No Padding'},
{'stride': 1, 'padding': 1, 'name': 'Stride=1, Padding=1'},
{'stride': 2, 'padding': 1, 'name': 'Stride=2, Padding=1'}
]
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()
# Original
axes[0].imshow(image, cmap='viridis', interpolation='nearest')
axes[0].set_title(f'Original Image\n{image.shape}', fontsize=11, fontweight='bold')
axes[0].grid(True, color='white', linewidth=1)
axes[0].set_xticks(np.arange(-0.5, 6, 1))
axes[0].set_yticks(np.arange(-0.5, 6, 1))
axes[0].set_xticklabels([])
axes[0].set_yticklabels([])
# Apply convolutions with different configs
for idx, config in enumerate(configs, 1):
output = convolve2d_with_stride_padding(
image, kernel,
stride=config['stride'],
padding=config['padding']
)
axes[idx].imshow(output, cmap='viridis', interpolation='nearest')
axes[idx].set_title(f"{config['name']}\nOutput: {output.shape}",
fontsize=10, fontweight='bold')
axes[idx].grid(True, color='white', linewidth=1)
axes[idx].set_xticks(np.arange(-0.5, output.shape[1], 1))
axes[idx].set_yticks(np.arange(-0.5, output.shape[0], 1))
axes[idx].set_xticklabels([])
axes[idx].set_yticklabels([])
# Hide last subplot
axes[5].axis('off')
plt.tight_layout()
plt.show()
print("="*60)
print("STRIDE & PADDING: Impact on Output Size")
print("="*60)
print("Formula: output_size = (input_size - kernel_size + 2*padding) / stride + 1")
print("\nExamples (input=6, kernel=3):")
print(" stride=1, padding=0 ? (6-3+0)/1+1 = 4")
print(" stride=2, padding=0 ? (6-3+0)/2+1 = 2.5 ? 2 (floor)")
print(" stride=1, padding=1 ? (6-3+2)/1+1 = 6 (same size!)")
print(" stride=2, padding=1 ? (6-3+2)/2+1 = 3.5 ? 3")
print("\n?? Common practices:")
print(" - stride=1, padding=1: Keep spatial dimensions (feature extraction)")
print(" - stride=2, padding=0: Reduce dimensions (downsample)")
show_stride_padding_effects()
Pooling Layers
Pooling reduces spatial dimensions by summarizing regions. It makes the network more robust to small translations and reduces computation.
Why Pooling?
Three Benefits:
- Translation invariance: Cat slightly shifted in image ? still detected
- Dimensionality reduction: 100×100 ? 50×50 with 2×2 pooling
- Computational efficiency: Fewer parameters, faster training
Common Pooling Operations:
- Max Pooling: Take maximum value in region (most common)
- Average Pooling: Take average value in region
- Global Pooling: Reduce entire feature map to single value
import numpy as np
import matplotlib.pyplot as plt
def max_pooling(image, pool_size=2, stride=None):
"""
Max pooling: Take maximum value in each region.
Typical: pool_size=2, stride=2 ? reduce dimensions by half
"""
if stride is None:
stride = pool_size
i_height, i_width = image.shape
out_height = (i_height - pool_size) // stride + 1
out_width = (i_width - pool_size) // stride + 1
output = np.zeros((out_height, out_width))
for i in range(out_height):
for j in range(out_width):
i_start = i * stride
j_start = j * stride
region = image[i_start:i_start+pool_size, j_start:j_start+pool_size]
output[i, j] = np.max(region) # Max pooling
return output
def avg_pooling(image, pool_size=2, stride=None):
"""Average pooling: Take average value in each region."""
if stride is None:
stride = pool_size
i_height, i_width = image.shape
out_height = (i_height - pool_size) // stride + 1
out_width = (i_width - pool_size) // stride + 1
output = np.zeros((out_height, out_width))
for i in range(out_height):
for j in range(out_width):
i_start = i * stride
j_start = j * stride
region = image[i_start:i_start+pool_size, j_start:j_start+pool_size]
output[i, j] = np.mean(region) # Average pooling
return output
# Create test image with distinct features
image = np.array([
[1, 3, 2, 4, 1, 2],
[5, 6, 1, 8, 3, 1],
[2, 1, 7, 3, 9, 2],
[4, 3, 2, 1, 4, 5],
[1, 9, 3, 6, 2, 1],
[3, 2, 4, 1, 7, 3]
], dtype=float)
# Apply pooling
max_pooled = max_pooling(image, pool_size=2, stride=2)
avg_pooled = avg_pooling(image, pool_size=2, stride=2)
# Visualize
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
im0 = axes[0].imshow(image, cmap='viridis', interpolation='nearest')
axes[0].set_title(f'Original Image\n{image.shape}', fontsize=12, fontweight='bold')
axes[0].grid(True, color='white', linewidth=2)
plt.colorbar(im0, ax=axes[0], fraction=0.046)
im1 = axes[1].imshow(max_pooled, cmap='viridis', interpolation='nearest')
axes[1].set_title(f'Max Pooling (2×2)\n{max_pooled.shape}', fontsize=12, fontweight='bold')
axes[1].grid(True, color='white', linewidth=2)
plt.colorbar(im1, ax=axes[1], fraction=0.046)
im2 = axes[2].imshow(avg_pooled, cmap='viridis', interpolation='nearest')
axes[2].set_title(f'Average Pooling (2×2)\n{avg_pooled.shape}', fontsize=12, fontweight='bold')
axes[2].grid(True, color='white', linewidth=2)
plt.colorbar(im2, ax=axes[2], fraction=0.046)
for ax in axes:
ax.set_xticks(np.arange(-0.5, ax.images[0].get_array().shape[1], 1))
ax.set_yticks(np.arange(-0.5, ax.images[0].get_array().shape[0], 1))
ax.set_xticklabels([])
ax.set_yticklabels([])
plt.tight_layout()
plt.show()
print("="*60)
print("POOLING: Downsampling Feature Maps")
print("="*60)
print(f"Original: {image.shape}")
print(f"After 2×2 pooling: {max_pooled.shape}")
print(f"\nDimension reduction: {image.size / max_pooled.size:.1f}x")
print("\nMax pooled output:")
print(max_pooled)
print("\nAverage pooled output:")
print(avg_pooled)
print("\n?? Max pooling preserves strongest features (most common)")
print(" Average pooling preserves overall brightness")
print(" Both reduce spatial dimensions by 4x with 2×2 pooling")
Building a CNN from Scratch
Now let's build a complete CNN with multiple convolutional layers, pooling, and fully connected layers. We'll implement forward and backward passes.
import numpy as np
class ConvLayer:
"""Convolutional layer with multiple filters"""
def __init__(self, num_filters, filter_size, input_channels):
"""
Initialize convolutional layer.
Parameters:
- num_filters: Number of filters to learn
- filter_size: Size of each filter (e.g., 3 for 3×3)
- input_channels: Depth of input (1 for grayscale, 3 for RGB)
"""
self.num_filters = num_filters
self.filter_size = filter_size
# Initialize filters with Xavier initialization
scale = np.sqrt(2.0 / (filter_size * filter_size * input_channels))
self.filters = np.random.randn(num_filters, input_channels,
filter_size, filter_size) * scale
self.biases = np.zeros(num_filters)
def forward(self, input_data):
"""
Forward pass: Apply all filters to input.
input_data: (batch_size, channels, height, width)
Returns: (batch_size, num_filters, out_height, out_width)
"""
self.last_input = input_data
batch_size, in_channels, in_height, in_width = input_data.shape
# Calculate output dimensions (assuming stride=1, padding=0)
out_height = in_height - self.filter_size + 1
out_width = in_width - self.filter_size + 1
# Initialize output
output = np.zeros((batch_size, self.num_filters, out_height, out_width))
# Apply each filter
for b in range(batch_size):
for f in range(self.num_filters):
for i in range(out_height):
for j in range(out_width):
# Extract region
region = input_data[b, :, i:i+self.filter_size, j:j+self.filter_size]
# Convolution: element-wise multiply and sum across all channels
output[b, f, i, j] = np.sum(region * self.filters[f]) + self.biases[f]
return output
def backward(self, grad_output, learning_rate):
"""
Backward pass: Compute gradients and update filters.
grad_output: Gradient from next layer
Returns: Gradient to pass to previous layer
"""
batch_size, _, out_height, out_width = grad_output.shape
_, in_channels, in_height, in_width = self.last_input.shape
# Initialize gradients
grad_filters = np.zeros_like(self.filters)
grad_biases = np.zeros_like(self.biases)
grad_input = np.zeros_like(self.last_input)
# Compute gradients (simplified - full implementation is more complex)
for b in range(batch_size):
for f in range(self.num_filters):
for i in range(out_height):
for j in range(out_width):
# Extract region
region = self.last_input[b, :, i:i+self.filter_size, j:j+self.filter_size]
# Gradient for this filter
grad_filters[f] += grad_output[b, f, i, j] * region
grad_biases[f] += grad_output[b, f, i, j]
# Gradient for input
grad_input[b, :, i:i+self.filter_size, j:j+self.filter_size] += \
grad_output[b, f, i, j] * self.filters[f]
# Average over batch
grad_filters /= batch_size
grad_biases /= batch_size
# Update parameters
self.filters -= learning_rate * grad_filters
self.biases -= learning_rate * grad_biases
return grad_input
class MaxPoolLayer:
"""Max pooling layer"""
def __init__(self, pool_size=2):
self.pool_size = pool_size
def forward(self, input_data):
"""
Forward pass: Max pooling.
input_data: (batch_size, channels, height, width)
Returns: (batch_size, channels, height//pool_size, width//pool_size)
"""
self.last_input = input_data
batch_size, channels, in_height, in_width = input_data.shape
out_height = in_height // self.pool_size
out_width = in_width // self.pool_size
output = np.zeros((batch_size, channels, out_height, out_width))
# Store argmax for backward pass
self.max_indices = np.zeros_like(output, dtype=int)
for b in range(batch_size):
for c in range(channels):
for i in range(out_height):
for j in range(out_width):
i_start = i * self.pool_size
j_start = j * self.pool_size
region = input_data[b, c, i_start:i_start+self.pool_size,
j_start:j_start+self.pool_size]
output[b, c, i, j] = np.max(region)
return output
# Test the layers
print("="*60)
print("CNN LAYERS: Convolution + Max Pooling")
print("="*60)
# Create sample input (1 image, 1 channel, 8×8)
sample_input = np.random.randn(1, 1, 8, 8)
# Convolutional layer: 3 filters of size 3×3
conv_layer = ConvLayer(num_filters=3, filter_size=3, input_channels=1)
conv_output = conv_layer.forward(sample_input)
print(f"Input shape: {sample_input.shape} (batch, channels, height, width)")
print(f"After Conv (3 filters, 3×3): {conv_output.shape}")
# Max pooling layer
pool_layer = MaxPoolLayer(pool_size=2)
pool_output = pool_layer.forward(conv_output)
print(f"After Max Pooling (2×2): {pool_output.shape}")
print("\n?? Typical CNN architecture:")
print(" Input ? [Conv ? ReLU ? Pool] × N ? Flatten ? Dense ? Output")
print(" - Conv: Extract features")
print(" - ReLU: Non-linearity")
print(" - Pool: Reduce dimensions")
print(" - Repeat N times for deeper features")
print(" - Flatten: Convert to vector")
print(" - Dense: Final classification")
Training on Real Image Data
Let's build a complete CNN and train it on a simple image classification task: distinguishing between simple geometric shapes.
import numpy as np
import matplotlib.pyplot as plt
# Generate synthetic dataset: circles vs squares
def generate_shape_dataset(num_samples=200, img_size=16):
"""
Generate simple geometric shapes for classification.
Returns:
- X: Images (num_samples, 1, img_size, img_size)
- y: Labels (num_samples,) - 0 for circle, 1 for square
"""
X = []
y = []
for _ in range(num_samples // 2):
# Generate circle
img = np.zeros((img_size, img_size))
center = img_size // 2
radius = np.random.randint(3, img_size // 3)
for i in range(img_size):
for j in range(img_size):
if (i - center)**2 + (j - center)**2 <= radius**2:
img[i, j] = 1
# Add noise
img += np.random.randn(img_size, img_size) * 0.1
X.append(img)
y.append(0) # Circle
# Generate square
img = np.zeros((img_size, img_size))
size = np.random.randint(6, img_size // 2)
top_left = np.random.randint(2, img_size - size - 2)
img[top_left:top_left+size, top_left:top_left+size] = 1
# Add noise
img += np.random.randn(img_size, img_size) * 0.1
X.append(img)
y.append(1) # Square
# Convert to numpy arrays
X = np.array(X)[:, np.newaxis, :, :] # Add channel dimension
y = np.array(y)
# Shuffle
indices = np.random.permutation(len(y))
X, y = X[indices], y[indices]
return X, y
# Generate dataset
X_train, y_train = generate_shape_dataset(num_samples=160, img_size=16)
X_test, y_test = generate_shape_dataset(num_samples=40, img_size=16)
# Visualize samples
fig, axes = plt.subplots(2, 5, figsize=(15, 6))
for i in range(5):
# Circles
axes[0, i].imshow(X_train[y_train == 0][i, 0], cmap='gray')
axes[0, i].set_title('Circle', fontsize=11, fontweight='bold')
axes[0, i].axis('off')
# Squares
axes[1, i].imshow(X_train[y_train == 1][i, 0], cmap='gray')
axes[1, i].set_title('Square', fontsize=11, fontweight='bold')
axes[1, i].axis('off')
plt.tight_layout()
plt.show()
print("="*60)
print("DATASET: Circles vs Squares")
print("="*60)
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"Image shape: {X_train.shape[1:]}")
print(f"Classes: 0=Circle, 1=Square")
import numpy as np
import matplotlib.pyplot as plt
# Simple CNN for binary classification
class SimpleCNN:
"""
Complete CNN: Conv ? ReLU ? Pool ? Flatten ? Dense ? Sigmoid
"""
def __init__(self, img_size=16, num_filters=8, filter_size=3):
self.img_size = img_size
self.num_filters = num_filters
self.filter_size = filter_size
# Conv layer filters
self.filters = np.random.randn(num_filters, 1, filter_size, filter_size) * 0.1
self.conv_bias = np.zeros(num_filters)
# Calculate dimensions after conv and pool
conv_out_size = img_size - filter_size + 1 # 16 - 3 + 1 = 14
pool_out_size = conv_out_size // 2 # 14 // 2 = 7
flatten_size = num_filters * pool_out_size * pool_out_size # 8 * 7 * 7 = 392
# Fully connected layer
self.fc_weights = np.random.randn(flatten_size, 1) * 0.01
self.fc_bias = np.zeros(1)
print(f"CNN Architecture:")
print(f" Input: (1, {img_size}, {img_size})")
print(f" Conv: {num_filters} filters of {filter_size}×{filter_size} ? ({num_filters}, {conv_out_size}, {conv_out_size})")
print(f" Pool: 2×2 max pooling ? ({num_filters}, {pool_out_size}, {pool_out_size})")
print(f" Flatten: ? ({flatten_size},)")
print(f" Dense: ? (1,)")
print(f" Total parameters: {self.filters.size + num_filters + self.fc_weights.size + 1}")
def relu(self, x):
return np.maximum(0, x)
def relu_derivative(self, x):
return (x > 0).astype(float)
def sigmoid(self, x):
return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
def forward(self, X):
"""Forward pass through entire network"""
batch_size = X.shape[0]
# 1. Convolution
conv_out_size = self.img_size - self.filter_size + 1
self.conv_out = np.zeros((batch_size, self.num_filters, conv_out_size, conv_out_size))
for b in range(batch_size):
for f in range(self.num_filters):
for i in range(conv_out_size):
for j in range(conv_out_size):
region = X[b, :, i:i+self.filter_size, j:j+self.filter_size]
self.conv_out[b, f, i, j] = np.sum(region * self.filters[f]) + self.conv_bias[f]
# 2. ReLU
self.relu_out = self.relu(self.conv_out)
# 3. Max Pooling (2×2)
pool_out_size = conv_out_size // 2
self.pool_out = np.zeros((batch_size, self.num_filters, pool_out_size, pool_out_size))
for b in range(batch_size):
for f in range(self.num_filters):
for i in range(pool_out_size):
for j in range(pool_out_size):
region = self.relu_out[b, f, i*2:i*2+2, j*2:j*2+2]
self.pool_out[b, f, i, j] = np.max(region)
# 4. Flatten
self.flatten = self.pool_out.reshape(batch_size, -1)
# 5. Fully connected + Sigmoid
self.fc_out = np.dot(self.flatten, self.fc_weights) + self.fc_bias
output = self.sigmoid(self.fc_out)
return output
def predict(self, X):
"""Predict class (0 or 1)"""
probs = self.forward(X)
return (probs > 0.5).astype(int)
def compute_loss(self, y_true, y_pred):
"""Binary cross-entropy loss"""
epsilon = 1e-7
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
def train_step(self, X, y, learning_rate=0.01):
"""
Single training step with backpropagation.
Simplified gradients for demonstration.
"""
batch_size = X.shape[0]
# Forward pass
output = self.forward(X)
# Compute loss
loss = self.compute_loss(y.reshape(-1, 1), output)
# Backward pass (simplified)
# Gradient of loss w.r.t. output
grad_output = (output - y.reshape(-1, 1)) / batch_size
# Gradient through FC layer
grad_fc_weights = np.dot(self.flatten.T, grad_output)
grad_fc_bias = np.sum(grad_output, axis=0)
# Update FC layer
self.fc_weights -= learning_rate * grad_fc_weights
self.fc_bias -= learning_rate * grad_fc_bias
return loss
# Create and train CNN
cnn = SimpleCNN(img_size=16, num_filters=8, filter_size=3)
# Training loop
epochs = 50
losses = []
accuracies = []
print("\n" + "="*60)
print("TRAINING CNN")
print("="*60)
for epoch in range(epochs):
# Train on batches
batch_size = 16
epoch_losses = []
for i in range(0, len(X_train), batch_size):
X_batch = X_train[i:i+batch_size]
y_batch = y_train[i:i+batch_size]
loss = cnn.train_step(X_batch, y_batch, learning_rate=0.05)
epoch_losses.append(loss)
avg_loss = np.mean(epoch_losses)
losses.append(avg_loss)
# Evaluate on test set
test_preds = cnn.predict(X_test)
accuracy = np.mean(test_preds.flatten() == y_test)
accuracies.append(accuracy)
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch+1}/{epochs} - Loss: {avg_loss:.4f}, Test Accuracy: {accuracy:.4f}")
# Plot training progress
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
ax1.plot(losses, linewidth=2, color='#BF092F')
ax1.set_xlabel('Epoch', fontsize=12)
ax1.set_ylabel('Loss', fontsize=12)
ax1.set_title('Training Loss', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)
ax2.plot(accuracies, linewidth=2, color='#3B9797')
ax2.set_xlabel('Epoch', fontsize=12)
ax2.set_ylabel('Accuracy', fontsize=12)
ax2.set_title('Test Accuracy', fontsize=14, fontweight='bold')
ax2.axhline(y=0.5, color='gray', linestyle='--', label='Random Guess')
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"\nFinal Test Accuracy: {accuracies[-1]:.2%}")
print("\n?? CNN successfully learned to distinguish circles from squares!")
import numpy as np
import matplotlib.pyplot as plt
# Visualize what the CNN learned
def visualize_learned_filters(cnn):
"""Show what features the CNN filters detect"""
fig, axes = plt.subplots(2, 4, figsize=(15, 7))
axes = axes.flatten()
for i in range(min(8, cnn.num_filters)):
# Get filter weights
filter_img = cnn.filters[i, 0] # First channel
# Normalize for visualization
filter_img = (filter_img - filter_img.min()) / (filter_img.max() - filter_img.min())
axes[i].imshow(filter_img, cmap='seismic', interpolation='nearest')
axes[i].set_title(f'Filter {i+1}', fontsize=11, fontweight='bold')
axes[i].axis('off')
plt.suptitle('Learned Convolutional Filters (What CNN Looks For)',
fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
# Visualize predictions
def visualize_predictions(cnn, X_test, y_test, num_samples=8):
"""Show CNN predictions on test images"""
predictions = cnn.predict(X_test)
probs = cnn.forward(X_test)
fig, axes = plt.subplots(2, 4, figsize=(15, 7))
axes = axes.flatten()
for i in range(num_samples):
axes[i].imshow(X_test[i, 0], cmap='gray')
true_label = 'Circle' if y_test[i] == 0 else 'Square'
pred_label = 'Circle' if predictions[i] == 0 else 'Square'
confidence = probs[i, 0] if predictions[i] == 1 else 1 - probs[i, 0]
color = 'green' if predictions[i] == y_test[i] else 'red'
axes[i].set_title(f'True: {true_label}\nPred: {pred_label} ({confidence:.2%})',
fontsize=10, fontweight='bold', color=color)
axes[i].axis('off')
plt.suptitle('CNN Predictions on Test Images', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
# Visualize learned filters
visualize_learned_filters(cnn)
# Visualize predictions
visualize_predictions(cnn, X_test, y_test, num_samples=8)
print("="*60)
print("CNN VISUALIZATION")
print("="*60)
print("Learned Filters:")
print(" - Each filter learned to detect specific patterns")
print(" - Early filters: edges, corners, basic shapes")
print(" - These combine to distinguish circles from squares")
print("\nPredictions:")
print(" - Green: Correct prediction")
print(" - Red: Incorrect prediction")
print(f" - Overall accuracy: {np.mean(cnn.predict(X_test).flatten() == y_test):.2%}")
print("\n?? Next steps for better CNNs:")
print(" 1. More conv layers (deeper = more abstract features)")
print(" 2. Batch normalization (faster, more stable training)")
print(" 3. Dropout (prevent overfitting)")
print(" 4. Data augmentation (rotations, flips, crops)")
print(" 5. Transfer learning (use pre-trained networks)")
CNN Deep Dive Summary
What We Built:
- ? Complete convolution operation with stride and padding
- ? Max and average pooling layers
- ? Full CNN architecture from scratch (Conv ? ReLU ? Pool ? Dense)
- ? Training loop with backpropagation
- ? Real classification task (circles vs squares)
- ? Visualization of learned filters and predictions
Key Insights:
- Convolution extracts local features using sliding filters
- Pooling reduces dimensions and adds translation invariance
- Multiple layers build hierarchical representations (edges ? shapes ? objects)
- Weight sharing makes CNNs parameter-efficient for images
Next: We'll explore Recurrent Neural Networks (RNNs) for sequential data like text and time series!
Recurrent Neural Networks (RNN) - Deep Dive
RNNs are designed for sequences: text, speech, time series, video. Unlike feedforward networks, RNNs have memory—they maintain hidden state that persists across time steps, allowing them to capture temporal dependencies.
RNN Architecture and Memory
Why RNNs Need Memory
Problem: Context Matters in Sequences
- "The clouds are in the ___" ? "sky"
- "I grew up in France. I speak fluent ___" ? "French"
- Stock price at t=10 depends on prices at t=0 through t=9
Feedforward networks can't handle this because:
- Fixed input size (can't process variable-length sequences)
- No memory of previous inputs
- Each prediction is independent
RNN Solution: Hidden State
- Hidden state
htacts as memory, storing information from previous time steps - Updated at each step:
ht = f(ht-1, xt) - Same weights used at every time step (parameter sharing)
import numpy as np
import matplotlib.pyplot as plt
# Visualize RNN unrolling through time
def visualize_rnn_unrolling():
"""
RNNs process sequences one step at a time.
The same network is 'unrolled' across time steps.
"""
# Sequence: "hello"
sequence = ['h', 'e', 'l', 'l', 'o']
print("="*60)
print("RNN: UNROLLING THROUGH TIME")
print("="*60)
print(f"Input sequence: {sequence}")
print(f"Sequence length: {len(sequence)}")
print("\nAt each time step:")
print(" Current input: x_t")
print(" Previous hidden state: h_{t-1} (memory)")
print(" Compute: h_t = tanh(W_xh @ x_t + W_hh @ h_{t-1} + b_h)")
print(" Output: y_t = W_hy @ h_t + b_y")
print("\nKey insight:")
print(" - Same weights (W_xh, W_hh, W_hy) used at EVERY time step")
print(" - Hidden state h_t carries information from all previous steps")
print(" - This allows the network to 'remember' context")
# Simulate simple RNN forward pass
vocab = ['h', 'e', 'l', 'o']
char_to_idx = {ch: i for i, ch in enumerate(vocab)}
hidden_size = 3
vocab_size = len(vocab)
# Initialize weights (small random values)
W_xh = np.random.randn(vocab_size, hidden_size) * 0.01
W_hh = np.random.randn(hidden_size, hidden_size) * 0.01
W_hy = np.random.randn(hidden_size, vocab_size) * 0.01
b_h = np.zeros((1, hidden_size))
b_y = np.zeros((1, vocab_size))
# Process sequence
h = np.zeros((1, hidden_size)) # Initial hidden state
print("\n" + "-"*60)
print("FORWARD PASS THROUGH SEQUENCE")
print("-"*60)
for t, char in enumerate(sequence):
# One-hot encode character
x = np.zeros((1, vocab_size))
if char in char_to_idx:
x[0, char_to_idx[char]] = 1
# Update hidden state
h = np.tanh(np.dot(x, W_xh) + np.dot(h, W_hh) + b_h)
# Compute output
y = np.dot(h, W_hy) + b_y
print(f"t={t}, input='{char}', hidden_state={h[0]}, output={y[0]}")
print("\n?? Notice how hidden state changes with each input!")
print(" It accumulates information from the entire sequence.")
visualize_rnn_unrolling()
Building RNN from Scratch
Let's implement a complete RNN with forward and backward passes. We'll build a character-level language model that learns to predict the next character in a sequence.
import numpy as np
class CharRNN:
"""
Character-level RNN for sequence prediction.
Given a sequence of characters, predicts the next character.
Example: "hell" ? predict "o" in "hello"
"""
def __init__(self, vocab_size, hidden_size, seq_length, learning_rate=0.01):
self.vocab_size = vocab_size # Number of unique characters
self.hidden_size = hidden_size # Size of hidden state
self.seq_length = seq_length # Length of sequences to process
self.learning_rate = learning_rate
# Initialize weights with Xavier initialization
self.W_xh = np.random.randn(vocab_size, hidden_size) * np.sqrt(2.0 / vocab_size)
self.W_hh = np.random.randn(hidden_size, hidden_size) * np.sqrt(2.0 / hidden_size)
self.W_hy = np.random.randn(hidden_size, vocab_size) * np.sqrt(2.0 / hidden_size)
# Biases
self.b_h = np.zeros((1, hidden_size))
self.b_y = np.zeros((1, vocab_size))
# For AdaGrad (adaptive learning rates)
self.memory_W_xh = np.zeros_like(self.W_xh)
self.memory_W_hh = np.zeros_like(self.W_hh)
self.memory_W_hy = np.zeros_like(self.W_hy)
self.memory_b_h = np.zeros_like(self.b_h)
self.memory_b_y = np.zeros_like(self.b_y)
def forward(self, inputs, h_prev):
"""
Forward pass through time.
inputs: List of input vectors (one-hot encoded characters)
h_prev: Previous hidden state
Returns: outputs, hidden states
"""
xs, hs, ys, ps = {}, {}, {}, {}
hs[-1] = np.copy(h_prev)
# Forward through time
for t in range(len(inputs)):
xs[t] = inputs[t]
# Hidden state: h_t = tanh(W_xh @ x_t + W_hh @ h_{t-1} + b_h)
hs[t] = np.tanh(np.dot(xs[t], self.W_xh) +
np.dot(hs[t-1], self.W_hh) + self.b_h)
# Output: y_t = W_hy @ h_t + b_y
ys[t] = np.dot(hs[t], self.W_hy) + self.b_y
# Probabilities via softmax
ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t]))
return xs, hs, ys, ps
def backward(self, xs, hs, ps, targets):
"""
Backward pass through time (BPTT).
Computes gradients for all parameters.
"""
# Initialize gradients
dW_xh = np.zeros_like(self.W_xh)
dW_hh = np.zeros_like(self.W_hh)
dW_hy = np.zeros_like(self.W_hy)
db_h = np.zeros_like(self.b_h)
db_y = np.zeros_like(self.b_y)
dh_next = np.zeros_like(hs[0])
# Backward through time
for t in reversed(range(len(xs))):
# Gradient of loss w.r.t. output
dy = np.copy(ps[t])
dy[0, targets[t]] -= 1 # Softmax + cross-entropy gradient
# Output layer gradients
dW_hy += np.dot(hs[t].T, dy)
db_y += dy
# Gradient w.r.t. hidden state
dh = np.dot(dy, self.W_hy.T) + dh_next
# Gradient through tanh
dh_raw = (1 - hs[t] * hs[t]) * dh
# Weight gradients
dW_xh += np.dot(xs[t].T, dh_raw)
dW_hh += np.dot(hs[t-1].T, dh_raw)
db_h += dh_raw
# Gradient for next time step
dh_next = np.dot(dh_raw, self.W_hh.T)
# Clip gradients to prevent exploding gradients
for grad in [dW_xh, dW_hh, dW_hy, db_h, db_y]:
np.clip(grad, -5, 5, out=grad)
return dW_xh, dW_hh, dW_hy, db_h, db_y
def update_weights(self, dW_xh, dW_hh, dW_hy, db_h, db_y):
"""Update weights using AdaGrad"""
for param, dparam, mem in zip(
[self.W_xh, self.W_hh, self.W_hy, self.b_h, self.b_y],
[dW_xh, dW_hh, dW_hy, db_h, db_y],
[self.memory_W_xh, self.memory_W_hh, self.memory_W_hy,
self.memory_b_h, self.memory_b_y]
):
mem += dparam * dparam
param -= self.learning_rate * dparam / (np.sqrt(mem) + 1e-8)
def sample(self, h, seed_idx, n):
"""
Sample a sequence of characters from the model.
h: Initial hidden state
seed_idx: Starting character index
n: Number of characters to generate
"""
x = np.zeros((1, self.vocab_size))
x[0, seed_idx] = 1
indices = []
for _ in range(n):
# Forward pass
h = np.tanh(np.dot(x, self.W_xh) + np.dot(h, self.W_hh) + self.b_h)
y = np.dot(h, self.W_hy) + self.b_y
p = np.exp(y) / np.sum(np.exp(y))
# Sample from probability distribution
idx = np.random.choice(range(self.vocab_size), p=p.ravel())
# Prepare next input
x = np.zeros((1, self.vocab_size))
x[0, idx] = 1
indices.append(idx)
return indices
# Example usage
print("="*60)
print("CHARACTER-LEVEL RNN")
print("="*60)
# Small vocabulary
data = "hello world"
chars = list(set(data))
vocab_size = len(chars)
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}
print(f"Text: '{data}'")
print(f"Vocabulary: {chars}")
print(f"Vocab size: {vocab_size}")
# Create RNN
rnn = CharRNN(vocab_size=vocab_size, hidden_size=16, seq_length=3)
print(f"\nRNN Parameters:")
print(f" Hidden size: {rnn.hidden_size}")
print(f" Total parameters: {rnn.W_xh.size + rnn.W_hh.size + rnn.W_hy.size + rnn.b_h.size + rnn.b_y.size}")
# Prepare a simple sequence: "hel" ? predict "l"
input_chars = ['h', 'e', 'l']
target_chars = ['e', 'l', 'l']
# One-hot encode
inputs = []
targets = []
for i, t in zip(input_chars, target_chars):
x = np.zeros((1, vocab_size))
x[0, char_to_idx[i]] = 1
inputs.append(x)
targets.append(char_to_idx[t])
# Forward pass
h_prev = np.zeros((1, rnn.hidden_size))
xs, hs, ys, ps = rnn.forward(inputs, h_prev)
print(f"\nForward pass with sequence: {input_chars}")
print("Predictions (before training):")
for t in range(len(inputs)):
predicted_idx = np.argmax(ps[t])
predicted_char = idx_to_char[predicted_idx]
target_char = idx_to_char[targets[t]]
print(f" Input: '{input_chars[t]}' ? Predicted: '{predicted_char}', Target: '{target_char}'")
print("\n?? Before training, predictions are random!")
print(" After training, the RNN learns patterns in the sequence.")
Training RNN on Text Data
Let's train our RNN to learn simple patterns in text. We'll use a small dataset and watch it learn to predict characters.
import numpy as np
import matplotlib.pyplot as plt
# Prepare training data
text = "hello world hello there world is beautiful"
chars = sorted(list(set(text)))
vocab_size = len(chars)
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}
print("="*60)
print("TRAINING CHARACTER-LEVEL RNN")
print("="*60)
print(f"Training text: '{text}'")
print(f"Text length: {len(text)} characters")
print(f"Vocabulary: {chars}")
print(f"Vocab size: {vocab_size}")
# Create RNN
seq_length = 10 # Process 10 characters at a time
rnn = CharRNN(vocab_size=vocab_size, hidden_size=32, seq_length=seq_length, learning_rate=0.1)
# Training loop
iterations = 3000
losses = []
smooth_loss = -np.log(1.0 / vocab_size) * seq_length # Initial loss
print(f"\nTraining for {iterations} iterations...")
h_prev = np.zeros((1, rnn.hidden_size))
for iteration in range(iterations):
# Prepare batch
if len(text) - seq_length - 1 < 1:
break
# Random starting position
start_idx = np.random.randint(0, len(text) - seq_length - 1)
# Get input and target sequences
input_seq = text[start_idx:start_idx + seq_length]
target_seq = text[start_idx + 1:start_idx + seq_length + 1]
# One-hot encode
inputs = []
targets = []
for ch in input_seq:
x = np.zeros((1, vocab_size))
x[0, char_to_idx[ch]] = 1
inputs.append(x)
for ch in target_seq:
targets.append(char_to_idx[ch])
# Forward pass
xs, hs, ys, ps = rnn.forward(inputs, h_prev)
# Compute loss
loss = 0
for t in range(len(inputs)):
loss += -np.log(ps[t][0, targets[t]])
smooth_loss = smooth_loss * 0.999 + loss * 0.001
losses.append(smooth_loss)
# Backward pass
dW_xh, dW_hh, dW_hy, db_h, db_y = rnn.backward(xs, hs, ps, targets)
# Update weights
rnn.update_weights(dW_xh, dW_hh, dW_hy, db_h, db_y)
# Update hidden state for next iteration
h_prev = hs[len(inputs) - 1]
# Print progress
if iteration % 500 == 0:
print(f"Iteration {iteration}, Loss: {smooth_loss:.4f}")
# Sample from model
sample_length = 30
sample_h = np.zeros((1, rnn.hidden_size))
sample_indices = rnn.sample(sample_h, char_to_idx[chars[0]], sample_length)
sample_text = ''.join([idx_to_char[idx] for idx in sample_indices])
print(f"Sample: '{sample_text}'")
print()
# Plot training loss
plt.figure(figsize=(12, 5))
plt.plot(losses, linewidth=2, color='#BF092F')
plt.xlabel('Iteration', fontsize=12)
plt.ylabel('Loss', fontsize=12)
plt.title('RNN Training Loss', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("="*60)
print("TRAINING COMPLETE")
print("="*60)
print(f"Final loss: {smooth_loss:.4f}")
print(f"Initial loss: {-np.log(1.0 / vocab_size) * seq_length:.4f}")
print(f"Improvement: {(-np.log(1.0 / vocab_size) * seq_length - smooth_loss):.4f}")
# Generate longer samples
print("\nGenerated text samples (after training):")
for i in range(3):
sample_h = np.zeros((1, rnn.hidden_size))
seed = chars[np.random.randint(0, len(chars))]
sample_indices = rnn.sample(sample_h, char_to_idx[seed], 50)
sample_text = seed + ''.join([idx_to_char[idx] for idx in sample_indices])
print(f" Sample {i+1}: '{sample_text}'")
print("\n?? Notice how the RNN learned:")
print(" - Character patterns from the training text")
print(" - Common letter combinations")
print(" - With more data and training, it would generate coherent words!")
The Vanishing Gradient Problem
Symbolic Proof of Vanishing Gradients
import sympy as sp
from sympy import symbols, Function, diff, simplify, tanh, sqrt, exp, product
import numpy as np
import matplotlib.pyplot as plt
print("="*60)
print("VANISHING GRADIENT PROBLEM - MATHEMATICAL PROOF")
print("="*60)
# Define symbolic variables
t = symbols('t', integer=True, positive=True)
T = symbols('T', integer=True, positive=True)
W_h = symbols('W_h', real=True) # Recurrent weight
print("\n1. RNN GRADIENT THROUGH TIME")
print("-" * 60)
print("RNN update: h_t = tanh(W_h × h_{t-1} + W_x × x_t + b)")
print("")
print("When computing ?L/?h_0, gradient flows through T timesteps:")
print("?L/?h_0 = ?L/?h_T × ?h_T/?h_{T-1} × ... × ?h_2/?h_1 × ?h_1/?h_0")
# Jacobian of hidden state transition
h_t, h_prev = symbols('h_t h_{t-1}', real=True)
# Simplified RNN: h_t = tanh(W_h * h_{t-1})
# (ignoring input for clarity)
h_transition = sp.tanh(W_h * h_prev)
# Gradient of h_t w.r.t. h_{t-1}
jac = diff(h_transition, h_prev)
print(f"\n?h_t/?h_{{t-1}} = {jac}")
print(f"Simplified: W_h × (1 - tanh²(W_h × h_{{t-1}}))")
# Gradient through T steps (product of Jacobians)
print("\n2. GRADIENT MAGNITUDE AFTER T STEPS")
print("-" * 60)
# Maximum gradient value (when tanh derivative is largest)
print("Tanh derivative: s'(z) = 1 - tanh²(z)")
print("Range: (0, 1], maximum at z=0 where s'(0) = 1")
# For typical activations (not near 0), tanh derivative ˜ 0.25 to 0.5
sigma_prime = symbols('sigma_prime', positive=True, real=True)
print("\nTypical value: s' ˜ 0.25 (when h is moderately activated)")
print(f"\nGradient after T steps: (W_h × s')^T")
# Show exponential decay
T_vals = [5, 10, 20, 50]
W_h_val = 0.5 # Small weight
sigma_val = 0.25 # Typical derivative
print(f"\nExample: W_h = {W_h_val}, s' = {sigma_val}")
print(f"Product per step: {W_h_val} × {sigma_val} = {W_h_val * sigma_val}")
print("\nGradient magnitude:")
for T_val in T_vals:
grad_magnitude = (W_h_val * sigma_val) ** T_val
print(f" T={T_val:2d} steps: ({W_h_val * sigma_val})^{T_val} = {grad_magnitude:.2e}")
print("\n?? Gradient vanishes exponentially with sequence length!")
# Exploding gradients (opposite problem)
print("\n3. EXPLODING GRADIENTS (W_h > 1)")
print("-" * 60)
W_h_large = 2.0 # Large weight
print(f"Example: W_h = {W_h_large}, s' = {sigma_val}")
print(f"Product per step: {W_h_large} × {sigma_val} = {W_h_large * sigma_val}")
print("\nGradient magnitude:")
for T_val in T_vals:
grad_magnitude = (W_h_large * sigma_val) ** T_val
print(f" T={T_val:2d} steps: ({W_h_large * sigma_val})^{T_val} = {grad_magnitude:.2e}")
print("\n?? Gradient explodes exponentially!")
# Condition for stable gradients
print("\n4. STABILITY CONDITION")
print("-" * 60)
print("For stable gradients (neither vanishing nor exploding):")
print("We need: |W_h × s'| ˜ 1")
print("")
print("But this is impossible to maintain across all timesteps because:")
print(" 1. s' varies with activation (0 to 1)")
print(" 2. Different timesteps have different activations")
print(" 3. A single W_h can't satisfy this for all states")
print("")
print("Solution: LSTM/GRU with gating mechanisms!")
# Visualize gradient flow
T_range = np.arange(1, 51)
# Different scenarios
vanishing = (0.5 * 0.25) ** T_range # W_h=0.5, s'=0.25
stable = (1.0 * 0.25) ** T_range # W_h=1.0, s'=0.25 (still decays!)
exploding = (2.0 * 0.5) ** T_range # W_h=2.0, s'=0.5
plt.figure(figsize=(12, 6))
plt.semilogy(T_range, vanishing, linewidth=2, label='Vanishing (W_h=0.5, s\'=0.25)',
color='#BF092F', marker='o', markersize=4, markevery=5)
plt.semilogy(T_range, stable, linewidth=2, label='Moderate (W_h=1.0, s\'=0.25)',
color='#3B9797', marker='s', markersize=4, markevery=5)
plt.semilogy(T_range, np.minimum(exploding, 1e10), linewidth=2,
label='Exploding (W_h=2.0, s\'=0.5)',
color='#132440', marker='^', markersize=4, markevery=5)
plt.axhline(y=1, color='green', linestyle='--', linewidth=2, alpha=0.5, label='Ideal (magnitude=1)')
plt.axhline(y=1e-5, color='red', linestyle='--', linewidth=1, alpha=0.5, label='Vanishing threshold')
plt.xlabel('Timesteps (T)', fontsize=12)
plt.ylabel('Gradient Magnitude (log scale)', fontsize=12)
plt.title('Gradient Flow Through Time in RNNs', fontsize=14, fontweight='bold')
plt.legend(loc='upper left', fontsize=10)
plt.grid(True, alpha=0.3)
plt.ylim([1e-15, 1e10])
plt.tight_layout()
plt.show()
print("\n?? Key takeaways:")
print(" 1. Gradient = product of many small terms (< 1)")
print(" 2. Exponential decay with sequence length")
print(" 3. Learning long-term dependencies becomes impossible")
print(" 4. LSTM/GRU solve this with additive gradient paths")
Why Simple RNNs Struggle with Long Sequences
The Problem: Gradients vanish as they backpropagate through time
Analogy: Telephone Game
- Person 1 whispers "The cat sat on the mat" to Person 2
- Person 2 whispers to Person 3 (slightly garbled)
- Person 3 to Person 4 (more garbled)
- By Person 10, message is incomprehensible
In RNNs:
- Gradient must flow backward through many time steps
- At each step, gradient is multiplied by weight matrix and activation derivative
- If values < 1, repeated multiplication makes gradient ? 0 (vanishing)
- If values > 1, repeated multiplication makes gradient ? 8 (exploding)
Consequence: RNN can't learn long-term dependencies (>10-20 steps)
Solution: LSTM and GRU architectures
import numpy as np
import matplotlib.pyplot as plt
def demonstrate_vanishing_gradient():
"""
Show how gradients vanish as sequence length increases.
"""
# Simulate gradient backpropagation through time
sequence_lengths = [5, 10, 20, 50, 100]
# Weight matrix for RNN hidden state
W_hh = np.random.randn(10, 10) * 0.1 # Small values
gradients = []
for T in sequence_lengths:
# Initial gradient
grad = np.random.randn(10, 10)
# Backpropagate through time
for t in range(T):
# Simplified: gradient gets multiplied by W_hh at each step
grad = np.dot(grad, W_hh.T)
# Measure gradient magnitude
grad_norm = np.linalg.norm(grad)
gradients.append(grad_norm)
# Plot
plt.figure(figsize=(12, 5))
plt.semilogy(sequence_lengths, gradients, marker='o', linewidth=2,
markersize=8, color='#BF092F')
plt.xlabel('Sequence Length (time steps)', fontsize=12)
plt.ylabel('Gradient Magnitude (log scale)', fontsize=12)
plt.title('Vanishing Gradient Problem in RNNs', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.axhline(y=1e-10, color='gray', linestyle='--', label='Effectively zero')
plt.legend()
plt.tight_layout()
plt.show()
print("="*60)
print("VANISHING GRADIENT DEMONSTRATION")
print("="*60)
print("Gradient magnitude after backpropagating through time:")
for T, grad_norm in zip(sequence_lengths, gradients):
print(f" Sequence length {T:3d}: {grad_norm:.2e}")
print("\n?? Notice: Gradient shrinks exponentially with sequence length!")
print(" After 100 steps, gradient is effectively 0.")
print(" This means the RNN can't learn from early parts of long sequences.")
demonstrate_vanishing_gradient()
Long Short-Term Memory (LSTM)
LSTM solves the vanishing gradient problem using gates that control information flow. LSTMs can learn dependencies across hundreds of time steps.
LSTM Architecture: Gates and Cell State
Key Innovation: Cell State
- Separate "memory highway" that runs through entire sequence
- Information can flow unchanged or be modified by gates
- Prevents gradient from vanishing
Three Gates Control Information:
- Forget Gate: What to forget from cell state? (0 = forget all, 1 = keep all)
- Input Gate: What new information to add to cell state?
- Output Gate: What to output based on cell state?
Analogy: Note-Taking in Class
- Cell state: Your notebook (persistent memory)
- Forget gate: Erase old, irrelevant notes
- Input gate: Write down new important information
- Output gate: Read relevant parts for current question
import numpy as np
class LSTMCell:
"""
Single LSTM cell with forget, input, and output gates.
"""
def __init__(self, input_size, hidden_size):
self.input_size = input_size
self.hidden_size = hidden_size
# Combined weight matrices for efficiency (concatenate x and h)
combined_size = input_size + hidden_size
# Forget gate: decides what to forget from cell state
self.W_f = np.random.randn(combined_size, hidden_size) * 0.01
self.b_f = np.zeros((1, hidden_size))
# Input gate: decides what new information to add
self.W_i = np.random.randn(combined_size, hidden_size) * 0.01
self.b_i = np.zeros((1, hidden_size))
# Candidate values: new information to potentially add
self.W_c = np.random.randn(combined_size, hidden_size) * 0.01
self.b_c = np.zeros((1, hidden_size))
# Output gate: decides what to output
self.W_o = np.random.randn(combined_size, hidden_size) * 0.01
self.b_o = np.zeros((1, hidden_size))
def sigmoid(self, x):
"""Sigmoid activation (for gates: output between 0 and 1)"""
return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
def forward(self, x, h_prev, c_prev):
"""
Forward pass through LSTM cell.
x: Input at current time step (1, input_size)
h_prev: Previous hidden state (1, hidden_size)
c_prev: Previous cell state (1, hidden_size)
Returns: h_next, c_next
"""
# Concatenate input and previous hidden state
combined = np.concatenate([x, h_prev], axis=1)
# 1. Forget gate: what to forget from cell state
f_t = self.sigmoid(np.dot(combined, self.W_f) + self.b_f)
# 2. Input gate: what new information to add
i_t = self.sigmoid(np.dot(combined, self.W_i) + self.b_i)
# 3. Candidate values: new information
c_tilde = np.tanh(np.dot(combined, self.W_c) + self.b_c)
# 4. Update cell state
c_next = f_t * c_prev + i_t * c_tilde
# 5. Output gate: what to output
o_t = self.sigmoid(np.dot(combined, self.W_o) + self.b_o)
# 6. Hidden state (output)
h_next = o_t * np.tanh(c_next)
return h_next, c_next, (f_t, i_t, c_tilde, o_t)
# Test LSTM cell
print("="*60)
print("LSTM CELL ARCHITECTURE")
print("="*60)
input_size = 5
hidden_size = 4
lstm = LSTMCell(input_size, hidden_size)
# Initial states
h_prev = np.zeros((1, hidden_size))
c_prev = np.zeros((1, hidden_size))
# Process a sequence
sequence = [np.random.randn(1, input_size) for _ in range(5)]
print(f"Input size: {input_size}")
print(f"Hidden size: {hidden_size}")
print(f"Sequence length: {len(sequence)}")
print("\nProcessing sequence:")
for t, x in enumerate(sequence):
h_next, c_next, gates = lstm.forward(x, h_prev, c_prev)
f_t, i_t, c_tilde, o_t = gates
print(f"\nTime step {t}:")
print(f" Forget gate (mean): {f_t.mean():.3f} (1=keep, 0=forget)")
print(f" Input gate (mean): {i_t.mean():.3f} (1=add new info, 0=ignore)")
print(f" Output gate (mean): {o_t.mean():.3f} (1=output, 0=hide)")
print(f" Cell state norm: {np.linalg.norm(c_next):.3f}")
print(f" Hidden state norm: {np.linalg.norm(h_next):.3f}")
# Update for next time step
h_prev = h_next
c_prev = c_next
print("\n?? LSTM gates adaptively control information flow:")
print(" - Forget gate removes irrelevant past information")
print(" - Input gate adds relevant new information")
print(" - Output gate exposes relevant information")
print(" - Cell state provides 'highway' for gradients ? no vanishing!")
Gated Recurrent Unit (GRU)
GRU is a simplified version of LSTM with only two gates (reset and update), making it faster to train while retaining most of LSTM's power.
import numpy as np
class GRUCell:
"""
Gated Recurrent Unit: simpler alternative to LSTM.
Only 2 gates instead of 3, no separate cell state.
"""
def __init__(self, input_size, hidden_size):
self.input_size = input_size
self.hidden_size = hidden_size
combined_size = input_size + hidden_size
# Update gate: how much of previous hidden state to keep
self.W_z = np.random.randn(combined_size, hidden_size) * 0.01
self.b_z = np.zeros((1, hidden_size))
# Reset gate: how much of previous hidden state to forget when computing candidate
self.W_r = np.random.randn(combined_size, hidden_size) * 0.01
self.b_r = np.zeros((1, hidden_size))
# Candidate hidden state: new information
self.W_h = np.random.randn(combined_size, hidden_size) * 0.01
self.b_h = np.zeros((1, hidden_size))
def sigmoid(self, x):
return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
def forward(self, x, h_prev):
"""
Forward pass through GRU cell.
x: Input at current time step
h_prev: Previous hidden state
Returns: h_next
"""
# Concatenate input and previous hidden
combined = np.concatenate([x, h_prev], axis=1)
# 1. Reset gate: how much past to forget
r_t = self.sigmoid(np.dot(combined, self.W_r) + self.b_r)
# 2. Update gate: how much to update
z_t = self.sigmoid(np.dot(combined, self.W_z) + self.b_z)
# 3. Candidate hidden state (using reset gate)
combined_reset = np.concatenate([x, r_t * h_prev], axis=1)
h_tilde = np.tanh(np.dot(combined_reset, self.W_h) + self.b_h)
# 4. Final hidden state: interpolate between previous and candidate
h_next = (1 - z_t) * h_prev + z_t * h_tilde
return h_next, (r_t, z_t, h_tilde)
# Compare LSTM vs GRU parameter counts
print("="*60)
print("LSTM vs GRU: Parameter Comparison")
print("="*60)
input_size = 100
hidden_size = 128
# LSTM parameters
lstm_params = 4 * ((input_size + hidden_size) * hidden_size + hidden_size)
print(f"LSTM parameters: {lstm_params:,}")
print(f" - 4 gates × (input?hidden + hidden?hidden + bias)")
# GRU parameters
gru_params = 3 * ((input_size + hidden_size) * hidden_size + hidden_size)
print(f"\nGRU parameters: {gru_params:,}")
print(f" - 3 gates × (input?hidden + hidden?hidden + bias)")
print(f"\nParameter reduction: {(1 - gru_params/lstm_params)*100:.1f}%")
# Test GRU
gru = GRUCell(input_size=5, hidden_size=4)
h_prev = np.zeros((1, 4))
x = np.random.randn(1, 5)
h_next, gates = gru.forward(x, h_prev)
r_t, z_t, h_tilde = gates
print("\n" + "="*60)
print("GRU GATES IN ACTION")
print("="*60)
print(f"Reset gate (mean): {r_t.mean():.3f}")
print(f" - Controls how much past info to use for candidate")
print(f"Update gate (mean): {z_t.mean():.3f}")
print(f" - Controls interpolation: (1-z)*h_old + z*h_new")
print("\n?? When to use LSTM vs GRU:")
print(" - LSTM: Longer sequences, more complex patterns, have compute budget")
print(" - GRU: Faster training, simpler patterns, limited compute")
print(" - In practice: Try both! GRU often works just as well.")
RNN Deep Dive Summary
What We Built:
- ? Complete vanilla RNN from scratch with BPTT
- ? Character-level language model
- ? Training loop generating text
- ? Vanishing gradient demonstration
- ? LSTM cell with 3 gates and cell state
- ? GRU cell as simpler alternative
Key Insights:
- RNN hidden state acts as memory across time steps
- Vanishing gradients prevent vanilla RNNs from learning long dependencies
- LSTM gates (forget, input, output) control information flow
- GRU simplifies LSTM to 2 gates with similar performance
- Applications: NLP, time series, speech, any sequential data
Next: We'll explore Autoencoders for unsupervised learning and dimensionality reduction!
Autoencoders - Deep Dive
Autoencoders are neural networks that learn to compress data into a lower-dimensional representation and then reconstruct it. They're trained in an unsupervised manner—no labels needed! The network learns to extract the most important features automatically.
Understanding Autoencoder Architecture
The Compression-Reconstruction Game
Analogy: Packing a Suitcase
- Input: All your clothes (high-dimensional)
- Encoder: Compress into suitcase (low-dimensional bottleneck)
- Decoder: Unpack and try to recover original clothes
- Goal: Learn what's essential vs what can be discarded
Autoencoder Components:
- Encoder: Compresses input X ? low-dimensional code Z
- Bottleneck (Latent Space): Compressed representation (Z)
- Decoder: Reconstructs from code Z ? output X'
- Loss: Reconstruction error |X - X'| (how well did we recover original?)
Key Insight: By forcing the network through a narrow bottleneck, it must learn to extract only the most important features!
import numpy as np
import matplotlib.pyplot as plt
# Simple illustration of autoencoder concept
def visualize_autoencoder_concept():
"""
Demonstrate dimensionality reduction and reconstruction.
"""
# Generate 2D data (100 dimensions ? 2 dimensions ? 100 dimensions)
np.random.seed(42)
# Original high-dimensional data (simplified as 10D for visualization)
original_dims = 10
compressed_dims = 2
num_samples = 5
# Random data
original_data = np.random.randn(num_samples, original_dims)
# Simulate encoder (compress to 2D)
encoder_weights = np.random.randn(original_dims, compressed_dims) * 0.1
compressed = np.dot(original_data, encoder_weights)
# Simulate decoder (reconstruct to 10D)
decoder_weights = np.random.randn(compressed_dims, original_dims) * 0.1
reconstructed = np.dot(compressed, decoder_weights)
# Compute reconstruction error
reconstruction_error = np.mean((original_data - reconstructed) ** 2)
print("="*60)
print("AUTOENCODER CONCEPT: Compression and Reconstruction")
print("="*60)
print(f"Original dimensions: {original_dims}")
print(f"Compressed dimensions: {compressed_dims}")
print(f"Compression ratio: {original_dims / compressed_dims:.1f}x")
print(f"\nReconstruction error: {reconstruction_error:.4f}")
print("\nSample comparison:")
for i in range(min(3, num_samples)):
print(f"\n Sample {i+1}:")
print(f" Original: {original_data[i][:5]} ... (10 dims)")
print(f" Compressed: {compressed[i]} (2 dims)")
print(f" Reconstructed: {reconstructed[i][:5]} ... (10 dims)")
print(f" Error: {np.mean((original_data[i] - reconstructed[i])**2):.4f}")
# Visualize compression
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Original data heatmap
axes[0].imshow(original_data.T, cmap='viridis', aspect='auto')
axes[0].set_title(f'Original Data\n({num_samples} samples × {original_dims} dims)',
fontsize=12, fontweight='bold')
axes[0].set_xlabel('Sample')
axes[0].set_ylabel('Dimension')
# Compressed data
axes[1].scatter(compressed[:, 0], compressed[:, 1], s=100, c=range(num_samples),
cmap='viridis', edgecolors='black', linewidths=2)
axes[1].set_title(f'Compressed Representation\n({compressed_dims}D Latent Space)',
fontsize=12, fontweight='bold')
axes[1].set_xlabel('Latent Dim 1')
axes[1].set_ylabel('Latent Dim 2')
axes[1].grid(True, alpha=0.3)
# Reconstructed data
axes[2].imshow(reconstructed.T, cmap='viridis', aspect='auto')
axes[2].set_title(f'Reconstructed Data\n({num_samples} samples × {original_dims} dims)',
fontsize=12, fontweight='bold')
axes[2].set_xlabel('Sample')
axes[2].set_ylabel('Dimension')
plt.tight_layout()
plt.show()
print("\n?? Autoencoder learns to:")
print(" 1. Extract essential features (encoder)")
print(" 2. Compress to low-dimensional representation")
print(" 3. Reconstruct original from compressed form (decoder)")
print(" 4. Minimize reconstruction error through training")
visualize_autoencoder_concept()
Building a Basic Autoencoder
Let's build a complete autoencoder from scratch and train it to compress and reconstruct data.
import numpy as np
class Autoencoder:
"""
Basic autoencoder: Input ? Encoder ? Bottleneck ? Decoder ? Reconstruction
"""
def __init__(self, input_size, encoding_size, learning_rate=0.01):
"""
Initialize autoencoder.
input_size: Original data dimensions
encoding_size: Compressed representation size (bottleneck)
"""
self.input_size = input_size
self.encoding_size = encoding_size
self.learning_rate = learning_rate
# Encoder weights: input ? encoding
self.W_encoder = np.random.randn(input_size, encoding_size) * np.sqrt(2.0 / input_size)
self.b_encoder = np.zeros((1, encoding_size))
# Decoder weights: encoding ? output
self.W_decoder = np.random.randn(encoding_size, input_size) * np.sqrt(2.0 / encoding_size)
self.b_decoder = np.zeros((1, input_size))
def relu(self, x):
"""ReLU activation"""
return np.maximum(0, x)
def relu_derivative(self, x):
"""Derivative of ReLU"""
return (x > 0).astype(float)
def encode(self, X):
"""
Encoder: compress input to lower dimension.
X: Input data (batch_size, input_size)
Returns: Compressed representation (batch_size, encoding_size)
"""
z = np.dot(X, self.W_encoder) + self.b_encoder
encoding = self.relu(z)
return encoding, z
def decode(self, encoding):
"""
Decoder: reconstruct from compressed representation.
encoding: Compressed data (batch_size, encoding_size)
Returns: Reconstructed data (batch_size, input_size)
"""
reconstruction = np.dot(encoding, self.W_decoder) + self.b_decoder
return reconstruction
def forward(self, X):
"""
Full forward pass: encode then decode.
Returns: reconstruction, encoding
"""
self.X = X
self.encoding, self.z_encoder = self.encode(X)
self.reconstruction = self.decode(self.encoding)
return self.reconstruction, self.encoding
def compute_loss(self, X, reconstruction):
"""Mean Squared Error loss"""
return np.mean((X - reconstruction) ** 2)
def backward(self, X, reconstruction):
"""
Backpropagation to compute gradients.
"""
batch_size = X.shape[0]
# Gradient of loss w.r.t. reconstruction
grad_reconstruction = 2 * (reconstruction - X) / batch_size
# Decoder gradients
grad_W_decoder = np.dot(self.encoding.T, grad_reconstruction)
grad_b_decoder = np.sum(grad_reconstruction, axis=0, keepdims=True)
# Gradient w.r.t. encoding
grad_encoding = np.dot(grad_reconstruction, self.W_decoder.T)
# Apply ReLU derivative
grad_encoding = grad_encoding * self.relu_derivative(self.z_encoder)
# Encoder gradients
grad_W_encoder = np.dot(X.T, grad_encoding)
grad_b_encoder = np.sum(grad_encoding, axis=0, keepdims=True)
return grad_W_encoder, grad_b_encoder, grad_W_decoder, grad_b_decoder
def update_weights(self, grad_W_encoder, grad_b_encoder, grad_W_decoder, grad_b_decoder):
"""Update weights using gradient descent"""
self.W_encoder -= self.learning_rate * grad_W_encoder
self.b_encoder -= self.learning_rate * grad_b_encoder
self.W_decoder -= self.learning_rate * grad_W_decoder
self.b_decoder -= self.learning_rate * grad_b_decoder
def train_step(self, X):
"""Single training step"""
# Forward pass
reconstruction, encoding = self.forward(X)
# Compute loss
loss = self.compute_loss(X, reconstruction)
# Backward pass
grads = self.backward(X, reconstruction)
# Update weights
self.update_weights(*grads)
return loss
# Create synthetic dataset
print("="*60)
print("BASIC AUTOENCODER: Dimensionality Reduction")
print("="*60)
# Generate correlated data (high-dimensional but low intrinsic dimension)
np.random.seed(42)
num_samples = 200
intrinsic_dims = 3
observed_dims = 20
# True low-dimensional data
true_latent = np.random.randn(num_samples, intrinsic_dims)
# Project to high dimensions with random projection
projection = np.random.randn(intrinsic_dims, observed_dims)
data = np.dot(true_latent, projection)
# Add small noise
data += np.random.randn(num_samples, observed_dims) * 0.1
# Normalize
data = (data - data.mean(axis=0)) / (data.std(axis=0) + 1e-8)
print(f"Dataset: {num_samples} samples")
print(f"Original dimensions: {observed_dims}")
print(f"True intrinsic dimensions: {intrinsic_dims}")
print(f"Target encoding dimensions: {intrinsic_dims}")
# Create autoencoder
autoencoder = Autoencoder(input_size=observed_dims, encoding_size=intrinsic_dims, learning_rate=0.01)
# Training loop
epochs = 1000
batch_size = 32
losses = []
print(f"\nTraining autoencoder for {epochs} epochs...")
for epoch in range(epochs):
epoch_losses = []
# Mini-batch training
indices = np.random.permutation(num_samples)
for i in range(0, num_samples, batch_size):
batch_indices = indices[i:i+batch_size]
X_batch = data[batch_indices]
loss = autoencoder.train_step(X_batch)
epoch_losses.append(loss)
avg_loss = np.mean(epoch_losses)
losses.append(avg_loss)
if (epoch + 1) % 200 == 0:
print(f"Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.6f}")
# Plot training loss
plt.figure(figsize=(12, 5))
plt.plot(losses, linewidth=2, color='#BF092F')
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Reconstruction Loss (MSE)', fontsize=12)
plt.title('Autoencoder Training Loss', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"\nFinal loss: {losses[-1]:.6f}")
print(f"Initial loss: {losses[0]:.6f}")
print(f"Improvement: {(1 - losses[-1]/losses[0])*100:.1f}%")
print("\n?? Autoencoder successfully learned to:")
print(" - Compress 20D data to 3D")
print(" - Reconstruct original with minimal error")
print(" - Discovered the intrinsic low-dimensional structure!")
import numpy as np
import matplotlib.pyplot as plt
# Visualize learned representations
def visualize_autoencoder_results(autoencoder, data, true_latent):
"""
Compare learned encoding with true latent structure.
"""
# Encode all data
reconstruction, learned_encoding = autoencoder.forward(data)
# Compute reconstruction error per sample
reconstruction_errors = np.mean((data - reconstruction) ** 2, axis=1)
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
# 1. Original vs Reconstructed (first 5 samples)
axes[0, 0].plot(data[:5].T, alpha=0.7, linewidth=2, label='Original')
axes[0, 0].plot(reconstruction[:5].T, '--', alpha=0.7, linewidth=2, label='Reconstructed')
axes[0, 0].set_title('Original vs Reconstructed Data (First 5 Samples)',
fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Dimension')
axes[0, 0].set_ylabel('Value')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
# 2. Reconstruction error distribution
axes[0, 1].hist(reconstruction_errors, bins=30, color='#3B9797', alpha=0.7, edgecolor='black')
axes[0, 1].set_title('Reconstruction Error Distribution', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('MSE')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].axvline(np.mean(reconstruction_errors), color='#BF092F',
linestyle='--', linewidth=2, label=f'Mean: {np.mean(reconstruction_errors):.4f}')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)
# 3. True latent space (first 2 dimensions)
scatter1 = axes[1, 0].scatter(true_latent[:, 0], true_latent[:, 1],
c=reconstruction_errors, cmap='viridis',
s=50, alpha=0.6, edgecolors='black')
axes[1, 0].set_title('True Latent Space (3D ? showing 2D)', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('True Latent Dim 1')
axes[1, 0].set_ylabel('True Latent Dim 2')
axes[1, 0].grid(True, alpha=0.3)
plt.colorbar(scatter1, ax=axes[1, 0], label='Reconstruction Error')
# 4. Learned encoding space (first 2 dimensions)
scatter2 = axes[1, 1].scatter(learned_encoding[:, 0], learned_encoding[:, 1],
c=reconstruction_errors, cmap='viridis',
s=50, alpha=0.6, edgecolors='black')
axes[1, 1].set_title('Learned Encoding Space (3D ? showing 2D)', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Learned Encoding Dim 1')
axes[1, 1].set_ylabel('Learned Encoding Dim 2')
axes[1, 1].grid(True, alpha=0.3)
plt.colorbar(scatter2, ax=axes[1, 1], label='Reconstruction Error')
plt.tight_layout()
plt.show()
print("="*60)
print("AUTOENCODER RESULTS")
print("="*60)
print(f"Mean reconstruction error: {np.mean(reconstruction_errors):.6f}")
print(f"Std reconstruction error: {np.std(reconstruction_errors):.6f}")
print(f"\nCompression achieved:")
print(f" Input: {data.shape[1]} dimensions")
print(f" Encoding: {learned_encoding.shape[1]} dimensions")
print(f" Compression ratio: {data.shape[1] / learned_encoding.shape[1]:.1f}x")
print("\n?? Visualization insights:")
print(" - Top-left: Reconstructed signals closely match originals")
print(" - Top-right: Most samples have low reconstruction error")
print(" - Bottom: Learned encoding captures similar structure to true latent space")
visualize_autoencoder_results(autoencoder, data, true_latent)
Denoising Autoencoders
Denoising autoencoders learn to remove noise from corrupted inputs. They're trained on clean?noisy?clean reconstruction, making them robust feature extractors.
Why Denoising Autoencoders?
Problem with Basic Autoencoders:
- May learn identity function (copy input to output)
- Doesn't generalize well to noisy or incomplete data
- Features may be brittle and overfit
Denoising Solution:
- Add noise to input: X ? X_noisy
- Train to reconstruct clean version: X_noisy ? X_clean
- Forces network to learn robust, meaningful features
- Can't just memorize—must understand structure
Applications:
- Image denoising (remove grain, artifacts)
- Audio restoration (remove background noise)
- Data imputation (fill missing values)
- Robust feature learning for downstream tasks
import numpy as np
import matplotlib.pyplot as plt
# Create simple image dataset (geometric patterns)
def create_pattern_dataset(num_samples=100, img_size=16):
"""
Generate simple patterns (stripes, checkerboards, gradients).
"""
patterns = []
for _ in range(num_samples):
pattern_type = np.random.choice(['vertical', 'horizontal', 'checkerboard', 'gradient'])
img = np.zeros((img_size, img_size))
if pattern_type == 'vertical':
# Vertical stripes
stripe_width = np.random.randint(2, 5)
for i in range(0, img_size, stripe_width * 2):
img[:, i:i+stripe_width] = 1
elif pattern_type == 'horizontal':
# Horizontal stripes
stripe_width = np.random.randint(2, 5)
for i in range(0, img_size, stripe_width * 2):
img[i:i+stripe_width, :] = 1
elif pattern_type == 'checkerboard':
# Checkerboard
block_size = 4
for i in range(0, img_size, block_size):
for j in range(0, img_size, block_size):
if (i // block_size + j // block_size) % 2 == 0:
img[i:i+block_size, j:j+block_size] = 1
else: # gradient
# Gradient
img = np.linspace(0, 1, img_size).reshape(-1, 1)
img = np.tile(img, (1, img_size))
patterns.append(img.flatten())
return np.array(patterns)
# Generate dataset
img_size = 16
num_samples = 200
clean_data = create_pattern_dataset(num_samples, img_size)
# Add noise for training denoising autoencoder
noise_level = 0.3
noisy_data = clean_data + np.random.randn(*clean_data.shape) * noise_level
noisy_data = np.clip(noisy_data, 0, 1) # Keep in valid range
print("="*60)
print("DENOISING AUTOENCODER: Removing Noise")
print("="*60)
print(f"Dataset: {num_samples} pattern images")
print(f"Image size: {img_size}×{img_size} = {img_size**2} pixels")
print(f"Noise level: {noise_level}")
# Visualize clean vs noisy
fig, axes = plt.subplots(2, 5, figsize=(15, 6))
for i in range(5):
# Clean
axes[0, i].imshow(clean_data[i].reshape(img_size, img_size), cmap='gray', vmin=0, vmax=1)
axes[0, i].set_title('Clean', fontsize=10, fontweight='bold')
axes[0, i].axis('off')
# Noisy
axes[1, i].imshow(noisy_data[i].reshape(img_size, img_size), cmap='gray', vmin=0, vmax=1)
axes[1, i].set_title('Noisy', fontsize=10, fontweight='bold')
axes[1, i].axis('off')
plt.suptitle('Clean vs Noisy Patterns', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
import numpy as np
import matplotlib.pyplot as plt
# Train denoising autoencoder
input_size = img_size ** 2 # 256
encoding_size = 32 # Compress to 32 dimensions
denoising_ae = Autoencoder(input_size=input_size, encoding_size=encoding_size, learning_rate=0.01)
print(f"\nDenoising Autoencoder Architecture:")
print(f" Input: {input_size} pixels")
print(f" Encoding: {encoding_size} dimensions")
print(f" Output: {input_size} pixels (reconstructed)")
# Training loop
epochs = 500
batch_size = 32
losses = []
print(f"\nTraining for {epochs} epochs...")
for epoch in range(epochs):
epoch_losses = []
indices = np.random.permutation(num_samples)
for i in range(0, num_samples, batch_size):
batch_indices = indices[i:i+batch_size]
# Input: noisy data
X_noisy = noisy_data[batch_indices]
# Target: clean data
X_clean = clean_data[batch_indices]
# Forward pass with noisy input
reconstruction, _ = denoising_ae.forward(X_noisy)
# Compute loss against clean target
loss = denoising_ae.compute_loss(X_clean, reconstruction)
# Backward pass and update (using clean target)
grads = denoising_ae.backward(X_clean, reconstruction)
denoising_ae.update_weights(*grads)
epoch_losses.append(loss)
avg_loss = np.mean(epoch_losses)
losses.append(avg_loss)
if (epoch + 1) % 100 == 0:
print(f"Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.6f}")
# Plot training
plt.figure(figsize=(12, 5))
plt.plot(losses, linewidth=2, color='#BF092F')
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Reconstruction Loss', fontsize=12)
plt.title('Denoising Autoencoder Training', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Test on unseen noisy images
test_samples = 8
test_clean = create_pattern_dataset(test_samples, img_size)
test_noisy = test_clean + np.random.randn(*test_clean.shape) * noise_level
test_noisy = np.clip(test_noisy, 0, 1)
# Denoise
test_denoised, _ = denoising_ae.forward(test_noisy)
# Visualize results
fig, axes = plt.subplots(3, test_samples, figsize=(16, 6))
for i in range(test_samples):
# Original clean
axes[0, i].imshow(test_clean[i].reshape(img_size, img_size), cmap='gray', vmin=0, vmax=1)
if i == 0:
axes[0, i].set_ylabel('Original\nClean', fontsize=11, fontweight='bold')
axes[0, i].axis('off')
# Noisy input
axes[1, i].imshow(test_noisy[i].reshape(img_size, img_size), cmap='gray', vmin=0, vmax=1)
if i == 0:
axes[1, i].set_ylabel('Noisy\nInput', fontsize=11, fontweight='bold')
axes[1, i].axis('off')
# Denoised output
axes[2, i].imshow(test_denoised[i].reshape(img_size, img_size), cmap='gray', vmin=0, vmax=1)
if i == 0:
axes[2, i].set_ylabel('Denoised\nOutput', fontsize=11, fontweight='bold')
axes[2, i].axis('off')
plt.suptitle('Denoising Autoencoder Results', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
# Compute metrics
mse_noisy = np.mean((test_clean - test_noisy) ** 2)
mse_denoised = np.mean((test_clean - test_denoised) ** 2)
print("="*60)
print("DENOISING RESULTS")
print("="*60)
print(f"MSE (noisy vs clean): {mse_noisy:.6f}")
print(f"MSE (denoised vs clean): {mse_denoised:.6f}")
print(f"Improvement: {(1 - mse_denoised / mse_noisy) * 100:.1f}%")
print("\n?? Denoising autoencoder successfully:")
print(" - Learned to remove noise from corrupted images")
print(" - Reconstructs clean patterns from noisy inputs")
print(" - Generalizes to unseen test data")
print(" - Can be used for image restoration, data cleaning, etc.")
Variational Autoencoders (VAE)
Variational Autoencoders learn a probabilistic latent space, enabling them to generate new data. Unlike standard autoencoders, VAEs model the distribution of data rather than just compressing it.
Symbolic VAE Loss Derivation
import sympy as sp
from sympy import symbols, exp, log, sqrt, pi, summation, simplify
import numpy as np
import matplotlib.pyplot as plt
print("="*60)
print("VARIATIONAL AUTOENCODER (VAE) - LOSS FUNCTION")
print("="*60)
# Define symbolic variables
x, z = symbols('x z', real=True) # Data and latent variable
mu, sigma = symbols('mu sigma', positive=True, real=True) # Encoder outputs
mu_z, sigma_z = symbols('mu_z sigma_z', real=True) # Prior parameters
print("\\n1. VAE PROBABILISTIC FRAMEWORK")
print("-" * 60)
print("Encoder: q(z|x) ˜ p(z|x)")
print(" Maps input x to latent distribution")
print(" Outputs: µ(x), s(x)")
print(" Latent: z ~ N(µ(x), s²(x))")
print("\\nDecoder: p(x|z)")
print(" Maps latent z to reconstruction")
print(" Outputs: x^")
print("\\nPrior: p(z) = N(0, I)")
print(" Standard normal distribution")
# Gaussian distribution formula
print("\\n2. GAUSSIAN DISTRIBUTION (Encoder Output)")
print("-" * 60)
# Probability density function
gaussian = (1 / (sigma * sqrt(2 * pi))) * exp(-(z - mu)**2 / (2 * sigma**2))
print("q(z|x) = N(z; µ, s²)")
print(f" = {gaussian}")
# Log probability (simpler for computation)
log_gaussian = log(1 / (sigma * sqrt(2 * pi))) - (z - mu)**2 / (2 * sigma**2)
log_gaussian_simplified = simplify(log_gaussian)
print(f"\\nlog q(z|x) = {log_gaussian_simplified}")
# VAE loss components
print("\\n3. VAE LOSS FUNCTION (ELBO)")
print("-" * 60)
print("VAE maximizes Evidence Lower Bound (ELBO):")
print("")
print("L = E_q[log p(x|z)] - D_KL(q(z|x) || p(z))")
print("")
print("Component 1: Reconstruction Loss")
print(" E_q[log p(x|z)] = Expected log-likelihood")
print(" ˜ -||x - x^||² (MSE for Gaussian decoder)")
print("")
print("Component 2: KL Divergence")
print(" D_KL(q(z|x) || p(z))")
print(" = How different is q(z|x) from prior p(z)?")
# KL divergence formula (closed form for Gaussians)
print("\\n4. KL DIVERGENCE (CLOSED FORM)")
print("-" * 60)
print("For q(z|x) = N(µ, s²) and p(z) = N(0, 1):")
print("")
# Symbolic KL divergence
d = symbols('d', integer=True, positive=True) # Latent dimension
mu_i, sigma_i = symbols('mu_i sigma_i', real=True)
i = symbols('i', integer=True)
print("D_KL = (1/2) × S [µ² + s² - log(s²) - 1]")
print(" i=1 to d")
print("")
print("Per dimension:")
kl_per_dim = (mu_i**2 + sigma_i**2 - log(sigma_i**2) - 1) / 2
print(f" KL_i = {kl_per_dim}")
# Numerical example
print("\\n5. NUMERICAL EXAMPLE")
print("-" * 60)
# Encoder outputs for a single data point
mu_val = np.array([0.5, -0.3])
sigma_val = np.array([1.2, 0.8])
print(f"Encoder outputs:")
print(f" µ = {mu_val}")
print(f" s = {sigma_val}")
# KL divergence per dimension
kl_dims = 0.5 * (mu_val**2 + sigma_val**2 - np.log(sigma_val**2) - 1)
kl_total = np.sum(kl_dims)
print(f"\\nKL divergence per dimension:")
for i, kl in enumerate(kl_dims):
print(f" Dim {i}: µ={mu_val[i]:.2f}, s={sigma_val[i]:.2f} ? KL={kl:.4f}")
print(f"\\nTotal KL divergence: {kl_total:.4f}")
# Reconstruction loss (example)
x_original = np.array([0.8, 0.9, 0.7, 0.6])
x_reconstructed = np.array([0.75, 0.88, 0.72, 0.58])
recon_loss = np.mean((x_original - x_reconstructed)**2)
print(f"\\nReconstruction loss (MSE): {recon_loss:.6f}")
# Total VAE loss
beta = 1.0 # KL weight
vae_loss = recon_loss + beta * kl_total
print(f"\\nTotal VAE loss:")
print(f" L = Recon + ß×KL")
print(f" = {recon_loss:.6f} + {beta}×{kl_total:.4f}")
print(f" = {vae_loss:.6f}")
# Reparameterization trick
print("\\n6. REPARAMETERIZATION TRICK")
print("-" * 60)
print("Challenge: Can't backprop through sampling z ~ N(µ, s²)")
print("")
print("Solution: Reparameterize")
print(" Instead of: z ~ N(µ, s²)")
print(" Use: z = µ + s × e, where e ~ N(0, 1)")
print("")
print("Now gradient flows through µ and s!")
# Symbolic representation
epsilon = symbols('epsilon', real=True)
z_reparam = mu + sigma * epsilon
print(f"\\nz = {z_reparam}, where e ~ N(0,1)")
print("\\nGradients:")
dz_dmu = sp.diff(z_reparam, mu)
dz_dsigma = sp.diff(z_reparam, sigma)
print(f" ?z/?µ = {dz_dmu}")
print(f" ?z/?s = {dz_dsigma}")
# Visualization
import matplotlib.pyplot as plt
# Generate samples from learned distribution vs prior
np.random.seed(42)
n_samples = 1000
# Prior N(0, 1)
prior_samples = np.random.randn(n_samples, 2)
# Learned distribution N(µ, s²)
learned_samples = mu_val + sigma_val * np.random.randn(n_samples, 2)
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Prior
axes[0].scatter(prior_samples[:, 0], prior_samples[:, 1], alpha=0.3, color='#3B9797', s=10)
axes[0].set_xlim(-4, 4)
axes[0].set_ylim(-4, 4)
axes[0].axhline(0, color='black', linestyle='--', linewidth=1, alpha=0.5)
axes[0].axvline(0, color='black', linestyle='--', linewidth=1, alpha=0.5)
axes[0].set_title('Prior: p(z) = N(0, I)', fontsize=14, fontweight='bold')
axes[0].set_xlabel('z1')
axes[0].set_ylabel('z2')
axes[0].grid(True, alpha=0.3)
# Learned
axes[1].scatter(learned_samples[:, 0], learned_samples[:, 1], alpha=0.3, color='#BF092F', s=10)
axes[1].scatter(mu_val[0], mu_val[1], color='#132440', s=200, marker='*',
edgecolor='white', linewidth=2, label='µ', zorder=5)
axes[1].set_xlim(-4, 4)
axes[1].set_ylim(-4, 4)
axes[1].axhline(0, color='black', linestyle='--', linewidth=1, alpha=0.5)
axes[1].axvline(0, color='black', linestyle='--', linewidth=1, alpha=0.5)
axes[1].set_title(f'Learned: q(z|x) = N({mu_val}, diag({sigma_val}²))', fontsize=14, fontweight='bold')
axes[1].set_xlabel('z1')
axes[1].set_ylabel('z2')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\\n?? Key insights:")
print(" 1. VAE loss = Reconstruction + KL divergence")
print(" 2. KL divergence regularizes latent space (keeps it close to prior)")
print(" 3. Reparameterization trick enables backprop through sampling")
print(" 4. Lower KL ? latent codes closer to N(0,1) ? better generation")
print(" 5. Trade-off: Reconstruction accuracy vs. latent space regularity")
From Compression to Generation
Standard Autoencoder Limitation:
- Latent space may have "holes" with no meaning
- Can't smoothly interpolate between encodings
- Can't generate new samples (only reconstruct existing ones)
VAE Innovation:
- Encoder outputs distribution parameters (mean µ and variance s²)
- Sample from distribution: z ~ N(µ, s²)
- Decoder reconstructs from sampled z
- Regularization ensures smooth, continuous latent space
VAE Loss = Reconstruction Loss + KL Divergence
- Reconstruction loss: How well can we rebuild input?
- KL divergence: Keep latent distribution close to standard normal N(0,1)
Result: Can sample random z ~ N(0,1) and decode to generate NEW data!
import numpy as np
class VariationalAutoencoder:
"""
VAE: Learns probabilistic latent space for generation.
"""
def __init__(self, input_size, latent_size, learning_rate=0.001):
self.input_size = input_size
self.latent_size = latent_size
self.learning_rate = learning_rate
# Encoder: input ? (mu, log_var)
hidden_size = 128
self.W_enc_hidden = np.random.randn(input_size, hidden_size) * 0.01
self.b_enc_hidden = np.zeros((1, hidden_size))
# Mean and log-variance branches
self.W_mu = np.random.randn(hidden_size, latent_size) * 0.01
self.b_mu = np.zeros((1, latent_size))
self.W_logvar = np.random.randn(hidden_size, latent_size) * 0.01
self.b_logvar = np.zeros((1, latent_size))
# Decoder: z ? reconstruction
self.W_dec_hidden = np.random.randn(latent_size, hidden_size) * 0.01
self.b_dec_hidden = np.zeros((1, hidden_size))
self.W_dec_out = np.random.randn(hidden_size, input_size) * 0.01
self.b_dec_out = np.zeros((1, input_size))
def relu(self, x):
return np.maximum(0, x)
def sigmoid(self, x):
return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
def encode(self, X):
"""
Encode input to latent distribution parameters.
Returns: mu, log_var
"""
# Hidden layer
h = self.relu(np.dot(X, self.W_enc_hidden) + self.b_enc_hidden)
# Mean and log-variance
mu = np.dot(h, self.W_mu) + self.b_mu
log_var = np.dot(h, self.W_logvar) + self.b_logvar
return mu, log_var
def reparameterize(self, mu, log_var):
"""
Reparameterization trick: z = mu + sigma * epsilon
where epsilon ~ N(0,1)
This allows backpropagation through sampling.
"""
std = np.exp(0.5 * log_var)
epsilon = np.random.randn(*std.shape)
z = mu + std * epsilon
return z
def decode(self, z):
"""
Decode latent vector to reconstruction.
"""
# Hidden layer
h = self.relu(np.dot(z, self.W_dec_hidden) + self.b_dec_hidden)
# Output (sigmoid to ensure [0,1])
reconstruction = self.sigmoid(np.dot(h, self.W_dec_out) + self.b_dec_out)
return reconstruction
def forward(self, X):
"""Full forward pass"""
# Encode
self.mu, self.log_var = self.encode(X)
# Sample latent vector
self.z = self.reparameterize(self.mu, self.log_var)
# Decode
reconstruction = self.decode(self.z)
return reconstruction, self.mu, self.log_var, self.z
def compute_loss(self, X, reconstruction, mu, log_var):
"""
VAE loss = Reconstruction loss + KL divergence.
KL divergence: KL(N(mu, sigma^2) || N(0, 1))
"""
# Reconstruction loss (binary cross-entropy)
recon_loss = -np.sum(X * np.log(reconstruction + 1e-8) +
(1 - X) * np.log(1 - reconstruction + 1e-8))
# KL divergence
kl_loss = -0.5 * np.sum(1 + log_var - mu**2 - np.exp(log_var))
total_loss = recon_loss + kl_loss
return total_loss / X.shape[0], recon_loss / X.shape[0], kl_loss / X.shape[0]
def generate(self, num_samples=1):
"""
Generate new samples by sampling from N(0,1) and decoding.
"""
# Sample from standard normal
z = np.random.randn(num_samples, self.latent_size)
# Decode
generated = self.decode(z)
return generated
# Example usage
print("="*60)
print("VARIATIONAL AUTOENCODER (VAE)")
print("="*60)
input_size = 256 # 16×16 images
latent_size = 8 # 8-dimensional latent space
vae = VariationalAutoencoder(input_size=input_size, latent_size=latent_size)
print(f"VAE Architecture:")
print(f" Input: {input_size} pixels")
print(f" Encoder: ? 128 hidden ? (mu, log_var) in {latent_size}D")
print(f" Reparameterization: z = mu + sigma * epsilon")
print(f" Decoder: {latent_size}D ? 128 hidden ? {input_size} pixels")
# Test forward pass
X_test = np.random.rand(5, input_size)
reconstruction, mu, log_var, z = vae.forward(X_test)
print(f"\nForward pass test:")
print(f" Input shape: {X_test.shape}")
print(f" Latent mu shape: {mu.shape}")
print(f" Latent log_var shape: {log_var.shape}")
print(f" Sampled z shape: {z.shape}")
print(f" Reconstruction shape: {reconstruction.shape}")
# Compute loss
total_loss, recon_loss, kl_loss = vae.compute_loss(X_test, reconstruction, mu, log_var)
print(f"\nLoss components:")
print(f" Reconstruction loss: {recon_loss:.4f}")
print(f" KL divergence: {kl_loss:.4f}")
print(f" Total loss: {total_loss:.4f}")
# Generate new samples
generated = vae.generate(num_samples=5)
print(f"\nGenerated samples shape: {generated.shape}")
print("\n?? VAE advantages:")
print(" - Smooth, continuous latent space")
print(" - Can generate NEW data (not just reconstruct)")
print(" - Can interpolate between samples")
print(" - Probabilistic interpretation")
Autoencoders Deep Dive Summary
What We Built:
- ? Basic autoencoder with encoder-decoder architecture
- ? Training on dimensionality reduction task (20D ? 3D)
- ? Denoising autoencoder for image restoration
- ? Variational autoencoder (VAE) for generation
- ? Visualizations of latent spaces and reconstructions
Key Insights:
- Basic AE: Learns compressed representation through bottleneck
- Denoising AE: Robust features by reconstructing clean from noisy
- VAE: Probabilistic latent space enables data generation
- Applications: Dimensionality reduction, denoising, anomaly detection, generation
Next: We'll dive into Generative Adversarial Networks (GANs) for even more powerful data generation!
Generative Adversarial Networks (GANs) - Deep Dive
GANs are one of the most exciting developments in deep learning. Two neural networks—a Generator and a Discriminator—compete in a game, and through this competition, the Generator learns to create incredibly realistic data.
The Adversarial Game
Symbolic Minimax Game Formulation
import sympy as sp
from sympy import symbols, log, exp, integrate, simplify, oo
import numpy as np
import matplotlib.pyplot as plt
print("="*60)
print("GAN MINIMAX GAME - SYMBOLIC FORMULATION")
print("="*60)
# Define symbolic variables
x, z = symbols('x z', real=True) # Real data and latent noise
theta_d, theta_g = symbols('theta_D theta_G', real=True) # Parameters
print("\\n1. GAN OBJECTIVE FUNCTION")
print("-" * 60)
print("Minimax game between Generator (G) and Discriminator (D):")
print("")
print("min max V(D, G)")
print(" G D")
print("")
print("where:")
print("V(D,G) = E_x[log D(x)] + E_z[log(1 - D(G(z)))]")
print("")
print("Components:")
print(" E_x[log D(x)] : Discriminator correctly identifies real data")
print(" E_z[log(1 - D(G(z)))] : Discriminator correctly rejects fake data")
# Symbolic discriminator output
D_x = symbols('D(x)', real=True, positive=True) # D(x) ? (0, 1)
D_G_z = symbols('D(G(z))', real=True, positive=True) # D(G(z)) ? (0, 1)
# Value function
V = log(D_x) + log(1 - D_G_z)
print(f"\\nSymbolic V(D, G) = {V}")
print("\\n2. DISCRIMINATOR'S OBJECTIVE (Maximize)")
print("-" * 60)
print("Discriminator wants to maximize V:")
print(" - Maximize log D(x) ? D(x) ? 1 (classify real as real)")
print(" - Maximize log(1 - D(G(z))) ? D(G(z)) ? 0 (classify fake as fake)")
# Optimal discriminator (closed form)
print("\\nOptimal Discriminator (given fixed G):")
print("D*(x) = p_data(x) / (p_data(x) + p_g(x))")
print("")
print("Where:")
print(" p_data(x) = real data distribution")
print(" p_g(x) = generator distribution")
print("\\n3. GENERATOR'S OBJECTIVE (Minimize)")
print("-" * 60)
print("Generator wants to minimize V:")
print(" - Minimize log(1 - D(G(z))) ? D(G(z)) ? 1 (fool discriminator)")
print("")
print("Alternative (non-saturating) objective:")
print(" Maximize log D(G(z)) instead of minimizing log(1 - D(G(z)))")
print(" (Stronger gradients early in training)")
# Numerical example: optimal discriminator
print("\\n4. NUMERICAL EXAMPLE")
print("-" * 60)
# Probabilities at different points in input space
p_data_val = 0.8 # Real data density at this point
p_g_val = 0.2 # Generated data density at this point
D_optimal = p_data_val / (p_data_val + p_g_val)
print(f"At point x:")
print(f" p_data(x) = {p_data_val}")
print(f" p_g(x) = {p_g_val}")
print(f" D*(x) = {p_data_val}/{p_data_val + p_g_val} = {D_optimal:.3f}")
print("\\n Interpretation: 80% real, 20% fake ? D predicts 80% real")
# When generator matches data distribution
p_data_perfect = 0.5
p_g_perfect = 0.5
D_perfect = p_data_perfect / (p_data_perfect + p_g_perfect)
print(f"\\nWhen G is perfect (p_g = p_data):")
print(f" p_data(x) = {p_data_perfect}")
print(f" p_g(x) = {p_g_perfect}")
print(f" D*(x) = {D_perfect:.3f}")
print("\\n Discriminator can't tell real from fake (Nash equilibrium)!")
# Loss values
print("\\n5. LOSS CALCULATIONS")
print("-" * 60)
# Discriminator loss on real data
D_real = 0.9 # Good discriminator
loss_real = -np.log(D_real)
print(f"Real data: D(x) = {D_real}")
print(f" Loss: -log({D_real}) = {loss_real:.4f}")
# Discriminator loss on fake data
D_fake_good_D = 0.1 # Good discriminator (correctly rejects fake)
D_fake_bad_D = 0.9 # Bad discriminator (fooled by fake)
loss_fake_good = -np.log(1 - D_fake_good_D)
loss_fake_bad = -np.log(1 - D_fake_bad_D)
print(f"\\nFake data (good D): D(G(z)) = {D_fake_good_D}")
print(f" Loss: -log(1-{D_fake_good_D}) = {loss_fake_good:.4f}")
print(f"\\nFake data (bad D): D(G(z)) = {D_fake_bad_D}")
print(f" Loss: -log(1-{D_fake_bad_D}) = {loss_fake_bad:.4f} ?? High loss!")
# Generator loss
print("\\n6. GENERATOR TRAINING")
print("-" * 60)
# Original (saturating) objective
loss_g_saturating = np.log(1 - D_fake_good_D)
print(f"Original objective: log(1 - D(G(z)))")
print(f" When D(G(z)) = {D_fake_good_D}: loss = {loss_g_saturating:.4f}")
# Non-saturating objective
loss_g_nonsaturating = -np.log(D_fake_good_D)
print(f"\\nNon-saturating objective: -log D(G(z))")
print(f" When D(G(z)) = {D_fake_good_D}: loss = {loss_g_nonsaturating:.4f}")
print("\\n Provides stronger gradient when D is good at detecting fakes")
# Gradient comparison
import matplotlib.pyplot as plt
D_range = np.linspace(0.01, 0.99, 100)
saturating_loss = np.log(1 - D_range)
nonsaturating_loss = -np.log(D_range)
# Gradients
saturating_grad = -1 / (1 - D_range)
nonsaturating_grad = -1 / D_range
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Loss curves
axes[0].plot(D_range, saturating_loss, linewidth=2, color='#BF092F',
label='Saturating: log(1-D(G(z)))')
axes[0].plot(D_range, nonsaturating_loss, linewidth=2, color='#3B9797',
label='Non-saturating: -log D(G(z))')
axes[0].axvline(x=0.5, color='black', linestyle='--', linewidth=1, alpha=0.5, label='D=0.5 (equilibrium)')
axes[0].set_xlabel('D(G(z)) - Discriminator output on fake', fontsize=12)
axes[0].set_ylabel('Generator Loss', fontsize=12)
axes[0].set_title('GAN Generator Loss Functions', fontsize=14, fontweight='bold')
axes[0].legend(loc='upper right', fontsize=10)
axes[0].grid(True, alpha=0.3)
# Gradient curves
axes[1].plot(D_range, np.abs(saturating_grad), linewidth=2, color='#BF092F',
label='Saturating gradient')
axes[1].plot(D_range, nonsaturating_grad, linewidth=2, color='#3B9797',
label='Non-saturating gradient')
axes[1].axvline(x=0.5, color='black', linestyle='--', linewidth=1, alpha=0.5)
axes[1].set_xlabel('D(G(z)) - Discriminator output on fake', fontsize=12)
axes[1].set_ylabel('|Gradient| magnitude', fontsize=12)
axes[1].set_title('Generator Gradient Magnitude', fontsize=14, fontweight='bold')
axes[1].legend(loc='upper right', fontsize=10)
axes[1].grid(True, alpha=0.3)
axes[1].set_ylim([0, 10])
# Annotate low D(G(z)) region
axes[1].annotate('Strong gradient\\n(non-saturating)', xy=(0.1, 8), xytext=(0.3, 8),
arrowprops=dict(arrowstyle='->', color='#3B9797', lw=2),
fontsize=11, color='#3B9797', fontweight='bold')
plt.tight_layout()
plt.show()
print("\\n?? Key insights:")
print(" 1. GAN = two-player minimax game")
print(" 2. Optimal D* knows exact probability ratio of real vs fake")
print(" 3. At Nash equilibrium: D(x) = 0.5 everywhere (can't distinguish)")
print(" 4. Non-saturating loss provides stronger gradients early")
print(" 5. Training is a delicate balance (D too good ? G can't learn)")
Understanding the GAN Game
Analogy: Art Forger vs. Detective
The Generator (Forger):
- Goal: Create fake paintings that look real
- Input: Random noise (like throwing paint randomly)
- Output: Fake painting
- Success metric: Fool the detective into thinking it's real
The Discriminator (Detective):
- Goal: Distinguish real paintings from fakes
- Input: Real or fake painting
- Output: Probability that painting is real (0 to 1)
- Success metric: Correctly identify real vs fake
The Competition:
- Generator creates fakes (initially terrible)
- Discriminator learns to spot them
- Generator improves to fool improved discriminator
- Discriminator gets better at detecting improved fakes
- Cycle continues until equilibrium: Generator creates perfect fakes!
Mathematical Formulation:
minG maxD V(D, G) = Ex~pdata[log D(x)] + Ez~pz[log(1 - D(G(z)))]
- Discriminator maximizes: log D(x) for real + log(1-D(G(z))) for fake
- Generator minimizes: log(1 - D(G(z))) ? wants D to output 1 (fooled!)
import numpy as np
import matplotlib.pyplot as plt
# Visualize GAN training dynamics
def visualize_gan_concept():
"""
Demonstrate how Generator and Discriminator improve over time.
"""
# Simulate training progress
epochs = np.arange(0, 101, 10)
# Generator quality: starts low, improves
generator_quality = 1 - np.exp(-epochs / 30)
# Discriminator accuracy: starts high (easy to detect bad fakes),
# decreases as generator improves, stabilizes at ~50% (can't tell difference)
discriminator_accuracy = 0.95 - 0.45 * (1 - np.exp(-epochs / 25))
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
# Generator improvement
ax1.plot(epochs, generator_quality, linewidth=3, color='#3B9797', marker='o', markersize=8)
ax1.set_xlabel('Training Epoch', fontsize=12)
ax1.set_ylabel('Generator Quality', fontsize=12)
ax1.set_title('Generator: Learning to Create Realistic Data', fontsize=13, fontweight='bold')
ax1.grid(True, alpha=0.3)
ax1.set_ylim([0, 1.1])
ax1.axhline(y=1.0, color='gray', linestyle='--', alpha=0.5, label='Perfect Quality')
ax1.legend()
# Discriminator accuracy
ax2.plot(epochs, discriminator_accuracy, linewidth=3, color='#BF092F', marker='s', markersize=8)
ax2.set_xlabel('Training Epoch', fontsize=12)
ax2.set_ylabel('Discriminator Accuracy', fontsize=12)
ax2.set_title('Discriminator: Ability to Detect Fakes', fontsize=13, fontweight='bold')
ax2.grid(True, alpha=0.3)
ax2.set_ylim([0, 1.1])
ax2.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5, label='Random Guess (Equilibrium)')
ax2.legend()
plt.tight_layout()
plt.show()
print("="*60)
print("GAN TRAINING DYNAMICS")
print("="*60)
print("Early Training (Epoch 0-20):")
print(" - Generator: Creates obvious fakes")
print(" - Discriminator: Easily spots them (>90% accuracy)")
print("\nMid Training (Epoch 20-50):")
print(" - Generator: Improves quality")
print(" - Discriminator: Gets challenged, accuracy drops")
print("\nLate Training (Epoch 50+):")
print(" - Generator: Creates realistic data")
print(" - Discriminator: ~50% accuracy (can't tell real from fake!)")
print("\n?? Equilibrium (Nash Equilibrium):")
print(" Generator creates perfect fakes")
print(" Discriminator can only guess randomly (50%)")
print(" Training complete!")
visualize_gan_concept()
Building a GAN from Scratch
Let's implement a complete GAN with Generator and Discriminator networks. We'll train it to generate simple 2D data distributions.
import numpy as np
class Generator:
"""
Generator network: Random noise ? Fake data
"""
def __init__(self, noise_dim, output_dim, hidden_dim=32):
self.noise_dim = noise_dim
self.output_dim = output_dim
# Network: noise ? hidden ? output
self.W1 = np.random.randn(noise_dim, hidden_dim) * 0.1
self.b1 = np.zeros((1, hidden_dim))
self.W2 = np.random.randn(hidden_dim, output_dim) * 0.1
self.b2 = np.zeros((1, output_dim))
def relu(self, x):
return np.maximum(0, x)
def forward(self, noise):
"""
Generate fake data from noise.
noise: Random vectors (batch_size, noise_dim)
Returns: Fake data (batch_size, output_dim)
"""
# Hidden layer
self.z1 = np.dot(noise, self.W1) + self.b1
self.a1 = self.relu(self.z1)
# Output layer (no activation for real-valued data)
self.output = np.dot(self.a1, self.W2) + self.b2
return self.output
def backward(self, noise, grad_output, learning_rate):
"""
Backpropagate gradients and update weights.
grad_output: Gradient from discriminator
"""
batch_size = noise.shape[0]
# Output layer gradients
grad_W2 = np.dot(self.a1.T, grad_output) / batch_size
grad_b2 = np.sum(grad_output, axis=0, keepdims=True) / batch_size
# Hidden layer gradients
grad_a1 = np.dot(grad_output, self.W2.T)
grad_z1 = grad_a1 * (self.z1 > 0) # ReLU derivative
grad_W1 = np.dot(noise.T, grad_z1) / batch_size
grad_b1 = np.sum(grad_z1, axis=0, keepdims=True) / batch_size
# Update weights
self.W1 -= learning_rate * grad_W1
self.b1 -= learning_rate * grad_b1
self.W2 -= learning_rate * grad_W2
self.b2 -= learning_rate * grad_b2
class Discriminator:
"""
Discriminator network: Data ? Probability of being real
"""
def __init__(self, input_dim, hidden_dim=32):
self.input_dim = input_dim
# Network: input ? hidden ? probability
self.W1 = np.random.randn(input_dim, hidden_dim) * 0.1
self.b1 = np.zeros((1, hidden_dim))
self.W2 = np.random.randn(hidden_dim, 1) * 0.1
self.b2 = np.zeros((1, 1))
def relu(self, x):
return np.maximum(0, x)
def sigmoid(self, x):
return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
def forward(self, data):
"""
Predict if data is real or fake.
data: Input samples (batch_size, input_dim)
Returns: Probability of being real (batch_size, 1)
"""
# Hidden layer
self.z1 = np.dot(data, self.W1) + self.b1
self.a1 = self.relu(self.z1)
# Output layer (sigmoid for probability)
self.z2 = np.dot(self.a1, self.W2) + self.b2
self.prob_real = self.sigmoid(self.z2)
return self.prob_real
def backward(self, data, grad_output, learning_rate):
"""
Backpropagate gradients and update weights.
"""
batch_size = data.shape[0]
# Gradient through sigmoid
grad_z2 = grad_output * self.prob_real * (1 - self.prob_real)
# Output layer gradients
grad_W2 = np.dot(self.a1.T, grad_z2) / batch_size
grad_b2 = np.sum(grad_z2, axis=0, keepdims=True) / batch_size
# Hidden layer gradients
grad_a1 = np.dot(grad_z2, self.W2.T)
grad_z1 = grad_a1 * (self.z1 > 0) # ReLU derivative
grad_W1 = np.dot(data.T, grad_z1) / batch_size
grad_b1 = np.sum(grad_z1, axis=0, keepdims=True) / batch_size
# Update weights
self.W1 -= learning_rate * grad_W1
self.b1 -= learning_rate * grad_b1
self.W2 -= learning_rate * grad_W2
self.b2 -= learning_rate * grad_b2
# Return gradient for generator
return np.dot(grad_z1, self.W1.T)
# Test the networks
print("="*60)
print("GAN ARCHITECTURE")
print("="*60)
noise_dim = 10
data_dim = 2
hidden_dim = 32
generator = Generator(noise_dim=noise_dim, output_dim=data_dim, hidden_dim=hidden_dim)
discriminator = Discriminator(input_dim=data_dim, hidden_dim=hidden_dim)
print(f"Generator:")
print(f" Input: {noise_dim}D random noise")
print(f" Hidden: {hidden_dim} neurons")
print(f" Output: {data_dim}D fake data")
print(f" Parameters: {generator.W1.size + generator.W2.size + noise_dim + data_dim}")
print(f"\nDiscriminator:")
print(f" Input: {data_dim}D data (real or fake)")
print(f" Hidden: {hidden_dim} neurons")
print(f" Output: Probability (0=fake, 1=real)")
print(f" Parameters: {discriminator.W1.size + discriminator.W2.size + data_dim + 1}")
# Test forward pass
noise = np.random.randn(5, noise_dim)
fake_data = generator.forward(noise)
prob_real = discriminator.forward(fake_data)
print(f"\nTest forward pass:")
print(f" Generated fake data shape: {fake_data.shape}")
print(f" Discriminator predictions: {prob_real.flatten()}")
print(f" (Before training, predictions are random)")
print("\n?? Training alternates between:")
print(" 1. Train Discriminator: Maximize ability to detect fakes")
print(" 2. Train Generator: Maximize ability to fool discriminator")
Training the GAN
Now let's train our GAN to generate 2D points that match a target distribution (e.g., a circle or mixture of Gaussians).
import numpy as np
import matplotlib.pyplot as plt
# Create target distribution: Circle
def sample_circle(num_samples, radius=2.0, noise=0.1):
"""Sample points in a circle."""
angles = np.random.uniform(0, 2*np.pi, num_samples)
radii = radius + np.random.randn(num_samples) * noise
x = radii * np.cos(angles)
y = radii * np.sin(angles)
return np.column_stack([x, y])
# Generate real data
real_data = sample_circle(1000)
# Visualize real data
plt.figure(figsize=(8, 8))
plt.scatter(real_data[:, 0], real_data[:, 1], alpha=0.5, s=20, color='#3B9797')
plt.title('Real Data Distribution (Circle)', fontsize=14, fontweight='bold')
plt.xlabel('X')
plt.ylabel('Y')
plt.axis('equal')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("="*60)
print("GAN TRAINING: Learning to Generate Circle Data")
print("="*60)
print(f"Real data samples: {len(real_data)}")
print(f"Real data range: X=[{real_data[:, 0].min():.2f}, {real_data[:, 0].max():.2f}], "
f"Y=[{real_data[:, 1].min():.2f}, {real_data[:, 1].max():.2f}]")
import numpy as np
import matplotlib.pyplot as plt
# Initialize GAN
noise_dim = 10
data_dim = 2
generator = Generator(noise_dim=noise_dim, output_dim=data_dim, hidden_dim=64)
discriminator = Discriminator(input_dim=data_dim, hidden_dim=64)
# Training parameters
epochs = 5000
batch_size = 64
d_learning_rate = 0.001
g_learning_rate = 0.001
# Track losses
d_losses = []
g_losses = []
print("\nTraining GAN...")
print("Epoch | D Loss | G Loss | D(real) | D(fake)")
print("-" * 50)
for epoch in range(epochs):
# ========================================
# 1. Train Discriminator
# ========================================
# Sample real data
indices = np.random.randint(0, len(real_data), batch_size)
real_batch = real_data[indices]
# Generate fake data
noise = np.random.randn(batch_size, noise_dim)
fake_batch = generator.forward(noise)
# Discriminator forward pass
d_real = discriminator.forward(real_batch)
d_fake = discriminator.forward(fake_batch)
# Discriminator loss: -[log(D(real)) + log(1 - D(fake))]
d_loss_real = -np.mean(np.log(d_real + 1e-8))
d_loss_fake = -np.mean(np.log(1 - d_fake + 1e-8))
d_loss = d_loss_real + d_loss_fake
# Discriminator backward pass
# Gradient: Want D(real) ? 1, D(fake) ? 0
grad_real = -(1 / (d_real + 1e-8)) / batch_size
grad_fake = (1 / (1 - d_fake + 1e-8)) / batch_size
discriminator.backward(real_batch, grad_real, d_learning_rate)
discriminator.backward(fake_batch, grad_fake, d_learning_rate)
# ========================================
# 2. Train Generator
# ========================================
# Generate new fake data
noise = np.random.randn(batch_size, noise_dim)
fake_batch = generator.forward(noise)
# Discriminator's opinion on fake data
d_fake = discriminator.forward(fake_batch)
# Generator loss: -log(D(fake))
# Want discriminator to think fake is real (D(fake) ? 1)
g_loss = -np.mean(np.log(d_fake + 1e-8))
# Generator backward pass
# Gradient flows from discriminator
grad_g = -(1 / (d_fake + 1e-8)) / batch_size
grad_data = discriminator.backward(fake_batch, grad_g, 0) # Don't update discriminator
generator.backward(noise, grad_data, g_learning_rate)
# Track losses
d_losses.append(d_loss)
g_losses.append(g_loss)
# Print progress
if (epoch + 1) % 1000 == 0:
print(f"{epoch+1:5d} | {d_loss:.4f} | {g_loss:.4f} | "
f"{d_real.mean():.4f} | {d_fake.mean():.4f}")
print("\nTraining complete!")
import numpy as np
import matplotlib.pyplot as plt
# Visualize training progress
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
# 1. Training losses
axes[0, 0].plot(d_losses, label='Discriminator Loss', linewidth=2, color='#BF092F', alpha=0.7)
axes[0, 0].plot(g_losses, label='Generator Loss', linewidth=2, color='#3B9797', alpha=0.7)
axes[0, 0].set_xlabel('Iteration', fontsize=12)
axes[0, 0].set_ylabel('Loss', fontsize=12)
axes[0, 0].set_title('GAN Training Losses', fontsize=13, fontweight='bold')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
# 2. Real data
axes[0, 1].scatter(real_data[:, 0], real_data[:, 1], alpha=0.5, s=20, color='#3B9797')
axes[0, 1].set_title('Real Data Distribution', fontsize=13, fontweight='bold')
axes[0, 1].set_xlabel('X')
axes[0, 1].set_ylabel('Y')
axes[0, 1].axis('equal')
axes[0, 1].grid(True, alpha=0.3)
axes[0, 1].set_xlim([-4, 4])
axes[0, 1].set_ylim([-4, 4])
# 3. Generated data (after training)
noise = np.random.randn(1000, noise_dim)
generated_data = generator.forward(noise)
axes[1, 0].scatter(generated_data[:, 0], generated_data[:, 1], alpha=0.5, s=20, color='#BF092F')
axes[1, 0].set_title('Generated Data (After Training)', fontsize=13, fontweight='bold')
axes[1, 0].set_xlabel('X')
axes[1, 0].set_ylabel('Y')
axes[1, 0].axis('equal')
axes[1, 0].grid(True, alpha=0.3)
axes[1, 0].set_xlim([-4, 4])
axes[1, 0].set_ylim([-4, 4])
# 4. Overlay comparison
axes[1, 1].scatter(real_data[:, 0], real_data[:, 1], alpha=0.4, s=20,
color='#3B9797', label='Real Data')
axes[1, 1].scatter(generated_data[:, 0], generated_data[:, 1], alpha=0.4, s=20,
color='#BF092F', label='Generated Data')
axes[1, 1].set_title('Real vs Generated Data Overlay', fontsize=13, fontweight='bold')
axes[1, 1].set_xlabel('X')
axes[1, 1].set_ylabel('Y')
axes[1, 1].axis('equal')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].set_xlim([-4, 4])
axes[1, 1].set_ylim([-4, 4])
plt.tight_layout()
plt.show()
print("="*60)
print("GAN TRAINING RESULTS")
print("="*60)
print(f"Final Discriminator loss: {d_losses[-1]:.4f}")
print(f"Final Generator loss: {g_losses[-1]:.4f}")
print(f"\nGenerated data statistics:")
print(f" Mean: [{generated_data[:, 0].mean():.3f}, {generated_data[:, 1].mean():.3f}]")
print(f" Std: [{generated_data[:, 0].std():.3f}, {generated_data[:, 1].std():.3f}]")
print(f"\nReal data statistics:")
print(f" Mean: [{real_data[:, 0].mean():.3f}, {real_data[:, 1].mean():.3f}]")
print(f" Std: [{real_data[:, 0].std():.3f}, {real_data[:, 1].std():.3f}]")
print("\n?? Success! Generator learned to create circle-shaped data")
print(" that matches the real data distribution!")
Training Challenges and Solutions
Common GAN Training Issues
1. Mode Collapse
- Problem: Generator produces only a few types of outputs (ignores diversity)
- Why: Generator finds one "easy win" that fools discriminator, sticks with it
- Example: Asked to generate digits 0-9, only generates 1's
- Solution: Minibatch discrimination, Wasserstein GAN, feature matching
2. Vanishing Gradients
- Problem: If discriminator gets too good, generator gradient ? 0
- Why: log(1-D(G(z))) saturates when D(G(z)) ? 0
- Solution: Use -log(D(G(z))) instead of log(1-D(G(z))) for generator loss
3. Training Instability
- Problem: Losses oscillate wildly, networks don't converge
- Why: Generator and discriminator in arms race, no stable equilibrium
- Solution: Careful learning rates, architectural choices, regularization
4. Discriminator Dominance
- Problem: Discriminator too strong, always wins
- Solution: Train discriminator less frequently, use one-sided label smoothing
import numpy as np
import matplotlib.pyplot as plt
# Demonstrate mode collapse
def demonstrate_mode_collapse():
"""
Show what happens when generator ignores diversity.
"""
# Real data: mixture of 4 Gaussians (4 modes)
def sample_mixture_of_gaussians(n_samples):
centers = [[-2, -2], [2, -2], [-2, 2], [2, 2]]
data = []
for _ in range(n_samples):
# Randomly pick a center
center = centers[np.random.randint(0, len(centers))]
# Sample from Gaussian around that center
point = center + np.random.randn(2) * 0.3
data.append(point)
return np.array(data)
# Real data with 4 modes
real_data = sample_mixture_of_gaussians(400)
# Simulate mode collapse: Generator only learns 1 mode
collapsed_data = np.random.randn(400, 2) * 0.3 + np.array([2, 2])
# Healthy GAN: Captures all modes
healthy_data = sample_mixture_of_gaussians(400)
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
# Real data
axes[0].scatter(real_data[:, 0], real_data[:, 1], alpha=0.6, s=30, color='#3B9797')
axes[0].set_title('Real Data\n(4 Modes)', fontsize=12, fontweight='bold')
axes[0].set_xlim([-4, 4])
axes[0].set_ylim([-4, 4])
axes[0].grid(True, alpha=0.3)
axes[0].axis('equal')
# Mode collapse
axes[1].scatter(collapsed_data[:, 0], collapsed_data[:, 1], alpha=0.6, s=30, color='#BF092F')
axes[1].set_title('Mode Collapse\n(Only 1 Mode)', fontsize=12, fontweight='bold')
axes[1].set_xlim([-4, 4])
axes[1].set_ylim([-4, 4])
axes[1].grid(True, alpha=0.3)
axes[1].axis('equal')
# Healthy GAN
axes[2].scatter(healthy_data[:, 0], healthy_data[:, 1], alpha=0.6, s=30, color='#16476A')
axes[2].set_title('Healthy GAN\n(All 4 Modes)', fontsize=12, fontweight='bold')
axes[2].set_xlim([-4, 4])
axes[2].set_ylim([-4, 4])
axes[2].grid(True, alpha=0.3)
axes[2].axis('equal')
plt.tight_layout()
plt.show()
print("="*60)
print("MODE COLLAPSE DEMONSTRATION")
print("="*60)
print("Real data has 4 distinct clusters (modes)")
print("\nMode Collapse:")
print(" - Generator only learns ONE mode")
print(" - Ignores diversity in real data")
print(" - All generated samples look similar")
print("\nHealthy GAN:")
print(" - Generator captures ALL modes")
print(" - Generated data has same diversity as real data")
print("\n?? Detecting mode collapse:")
print(" - Visually inspect generated samples")
print(" - Check diversity metrics (inception score, FID)")
print(" - Compare coverage of real vs generated distributions")
demonstrate_mode_collapse()
Advanced GAN Architectures
Evolution of GANs
1. DCGAN (Deep Convolutional GAN)
- Uses convolutional layers instead of fully connected
- Generator: upsampling convolutions, Decoder: downsampling convolutions
- Batch normalization for stability
- Best for image generation
2. WGAN (Wasserstein GAN)
- Uses Wasserstein distance instead of JS divergence
- Critic (not discriminator) outputs unbounded score
- Much more stable training
- Meaningful loss metric (correlates with quality)
3. StyleGAN
- Controls style at different levels (coarse to fine)
- Mapping network + synthesis network
- Incredible photorealistic faces, art
- Basis for many creative applications
4. Conditional GAN (cGAN)
- Conditions generation on labels or other data
- Example: Generate "dog" vs "cat" based on label
- Enables controlled generation
- Used in image-to-image translation (pix2pix)
import numpy as np
# Conceptual comparison of GAN variants
def compare_gan_variants():
"""
Compare key characteristics of different GAN types.
"""
variants = {
'Vanilla GAN': {
'loss': 'Binary Cross-Entropy',
'stability': '?????',
'quality': '?????',
'training_speed': '?????',
'use_case': 'Simple 2D distributions, learning'
},
'DCGAN': {
'loss': 'Binary Cross-Entropy',
'stability': '?????',
'quality': '?????',
'training_speed': '?????',
'use_case': 'Image generation, general vision'
},
'WGAN': {
'loss': 'Wasserstein Distance',
'stability': '?????',
'quality': '?????',
'training_speed': '?????',
'use_case': 'Stable training needed, research'
},
'StyleGAN': {
'loss': 'Modified WGAN-GP',
'stability': '?????',
'quality': '?????',
'training_speed': '?????',
'use_case': 'High-quality faces, art generation'
},
'Conditional GAN': {
'loss': 'Binary Cross-Entropy (conditional)',
'stability': '?????',
'quality': '?????',
'training_speed': '?????',
'use_case': 'Controlled generation, image translation'
}
}
print("="*70)
print("GAN VARIANTS COMPARISON")
print("="*70)
print(f"{'Variant':<20} {'Loss':<30} {'Stability':<12} {'Quality':<12}")
print("-"*70)
for name, props in variants.items():
print(f"{name:<20} {props['loss']:<30} {props['stability']:<12} {props['quality']:<12}")
print("\n" + "="*70)
print("DETAILED USE CASES")
print("="*70)
for name, props in variants.items():
print(f"\n{name}:")
print(f" Use Case: {props['use_case']}")
print(f" Training Speed: {props['training_speed']}")
print("\n?? Choosing a GAN variant:")
print(" - Starting out? Vanilla GAN or DCGAN")
print(" - Need stability? WGAN")
print(" - Need quality? StyleGAN (but requires resources)")
print(" - Need control? Conditional GAN")
compare_gan_variants()
GANs Deep Dive Summary
What We Built:
- ? Complete Generator and Discriminator networks from scratch
- ? Adversarial training loop with alternating updates
- ? 2D data generation (circle distribution)
- ? Training dynamics visualization
- ? Mode collapse demonstration
- ? Comparison of GAN variants (DCGAN, WGAN, StyleGAN, cGAN)
Key Insights:
- Adversarial game: Generator vs Discriminator competition drives learning
- Nash equilibrium: Training succeeds when discriminator can't tell real from fake
- Mode collapse: Generator may ignore diversity, only produce similar outputs
- Training challenges: Instability, vanishing gradients, balance issues
- Modern variants: WGAN, StyleGAN solve many stability and quality issues
Applications:
- Image generation (faces, art, scenes)
- Data augmentation (create training data)
- Image-to-image translation (style transfer, colorization)
- Super-resolution (enhance image quality)
- Text-to-image (DALL-E, Stable Diffusion concepts)
Next: We'll explore Transformers, the architecture behind GPT and BERT!
Transformers - Deep Dive
Transformers revolutionized deep learning, powering GPT, BERT, ChatGPT, and modern AI systems. They replaced RNNs for sequence tasks by introducing the attention mechanism—a way for models to focus on relevant parts of the input, regardless of distance.
The Attention Mechanism
What is Attention?
Analogy: Reading a Research Paper
- You're reading a sentence: "The cat, which was sleeping on the mat, woke up."
- To understand "woke up", you need to remember "the cat" (not "the mat")
- Your brain attends to relevant words, ignoring others
- Attention = learned focus on important information
RNN Problem:
- Sequential processing: must go through every word one-by-one
- Long-range dependencies difficult (vanishing gradients)
- Can't parallelize (each step depends on previous)
Attention Solution:
- Look at ALL words simultaneously
- Compute relevance scores: how much should word i attend to word j?
- Weighted sum based on relevance
- Fully parallelizable ? much faster training
Key Insight: "Attention is all you need" — no recurrence, just attention!
import numpy as np
import matplotlib.pyplot as plt
# Simple attention visualization
def visualize_attention_concept():
"""
Demonstrate how attention focuses on relevant words.
"""
sentence = ["The", "cat", "sat", "on", "the", "mat"]
# Manual attention weights: when predicting "sat", what to attend to?
# High weight on "cat" (subject), lower on others
attention_for_sat = np.array([0.1, 0.6, 0.0, 0.1, 0.1, 0.1])
# When predicting "mat", attend to "on" (preposition context)
attention_for_mat = np.array([0.05, 0.1, 0.15, 0.5, 0.1, 0.1])
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
# Attention for "sat"
axes[0].bar(sentence, attention_for_sat, color='#3B9797', alpha=0.7, edgecolor='black')
axes[0].set_title('Attention When Predicting "sat"\n(Focus on subject "cat")',
fontsize=12, fontweight='bold')
axes[0].set_ylabel('Attention Weight', fontsize=11)
axes[0].set_ylim([0, 0.7])
axes[0].grid(True, alpha=0.3, axis='y')
# Attention for "mat"
axes[1].bar(sentence, attention_for_mat, color='#BF092F', alpha=0.7, edgecolor='black')
axes[1].set_title('Attention When Predicting "mat"\n(Focus on preposition "on")',
fontsize=12, fontweight='bold')
axes[1].set_ylabel('Attention Weight', fontsize=11)
axes[1].set_ylim([0, 0.7])
axes[1].grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
print("="*60)
print("ATTENTION MECHANISM: Focus on Relevant Information")
print("="*60)
print(f"Sentence: {' '.join(sentence)}")
print("\nWhen predicting 'sat':")
print(f" Highest attention: 'cat' ({attention_for_sat[1]:.1f})")
print(f" ? Makes sense: 'cat' is the subject doing the action")
print("\nWhen predicting 'mat':")
print(f" Highest attention: 'on' ({attention_for_mat[3]:.1f})")
print(f" ? Makes sense: 'on' provides locational context")
print("\n?? Key idea:")
print(" - Different words attend to different parts of the sentence")
print(" - Attention weights are LEARNED during training")
print(" - No recurrence needed — all positions processed in parallel!")
visualize_attention_concept()
Self-Attention Implementation
Self-Attention (also called Scaled Dot-Product Attention) is the core mechanism. Each word creates three vectors: Query (what I'm looking for), Key (what I contain), and Value (what I output).
Symbolic Attention Formula Derivation
import sympy as sp
from sympy import symbols, Matrix, exp, sqrt, summation, IndexedBase, Function
from sympy import simplify, latex
import numpy as np
import matplotlib.pyplot as plt
print("="*60)
print("ATTENTION MECHANISM - SYMBOLIC DERIVATION")
print("="*60)
# Define symbolic variables
i, j, k = symbols('i j k', integer=True)
n, d_k = symbols('n d_k', integer=True, positive=True) # sequence length, key dimension
# Indexed bases for matrices
Q = IndexedBase('Q') # Query matrix
K = IndexedBase('K') # Key matrix
V = IndexedBase('V') # Value matrix
A = IndexedBase('A') # Attention weights
print("\n1. SCALED DOT-PRODUCT ATTENTION FORMULA")
print("-" * 60)
print("Attention(Q, K, V) = softmax(QK^T / vd_k) V")
print("")
print("Where:")
print(" Q = Query matrix (n × d_k)")
print(" K = Key matrix (n × d_k)")
print(" V = Value matrix (n × d_v)")
print(" n = sequence length")
print(" d_k = dimension of keys/queries")
# Step 1: Compute similarity scores
print("\n2. STEP-BY-STEP DERIVATION")
print("-" * 60)
print("\nStep 1: Compute similarity scores (dot products)")
print(" S[i,j] = Q[i,:] · K[j,:] = S Q[i,k] × K[j,k]")
print(" k=1 to d_k")
# Create symbolic 2x2 example
print("\nExample (2 tokens, d_k=3):")
q1_1, q1_2, q1_3 = symbols('q_{1,1} q_{1,2} q_{1,3}')
q2_1, q2_2, q2_3 = symbols('q_{2,1} q_{2,2} q_{2,3}')
Q_matrix = Matrix([
[q1_1, q1_2, q1_3],
[q2_1, q2_2, q2_3]
])
k1_1, k1_2, k1_3 = symbols('k_{1,1} k_{1,2} k_{1,3}')
k2_1, k2_2, k2_3 = symbols('k_{2,1} k_{2,2} k_{2,3}')
K_matrix = Matrix([
[k1_1, k1_2, k1_3],
[k2_1, k2_2, k2_3]
])
print(f"\nQ = ")
for row in range(2):
print(f" {Q_matrix[row,:]}")
print(f"\nK = ")
for row in range(2):
print(f" {K_matrix[row,:]}")
# Compute QK^T
scores = Q_matrix * K_matrix.T
print(f"\nScores S = QK^T:")
for row in range(2):
print(f" S[{row+1},:] = {scores[row,:]}")
# Step 2: Scale by sqrt(d_k)
print("\nStep 2: Scale by vd_k (prevents large values in softmax)")
d_k_sym = symbols('d_k', positive=True)
scaled_scores = scores / sqrt(d_k_sym)
print(f" Scaled[i,j] = S[i,j] / v{d_k_sym}")
print(f"\nWhy scale? Large dot products ? extreme softmax ? vanishing gradients")
# Step 3: Softmax
print("\nStep 3: Apply softmax (row-wise)")
print(" For each query position i:")
print(" a[i,j] = exp(Scaled[i,j]) / S exp(Scaled[i,k])")
print(" k=1 to n")
print("")
print(" Result: attention weights (how much to attend to each position)")
print(" Properties: a[i,j] ? [0,1], S a[i,j] = 1")
print(" j")
# Symbolic softmax for first row
s11, s12 = symbols('s_{11} s_{12}', real=True)
exp_s11 = exp(s11)
exp_s12 = exp(s12)
alpha_11 = exp_s11 / (exp_s11 + exp_s12)
alpha_12 = exp_s12 / (exp_s11 + exp_s12)
print(f"\nExample (first query):")
print(f" a[1,1] = exp(s_{{1,1}}) / (exp(s_{{1,1}}) + exp(s_{{1,2}}))")
print(f" = {alpha_11}")
print(f"\n a[1,2] = exp(s_{{1,2}}) / (exp(s_{{1,1}}) + exp(s_{{1,2}}))")
print(f" = {alpha_12}")
print(f"\n Sum: a[1,1] + a[1,2] = {simplify(alpha_11 + alpha_12)}")
# Step 4: Weighted sum of values
print("\nStep 4: Weighted sum of Values")
print(" Output[i,:] = S a[i,j] × V[j,:]")
print(" j=1 to n")
print("")
print(" Each output is a weighted combination of all value vectors")
print(" Weights determined by query-key similarity")
# Numerical example
print("\n3. NUMERICAL EXAMPLE")
print("-" * 60)
# Simple 2x2 case
Q_num = np.array([[1.0, 0.0], [0.0, 1.0]])
K_num = np.array([[1.0, 0.0], [0.0, 1.0]])
V_num = np.array([[10.0, 20.0], [30.0, 40.0]])
d_k_num = 2
print(f"Q = \n{Q_num}")
print(f"\nK = \n{K_num}")
print(f"\nV = \n{V_num}")
print(f"\nd_k = {d_k_num}")
# Compute attention
scores_num = Q_num @ K_num.T
print(f"\nScores (QK^T) = \n{scores_num}")
scaled_scores_num = scores_num / np.sqrt(d_k_num)
print(f"\nScaled scores (÷v{d_k_num}) = \n{scaled_scores_num}")
# Softmax
exp_scores = np.exp(scaled_scores_num - np.max(scaled_scores_num, axis=1, keepdims=True))
attention_weights = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
print(f"\nAttention weights (softmax) = \n{attention_weights}")
print(f"Row sums: {attention_weights.sum(axis=1)}")
# Output
output = attention_weights @ V_num
print(f"\nOutput (attention × V) = \n{output}")
print("\n4. INTERPRETATION")
print("-" * 60)
print("Query 1 ([1,0]):")
print(f" Attends to Key 1 with weight {attention_weights[0,0]:.3f}")
print(f" Attends to Key 2 with weight {attention_weights[0,1]:.3f}")
print(f" Output: {output[0]} (mostly Value 1)")
print("\nQuery 2 ([0,1]):")
print(f" Attends to Key 1 with weight {attention_weights[1,0]:.3f}")
print(f" Attends to Key 2 with weight {attention_weights[1,1]:.3f}")
print(f" Output: {output[1]} (mostly Value 2)")
print("\n?? Key insights:")
print(" 1. Attention = learned weighted sum")
print(" 2. Weights based on query-key similarity")
print(" 3. Scaling prevents saturation in softmax")
print(" 4. Output is context-aware combination of values")
print(" 5. Fully differentiable ? learnable via backprop!")
Query, Key, Value: The Attention Trinity
Analogy: Library Search
- Query (Q): Your search question ("books about neural networks")
- Key (K): Book titles/metadata (what each book is about)
- Value (V): Book contents (actual information you retrieve)
Process:
- Compare Query with all Keys ? similarity scores
- Apply softmax ? attention weights (sum to 1)
- Weighted sum of Values ? output
Formula:
Attention(Q, K, V) = softmax(QKT / vdk) V
- QKT: Dot product = similarity scores
- vdk: Scale factor (prevents large values)
- softmax: Convert to probabilities
- × V: Weighted sum of values
import numpy as np
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Scaled Dot-Product Attention (core of Transformers).
Q: Query matrix (seq_len, d_k)
K: Key matrix (seq_len, d_k)
V: Value matrix (seq_len, d_v)
mask: Optional mask to prevent attending to certain positions
Returns: Output (seq_len, d_v), Attention weights (seq_len, seq_len)
"""
d_k = Q.shape[-1] # Dimension of keys
# 1. Compute attention scores (similarity between queries and keys)
scores = np.dot(Q, K.T) / np.sqrt(d_k)
# 2. Apply mask if provided (e.g., for padding or causal masking)
if mask is not None:
scores = scores + (mask * -1e9)
# 3. Softmax to get attention weights (probabilities)
exp_scores = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
attention_weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)
# 4. Weighted sum of values
output = np.dot(attention_weights, V)
return output, attention_weights
# Example: 4-word sentence with 8-dimensional embeddings
print("="*60)
print("SCALED DOT-PRODUCT ATTENTION")
print("="*60)
sentence = ["The", "cat", "sat", "down"]
seq_len = len(sentence)
d_model = 8 # Embedding dimension
# Random embeddings for demonstration
embeddings = np.random.randn(seq_len, d_model)
# Linear projections to get Q, K, V (in real transformers, these are learned)
W_Q = np.random.randn(d_model, d_model) * 0.1
W_K = np.random.randn(d_model, d_model) * 0.1
W_V = np.random.randn(d_model, d_model) * 0.1
Q = np.dot(embeddings, W_Q)
K = np.dot(embeddings, W_K)
V = np.dot(embeddings, W_V)
print(f"Sentence: {sentence}")
print(f"Sequence length: {seq_len}")
print(f"Embedding dimension: {d_model}")
print(f"\nMatrix shapes:")
print(f" Q (Query): {Q.shape}")
print(f" K (Key): {K.shape}")
print(f" V (Value): {V.shape}")
# Apply attention
output, attention_weights = scaled_dot_product_attention(Q, K, V)
print(f"\nOutput shape: {output.shape}")
print(f"Attention weights shape: {attention_weights.shape}")
# Visualize attention weights
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6))
plt.imshow(attention_weights, cmap='viridis', aspect='auto')
plt.colorbar(label='Attention Weight')
plt.xlabel('Key Position (attending to)', fontsize=11)
plt.ylabel('Query Position (attending from)', fontsize=11)
plt.title('Self-Attention Weights\n(How much each word attends to every other word)',
fontsize=12, fontweight='bold')
plt.xticks(range(seq_len), sentence)
plt.yticks(range(seq_len), sentence)
# Add values as text
for i in range(seq_len):
for j in range(seq_len):
plt.text(j, i, f'{attention_weights[i, j]:.2f}',
ha='center', va='center', color='white', fontsize=10)
plt.tight_layout()
plt.show()
print("\n" + "="*60)
print("ATTENTION WEIGHTS INTERPRETATION")
print("="*60)
for i, word in enumerate(sentence):
attended_to = np.argmax(attention_weights[i])
max_weight = attention_weights[i, attended_to]
print(f"'{word}' attends most to '{sentence[attended_to]}' ({max_weight:.3f})")
print("\n?? Each row sums to 1.0 (softmax normalization)")
print(f" Row 0 sum: {attention_weights[0].sum():.4f}")
print(f" Row 1 sum: {attention_weights[1].sum():.4f}")
Multi-Head Attention
Multi-Head Attention runs multiple attention mechanisms in parallel. Each "head" can learn different types of relationships (syntax, semantics, long-range dependencies, etc.).
import numpy as np
class MultiHeadAttention:
"""
Multi-Head Attention: Multiple attention mechanisms in parallel.
Each head can attend to different aspects of the input.
"""
def __init__(self, d_model, num_heads):
"""
d_model: Embedding dimension (must be divisible by num_heads)
num_heads: Number of attention heads
"""
assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads # Dimension per head
# Projection matrices for Q, K, V (learned parameters)
self.W_Q = np.random.randn(d_model, d_model) * 0.01
self.W_K = np.random.randn(d_model, d_model) * 0.01
self.W_V = np.random.randn(d_model, d_model) * 0.01
# Output projection (combines heads)
self.W_O = np.random.randn(d_model, d_model) * 0.01
def split_heads(self, X):
"""
Split into multiple heads.
X: (seq_len, d_model)
Returns: (num_heads, seq_len, d_k)
"""
seq_len = X.shape[0]
# Reshape: (seq_len, num_heads, d_k)
X = X.reshape(seq_len, self.num_heads, self.d_k)
# Transpose: (num_heads, seq_len, d_k)
return X.transpose(1, 0, 2)
def combine_heads(self, X):
"""
Combine multiple heads back.
X: (num_heads, seq_len, d_k)
Returns: (seq_len, d_model)
"""
# Transpose: (seq_len, num_heads, d_k)
X = X.transpose(1, 0, 2)
seq_len = X.shape[0]
# Reshape: (seq_len, d_model)
return X.reshape(seq_len, self.d_model)
def forward(self, X):
"""
Apply multi-head attention.
X: Input (seq_len, d_model)
Returns: Output (seq_len, d_model), Attention weights per head
"""
# 1. Linear projections
Q = np.dot(X, self.W_Q)
K = np.dot(X, self.W_K)
V = np.dot(X, self.W_V)
# 2. Split into multiple heads
Q_heads = self.split_heads(Q) # (num_heads, seq_len, d_k)
K_heads = self.split_heads(K)
V_heads = self.split_heads(V)
# 3. Apply scaled dot-product attention for each head
head_outputs = []
all_attention_weights = []
for i in range(self.num_heads):
output, attn_weights = scaled_dot_product_attention(
Q_heads[i], K_heads[i], V_heads[i]
)
head_outputs.append(output)
all_attention_weights.append(attn_weights)
# 4. Concatenate heads
head_outputs = np.array(head_outputs) # (num_heads, seq_len, d_k)
concatenated = self.combine_heads(head_outputs) # (seq_len, d_model)
# 5. Final linear projection
output = np.dot(concatenated, self.W_O)
return output, all_attention_weights
# Example usage
print("="*60)
print("MULTI-HEAD ATTENTION")
print("="*60)
d_model = 64
num_heads = 8
seq_len = 4
print(f"Model dimension: {d_model}")
print(f"Number of heads: {num_heads}")
print(f"Dimension per head: {d_model // num_heads}")
# Create multi-head attention
mha = MultiHeadAttention(d_model=d_model, num_heads=num_heads)
# Input: sentence embeddings
X = np.random.randn(seq_len, d_model)
# Forward pass
output, attention_weights = mha.forward(X)
print(f"\nInput shape: {X.shape}")
print(f"Output shape: {output.shape}")
print(f"Number of attention weight matrices: {len(attention_weights)} (one per head)")
print(f"Each attention weight matrix shape: {attention_weights[0].shape}")
# Visualize attention from different heads
sentence = ["The", "cat", "sat", "down"]
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.flatten()
for head_idx in range(num_heads):
ax = axes[head_idx]
im = ax.imshow(attention_weights[head_idx], cmap='viridis', aspect='auto')
ax.set_title(f'Head {head_idx+1}', fontsize=11, fontweight='bold')
ax.set_xticks(range(seq_len))
ax.set_yticks(range(seq_len))
ax.set_xticklabels(sentence, fontsize=9)
ax.set_yticklabels(sentence, fontsize=9)
# Add colorbar
plt.colorbar(im, ax=ax, fraction=0.046)
plt.suptitle('Multi-Head Attention: Different Heads Learn Different Patterns',
fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
print("\n?? Why multiple heads?")
print(" - Different heads can focus on different relationships:")
print(" * Head 1: Syntactic dependencies (subject-verb)")
print(" * Head 2: Semantic relationships (word meanings)")
print(" * Head 3: Long-range dependencies")
print(" - Enriches representation with diverse perspectives")
print(" - Empirically improves performance significantly")
Positional Encoding
Since attention has no inherent notion of order (it's permutation-invariant), we must add positional encodings to tell the model where each word is in the sequence.
Why Positional Encoding?
Problem: Attention is Order-Agnostic
- "The cat sat on the mat" and "mat the on sat cat The" would produce identical attention!
- But word order matters: "Dog bites man" ? "Man bites dog"
Solution: Add Position Information
- Add position-dependent vectors to embeddings
- Use sine/cosine functions with different frequencies
- Allows model to learn relative positions
Positional Encoding Formula:
- PE(pos, 2i) = sin(pos / 100002i/d_model)
- PE(pos, 2i+1) = cos(pos / 100002i/d_model)
- pos: position in sequence
- i: dimension index
import numpy as np
import matplotlib.pyplot as plt
def positional_encoding(seq_len, d_model):
"""
Generate positional encodings using sine and cosine functions.
seq_len: Maximum sequence length
d_model: Embedding dimension
Returns: Positional encoding matrix (seq_len, d_model)
"""
PE = np.zeros((seq_len, d_model))
# Position indices
position = np.arange(seq_len).reshape(-1, 1)
# Dimension indices
div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
# Apply sine to even indices
PE[:, 0::2] = np.sin(position * div_term)
# Apply cosine to odd indices
PE[:, 1::2] = np.cos(position * div_term)
return PE
# Generate positional encodings
seq_len = 50
d_model = 128
PE = positional_encoding(seq_len, d_model)
print("="*60)
print("POSITIONAL ENCODING")
print("="*60)
print(f"Sequence length: {seq_len}")
print(f"Model dimension: {d_model}")
print(f"Positional encoding shape: {PE.shape}")
# Visualize
fig, axes = plt.subplots(2, 1, figsize=(12, 10))
# Heatmap of positional encodings
im = axes[0].imshow(PE.T, cmap='RdBu', aspect='auto', vmin=-1, vmax=1)
axes[0].set_xlabel('Position in Sequence', fontsize=12)
axes[0].set_ylabel('Embedding Dimension', fontsize=12)
axes[0].set_title('Positional Encoding Heatmap', fontsize=13, fontweight='bold')
plt.colorbar(im, ax=axes[0], label='Encoding Value')
# Individual position encodings (first 10 positions)
for pos in range(min(10, seq_len)):
axes[1].plot(PE[pos], alpha=0.7, linewidth=1.5, label=f'Position {pos}')
axes[1].set_xlabel('Dimension', fontsize=12)
axes[1].set_ylabel('Encoding Value', fontsize=12)
axes[1].set_title('Positional Encoding Vectors (First 10 Positions)', fontsize=13, fontweight='bold')
axes[1].legend(loc='right', bbox_to_anchor=(1.15, 0.5), fontsize=9)
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nKey properties:")
print(f" Range: [{PE.min():.3f}, {PE.max():.3f}]")
print(f" Each position has unique encoding")
print(f" Sine/cosine allows model to learn relative positions")
# Demonstrate uniqueness
print("\nUniqueness check (first 5 positions):")
for i in range(5):
print(f" Position {i}: {PE[i, :8]}") # Show first 8 dimensions
print("\n?? Why sine/cosine?")
print(" - Bounded values (between -1 and 1)")
print(" - Unique encoding for each position")
print(" - Model can learn relative positions: PE(pos+k) as function of PE(pos)")
print(" - Generalizes to longer sequences than seen during training")
Complete Transformer Architecture
The full Transformer consists of an Encoder (processes input) and Decoder (generates output). Each has multiple layers with multi-head attention, feed-forward networks, and residual connections.
Transformer Block Structure
Encoder Layer:
- Multi-Head Self-Attention (attend to all positions)
- Add & Norm (residual connection + layer normalization)
- Feed-Forward Network (2 linear layers with ReLU)
- Add & Norm (residual connection + layer normalization)
Decoder Layer:
- Masked Multi-Head Self-Attention (attend only to previous positions)
- Add & Norm
- Multi-Head Cross-Attention (attend to encoder output)
- Add & Norm
- Feed-Forward Network
- Add & Norm
Complete Transformer:
- Input Embedding + Positional Encoding
- N × Encoder Layers (typically 6-12)
- N × Decoder Layers (typically 6-12)
- Output Linear + Softmax
import numpy as np
class TransformerEncoderLayer:
"""
Single Transformer Encoder Layer.
Components:
1. Multi-Head Self-Attention
2. Add & Norm (residual + layer norm)
3. Feed-Forward Network
4. Add & Norm
"""
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
"""
d_model: Model dimension
num_heads: Number of attention heads
d_ff: Dimension of feed-forward network
dropout: Dropout rate
"""
self.d_model = d_model
self.num_heads = num_heads
self.d_ff = d_ff
# Multi-head attention
self.mha = MultiHeadAttention(d_model, num_heads)
# Feed-forward network: d_model ? d_ff ? d_model
self.ff_W1 = np.random.randn(d_model, d_ff) * np.sqrt(2.0 / d_model)
self.ff_b1 = np.zeros((1, d_ff))
self.ff_W2 = np.random.randn(d_ff, d_model) * np.sqrt(2.0 / d_ff)
self.ff_b2 = np.zeros((1, d_model))
# Layer norm parameters (simplified)
self.gamma1 = np.ones((1, d_model))
self.beta1 = np.zeros((1, d_model))
self.gamma2 = np.ones((1, d_model))
self.beta2 = np.zeros((1, d_model))
def layer_norm(self, X, gamma, beta, epsilon=1e-6):
"""Layer normalization"""
mean = np.mean(X, axis=-1, keepdims=True)
var = np.var(X, axis=-1, keepdims=True)
X_norm = (X - mean) / np.sqrt(var + epsilon)
return gamma * X_norm + beta
def feed_forward(self, X):
"""Feed-forward network with ReLU"""
# First layer
hidden = np.dot(X, self.ff_W1) + self.ff_b1
hidden = np.maximum(0, hidden) # ReLU
# Second layer
output = np.dot(hidden, self.ff_W2) + self.ff_b2
return output
def forward(self, X):
"""
Forward pass through encoder layer.
X: Input (seq_len, d_model)
Returns: Output (seq_len, d_model)
"""
# 1. Multi-head attention
attn_output, _ = self.mha.forward(X)
# 2. Add & Norm (residual connection)
X = self.layer_norm(X + attn_output, self.gamma1, self.beta1)
# 3. Feed-forward network
ff_output = self.feed_forward(X)
# 4. Add & Norm
X = self.layer_norm(X + ff_output, self.gamma2, self.beta2)
return X
# Example: Transformer Encoder
print("="*60)
print("TRANSFORMER ENCODER LAYER")
print("="*60)
d_model = 64
num_heads = 8
d_ff = 256 # Typically 4x d_model
seq_len = 10
encoder_layer = TransformerEncoderLayer(d_model, num_heads, d_ff)
print(f"Configuration:")
print(f" Model dimension (d_model): {d_model}")
print(f" Attention heads: {num_heads}")
print(f" Feed-forward dimension: {d_ff}")
print(f" Sequence length: {seq_len}")
# Input embeddings + positional encoding
embeddings = np.random.randn(seq_len, d_model)
pos_encoding = positional_encoding(seq_len, d_model)
X = embeddings + pos_encoding
print(f"\nInput shape: {X.shape}")
# Forward pass
output = encoder_layer.forward(X)
print(f"Output shape: {output.shape}")
print("\n?? Transformer Encoder advantages:")
print(" - Parallel processing (all positions at once)")
print(" - Long-range dependencies (direct attention)")
print(" - Residual connections (gradient flow)")
print(" - Layer normalization (training stability)")
# Count parameters
mha_params = encoder_layer.mha.W_Q.size + encoder_layer.mha.W_K.size + \
encoder_layer.mha.W_V.size + encoder_layer.mha.W_O.size
ff_params = encoder_layer.ff_W1.size + encoder_layer.ff_W2.size + d_ff + d_model
ln_params = 4 * d_model # gamma and beta for 2 layer norms
total_params = mha_params + ff_params + ln_params
print(f"\nParameter count (single layer):")
print(f" Multi-head attention: {mha_params:,}")
print(f" Feed-forward network: {ff_params:,}")
print(f" Layer normalization: {ln_params:,}")
print(f" Total: {total_params:,}")
print(f"\nFor a 6-layer transformer: ~{total_params * 6:,} parameters")
Transformer Applications
Famous Transformer Models
1. BERT (Bidirectional Encoder Representations from Transformers)
- Architecture: Encoder-only (12-24 layers)
- Training: Masked language modeling + next sentence prediction
- Use: Text classification, question answering, NER
- Innovation: Pre-training on massive text, fine-tune on downstream tasks
2. GPT (Generative Pre-trained Transformer)
- Architecture: Decoder-only (12-96+ layers in GPT-3/4)
- Training: Next-token prediction (language modeling)
- Use: Text generation, completion, ChatGPT
- Innovation: Autoregressive generation, few-shot learning
3. T5 (Text-to-Text Transfer Transformer)
- Architecture: Full encoder-decoder
- Training: All tasks as text-to-text (translation, summarization, etc.)
- Use: Universal text transformation
- Innovation: Unified framework for all NLP tasks
4. Vision Transformer (ViT)
- Architecture: Encoder-only, applied to image patches
- Training: Image classification on ImageNet
- Use: Computer vision tasks (classification, detection)
- Innovation: Transformers beat CNNs on vision tasks!
import numpy as np
import matplotlib.pyplot as plt
# Comparison of transformer architectures
def compare_transformer_variants():
"""
Compare different transformer-based models.
"""
models = {
'BERT': {
'architecture': 'Encoder-only',
'layers': 12,
'params': '110M',
'training': 'Masked LM',
'bidirectional': True,
'use_case': 'Understanding'
},
'GPT-3': {
'architecture': 'Decoder-only',
'layers': 96,
'params': '175B',
'training': 'Next token',
'bidirectional': False,
'use_case': 'Generation'
},
'T5': {
'architecture': 'Encoder-Decoder',
'layers': '12+12',
'params': '11B (XXL)',
'training': 'Text-to-text',
'bidirectional': True,
'use_case': 'Translation'
},
'ViT': {
'architecture': 'Encoder-only',
'layers': 12,
'params': '86M',
'training': 'Image patches',
'bidirectional': True,
'use_case': 'Vision'
}
}
print("="*80)
print("TRANSFORMER MODEL COMPARISON")
print("="*80)
print(f"{'Model':<10} {'Architecture':<18} {'Layers':<8} {'Parameters':<12} {'Primary Use':<15}")
print("-"*80)
for name, props in models.items():
print(f"{name:<10} {props['architecture']:<18} {str(props['layers']):<8} "
f"{props['params']:<12} {props['use_case']:<15}")
print("\n" + "="*80)
print("DETAILED CHARACTERISTICS")
print("="*80)
for name, props in models.items():
print(f"\n{name}:")
print(f" Architecture: {props['architecture']}")
print(f" Training objective: {props['training']}")
print(f" Bidirectional: {props['bidirectional']}")
print(f" Best for: {props['use_case']}")
# Visualize model sizes
model_names = list(models.keys())
param_counts = [110, 175000, 11000, 86] # In millions
fig, ax = plt.subplots(figsize=(12, 6))
colors = ['#3B9797', '#BF092F', '#16476A', '#132440']
bars = ax.bar(model_names, param_counts, color=colors, alpha=0.7, edgecolor='black')
ax.set_ylabel('Parameters (Millions)', fontsize=12)
ax.set_title('Transformer Model Sizes', fontsize=14, fontweight='bold')
ax.set_yscale('log')
ax.grid(True, alpha=0.3, axis='y')
# Add value labels
for bar, count in zip(bars, param_counts):
height = bar.get_height()
label = f'{count:,}M' if count < 1000 else f'{count/1000:.0f}B'
ax.text(bar.get_x() + bar.get_width()/2, height,
label, ha='center', va='bottom', fontsize=11, fontweight='bold')
plt.tight_layout()
plt.show()
print("\n?? Choosing a transformer:")
print(" - Text understanding (classification, QA): BERT-like")
print(" - Text generation (chatbots, completion): GPT-like")
print(" - Sequence-to-sequence (translation): T5, encoder-decoder")
print(" - Computer vision: ViT, CLIP")
compare_transformer_variants()
Transformers Deep Dive Summary
What We Built:
- ? Scaled dot-product attention from scratch
- ? Multi-head attention mechanism
- ? Positional encoding (sine/cosine)
- ? Complete Transformer Encoder layer
- ? Comparison of famous models (BERT, GPT, T5, ViT)
Key Insights:
- Attention: Learn what to focus on (Q, K, V mechanism)
- Multi-head: Multiple perspectives in parallel
- Positional encoding: Inject sequence order information
- Parallelization: Process all positions simultaneously ? fast
- Scalability: Models scale to billions of parameters (GPT-3, GPT-4)
Why Transformers Dominate:
- No sequential bottleneck (unlike RNNs)
- Direct long-range connections
- Highly parallelizable (GPU-friendly)
- Transfer learning (pre-train on huge data, fine-tune)
- Works across modalities (text, images, audio, video)
Next: We'll explore best practices, common pitfalls, and practical tips for training neural networks!
Best Practices and Common Pitfalls
Training neural networks is part art, part science. This section covers practical strategies to improve performance, avoid common mistakes, and debug issues when things go wrong.
Preventing Overfitting
Overfitting occurs when the model memorizes training data but fails to generalize to new data. It's like a student who memorizes answers without understanding concepts—performs well on practice tests but fails on real exams.
Signs of Overfitting
- Training accuracy high (95%+), validation accuracy low (70%)
- Training loss decreases, validation loss increases
- Large gap between training and validation curves
- Model performs perfectly on training set, poorly on new data
Prevention Strategies:
- Dropout: Randomly deactivate neurons during training
- L2 Regularization: Penalize large weights
- Early Stopping: Stop training when validation loss starts increasing
- Data Augmentation: Generate more training samples
- Reduce Model Complexity: Fewer layers/neurons
1. Dropout
Dropout randomly sets a fraction of neurons to zero during each training iteration. This prevents co-adaptation (neurons relying too heavily on specific other neurons) and forces the network to learn robust features.
import numpy as np
import matplotlib.pyplot as plt
class DropoutLayer:
"""
Dropout layer: randomly drop neurons during training.
Prevents overfitting by forcing network to learn redundant representations.
"""
def __init__(self, dropout_rate=0.5):
"""
dropout_rate: Probability of dropping a neuron (0.0 to 1.0)
"""
self.dropout_rate = dropout_rate
self.mask = None
def forward(self, X, training=True):
"""
Apply dropout during training, scale during inference.
X: Input activations
training: If True, apply dropout; if False, just scale
"""
if training:
# Create binary mask: 1 = keep, 0 = drop
self.mask = np.random.binomial(1, 1 - self.dropout_rate, size=X.shape)
# Apply mask and scale (inverted dropout)
return X * self.mask / (1 - self.dropout_rate)
else:
# During inference, keep all neurons (no dropout)
return X
def backward(self, grad_output):
"""
Backprop through dropout: only pass gradients for kept neurons.
"""
return grad_output * self.mask / (1 - self.dropout_rate)
# Demonstrate dropout effect
print("="*60)
print("DROPOUT REGULARIZATION")
print("="*60)
# Simulate activations from a hidden layer
activations = np.random.randn(100, 20) # 100 samples, 20 neurons
dropout = DropoutLayer(dropout_rate=0.5)
print(f"Original activations shape: {activations.shape}")
print(f"Dropout rate: {dropout.dropout_rate}")
# Apply dropout (training mode)
dropped_activations = dropout.forward(activations, training=True)
# Count how many neurons were dropped
dropped_count = np.sum(dropped_activations == 0)
total_count = activations.size
print(f"\nNeurons dropped: {dropped_count} / {total_count} ({dropped_count/total_count*100:.1f}%)")
print(f"Expected: ~{dropout.dropout_rate*100:.0f}%")
# Visualize dropout effect
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Original activations
im1 = axes[0].imshow(activations[:20].T, cmap='RdBu', aspect='auto', vmin=-3, vmax=3)
axes[0].set_title('Original Activations', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Sample')
axes[0].set_ylabel('Neuron')
plt.colorbar(im1, ax=axes[0])
# Dropout mask
im2 = axes[1].imshow(dropout.mask[:20].T, cmap='Greys', aspect='auto', vmin=0, vmax=1)
axes[1].set_title('Dropout Mask\n(White = Kept, Black = Dropped)', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Sample')
axes[1].set_ylabel('Neuron')
plt.colorbar(im2, ax=axes[1])
# Dropped activations
im3 = axes[2].imshow(dropped_activations[:20].T, cmap='RdBu', aspect='auto', vmin=-3, vmax=3)
axes[2].set_title('After Dropout\n(~50% neurons zeroed)', fontsize=12, fontweight='bold')
axes[2].set_xlabel('Sample')
axes[2].set_ylabel('Neuron')
plt.colorbar(im3, ax=axes[2])
plt.tight_layout()
plt.show()
print("\n?? Why dropout works:")
print(" - Forces network to not rely on specific neurons")
print(" - Each training iteration uses different 'sub-network'")
print(" - Acts like training ensemble of networks")
print(" - During inference, use full network (no dropout)")
print("\n?? Important: Always disable dropout during testing!")
2. L2 Regularization (Weight Decay)
L2 regularization adds a penalty term to the loss function proportional to the square of weights. This encourages smaller weights, preventing the model from becoming too complex.
Symbolic L2 Regularization Derivation
import sympy as sp
from sympy import symbols, diff, simplify, summation, IndexedBase, Function
import numpy as np
import matplotlib.pyplot as plt
print("="*60)
print("L2 REGULARIZATION - SYMBOLIC DERIVATION")
print("="*60)
# Define symbolic variables
lambda_reg = symbols('lambda', positive=True) # Regularization strength
i, j, n = symbols('i j n', integer=True, positive=True)
# Weight matrix
W = IndexedBase('W')
print("\n1. L2 REGULARIZATION FORMULA")
print("-" * 60)
print("Loss with L2 regularization:")
print(" L_total = L_data + ?/2 × ||W||²")
print(" = L_data + ?/2 × S w_i²")
print("")
print("Where:")
print(" L_data = original loss (MSE, cross-entropy, etc.)")
print(" ? = regularization strength (hyperparameter)")
print(" ||W||² = sum of squared weights")
print(" Factor 1/2 for cleaner derivatives")
# Simple case: single weight
print("\n2. GRADIENT DERIVATION (Single Weight)")
print("-" * 60)
w = symbols('w', real=True)
L_data = Function('L_{data}')(w) # Data loss as function of w
# Total loss
L_total = L_data + (lambda_reg / 2) * w**2
print(f"L_total = L_data(w) + ?/2 × w²")
# Gradient
grad_L_total = diff(L_total, w)
print(f"\n?L_total/?w = ?L_data/?w + ?w")
print("\nGradient descent update:")
print(" w_new = w - lr × ?L_total/?w")
print(" = w - lr × (?L_data/?w + ?w)")
print(" = w - lr × ?L_data/?w - lr × ?w")
print(" = (1 - lr×?)w - lr × ?L_data/?w")
print("\n?? Weight decay interpretation:")
print(f" Weights multiplied by (1 - lr×?) each update")
print(f" Example: lr=0.01, ?=0.01 ? multiply by 0.9999")
print(f" Weights gradually shrink toward zero!")
# Numerical example
print("\n3. NUMERICAL EXAMPLE")
print("-" * 60)
lr_val = 0.1
lambda_vals = [0.0, 0.01, 0.1, 1.0]
w_initial = 2.0
grad_data = 0.5 # Assume gradient from data is 0.5
print(f"Initial weight: w = {w_initial}")
print(f"Data gradient: ?L_data/?w = {grad_data}")
print(f"Learning rate: lr = {lr_val}")
print("\nWeight updates for different ?:")
for lambda_val in lambda_vals:
# Regular gradient descent
w_new_no_reg = w_initial - lr_val * grad_data
# With L2 regularization
decay_factor = 1 - lr_val * lambda_val
w_new_with_reg = decay_factor * w_initial - lr_val * grad_data
shrinkage = w_initial - w_new_with_reg
print(f"\n ? = {lambda_val}:")
print(f" No reg: w ? {w_new_no_reg:.4f}")
print(f" With L2: w ? {w_new_with_reg:.4f}")
print(f" Shrinkage: {shrinkage:.4f}")
# Effect over many iterations
print("\n4. LONG-TERM EFFECT (100 iterations)")
print("-" * 60)
import matplotlib.pyplot as plt
iterations = 100
w_history = {}
for lambda_val in [0.0, 0.01, 0.1]:
w = w_initial
history = [w]
for _ in range(iterations):
# Simplified: assume gradient from data stays constant
decay_factor = 1 - lr_val * lambda_val
w = decay_factor * w - lr_val * grad_data
history.append(w)
w_history[lambda_val] = history
plt.figure(figsize=(10, 6))
for lambda_val, history in w_history.items():
label = f'? = {lambda_val}'
plt.plot(history, linewidth=2, label=label, marker='o', markersize=3, markevery=10)
plt.axhline(y=0, color='red', linestyle='--', linewidth=1, alpha=0.5, label='Zero')
plt.xlabel('Iteration', fontsize=12)
plt.ylabel('Weight Value', fontsize=12)
plt.title('L2 Regularization: Weight Decay Over Time', fontsize=14, fontweight='bold')
plt.legend(loc='upper right', fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("Final weights after 100 iterations:")
for lambda_val, history in w_history.items():
print(f" ? = {lambda_val}: w = {history[-1]:.4f}")
print("\n?? Key insights:")
print(" 1. L2 = weight decay (multiplicative shrinkage)")
print(" 2. Larger ? ? stronger regularization ? smaller weights")
print(" 3. Prevents overfitting by limiting model complexity")
print(" 4. Equivalent to Gaussian prior on weights (Bayesian view)")
import numpy as np
import matplotlib.pyplot as plt
def l2_regularization_demo():
"""
Demonstrate L2 regularization effect on weights.
"""
# Loss with L2 regularization: L = L_data + ? * ||W||²
# ? (lambda): regularization strength
lambda_values = [0.0, 0.01, 0.1, 1.0]
# Simulate training with different regularization strengths
epochs = 100
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()
for idx, lambda_reg in enumerate(lambda_values):
# Initialize weights
weights = np.random.randn(50) * 2.0
weight_history = [weights.copy()]
# Simulate training
for epoch in range(epochs):
# Gradient descent with L2 regularization
# Normally: W = W - lr * grad_data
# With L2: W = W - lr * (grad_data + ? * W)
# Simulate data gradient (random for demo)
grad_data = np.random.randn(50) * 0.1
# L2 gradient = ? * W
grad_l2 = lambda_reg * weights
# Update
lr = 0.1
weights = weights - lr * (grad_data + grad_l2)
weight_history.append(weights.copy())
weight_history = np.array(weight_history)
# Plot weight evolution
ax = axes[idx]
for i in range(min(10, weights.shape[0])):
ax.plot(weight_history[:, i], alpha=0.6, linewidth=1.5)
ax.set_title(f'? = {lambda_reg}\nFinal ||W||² = {np.sum(weights**2):.2f}',
fontsize=12, fontweight='bold')
ax.set_xlabel('Epoch', fontsize=11)
ax.set_ylabel('Weight Value', fontsize=11)
ax.grid(True, alpha=0.3)
ax.axhline(y=0, color='black', linestyle='--', linewidth=1)
plt.suptitle('L2 Regularization: Effect on Weight Magnitude', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
print("="*60)
print("L2 REGULARIZATION (WEIGHT DECAY)")
print("="*60)
print("\nEffect of different ? values:")
print(" ? = 0.0: No regularization ? large weights")
print(" ? = 0.01: Mild regularization ? moderate weights")
print(" ? = 0.1: Strong regularization ? smaller weights")
print(" ? = 1.0: Very strong ? weights decay toward zero")
print("\n?? When to use L2:")
print(" - Training loss << validation loss (overfitting)")
print(" - Start with ? = 0.01 or 0.001")
print(" - Tune via validation set performance")
print("\n?? Implementation:")
print(" loss = data_loss + lambda_reg * np.sum(weights**2)")
print(" grad_weights = grad_data + 2 * lambda_reg * weights")
l2_regularization_demo()
3. Early Stopping
Early stopping monitors validation loss and stops training when it starts increasing, preventing the model from overfitting to the training data.
import numpy as np
import matplotlib.pyplot as plt
class EarlyStopping:
"""
Early stopping: stop training when validation loss stops improving.
"""
def __init__(self, patience=10, min_delta=0.001):
"""
patience: Number of epochs to wait before stopping
min_delta: Minimum change to qualify as improvement
"""
self.patience = patience
self.min_delta = min_delta
self.best_loss = np.inf
self.counter = 0
self.early_stop = False
self.best_epoch = 0
def __call__(self, val_loss, epoch):
"""
Check if training should stop.
val_loss: Current validation loss
epoch: Current epoch number
"""
if val_loss < self.best_loss - self.min_delta:
# Improvement
self.best_loss = val_loss
self.counter = 0
self.best_epoch = epoch
else:
# No improvement
self.counter += 1
if self.counter >= self.patience:
self.early_stop = True
return self.early_stop
# Simulate training with early stopping
def simulate_training_with_early_stopping():
"""
Demonstrate early stopping preventing overfitting.
"""
epochs = 200
# Simulate loss curves
train_losses = []
val_losses = []
# Training loss: steadily decreases
for epoch in range(epochs):
train_loss = 2.0 * np.exp(-0.03 * epoch) + 0.1 + np.random.randn() * 0.02
train_losses.append(train_loss)
# Validation loss: decreases then increases (overfitting after epoch 80)
for epoch in range(epochs):
if epoch < 80:
val_loss = 2.2 * np.exp(-0.025 * epoch) + 0.3 + np.random.randn() * 0.05
else:
# Start overfitting
val_loss = 0.3 + 0.01 * (epoch - 80) + np.random.randn() * 0.05
val_losses.append(val_loss)
# Apply early stopping
early_stopping = EarlyStopping(patience=15, min_delta=0.01)
stopped_epoch = epochs
for epoch in range(epochs):
if early_stopping(val_losses[epoch], epoch):
stopped_epoch = epoch
break
# Visualize
plt.figure(figsize=(12, 6))
plt.plot(train_losses, label='Training Loss', linewidth=2, color='#3B9797')
plt.plot(val_losses, label='Validation Loss', linewidth=2, color='#BF092F')
# Mark best epoch
plt.axvline(x=early_stopping.best_epoch, color='green', linestyle='--',
linewidth=2, label=f'Best Epoch ({early_stopping.best_epoch})')
# Mark stopping epoch
plt.axvline(x=stopped_epoch, color='orange', linestyle='--',
linewidth=2, label=f'Stopped Epoch ({stopped_epoch})')
# Shade overfitting region
plt.axvspan(80, epochs, alpha=0.2, color='red', label='Overfitting Region')
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Loss', fontsize=12)
plt.title('Early Stopping Prevents Overfitting', fontsize=14, fontweight='bold')
plt.legend(loc='upper right', fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("="*60)
print("EARLY STOPPING")
print("="*60)
print(f"Best validation loss: {early_stopping.best_loss:.4f} at epoch {early_stopping.best_epoch}")
print(f"Training stopped at epoch: {stopped_epoch}")
print(f"Patience: {early_stopping.patience} epochs")
print("\n?? Best practice:")
print(" 1. Monitor validation loss every epoch")
print(" 2. Save model weights when validation loss improves")
print(" 3. Stop training after patience epochs without improvement")
print(" 4. Restore best weights (not final weights)")
print("\n?? Typical patience values:")
print(" - Small datasets: 5-10 epochs")
print(" - Large datasets: 10-20 epochs")
print(" - Very large models: 3-5 epochs")
simulate_training_with_early_stopping()
Hyperparameter Tuning
Hyperparameters (learning rate, batch size, number of layers, etc.) dramatically affect model performance. Systematic tuning is essential.
Key Hyperparameters to Tune
High Priority (tune first):
- Learning Rate: Most important! Range: 0.0001 to 0.1
- Batch Size: 16, 32, 64, 128, 256 (powers of 2)
- Number of Layers: Start shallow (2-3), increase if needed
- Neurons per Layer: 32, 64, 128, 256, 512
Medium Priority:
- Optimizer: Adam (default), SGD+momentum, RMSprop
- Activation Function: ReLU (default), Leaky ReLU, ELU
- Dropout Rate: 0.2 to 0.5 (if using dropout)
- L2 Regularization: 0.001 to 0.1 (if needed)
Low Priority (tune last):
- Weight initialization scheme
- Batch normalization momentum
- Gradient clipping threshold
Tuning Strategies:
- Grid Search: Try all combinations (exhaustive but slow)
- Random Search: Sample randomly (often better than grid)
- Bayesian Optimization: Smart exploration (advanced)
- Manual Tuning: Start with defaults, adjust based on results
import numpy as np
import matplotlib.pyplot as plt
def learning_rate_comparison():
"""
Demonstrate impact of learning rate on training.
"""
# Simulate training with different learning rates
learning_rates = [0.001, 0.01, 0.1, 1.0]
epochs = 100
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()
for idx, lr in enumerate(learning_rates):
losses = []
# Simulate loss curve for this learning rate
loss = 2.0
for epoch in range(epochs):
if lr < 0.01:
# Too small: slow convergence
loss = 2.0 * np.exp(-0.01 * epoch) + 0.5 + np.random.randn() * 0.05
elif lr < 0.1:
# Good: smooth convergence
loss = 2.0 * np.exp(-0.04 * epoch) + 0.1 + np.random.randn() * 0.02
elif lr < 0.5:
# Too large: oscillation
loss = 0.5 + 0.3 * np.sin(epoch * 0.3) + np.random.randn() * 0.1
else:
# Way too large: divergence
loss = loss * (1.0 + 0.1 * np.random.randn())
losses.append(max(0, loss))
ax = axes[idx]
ax.plot(losses, linewidth=2, color='#3B9797')
# Determine status
if lr < 0.01:
status = "Too Small (Slow)"
color = 'orange'
elif lr < 0.1:
status = "Good (Smooth)"
color = 'green'
elif lr < 0.5:
status = "Too Large (Oscillating)"
color = 'red'
else:
status = "Way Too Large (Diverging)"
color = 'darkred'
ax.set_title(f'Learning Rate = {lr}\n{status}',
fontsize=12, fontweight='bold', color=color)
ax.set_xlabel('Epoch', fontsize=11)
ax.set_ylabel('Loss', fontsize=11)
ax.grid(True, alpha=0.3)
ax.set_ylim([0, min(5, max(losses) * 1.1)])
plt.suptitle('Learning Rate Impact on Training', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
print("="*60)
print("LEARNING RATE TUNING")
print("="*60)
print("\n?? Symptoms:")
print(" Too small (< 0.001):")
print(" - Training very slow")
print(" - Loss decreases gradually")
print(" - May not converge in reasonable time")
print("\n Too large (> 0.1):")
print(" - Loss oscillates wildly")
print(" - May never converge")
print(" - Can diverge (loss ? infinity)")
print("\n Just right (0.001 - 0.01 for Adam):")
print(" - Smooth decrease")
print(" - Converges in reasonable time")
print(" - Stable training")
print("\n?? Finding good learning rate:")
print(" 1. Start with lr = 0.001 (safe default for Adam)")
print(" 2. If too slow, try 0.01")
print(" 3. If unstable, try 0.0001")
print(" 4. Use learning rate schedules (decay over time)")
print("\n?? Learning rate schedules:")
print(" - Step decay: Reduce by 10x every N epochs")
print(" - Exponential decay: lr = lr0 * e^(-kt)")
print(" - Cosine annealing: Smooth oscillation")
learning_rate_comparison()
Data Preprocessing
Proper data preprocessing is crucial for neural network performance. Raw data often needs normalization, standardization, or augmentation.
Essential Preprocessing Steps
1. Normalization (Scale to [0, 1]):
- X_norm = (X - X_min) / (X_max - X_min)
- Use for: Image pixels, bounded features
2. Standardization (Zero Mean, Unit Variance):
- X_std = (X - µ) / s
- Use for: Most features, when distribution matters
3. Data Augmentation (Generate More Samples):
- Images: Rotation, flipping, cropping, color jitter
- Text: Synonym replacement, back-translation
- Time series: Jittering, scaling, window slicing
4. Handling Missing Values:
- Mean/median imputation
- Forward/backward fill (time series)
- Use separate "missing" indicator feature
import numpy as np
import matplotlib.pyplot as plt
def preprocessing_comparison():
"""
Compare different preprocessing techniques.
"""
# Generate sample data (two features with different scales)
np.random.seed(42)
n_samples = 500
# Feature 1: Small range (0 to 10)
feature1 = np.random.randn(n_samples) * 2 + 5
# Feature 2: Large range (1000 to 2000)
feature2 = np.random.randn(n_samples) * 200 + 1500
data = np.column_stack([feature1, feature2])
# 1. Original data
original = data.copy()
# 2. Normalization (min-max scaling to [0, 1])
normalized = (data - data.min(axis=0)) / (data.max(axis=0) - data.min(axis=0))
# 3. Standardization (zero mean, unit variance)
standardized = (data - data.mean(axis=0)) / data.std(axis=0)
# Visualize
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
datasets = [
('Original Data\n(Different Scales)', original),
('Normalized to [0, 1]\n(Min-Max Scaling)', normalized),
('Standardized\n(µ=0, s=1)', standardized)
]
for ax, (title, dataset) in zip(axes, datasets):
ax.scatter(dataset[:, 0], dataset[:, 1], alpha=0.5, s=30, color='#3B9797', edgecolor='black')
ax.set_xlabel('Feature 1', fontsize=11)
ax.set_ylabel('Feature 2', fontsize=11)
ax.set_title(title, fontsize=12, fontweight='bold')
ax.grid(True, alpha=0.3)
ax.axhline(y=0, color='black', linestyle='--', linewidth=1, alpha=0.5)
ax.axvline(x=0, color='black', linestyle='--', linewidth=1, alpha=0.5)
plt.tight_layout()
plt.show()
print("="*60)
print("DATA PREPROCESSING COMPARISON")
print("="*60)
print("\n1. Original Data:")
print(f" Feature 1: min={original[:, 0].min():.2f}, max={original[:, 0].max():.2f}, "
f"mean={original[:, 0].mean():.2f}, std={original[:, 0].std():.2f}")
print(f" Feature 2: min={original[:, 1].min():.2f}, max={original[:, 1].max():.2f}, "
f"mean={original[:, 1].mean():.2f}, std={original[:, 1].std():.2f}")
print(" ?? Problem: Feature 2 dominates (much larger scale)")
print("\n2. Normalized Data:")
print(f" Feature 1: min={normalized[:, 0].min():.2f}, max={normalized[:, 0].max():.2f}")
print(f" Feature 2: min={normalized[:, 1].min():.2f}, max={normalized[:, 1].max():.2f}")
print(" ? Both features in [0, 1] range")
print("\n3. Standardized Data:")
print(f" Feature 1: mean={standardized[:, 0].mean():.4f}, std={standardized[:, 0].std():.4f}")
print(f" Feature 2: mean={standardized[:, 1].mean():.4f}, std={standardized[:, 1].std():.4f}")
print(" ? Both features have µ˜0, s˜1")
print("\n?? When to use which:")
print(" - Normalization: Images (0-255 pixels ? 0-1)")
print(" - Standardization: Most features, Gaussian-like data")
print(" - Always preprocess training and test sets the same way!")
print(" - Use training set statistics for test set!")
preprocessing_comparison()
Batch Normalization
Batch Normalization normalizes activations within each mini-batch during training. This stabilizes learning, allows higher learning rates, and acts as regularization.
Symbolic Batch Normalization Formulas
import sympy as sp
from sympy import symbols, sqrt, summation, IndexedBase, simplify
import numpy as np
import matplotlib.pyplot as plt
print("="*60)
print("BATCH NORMALIZATION - SYMBOLIC FORMULAS")
print("="*60)
# Define symbolic variables
i, m = symbols('i m', integer=True, positive=True) # index, mini-batch size
epsilon = symbols('epsilon', positive=True, real=True) # small constant for stability
# Batch of activations
x = IndexedBase('x') # Input activations x[1], x[2], ..., x[m]
print("\\n1. BATCH NORMALIZATION ALGORITHM")
print("-" * 60)
print("Given: Mini-batch of m activations {x1, x2, ..., x_m}")
print("")
print("Step 1: Compute batch mean")
print(" µ_B = (1/m) × S x_i")
print(" i=1 to m")
# Symbolic mean
mu_B = symbols('mu_B', real=True) # We'll use symbol for mean to keep formulas clean
print("\\nStep 2: Compute batch variance")
print(" s²_B = (1/m) × S (x_i - µ_B)²")
print(" i=1 to m")
sigma_sq_B = symbols('sigma^2_B', positive=True, real=True)
print("\\nStep 3: Normalize")
print(" x^_i = (x_i - µ_B) / v(s²_B + e)")
print("")
print(" Where e (epsilon) prevents division by zero")
# Normalized activation (symbolic)
x_i = symbols('x_i', real=True)
x_hat = (x_i - mu_B) / sqrt(sigma_sq_B + epsilon)
print(f"\\nSymbolic form: x^_i = {x_hat}")
print("\\nStep 4: Scale and shift (learnable parameters)")
print(" y_i = ? × x^_i + ß")
print("")
print(" ? (gamma) = scale parameter (learned)")
print(" ß (beta) = shift parameter (learned)")
gamma, beta = symbols('gamma beta', real=True)
y_i = gamma * x_hat + beta
print(f"\\nFull transformation: y_i = {y_i}")
# Numerical example
print("\\n2. NUMERICAL EXAMPLE")
print("-" * 60)
# Mini-batch of 4 activations
batch = np.array([1.0, 2.0, 3.0, 4.0])
m_val = len(batch)
print(f"Input batch: {batch}")
print(f"Batch size m = {m_val}")
# Step 1: Mean
mu_val = np.mean(batch)
print(f"\\nStep 1 - Mean: µ_B = {mu_val}")
# Step 2: Variance
var_val = np.var(batch)
print(f"Step 2 - Variance: s²_B = {var_val}")
# Step 3: Normalize
epsilon_val = 1e-5
x_norm = (batch - mu_val) / np.sqrt(var_val + epsilon_val)
print(f"Step 3 - Normalized: x^ = {x_norm}")
print(f" Mean of x^: {np.mean(x_norm):.6f} (˜ 0)")
print(f" Std of x^: {np.std(x_norm):.6f} (˜ 1)")
# Step 4: Scale and shift
gamma_val = 2.0
beta_val = 0.5
y_output = gamma_val * x_norm + beta_val
print(f"\\nStep 4 - Scale (?={gamma_val}) and Shift (ß={beta_val}):")
print(f" y = {y_output}")
print(f" Mean of y: {np.mean(y_output):.6f}")
print(f" Std of y: {np.std(y_output):.6f}")
# Gradient formulas
print("\\n3. GRADIENTS FOR BACKPROPAGATION")
print("-" * 60)
# Loss
L = symbols('L', real=True)
# Gradient of loss w.r.t. output
dL_dy = IndexedBase('dL/dy')
print("Given: dL/dy_i (gradient from next layer)")
print("\\nWe need: dL/dx_i, dL/d?, dL/dß")
print("\\nGradient w.r.t. ß (shift):")
print(" dL/dß = S dL/dy_i")
print(" i=1 to m")
print(" (Sum of all incoming gradients)")
print("\\nGradient w.r.t. ? (scale):")
print(" dL/d? = S (dL/dy_i × x^_i)")
print(" i=1 to m")
print(" (Weighted sum by normalized values)")
print("\\nGradient w.r.t. x_i (input):")
print(" dL/dx_i = (?/v(s²_B + e)) × [m×dL/dx^_i - S dL/dx^_j - x^_i×S (dL/dx^_j × x^_j)]")
print(" (Complex! Accounts for dependencies through µ_B and s²_B)")
# Visualization: Effect on distribution
import matplotlib.pyplot as plt
# Generate skewed batch
np.random.seed(42)
batch_large = np.random.exponential(scale=2.0, size=100)
# Before normalization
mean_before = batch_large.mean()
std_before = batch_large.std()
# After normalization
batch_norm = (batch_large - mean_before) / (std_before + 1e-5)
# After scale and shift
gamma_vis = 1.5
beta_vis = 0.3
batch_scaled = gamma_vis * batch_norm + beta_vis
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Before
axes[0].hist(batch_large, bins=20, color='#BF092F', alpha=0.7, edgecolor='black')
axes[0].axvline(mean_before, color='black', linestyle='--', linewidth=2, label=f'µ={mean_before:.2f}')
axes[0].set_title(f'Before BatchNorm\\nµ={mean_before:.2f}, s={std_before:.2f}', fontweight='bold')
axes[0].set_xlabel('Activation Value')
axes[0].set_ylabel('Frequency')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Normalized
axes[1].hist(batch_norm, bins=20, color='#3B9797', alpha=0.7, edgecolor='black')
axes[1].axvline(0, color='black', linestyle='--', linewidth=2, label='µ˜0')
axes[1].set_title(f'After Normalization\\nµ˜0, s˜1', fontweight='bold')
axes[1].set_xlabel('Normalized Value')
axes[1].set_ylabel('Frequency')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
# Scaled
mean_after = batch_scaled.mean()
std_after = batch_scaled.std()
axes[2].hist(batch_scaled, bins=20, color='#132440', alpha=0.7, edgecolor='black')
axes[2].axvline(mean_after, color='white', linestyle='--', linewidth=2, label=f'µ={mean_after:.2f}')
axes[2].set_title(f'After Scale & Shift\\n?={gamma_vis}, ß={beta_vis}', fontweight='bold')
axes[2].set_xlabel('Final Value')
axes[2].set_ylabel('Frequency')
axes[2].legend()
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\\n?? Key insights:")
print(" 1. Normalization ? standardized distribution (µ=0, s=1)")
print(" 2. Scale (?) and shift (ß) ? network can learn optimal distribution")
print(" 3. Reduces internal covariate shift (changing input distributions)")
print(" 4. Allows higher learning rates (gradients more stable)")
print(" 5. Acts as regularization (adds noise through mini-batch statistics)")
import numpy as np
class BatchNormalization:
"""
Batch Normalization layer: normalize activations per mini-batch.
Reduces internal covariate shift, speeds up training.
"""
def __init__(self, num_features, epsilon=1e-5, momentum=0.9):
"""
num_features: Number of features (neurons in layer)
epsilon: Small constant for numerical stability
momentum: Running average momentum
"""
self.num_features = num_features
self.epsilon = epsilon
self.momentum = momentum
# Learnable parameters
self.gamma = np.ones((1, num_features)) # Scale
self.beta = np.zeros((1, num_features)) # Shift
# Running statistics (for inference)
self.running_mean = np.zeros((1, num_features))
self.running_var = np.ones((1, num_features))
def forward(self, X, training=True):
"""
Normalize batch, scale and shift.
X: Input (batch_size, num_features)
training: If True, use batch statistics; if False, use running statistics
"""
if training:
# Compute batch statistics
batch_mean = np.mean(X, axis=0, keepdims=True)
batch_var = np.var(X, axis=0, keepdims=True)
# Normalize
X_norm = (X - batch_mean) / np.sqrt(batch_var + self.epsilon)
# Update running statistics (exponential moving average)
self.running_mean = self.momentum * self.running_mean + (1 - self.momentum) * batch_mean
self.running_var = self.momentum * self.running_var + (1 - self.momentum) * batch_var
# Store for backward pass
self.batch_mean = batch_mean
self.batch_var = batch_var
self.X_norm = X_norm
self.X = X
else:
# Use running statistics (inference mode)
X_norm = (X - self.running_mean) / np.sqrt(self.running_var + self.epsilon)
# Scale and shift
out = self.gamma * X_norm + self.beta
return out
# Demonstrate batch normalization
print("="*60)
print("BATCH NORMALIZATION")
print("="*60)
# Simulate activations from a layer (2 mini-batches)
batch_size = 64
num_features = 128
# Batch 1: mean ˜ 5, std ˜ 2
batch1 = np.random.randn(batch_size, num_features) * 2 + 5
# Batch 2: mean ˜ -3, std ˜ 4 (different distribution!)
batch2 = np.random.randn(batch_size, num_features) * 4 - 3
bn = BatchNormalization(num_features)
print(f"Input batch 1 statistics:")
print(f" Mean: {batch1.mean():.3f}, Std: {batch1.std():.3f}")
print(f" Range: [{batch1.min():.3f}, {batch1.max():.3f}]")
# Forward pass (training mode)
normalized1 = bn.forward(batch1, training=True)
print(f"\nAfter batch normalization:")
print(f" Mean: {normalized1.mean():.6f}, Std: {normalized1.std():.6f}")
print(f" Range: [{normalized1.min():.3f}, {normalized1.max():.3f}]")
print(f"\nInput batch 2 statistics:")
print(f" Mean: {batch2.mean():.3f}, Std: {batch2.std():.3f}")
normalized2 = bn.forward(batch2, training=True)
print(f"\nAfter batch normalization:")
print(f" Mean: {normalized2.mean():.6f}, Std: {normalized2.std():.6f}")
# Visualize
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Original distributions
axes[0, 0].hist(batch1.flatten(), bins=50, alpha=0.7, color='#3B9797', edgecolor='black', label='Batch 1')
axes[0, 0].hist(batch2.flatten(), bins=50, alpha=0.7, color='#BF092F', edgecolor='black', label='Batch 2')
axes[0, 0].set_title('Original Activations\n(Different distributions)', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Activation Value')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
# Normalized distributions
axes[0, 1].hist(normalized1.flatten(), bins=50, alpha=0.7, color='#3B9797', edgecolor='black', label='Batch 1')
axes[0, 1].hist(normalized2.flatten(), bins=50, alpha=0.7, color='#BF092F', edgecolor='black', label='Batch 2')
axes[0, 1].set_title('After Batch Normalization\n(Both normalized to µ˜0, s˜1)', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('Activation Value')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)
# Feature statistics before BN
feature_means_before = np.array([batch1[:, i].mean() for i in range(min(50, num_features))])
axes[1, 0].bar(range(len(feature_means_before)), feature_means_before, color='#3B9797', alpha=0.7, edgecolor='black')
axes[1, 0].set_title('Feature Means Before BN\n(High variance across features)', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Feature Index')
axes[1, 0].set_ylabel('Mean')
axes[1, 0].grid(True, alpha=0.3, axis='y')
# Feature statistics after BN
feature_means_after = np.array([normalized1[:, i].mean() for i in range(min(50, num_features))])
axes[1, 1].bar(range(len(feature_means_after)), feature_means_after, color='#BF092F', alpha=0.7, edgecolor='black')
axes[1, 1].set_title('Feature Means After BN\n(All ˜ 0)', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Feature Index')
axes[1, 1].set_ylabel('Mean')
axes[1, 1].grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
print("\n?? Why batch normalization works:")
print(" - Reduces internal covariate shift")
print(" - Allows higher learning rates (faster training)")
print(" - Acts as regularization (slight noise from batch statistics)")
print(" - Reduces sensitivity to initialization")
print("\n?? Best practices:")
print(" - Place after linear/conv layer, before activation")
print(" - Use momentum ˜ 0.9 for running statistics")
print(" - Switch to eval mode during testing!")
Debugging Neural Networks
Common Issues and Solutions
1. Loss is NaN or Infinity:
- Cause: Learning rate too high, numerical instability
- Fix: Reduce learning rate (10x), check for division by zero, add gradient clipping
2. Loss Not Decreasing:
- Cause: Learning rate too low, bad initialization, wrong loss function
- Fix: Increase learning rate, check data preprocessing, verify labels
3. Training Loss Decreases, Validation Loss Increases:
- Cause: Overfitting
- Fix: Add dropout, L2 regularization, early stopping, more data, reduce model size
4. Both Losses High and Not Improving:
- Cause: Underfitting (model too simple)
- Fix: Increase model capacity (more layers/neurons), train longer, reduce regularization
5. Gradients Exploding:
- Cause: Deep networks, high learning rate, unstable activations
- Fix: Gradient clipping, batch normalization, lower learning rate, use residual connections
6. Gradients Vanishing:
- Cause: Deep networks with sigmoid/tanh, poor initialization
- Fix: Use ReLU, batch normalization, residual connections, better initialization (He, Xavier)
Debugging Checklist:
- ? Verify data shapes match network expectations
- ? Check data preprocessing (normalized? Standardized?)
- ? Confirm labels are correct and properly encoded
- ? Start with small model, overfit small batch (sanity check)
- ? Visualize activations and gradients (check for dead neurons)
- ? Monitor training metrics (loss, accuracy, learning rate)
- ? Compare to baseline (random initialization performance)
import numpy as np
import matplotlib.pyplot as plt
def debugging_visualization():
"""
Demonstrate common debugging visualizations.
"""
# Simulate training scenarios
epochs = 100
# Scenario 1: Healthy training
train_healthy = 2.0 * np.exp(-0.04 * np.arange(epochs)) + 0.1 + np.random.randn(epochs) * 0.02
val_healthy = 2.2 * np.exp(-0.035 * np.arange(epochs)) + 0.2 + np.random.randn(epochs) * 0.03
# Scenario 2: Overfitting
train_overfit = 2.0 * np.exp(-0.05 * np.arange(epochs)) + 0.05 + np.random.randn(epochs) * 0.01
val_overfit = np.concatenate([
2.2 * np.exp(-0.04 * np.arange(40)) + 0.3,
0.3 + 0.015 * np.arange(60) + np.random.randn(60) * 0.05
])
# Scenario 3: Underfitting
train_underfit = 1.5 + np.random.randn(epochs) * 0.1
val_underfit = 1.6 + np.random.randn(epochs) * 0.15
# Scenario 4: Exploding gradients
train_explode = []
val_explode = []
loss = 1.0
for i in range(epochs):
if i < 30:
loss = loss * 0.9 + np.random.randn() * 0.05
else:
loss = loss * 1.15 + np.random.randn() * 0.5
train_explode.append(max(0, loss))
val_explode.append(max(0, loss * 1.1 + np.random.randn() * 0.3))
# Visualize
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
scenarios = [
('Healthy Training ?', train_healthy, val_healthy, 'green'),
('Overfitting ??', train_overfit, val_overfit, 'orange'),
('Underfitting ??', train_underfit, val_underfit, 'red'),
('Exploding Gradients ??', train_explode, val_explode, 'darkred')
]
for ax, (title, train, val, color) in zip(axes.flatten(), scenarios):
ax.plot(train, label='Training Loss', linewidth=2, color='#3B9797')
ax.plot(val, label='Validation Loss', linewidth=2, color='#BF092F')
ax.set_xlabel('Epoch', fontsize=11)
ax.set_ylabel('Loss', fontsize=11)
ax.set_title(title, fontsize=12, fontweight='bold', color=color)
ax.legend(loc='upper right', fontsize=10)
ax.grid(True, alpha=0.3)
if 'Exploding' in title:
ax.set_ylim([0, min(20, max(max(train), max(val)) * 1.1)])
plt.suptitle('Neural Network Training Scenarios', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
print("="*60)
print("DEBUGGING NEURAL NETWORKS")
print("="*60)
print("\n1. Healthy Training ?")
print(" - Both losses decrease smoothly")
print(" - Small gap between train and val")
print(" - Action: Continue training, maybe tune hyperparameters")
print("\n2. Overfitting ??")
print(" - Training loss continues decreasing")
print(" - Validation loss increases after initial decrease")
print(" - Action: Add regularization, early stopping, more data")
print("\n3. Underfitting ??")
print(" - Both losses high and flat")
print(" - No improvement over time")
print(" - Action: Increase model size, train longer, reduce regularization")
print("\n4. Exploding Gradients ??")
print(" - Loss increases or becomes NaN")
print(" - Sudden spikes in loss curve")
print(" - Action: Lower learning rate, gradient clipping, batch norm")
print("\n?? First steps when debugging:")
print(" 1. Print data shapes and sample values")
print(" 2. Overfit a single batch (should reach ~0 loss)")
print(" 3. Check gradient magnitudes (should be ~0.001 to 0.1)")
print(" 4. Visualize predictions vs ground truth")
print(" 5. Compare to random baseline")
debugging_visualization()
Best Practices Summary
What We Covered:
- ? Overfitting prevention: Dropout, L2 regularization, early stopping
- ? Hyperparameter tuning: Learning rate is king, systematic search strategies
- ? Data preprocessing: Normalization, standardization, augmentation
- ? Batch normalization: Stabilizes training, allows higher learning rates
- ? Debugging strategies: Common issues and how to fix them
Quick Reference Guide:
- Start with: Adam optimizer, lr=0.001, batch_size=32, ReLU activations
- If overfitting: Add dropout (0.3-0.5), L2 reg (0.01), early stopping
- If underfitting: More layers/neurons, train longer, reduce regularization
- If unstable: Lower learning rate, add batch norm, gradient clipping
- Always: Normalize data, monitor val loss, save best model, visualize results
Next: We'll explore real-world applications of neural networks across different domains!
Real-World Applications
Neural networks have transformed numerous industries. This section showcases practical applications across computer vision, natural language processing, time series forecasting, and more.
Computer Vision Applications
Neural networks excel at visual tasks, from simple image classification to complex scene understanding.
Key Computer Vision Applications
1. Image Classification
- Use: Categorize images into predefined classes
- Examples: Medical diagnosis (cancer detection), quality control (defect detection), wildlife monitoring
- Architecture: CNNs (ResNet, EfficientNet, Vision Transformers)
- Accuracy: >99% on ImageNet (superhuman on many tasks)
2. Object Detection
- Use: Locate and classify multiple objects in images
- Examples: Autonomous vehicles, surveillance, retail analytics
- Architecture: YOLO, Faster R-CNN, RetinaNet
- Performance: Real-time detection (30-60 FPS)
3. Semantic Segmentation
- Use: Classify each pixel in an image
- Examples: Medical imaging (tumor segmentation), satellite imagery, augmented reality
- Architecture: U-Net, DeepLab, Mask R-CNN
- Precision: Pixel-level accuracy for surgical planning
4. Face Recognition
- Use: Identify individuals from facial features
- Examples: Security systems, phone unlocking, photo organization
- Architecture: FaceNet, ArcFace, DeepFace
- Accuracy: 99.8% on benchmark datasets
5. Image Generation
- Use: Create realistic synthetic images
- Examples: Art generation (DALL-E, Midjourney), data augmentation, virtual try-on
- Architecture: GANs, Diffusion models (Stable Diffusion)
- Quality: Photorealistic outputs indistinguishable from real photos
import numpy as np
import matplotlib.pyplot as plt
def simulate_image_classification():
"""
Demonstrate image classification pipeline.
Example: Classifying medical images (chest X-rays).
"""
# Simulate CNN feature extraction on medical images
# In reality, you'd use pre-trained models like ResNet
categories = ['Normal', 'Pneumonia', 'COVID-19', 'Tuberculosis']
# Simulate prediction probabilities for 5 test images
predictions = np.array([
[0.92, 0.05, 0.02, 0.01], # Image 1: Normal (confident)
[0.03, 0.89, 0.05, 0.03], # Image 2: Pneumonia (confident)
[0.02, 0.15, 0.78, 0.05], # Image 3: COVID-19 (confident)
[0.25, 0.30, 0.25, 0.20], # Image 4: Uncertain
[0.01, 0.02, 0.03, 0.94], # Image 5: Tuberculosis (confident)
])
ground_truth = [0, 1, 2, 3, 3] # True labels
# Visualize predictions
fig, axes = plt.subplots(1, 5, figsize=(18, 4))
for i, ax in enumerate(axes):
# Simulate X-ray image (random noise for demo)
image = np.random.rand(64, 64) * 0.5 + 0.3
ax.imshow(image, cmap='gray')
ax.axis('off')
# Predicted class
pred_class = np.argmax(predictions[i])
confidence = predictions[i, pred_class]
true_class = ground_truth[i]
# Color: green if correct, red if wrong
color = 'green' if pred_class == true_class else 'red'
title = f"Pred: {categories[pred_class]}\n({confidence*100:.1f}%)"
if pred_class != true_class:
title += f"\nTrue: {categories[true_class]}"
ax.set_title(title, fontsize=10, fontweight='bold', color=color)
plt.suptitle('Medical Image Classification (Chest X-Ray Analysis)',
fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()
# Performance metrics
accuracy = np.mean([np.argmax(predictions[i]) == ground_truth[i]
for i in range(len(ground_truth))])
print("="*60)
print("COMPUTER VISION: IMAGE CLASSIFICATION")
print("="*60)
print(f"Task: Chest X-ray diagnosis")
print(f"Classes: {len(categories)}")
print(f"Test samples: {len(predictions)}")
print(f"Accuracy: {accuracy*100:.1f}%")
print("\nPrediction details:")
for i in range(len(predictions)):
pred = np.argmax(predictions[i])
conf = predictions[i, pred]
true = ground_truth[i]
status = "?" if pred == true else "?"
print(f" Image {i+1}: {status} Predicted {categories[pred]} "
f"({conf*100:.1f}%), True: {categories[true]}")
print("\n?? Real-world impact:")
print(" - Early disease detection (saves lives)")
print(" - Radiologist assistance (faster diagnosis)")
print(" - Remote healthcare (underserved areas)")
print(" - 24/7 availability (no fatigue)")
print("\n?? Deployed systems:")
print(" - Google's diabetic retinopathy detection")
print(" - Zebra Medical Vision (radiology AI)")
print(" - PathAI (cancer diagnosis)")
simulate_image_classification()
Natural Language Processing Applications
Neural networks have revolutionized how machines understand and generate human language.
NLP Breakthroughs
1. Machine Translation
- Task: Translate text between languages
- Models: Transformers (Google Translate, DeepL)
- Achievement: Near-human quality for common language pairs
- Impact: Breaking language barriers globally
2. Text Generation
- Task: Generate coherent, contextual text
- Models: GPT-3/4, ChatGPT, Claude
- Achievement: Human-like writing, code generation, creative content
- Impact: Content creation, education, programming assistance
3. Sentiment Analysis
- Task: Determine emotional tone of text
- Models: BERT, RoBERTa, DistilBERT
- Achievement: 90%+ accuracy on product reviews
- Impact: Customer feedback analysis, social media monitoring
4. Question Answering
- Task: Answer questions from text/knowledge base
- Models: BERT-based QA, T5, RAG systems
- Achievement: Superhuman performance on SQuAD benchmark
- Impact: Customer support bots, search engines, virtual assistants
5. Named Entity Recognition (NER)
- Task: Identify entities (people, places, organizations)
- Models: BiLSTM-CRF, BERT-NER
- Achievement: F1 scores >95% on news articles
- Impact: Information extraction, document processing
import numpy as np
import matplotlib.pyplot as plt
def simulate_sentiment_analysis():
"""
Demonstrate sentiment analysis on customer reviews.
"""
# Sample customer reviews
reviews = [
"This product is absolutely amazing! Best purchase ever!",
"Terrible quality. Broke after one day. Very disappointed.",
"It's okay, nothing special. Does the job.",
"Love it! Exceeded my expectations. Highly recommend!",
"Waste of money. Would not recommend to anyone.",
"Pretty good for the price. Happy with my purchase."
]
# Simulate BERT sentiment predictions (positive, neutral, negative)
# In reality, you'd use a pre-trained BERT model
sentiments = np.array([
[0.95, 0.03, 0.02], # Review 1: Very positive
[0.02, 0.05, 0.93], # Review 2: Very negative
[0.15, 0.75, 0.10], # Review 3: Neutral
[0.92, 0.06, 0.02], # Review 4: Very positive
[0.01, 0.04, 0.95], # Review 5: Very negative
[0.70, 0.25, 0.05], # Review 6: Positive
])
sentiment_labels = ['Positive', 'Neutral', 'Negative']
colors = ['#28a745', '#ffc107', '#dc3545'] # Green, yellow, red
# Visualize sentiment distribution
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.flatten()
for i, ax in enumerate(axes):
# Bar chart for this review
bars = ax.bar(sentiment_labels, sentiments[i], color=colors, alpha=0.7, edgecolor='black')
ax.set_ylim([0, 1])
ax.set_ylabel('Probability', fontsize=10)
ax.set_title(f'Review {i+1}', fontsize=11, fontweight='bold')
ax.grid(True, alpha=0.3, axis='y')
# Add percentage labels
for bar, prob in zip(bars, sentiments[i]):
height = bar.get_height()
ax.text(bar.get_x() + bar.get_width()/2, height,
f'{prob*100:.0f}%', ha='center', va='bottom', fontsize=9)
# Add truncated review text
review_text = reviews[i][:40] + "..." if len(reviews[i]) > 40 else reviews[i]
ax.text(0.5, -0.25, f'"{review_text}"', transform=ax.transAxes,
ha='center', fontsize=8, style='italic', wrap=True)
plt.suptitle('Sentiment Analysis on Customer Reviews', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
print("="*60)
print("NLP: SENTIMENT ANALYSIS")
print("="*60)
# Calculate overall sentiment distribution
avg_sentiment = sentiments.mean(axis=0)
print(f"\nAnalyzed {len(reviews)} reviews:")
for i, review in enumerate(reviews):
pred = sentiment_labels[np.argmax(sentiments[i])]
conf = sentiments[i, np.argmax(sentiments[i])]
print(f"\nReview {i+1}: {pred} ({conf*100:.1f}%)")
print(f' "{review}"')
print(f"\n?? Overall sentiment distribution:")
print(f" Positive: {avg_sentiment[0]*100:.1f}%")
print(f" Neutral: {avg_sentiment[1]*100:.1f}%")
print(f" Negative: {avg_sentiment[2]*100:.1f}%")
print("\n?? Business applications:")
print(" - Product review analysis (identify issues)")
print(" - Social media monitoring (brand reputation)")
print(" - Customer support prioritization (urgent issues)")
print(" - Market research (consumer opinions)")
print("\n?? Real implementations:")
print(" - Amazon product review analysis")
print(" - Twitter sentiment tracking")
print(" - Customer feedback dashboards")
simulate_sentiment_analysis()
Time Series and Forecasting
Neural networks predict future values based on historical patterns, crucial for finance, weather, and demand forecasting.
import numpy as np
import matplotlib.pyplot as plt
def simulate_stock_price_forecasting():
"""
Demonstrate time series forecasting with LSTM.
Example: Stock price prediction.
"""
# Generate synthetic stock price data
np.random.seed(42)
days = 200
# Trend + seasonality + noise
trend = np.linspace(100, 150, days)
seasonality = 10 * np.sin(np.arange(days) * 2 * np.pi / 30)
noise = np.random.randn(days) * 3
stock_price = trend + seasonality + noise
# Split into train and test
train_size = 150
train_data = stock_price[:train_size]
test_data = stock_price[train_size:]
# Simulate LSTM predictions (in reality, you'd train an LSTM)
# Predictions have some error but follow the pattern
predictions = test_data + np.random.randn(len(test_data)) * 2
# Visualize
fig, axes = plt.subplots(2, 1, figsize=(14, 10))
# Full time series with train/test split
axes[0].plot(range(train_size), train_data, label='Training Data',
linewidth=2, color='#3B9797')
axes[0].plot(range(train_size, days), test_data, label='Actual Price',
linewidth=2, color='#132440')
axes[0].plot(range(train_size, days), predictions, label='LSTM Predictions',
linewidth=2, color='#BF092F', linestyle='--')
axes[0].axvline(x=train_size, color='orange', linestyle='--',
linewidth=2, label='Train/Test Split')
axes[0].set_xlabel('Day', fontsize=12)
axes[0].set_ylabel('Stock Price ($)', fontsize=12)
axes[0].set_title('Stock Price Forecasting with LSTM', fontsize=13, fontweight='bold')
axes[0].legend(loc='upper left', fontsize=11)
axes[0].grid(True, alpha=0.3)
# Prediction error analysis
errors = predictions - test_data
axes[1].bar(range(len(errors)), errors, color=['red' if e < 0 else 'green' for e in errors],
alpha=0.7, edgecolor='black')
axes[1].axhline(y=0, color='black', linestyle='-', linewidth=1)
axes[1].set_xlabel('Day (Test Set)', fontsize=12)
axes[1].set_ylabel('Prediction Error ($)', fontsize=12)
axes[1].set_title('Prediction Errors (Predicted - Actual)', fontsize=13, fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
# Performance metrics
mae = np.mean(np.abs(errors))
rmse = np.sqrt(np.mean(errors**2))
mape = np.mean(np.abs(errors / test_data)) * 100
print("="*60)
print("TIME SERIES: STOCK PRICE FORECASTING")
print("="*60)
print(f"Training samples: {train_size}")
print(f"Test samples: {len(test_data)}")
print(f"\nPerformance metrics:")
print(f" Mean Absolute Error (MAE): ${mae:.2f}")
print(f" Root Mean Squared Error (RMSE): ${rmse:.2f}")
print(f" Mean Absolute Percentage Error (MAPE): {mape:.2f}%")
print(f"\nSample predictions:")
for i in range(min(5, len(test_data))):
actual = test_data[i]
pred = predictions[i]
error = pred - actual
print(f" Day {train_size + i + 1}: Actual=${actual:.2f}, "
f"Predicted=${pred:.2f}, Error=${error:+.2f}")
print("\n?? Time series applications:")
print(" - Stock market prediction")
print(" - Demand forecasting (retail inventory)")
print(" - Energy consumption prediction")
print(" - Weather forecasting")
print(" - Traffic prediction")
print("\n?? Industry examples:")
print(" - Walmart: Demand forecasting for 500M+ SKUs")
print(" - Uber: Ride demand prediction (surge pricing)")
print(" - Google: Data center energy optimization")
print(" - Amazon: Inventory management")
print("\n?? Architecture choices:")
print(" - Short sequences (<50 steps): Simple RNN, GRU")
print(" - Long sequences (>50 steps): LSTM, Transformers")
print(" - Multiple variables: Multivariate LSTM")
print(" - Very long sequences: Temporal Convolutional Networks")
simulate_stock_price_forecasting()
Recommendation Systems
Neural networks power personalized recommendations on platforms like Netflix, Amazon, and Spotify.
Recommendation System Approaches
1. Collaborative Filtering
- Idea: Users with similar preferences will like similar items
- Method: Neural matrix factorization, autoencoders
- Example: "Users who liked X also liked Y"
- Challenge: Cold start problem (new users/items)
2. Content-Based Filtering
- Idea: Recommend items similar to what user liked before
- Method: CNNs for image features, transformers for text
- Example: "Because you watched Inception, try Interstellar"
- Advantage: Works for new items
3. Hybrid Systems
- Idea: Combine collaborative + content-based
- Method: Deep neural networks with multiple inputs
- Example: Netflix's recommendation engine
- Performance: Best of both worlds
4. Session-Based Recommendations
- Idea: Predict next action based on current session
- Method: RNNs, GRU4Rec, Transformers
- Example: "Customers who viewed this also viewed..."
- Use case: Anonymous users, short-term interests
import numpy as np
import matplotlib.pyplot as plt
def simulate_recommendation_system():
"""
Demonstrate collaborative filtering with neural networks.
Example: Movie recommendations.
"""
# Simulate user-item rating matrix (5 users, 10 movies)
# Ratings: 1-5 stars, 0 = not rated
user_ratings = np.array([
[5, 4, 0, 0, 2, 0, 0, 5, 0, 1], # User 1: Likes action (movies 0,1,7)
[4, 5, 0, 0, 1, 0, 0, 4, 0, 2], # User 2: Similar to User 1
[0, 0, 5, 4, 0, 5, 4, 0, 0, 0], # User 3: Likes drama (movies 2,3,5,6)
[0, 0, 4, 5, 0, 4, 5, 0, 0, 0], # User 4: Similar to User 3
[3, 3, 3, 3, 3, 3, 3, 3, 0, 0], # User 5: Rates everything average
])
movie_names = ['Action1', 'Action2', 'Drama1', 'Drama2', 'Horror',
'Drama3', 'Drama4', 'Action3', 'Comedy', 'Horror2']
# Simulate neural network predictions for unrated movies
# (In reality, train a matrix factorization network)
predictions = user_ratings.copy().astype(float)
# Predict ratings for user 1's unrated movies
predictions[0, 2] = 2.5 # Drama1 (different taste)
predictions[0, 3] = 2.3 # Drama2
predictions[0, 4] = 3.0 # Horror
predictions[0, 5] = 2.2 # Drama3
predictions[0, 6] = 2.1 # Drama4
predictions[0, 8] = 3.5 # Comedy
# Visualize ratings and recommendations
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# Heatmap of all user ratings
im1 = axes[0].imshow(user_ratings, cmap='YlOrRd', aspect='auto', vmin=0, vmax=5)
axes[0].set_xlabel('Movie', fontsize=12)
axes[0].set_ylabel('User', fontsize=12)
axes[0].set_title('User-Movie Rating Matrix\n(0 = Not Rated)',
fontsize=13, fontweight='bold')
axes[0].set_xticks(range(len(movie_names)))
axes[0].set_xticklabels(movie_names, rotation=45, ha='right', fontsize=9)
axes[0].set_yticks(range(5))
axes[0].set_yticklabels([f'User {i+1}' for i in range(5)])
# Add rating values
for i in range(5):
for j in range(10):
rating = user_ratings[i, j]
if rating > 0:
axes[0].text(j, i, str(rating), ha='center', va='center',
color='white' if rating >= 3 else 'black', fontweight='bold')
plt.colorbar(im1, ax=axes[0], label='Rating (1-5 stars)')
# Recommendations for User 1
user_idx = 0
unrated_movies = np.where(user_ratings[user_idx] == 0)[0]
predicted_ratings = predictions[user_idx, unrated_movies]
# Sort by predicted rating
sorted_indices = np.argsort(predicted_ratings)[::-1]
top_movies = unrated_movies[sorted_indices]
top_ratings = predicted_ratings[sorted_indices]
axes[1].barh([movie_names[i] for i in top_movies], top_ratings,
color='#3B9797', alpha=0.7, edgecolor='black')
axes[1].set_xlabel('Predicted Rating', fontsize=12)
axes[1].set_title('Recommendations for User 1\n(Unrated Movies, Sorted by Prediction)',
fontsize=13, fontweight='bold')
axes[1].set_xlim([0, 5])
axes[1].grid(True, alpha=0.3, axis='x')
# Add rating values
for i, (movie, rating) in enumerate(zip(top_movies, top_ratings)):
axes[1].text(rating + 0.1, i, f'{rating:.1f}?',
va='center', fontsize=10, fontweight='bold')
plt.tight_layout()
plt.show()
print("="*60)
print("RECOMMENDATION SYSTEM: COLLABORATIVE FILTERING")
print("="*60)
print(f"\nUser 1's rated movies:")
rated_movies = np.where(user_ratings[0] > 0)[0]
for movie_idx in rated_movies:
print(f" {movie_names[movie_idx]}: {user_ratings[0, movie_idx]}?")
print(f"\nTop 5 recommendations for User 1:")
for i, (movie_idx, rating) in enumerate(zip(top_movies[:5], top_ratings[:5]), 1):
print(f" {i}. {movie_names[movie_idx]}: {rating:.1f}? (predicted)")
print("\n?? How it works:")
print(" 1. Learn user embeddings (user preferences)")
print(" 2. Learn movie embeddings (movie characteristics)")
print(" 3. Predict rating = dot(user_embedding, movie_embedding)")
print(" 4. Recommend highest predicted ratings")
print("\n?? Real-world impact:")
print(" - Netflix: 80% of watched content from recommendations")
print(" - Amazon: 35% of revenue from recommendations")
print(" - YouTube: 70% of watch time from recommendations")
print(" - Spotify: Discover Weekly playlist (personalized)")
print("\n?? Architecture:")
print(" - Input: User ID + Movie ID (one-hot or embeddings)")
print(" - Hidden: Dense layers with ReLU")
print(" - Output: Predicted rating (1-5)")
print(" - Loss: Mean Squared Error between predicted and actual ratings")
simulate_recommendation_system()
Healthcare and Science Applications
Neural networks are revolutionizing medicine and scientific research, from drug discovery to protein folding.
Healthcare AI Breakthroughs
1. Medical Image Analysis
- Cancer Detection: Mammography, skin lesion classification (dermatology)
- Performance: Match or exceed expert radiologists
- Example: Google's lymph node metastasis detection (99% accuracy)
2. Drug Discovery
- Task: Predict molecular properties, design new compounds
- Models: Graph neural networks, transformers for molecules
- Impact: Reduce drug development time from 10+ years to 1-2 years
- Example: Insilico Medicine discovered drug candidates in 46 days
3. Protein Structure Prediction
- Task: Predict 3D protein structure from amino acid sequence
- Model: AlphaFold 2 (DeepMind)
- Achievement: Solved 50-year-old problem, atomic-level accuracy
- Impact: Accelerate understanding of diseases, design therapies
4. Genomics and Personalized Medicine
- Task: Predict disease risk from genetic data
- Models: CNNs for DNA sequences, transformers for gene expression
- Application: Cancer risk assessment, treatment selection
5. Clinical Decision Support
- Task: Assist doctors with diagnosis and treatment plans
- Models: Multi-modal networks (text + imaging + lab results)
- Example: IBM Watson for Oncology
import numpy as np
import matplotlib.pyplot as plt
def simulate_drug_discovery():
"""
Demonstrate molecular property prediction for drug discovery.
Example: Predicting drug-likeness and toxicity.
"""
# Simulate molecules with different properties
# In reality, you'd use graph neural networks on molecular structures
molecules = [
'Aspirin', 'Penicillin', 'Insulin', 'Morphine',
'Caffeine', 'Nicotine', 'Ethanol', 'Glucose'
]
# Simulated predictions (0-1 scale)
# Properties: Drug-likeness, Bioavailability, Toxicity, Synthesizability
properties = np.array([
[0.85, 0.90, 0.15, 0.95], # Aspirin: Good drug candidate
[0.90, 0.75, 0.20, 0.80], # Penicillin: Good drug
[0.70, 0.40, 0.10, 0.30], # Insulin: Low bioavailability (protein)
[0.75, 0.60, 0.65, 0.70], # Morphine: High toxicity
[0.80, 0.85, 0.25, 0.90], # Caffeine: Good properties
[0.65, 0.75, 0.70, 0.85], # Nicotine: High toxicity
[0.50, 0.95, 0.45, 0.99], # Ethanol: Moderate toxicity
[0.40, 0.30, 0.05, 0.95], # Glucose: Not drug-like
])
property_names = ['Drug-likeness', 'Bioavailability', 'Toxicity', 'Synthesizability']
# Visualize
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.flatten()
colors = ['#28a745', '#3B9797', '#dc3545', '#ffc107']
for i, (prop_name, color) in enumerate(zip(property_names, colors)):
ax = axes[i]
values = properties[:, i]
bars = ax.barh(molecules, values, color=color, alpha=0.7, edgecolor='black')
ax.set_xlabel('Score (0-1)', fontsize=12)
ax.set_title(prop_name, fontsize=13, fontweight='bold')
ax.set_xlim([0, 1])
ax.grid(True, alpha=0.3, axis='x')
# Add score labels
for bar, val in zip(bars, values):
ax.text(val + 0.02, bar.get_y() + bar.get_height()/2,
f'{val:.2f}', va='center', fontsize=10, fontweight='bold')
# Add threshold line for toxicity
if prop_name == 'Toxicity':
ax.axvline(x=0.5, color='red', linestyle='--', linewidth=2,
label='Safety Threshold')
ax.legend(fontsize=9)
plt.suptitle('Molecular Property Prediction for Drug Discovery',
fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
print("="*60)
print("HEALTHCARE: AI-DRIVEN DRUG DISCOVERY")
print("="*60)
print("\nMolecular property predictions:")
for i, mol in enumerate(molecules):
print(f"\n{mol}:")
print(f" Drug-likeness: {properties[i, 0]:.2f} "
f"({'Good' if properties[i, 0] > 0.7 else 'Poor'})")
print(f" Bioavailability: {properties[i, 1]:.2f}")
print(f" Toxicity: {properties[i, 2]:.2f} "
f"({'?? High' if properties[i, 2] > 0.5 else '? Low'})")
print(f" Synthesizability: {properties[i, 3]:.2f}")
# Identify best drug candidates
# Good drug: high drug-likeness, high bioavailability, low toxicity, high synth
drug_score = properties[:, 0] * properties[:, 1] * (1 - properties[:, 2]) * properties[:, 3]
best_idx = np.argmax(drug_score)
print(f"\n?? Best drug candidate: {molecules[best_idx]}")
print(f" Overall score: {drug_score[best_idx]:.3f}")
print("\n?? How neural networks help:")
print(" - Screen millions of molecules in days (vs years)")
print(" - Predict properties without synthesis")
print(" - Design novel molecules with desired properties")
print(" - Optimize existing drugs (reduce side effects)")
print("\n?? Real breakthroughs:")
print(" - AlphaFold: Protein structure prediction (Nobel-worthy)")
print(" - Insilico Medicine: New drug in 46 days (normally 4+ years)")
print(" - Atomwise: COVID-19 drug candidates identified in weeks")
print(" - BenevolentAI: Repurposed existing drugs for new diseases")
print("\n?? Architecture:")
print(" - Input: Molecular graph (atoms = nodes, bonds = edges)")
print(" - Model: Graph Neural Networks (GNN) or Transformers")
print(" - Output: Property predictions (continuous or classification)")
print(" - Training: Large databases (ChEMBL, PubChem, 100M+ molecules)")
simulate_drug_discovery()
Industry Case Studies
Transformative Industry Applications
1. Autonomous Vehicles (Tesla, Waymo)
- Challenge: Navigate safely in complex environments
- Solution: Multi-camera CNNs + transformers for scene understanding
- Components: Object detection, lane detection, trajectory prediction
- Impact: Billions of autonomous miles driven
2. Fraud Detection (PayPal, Stripe)
- Challenge: Identify fraudulent transactions in real-time
- Solution: Deep learning on transaction patterns
- Techniques: Anomaly detection, graph neural networks
- Impact: <99.9% fraud rate, saved billions
3. Smart Assistants (Alexa, Siri, Google Assistant)
- Challenge: Understand natural speech, respond intelligently
- Solution: Speech recognition (CNNs/RNNs) + NLU (transformers)
- Capabilities: Multi-turn dialogue, context awareness
- Scale: Billions of queries daily
4. Content Moderation (Facebook, YouTube)
- Challenge: Remove harmful content at scale
- Solution: CNNs for images/video, transformers for text
- Detection: Violence, hate speech, misinformation
- Scale: Billions of posts/videos reviewed daily
5. Predictive Maintenance (GE, Siemens)
- Challenge: Predict equipment failures before they happen
- Solution: Time series models (LSTM) on sensor data
- Benefits: Reduce downtime, optimize maintenance schedules
- Savings: Millions in avoided failures
Real-World Applications Summary
What We Explored:
- ? Computer Vision: Image classification, object detection, medical imaging
- ? NLP: Translation, sentiment analysis, text generation, question answering
- ? Time Series: Stock prediction, demand forecasting, energy optimization
- ? Recommendations: Collaborative filtering, content-based, hybrid systems
- ? Healthcare: Drug discovery, protein folding, disease diagnosis
- ? Industry: Autonomous vehicles, fraud detection, smart assistants
Key Takeaways:
- Neural networks solve problems impossible for traditional algorithms
- Real-world deployment requires careful engineering (data, monitoring, ethics)
- Domain expertise + AI = powerful solutions
- Continuous improvement: models retrained as new data arrives
- Ethical considerations: bias, privacy, transparency
Next: We'll conclude with learning resources and next steps in your neural network journey!
Conclusion and Further Learning
Congratulations! You've completed a comprehensive journey through artificial neural networks, from basic perceptrons to cutting-edge transformers. Let's recap what you've learned and chart your path forward.
What You've Accomplished
Your Learning Journey
Foundations (Sections 1-3):
- ? Understanding of biological inspiration and neural network evolution
- ? Recognition of classical ML limitations and why ANNs emerged
- ? Historical context from perceptron (1958) to modern deep learning
Core Concepts (Sections 4-6):
- ? Artificial neuron mechanics: weighted sum + activation
- ? Activation functions: Sigmoid, Tanh, ReLU, Leaky ReLU
- ? Forward propagation: data flow through layers
- ? Loss functions: MSE, Binary Cross-Entropy
- ? Backpropagation: gradient computation via chain rule
- ? Optimizers: SGD, Momentum, Adam, RMSprop
- ? Built complete neural network from scratch (XOR problem)
Architectures (Sections 7-12):
- ? Feedforward Networks: Dense layers for tabular data
- ? CNNs: Convolution, pooling, feature hierarchies (vision tasks)
- ? RNNs: Sequential processing, vanishing gradients, BPTT
- ? LSTMs/GRUs: Long-term dependencies, gating mechanisms
- ? Autoencoders: Unsupervised learning, dimensionality reduction, denoising
- ? GANs: Adversarial training, generative modeling
- ? Transformers: Attention mechanism, multi-head attention, positional encoding
Practical Skills (Sections 13-14):
- ? Overfitting prevention: Dropout, L2 regularization, early stopping
- ? Hyperparameter tuning: Learning rate, batch size, architecture
- ? Data preprocessing: Normalization, standardization, augmentation
- ? Batch normalization for training stability
- ? Debugging strategies for common issues
- ? Real-world applications across 6+ domains
Hands-On Experience:
- ? Implemented 15+ neural network architectures from scratch
- ? Solved 10+ practical problems (XOR, MNIST-like, time series, etc.)
- ? Created 50+ visualizations for understanding
- ? All code examples copy-paste ready for Jupyter notebooks
Recommended Learning Path
Now that you have a solid foundation, here's a structured path to mastery:
3-Stage Learning Roadmap
Stage 1: Solidify Foundations (1-3 months)
- Practice implementations: Re-implement networks from this guide in PyTorch/TensorFlow
- Kaggle competitions: Start with "Getting Started" competitions
- Titanic (classification)
- House Prices (regression)
- Digit Recognizer (MNIST)
- Math review: Linear algebra, calculus, probability (3Blue1Brown videos)
- Read papers: Start with foundational papers (AlexNet, ResNet, LSTM)
Stage 2: Specialize and Build (3-6 months)
- Choose domain: Computer vision, NLP, reinforcement learning, or time series
- Deep dive courses: Domain-specific courses (Fast.ai, Coursera specializations)
- Build projects: 3-5 substantial projects
- CV: Custom image classifier, object detector
- NLP: Sentiment analyzer, text generator, chatbot
- Time Series: Stock predictor, demand forecaster
- Contribute to open source: Fix bugs, add features to ML libraries
- Kaggle competitions: Move to intermediate competitions, aim for top 10%
Stage 3: Expert Level (6-12+ months)
- Research papers: Read 1-2 papers weekly (arxiv.org, Papers with Code)
- Reproduce papers: Implement cutting-edge techniques from scratch
- Production deployment: Learn MLOps (Docker, Kubernetes, model serving)
- Publish work: Write blog posts, tutorials, or research papers
- Conference talks: Present at meetups or conferences
- Advanced competitions: Kaggle Grandmaster track, winning solutions
Essential Resources
Online Courses
Top-Rated Courses
Beginner-Friendly:
- Fast.ai - Practical Deep Learning for Coders
- Free, top-down approach
- Build models from day 1
- PyTorch-based
- ?? course.fast.ai
- Andrew Ng - Deep Learning Specialization (Coursera)
- 5-course series
- Bottom-up, mathematical approach
- TensorFlow/Keras
- ?? coursera.org/specializations/deep-learning
Intermediate/Advanced:
- Stanford CS231n - CNNs for Visual Recognition
- Free lecture videos + notes
- Deep dive into computer vision
- ?? cs231n.stanford.edu
- Stanford CS224n - NLP with Deep Learning
- Comprehensive NLP coverage
- Transformers, BERT, GPT
- ?? web.stanford.edu/class/cs224n/
- MIT 6.S191 - Introduction to Deep Learning
- Fast-paced, comprehensive
- Latest research trends
- ?? introtodeeplearning.com
Books
import matplotlib.pyplot as plt
import numpy as np
def recommend_books():
"""
Recommended books for neural network learning.
"""
books = {
'Beginner': [
('Deep Learning with Python', 'François Chollet', 2021, 'Keras creator, hands-on'),
('Grokking Deep Learning', 'Andrew Trask', 2019, 'Build from scratch, intuitive'),
('Make Your Own Neural Network', 'Tariq Rashid', 2016, 'Simple, visual explanations'),
],
'Intermediate': [
('Deep Learning', 'Goodfellow, Bengio, Courville', 2016, 'The "Bible" of DL'),
('Hands-On Machine Learning', 'Aurélien Géron', 2022, 'Scikit-Learn, Keras, TF'),
('Deep Learning for Coders', 'Jeremy Howard, Sylvain Gugger', 2020, 'Fast.ai approach'),
],
'Advanced': [
('Pattern Recognition and ML', 'Christopher Bishop', 2006, 'Mathematical foundations'),
('Dive into Deep Learning', 'Zhang et al.', 2023, 'Interactive, comprehensive'),
('Understanding Deep Learning', 'Simon Prince', 2023, 'Modern architectures'),
],
'Specialized': [
('Computer Vision (Szeliski)', 'Richard Szeliski', 2022, 'CV algorithms'),
('Speech and Language Processing', 'Jurafsky & Martin', 2023, 'NLP fundamentals'),
('Reinforcement Learning', 'Sutton & Barto', 2018, 'RL bible'),
]
}
print("="*70)
print("RECOMMENDED BOOKS FOR NEURAL NETWORKS")
print("="*70)
for level, book_list in books.items():
print(f"\n?? {level} Level:")
print("-" * 70)
for i, (title, author, year, note) in enumerate(book_list, 1):
print(f" {i}. \"{title}\"")
print(f" Author: {author} ({year})")
print(f" Note: {note}")
print()
# Visualize reading path
categories = list(books.keys())
counts = [len(books[cat]) for cat in categories]
fig, ax = plt.subplots(figsize=(10, 6))
colors = ['#28a745', '#3B9797', '#BF092F', '#132440']
bars = ax.bar(categories, counts, color=colors, alpha=0.7, edgecolor='black', width=0.6)
ax.set_ylabel('Number of Recommended Books', fontsize=12)
ax.set_title('Learning Path: Recommended Books by Level', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3, axis='y')
# Add count labels
for bar, count in zip(bars, counts):
height = bar.get_height()
ax.text(bar.get_x() + bar.get_width()/2, height,
f'{count} books', ha='center', va='bottom', fontsize=11, fontweight='bold')
plt.tight_layout()
plt.show()
print("\n?? Reading strategy:")
print(" 1. Start with ONE beginner book (Deep Learning with Python recommended)")
print(" 2. Implement examples as you read")
print(" 3. Move to intermediate after 3-6 months of practice")
print(" 4. Use advanced books as references, not cover-to-cover")
print(" 5. Specialized books: Pick ONE domain, deep dive")
recommend_books()
Deep Learning Frameworks
Framework Comparison: PyTorch vs TensorFlow
PyTorch
- Pros:
- Pythonic, intuitive API
- Dynamic computation graphs (easier debugging)
- Preferred by researchers (80%+ of papers)
- Excellent for experimentation
- Growing industry adoption
- Cons:
- Deployment slightly more complex
- Smaller ecosystem than TensorFlow (historically)
- Best for: Research, prototyping, learning, CV, NLP
- Get started: pytorch.org/tutorials
TensorFlow / Keras
- Pros:
- Production-ready (TF Serving, TF Lite)
- Keras: Very beginner-friendly
- Strong mobile/edge deployment
- Mature ecosystem (TensorBoard, etc.)
- Google backing
- Cons:
- More verbose than PyTorch
- Debugging can be harder
- Best for: Production deployment, mobile apps, beginners (Keras)
- Get started: tensorflow.org/tutorials
Recommendation:
- Absolute beginners: Start with Keras (simplest API)
- Aiming for research: Learn PyTorch (industry standard for papers)
- Production focus: TensorFlow (better deployment tools)
- Best approach: Learn ONE deeply first, then pick up the other (concepts transfer!)
Other Frameworks Worth Knowing:
- JAX: High-performance, functional approach (Google Brain)
- MXNet: Used by Amazon, efficient distributed training
- Hugging Face: NLP library built on PyTorch/TensorFlow (transformers)
Foundational Papers
Must-Read Papers (Chronological)
Historical Foundations:
- 1986: "Learning representations by back-propagating errors" - Rumelhart, Hinton, Williams
- 1998: "Gradient-Based Learning Applied to Document Recognition" - LeCun et al. (LeNet)
- 1997: "Long Short-Term Memory" - Hochreiter & Schmidhuber
Deep Learning Era:
- 2012: "ImageNet Classification with Deep CNNs" - Krizhevsky et al. (AlexNet) ??
- 2014: "Generative Adversarial Networks" - Goodfellow et al. (GANs) ??
- 2015: "Deep Residual Learning" - He et al. (ResNet) ??
- 2017: "Attention is All You Need" - Vaswani et al. (Transformers) ??
Recent Breakthroughs:
- 2018: "BERT: Pre-training of Deep Bidirectional Transformers" - Devlin et al.
- 2020: "Language Models are Few-Shot Learners" - Brown et al. (GPT-3)
- 2021: "Highly accurate protein structure prediction with AlphaFold" - Jumper et al.
- 2022: "Photorealistic Text-to-Image Diffusion Models" - Saharia et al. (Imagen)
Where to find papers:
- ?? arXiv.org: Pre-prints, latest research
- ?? Papers with Code: Papers + code implementations
- ?? Google Scholar: Search papers by topic
- ?? Distill.pub: Interactive, visual explanations
Reading strategy:
- Read abstract and conclusion first
- Look at figures and tables
- Skim introduction and related work
- Deep dive into method section
- Try to implement key ideas
Community and Practice
import numpy as np
import matplotlib.pyplot as plt
def community_resources():
"""
Overview of AI/ML communities and practice platforms.
"""
communities = {
'Learning Platforms': [
('Kaggle', 'Competitions, datasets, notebooks', '?????'),
('Google Colab', 'Free GPUs, Jupyter notebooks', '?????'),
('Hugging Face', 'Pre-trained models, datasets', '?????'),
('Papers with Code', 'Papers + implementations', '?????'),
],
'Communities': [
('r/MachineLearning', 'Reddit: research discussions', '????'),
('Towards Data Science', 'Medium: tutorials, articles', '????'),
('AI Discord Servers', 'Real-time help, networking', '????'),
('Local Meetups', 'In-person networking, talks', '?????'),
],
'YouTube Channels': [
('3Blue1Brown', 'Math visualizations', '?????'),
('Two Minute Papers', 'Research paper summaries', '?????'),
('Yannic Kilcher', 'Paper explanations', '????'),
('Sentdex', 'Practical tutorials', '????'),
],
'Podcasts': [
('Lex Fridman AI Podcast', 'Deep conversations with experts', '?????'),
('The TWIML AI Podcast', 'Weekly AI news, interviews', '????'),
('Gradient Dissent', 'Wandb, ML engineering', '????'),
],
}
print("="*70)
print("COMMUNITY AND PRACTICE RESOURCES")
print("="*70)
for category, resources in communities.items():
print(f"\n?? {category}:")
print("-" * 70)
for name, description, rating in resources:
print(f" • {name:<25} {description:<35} {rating}")
print("\n" + "="*70)
print("RECOMMENDED PRACTICE ROUTINE")
print("="*70)
routine = {
'Daily (30-60 min)': [
'Read 1 ML paper or article',
'Code for 30 minutes (implement concepts)',
'Review Kaggle notebooks or tutorials',
],
'Weekly (3-5 hours)': [
'Work on personal project (2-3 hours)',
'Kaggle competition or new dataset exploration',
'Watch 1-2 educational videos (lectures/tutorials)',
'Write blog post or document learning',
],
'Monthly': [
'Complete 1 online course module',
'Attend 1 meetup or webinar',
'Reproduce 1 research paper',
'Contribute to open-source ML project',
],
}
print()
for period, activities in routine.items():
print(f"{period}:")
for activity in activities:
print(f" ? {activity}")
print()
# Visualize skill progression
months = np.arange(1, 13)
# Different learning curves
beginner_skill = 100 * (1 - np.exp(-0.3 * months))
intermediate_skill = 100 * (1 - np.exp(-0.15 * months))
advanced_skill = 100 * (1 - np.exp(-0.08 * months))
plt.figure(figsize=(12, 6))
plt.plot(months, beginner_skill, linewidth=3, label='With Daily Practice',
color='#28a745', marker='o', markersize=6)
plt.plot(months, intermediate_skill, linewidth=3, label='With Weekly Practice',
color='#3B9797', marker='s', markersize=6)
plt.plot(months, advanced_skill, linewidth=3, label='Occasional Practice',
color='#BF092F', marker='^', markersize=6)
plt.xlabel('Months of Learning', fontsize=12)
plt.ylabel('Skill Level (%)', fontsize=12)
plt.title('Skill Progression: Impact of Consistent Practice', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=11)
plt.grid(True, alpha=0.3)
plt.ylim([0, 105])
# Add milestones
plt.axhline(y=50, color='orange', linestyle='--', alpha=0.5, label='Job-Ready')
plt.text(12.2, 50, 'Job-Ready', fontsize=9, va='center')
plt.axhline(y=80, color='red', linestyle='--', alpha=0.5, label='Expert')
plt.text(12.2, 80, 'Expert', fontsize=9, va='center')
plt.tight_layout()
plt.show()
print("?? Key takeaway:")
print(" Consistency > Intensity")
print(" Daily practice (even 30 min) beats weekend marathons!")
community_resources()
Your Next Steps
Action Plan: Start Today
Week 1: Consolidate Foundations
- Re-implement 3 networks from this guide in PyTorch/TensorFlow
- Create a GitHub repository for your implementations
- Join Kaggle, explore "Getting Started" competitions
- Watch 3Blue1Brown's neural network series (4 videos)
Month 1: First Project
- Choose a dataset that interests you (Kaggle, UCI ML Repository)
- Build end-to-end pipeline: data loading ? preprocessing ? model ? evaluation
- Experiment with different architectures and hyperparameters
- Write a blog post documenting your process and learnings
- Share on LinkedIn/Twitter for feedback
Months 2-3: Deepen Knowledge
- Complete Andrew Ng's Deep Learning course OR Fast.ai Part 1
- Read and implement 3 foundational papers (AlexNet, ResNet, LSTM)
- Build 2 more projects in different domains (CV, NLP, or time series)
- Participate in 1 active Kaggle competition
- Contribute to 1 open-source ML project (fix bug, add feature)
Months 4-6: Specialize
- Choose specialization: CV, NLP, RL, or domain-specific (healthcare, finance)
- Take domain-specific course (CS231n for CV, CS224n for NLP)
- Build capstone project: production-ready application
- Deploy with Streamlit/Gradio for demo
- Docker containerization
- CI/CD pipeline
- Network: Attend 2-3 meetups or conferences
- Start building portfolio website
Beyond 6 Months: Career/Research
- Industry Path:
- Apply for ML Engineer / Data Scientist roles
- Focus on MLOps: model deployment, monitoring, versioning
- Learn cloud platforms (AWS SageMaker, GCP AI, Azure ML)
- Research Path:
- Read 2-3 papers weekly, reproduce cutting-edge results
- Contribute to top conferences (NeurIPS, ICML, CVPR)
- Pursue PhD or research positions
- Entrepreneurship Path:
- Build AI product solving real problem
- Validate with users, iterate quickly
- Launch startup or consulting practice
Final Thoughts
Parting Words
You've taken the first major step.
Neural networks are not magic—they're mathematics, statistics, and clever engineering combined. You now understand the fundamentals that power ChatGPT, self-driving cars, medical AI, and countless other applications transforming our world.
Remember:
- Everyone starts as a beginner. Today's AI researchers struggled with backpropagation once.
- Learning is non-linear. Plateaus are normal. Breakthroughs come when you persist.
- Build, build, build. Theory matters, but practice cements understanding.
- Community is key. Learn together, teach others, ask questions.
- Stay curious. The field evolves rapidly—embrace continuous learning.
The field needs you.
AI is still young. We need diverse perspectives, creative problem-solving, and ethical thinking to ensure AI benefits humanity. Your journey doesn't end here—it's just beginning.
What will you build?
An app that helps doctors diagnose diseases? A model that predicts climate patterns? A system that makes education accessible? The tools are in your hands now.
Go forth and build the future. ??
"The best way to predict the future is to invent it." — Alan Kay
Thank You for Learning With Us!
Questions? Feedback? Found this helpful?
Share your journey, projects, or questions on social media with #NeuralNetworkGuide
Happy Learning! ??