Why CNNs Work for Images
Fully connected networks treat every pixel as an independent feature. For a modest 28×28 grayscale image, that means 784 input neurons — and if we connect them to just 128 hidden neurons, we already have over 100,000 parameters in the first layer alone. Scale that to a 224×224 RGB image and the parameter count explodes to over 37 million in a single layer.
CNNs leverage three fundamental principles:
- Local connectivity: Each neuron connects only to a small patch of the input
- Weight sharing: The same filter is applied everywhere in the image
- Translation equivariance: If an object moves, the feature map response moves correspondingly
Let us quantify the parameter savings:
import numpy as np
# Compare parameter counts: Fully Connected vs CNN
# Input: 28x28 grayscale image (like MNIST)
input_size = 28 * 28 # 784 pixels
# Fully Connected Network: first hidden layer with 128 neurons
fc_params = input_size * 128 + 128 # weights + biases
print(f"Fully Connected Layer: {fc_params:,} parameters")
# 784 * 128 + 128 = 100,480 parameters
# CNN: 16 filters of size 3x3, single input channel
# Each filter: 3 * 3 * 1 = 9 weights + 1 bias = 10 params
cnn_filters = 16
kernel_size = 3
channels_in = 1
cnn_params = cnn_filters * (kernel_size * kernel_size * channels_in + 1)
print(f"Conv Layer (16 filters, 3x3): {cnn_params:,} parameters")
# 16 * (9 + 1) = 160 parameters
# Reduction factor
reduction = fc_params / cnn_params
print(f"Parameter reduction: {reduction:.0f}x fewer parameters")
# ~628x fewer parameters!
# For RGB 224x224 image
input_rgb = 224 * 224 * 3 # 150,528
fc_rgb_params = input_rgb * 256 + 256
cnn_rgb_params = 64 * (3 * 3 * 3 + 1) # 64 filters, 3x3, 3 channels
print(f"\nRGB 224x224 - FC layer: {fc_rgb_params:,} parameters")
print(f"RGB 224x224 - Conv layer: {cnn_rgb_params:,} parameters")
print(f"Reduction: {fc_rgb_params / cnn_rgb_params:.0f}x")
flowchart LR
A[Input Image\n28x28x1] --> B[Conv Layer\n3x3, 16 filters]
B --> C[ReLU]
C --> D[Max Pool\n2x2]
D --> E[Conv Layer\n3x3, 32 filters]
E --> F[ReLU]
F --> G[Max Pool\n2x2]
G --> H[Flatten]
H --> I[Dense\n128 units]
I --> J[Output\n10 classes]
Understanding Convolution Operations
A convolution operation slides a small matrix (the kernel or filter) over the input and computes element-wise multiplication followed by summation at each position. Mathematically:
$$(I * K)(i,j) = \sum_m \sum_n I(i+m, j+n) \cdot K(m,n)$$
where $I$ is the input, $K$ is the kernel, and $(i,j)$ is the output position. The kernel has a fixed size (commonly $3 \times 3$ or $5 \times 5$) and “slides” across the entire input.
We implement cross-correlation (what deep learning frameworks call “convolution”) without any library functions. True mathematical convolution flips the kernel, but in practice all frameworks use cross-correlation.
import numpy as np
def convolve2d(image, kernel):
"""
Perform 2D convolution (cross-correlation) from scratch.
Parameters:
image: 2D numpy array (H, W)
kernel: 2D numpy array (kH, kW)
Returns:
output: 2D numpy array
"""
img_h, img_w = image.shape
k_h, k_w = kernel.shape
# Output dimensions (valid convolution - no padding)
out_h = img_h - k_h + 1
out_w = img_w - k_w + 1
# Initialize output
output = np.zeros((out_h, out_w))
# Slide kernel over input
for i in range(out_h):
for j in range(out_w):
# Extract the patch
patch = image[i:i+k_h, j:j+k_w]
# Element-wise multiply and sum
output[i, j] = np.sum(patch * kernel)
return output
# Create a sample 8x8 image with a vertical edge
image = np.zeros((8, 8))
image[:, 4:] = 1.0 # Right half is white
print("Input image (vertical edge at column 4):")
print(image)
# Vertical edge detection kernel
kernel_vertical = np.array([
[-1, 0, 1],
[-1, 0, 1],
[-1, 0, 1]
])
# Apply convolution
result = convolve2d(image, kernel_vertical)
print("\nVertical edge detection result:")
print(result)
print(f"\nOutput shape: {result.shape}")
Different kernels detect different features. Here are the most common ones used in image processing:
import numpy as np
def convolve2d(image, kernel):
"""Perform 2D convolution (cross-correlation)."""
img_h, img_w = image.shape
k_h, k_w = kernel.shape
out_h = img_h - k_h + 1
out_w = img_w - k_w + 1
output = np.zeros((out_h, out_w))
for i in range(out_h):
for j in range(out_w):
output[i, j] = np.sum(image[i:i+k_h, j:j+k_w] * kernel)
return output
# Create a more complex test image
np.random.seed(42)
image = np.zeros((10, 10))
image[2:8, 2:8] = 1.0 # White square in center
image[4:6, 4:6] = 0.5 # Gray inner square
# Define common kernels
kernels = {
"Horizontal Edge": np.array([
[-1, -1, -1],
[ 0, 0, 0],
[ 1, 1, 1]
]),
"Vertical Edge": np.array([
[-1, 0, 1],
[-1, 0, 1],
[-1, 0, 1]
]),
"Gaussian Blur (3x3)": np.array([
[1, 2, 1],
[2, 4, 2],
[1, 2, 1]
]) / 16.0,
"Sharpen": np.array([
[ 0, -1, 0],
[-1, 5, -1],
[ 0, -1, 0]
]),
}
# Apply each kernel
for name, kernel in kernels.items():
result = convolve2d(image, kernel)
print(f"{name} kernel:")
print(f" Output shape: {result.shape}")
print(f" Value range: [{result.min():.2f}, {result.max():.2f}]")
print(f" Non-zero elements: {np.count_nonzero(result)}")
print()
flowchart TD
A[Input Image\nH x W] --> B[Extract Patch\nat position i,j]
B --> C[Element-wise Multiply\npatch x kernel]
C --> D[Sum All Elements]
D --> E[Store in Output\nat position i,j]
E --> F{More positions?}
F -->|Yes| B
F -->|No| G[Output Feature Map\nH-kH+1 x W-kW+1]
Stride, Padding, and Output Size
Two parameters control how the kernel traverses the input:
- Stride: How many pixels the kernel moves at each step. Stride 1 moves one pixel at a time; stride 2 skips every other position.
- Padding: Zeros added around the input border. “Same” padding preserves spatial dimensions; “valid” means no padding.
The output size formula is:
$$\text{output\_size} = \left\lfloor \frac{\text{input\_size} - \text{kernel\_size} + 2 \times \text{padding}}{\text{stride}} \right\rfloor + 1$$
import numpy as np
def convolve2d_full(image, kernel, stride=1, padding=0):
"""
2D convolution with configurable stride and padding.
Parameters:
image: 2D array (H, W)
kernel: 2D array (kH, kW)
stride: step size for kernel movement
padding: number of zero-padding pixels on each side
Returns:
output: 2D array
"""
# Apply padding
if padding > 0:
image = np.pad(image, padding, mode='constant', constant_values=0)
img_h, img_w = image.shape
k_h, k_w = kernel.shape
# Calculate output dimensions
out_h = (img_h - k_h) // stride + 1
out_w = (img_w - k_w) // stride + 1
output = np.zeros((out_h, out_w))
for i in range(out_h):
for j in range(out_w):
row = i * stride
col = j * stride
patch = image[row:row+k_h, col:col+k_w]
output[i, j] = np.sum(patch * kernel)
return output
def calc_output_size(input_size, kernel_size, stride, padding):
"""Calculate convolution output size."""
return (input_size - kernel_size + 2 * padding) // stride + 1
# Demonstrate different configurations
image = np.random.randn(8, 8)
kernel = np.ones((3, 3)) / 9.0 # Average filter
configs = [
{"stride": 1, "padding": 0, "name": "Valid (no padding, stride 1)"},
{"stride": 1, "padding": 1, "name": "Same (padding 1, stride 1)"},
{"stride": 2, "padding": 0, "name": "Stride 2 (no padding)"},
{"stride": 2, "padding": 1, "name": "Stride 2 (padding 1)"},
]
print("Input shape: 8x8, Kernel: 3x3\n")
print(f"{'Configuration':<32} {'Output Shape':<15} {'Formula'}")
print("-" * 70)
for cfg in configs:
result = convolve2d_full(image, kernel, cfg["stride"], cfg["padding"])
out_size = calc_output_size(8, 3, cfg["stride"], cfg["padding"])
formula = f"(8 - 3 + 2*{cfg['padding']}) / {cfg['stride']} + 1 = {out_size}"
print(f"{cfg['name']:<32} {str(result.shape):<15} {formula}")
Pooling Layers
Pooling reduces the spatial dimensions of feature maps, providing two key benefits:
- Dimensionality reduction: Fewer parameters in subsequent layers
- Translation invariance: Small shifts in input produce the same pooled output
The two most common pooling operations are max pooling (takes the maximum value in each window) and average pooling (takes the mean).
import numpy as np
def max_pool2d(feature_map, pool_size=2, stride=2):
"""
Max pooling operation.
Parameters:
feature_map: 2D array (H, W)
pool_size: size of the pooling window
stride: step size (typically equals pool_size)
Returns:
pooled: 2D array with reduced dimensions
"""
h, w = feature_map.shape
out_h = (h - pool_size) // stride + 1
out_w = (w - pool_size) // stride + 1
pooled = np.zeros((out_h, out_w))
for i in range(out_h):
for j in range(out_w):
row = i * stride
col = j * stride
window = feature_map[row:row+pool_size, col:col+pool_size]
pooled[i, j] = np.max(window)
return pooled
def avg_pool2d(feature_map, pool_size=2, stride=2):
"""
Average pooling operation.
Parameters:
feature_map: 2D array (H, W)
pool_size: size of the pooling window
stride: step size (typically equals pool_size)
Returns:
pooled: 2D array with reduced dimensions
"""
h, w = feature_map.shape
out_h = (h - pool_size) // stride + 1
out_w = (w - pool_size) // stride + 1
pooled = np.zeros((out_h, out_w))
for i in range(out_h):
for j in range(out_w):
row = i * stride
col = j * stride
window = feature_map[row:row+pool_size, col:col+pool_size]
pooled[i, j] = np.mean(window)
return pooled
# Create a sample feature map
feature_map = np.array([
[1, 3, 2, 4],
[5, 6, 1, 2],
[7, 2, 3, 1],
[4, 8, 5, 6]
], dtype=float)
print("Feature Map (4x4):")
print(feature_map)
# Apply max pooling
max_result = max_pool2d(feature_map, pool_size=2, stride=2)
print(f"\nMax Pooling (2x2, stride 2) -> {max_result.shape}:")
print(max_result)
# Apply average pooling
avg_result = avg_pool2d(feature_map, pool_size=2, stride=2)
print(f"\nAverage Pooling (2x2, stride 2) -> {avg_result.shape}:")
print(avg_result)
# Demonstrate translation invariance
print("\n--- Translation Invariance Demo ---")
original = np.array([
[0, 0, 0, 0, 5, 8],
[0, 0, 0, 0, 7, 3],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0]
], dtype=float)
shifted = np.array([
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 5, 8, 0],
[0, 0, 0, 7, 3, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0]
], dtype=float)
pool_orig = max_pool2d(original, pool_size=2, stride=2)
pool_shift = max_pool2d(shifted, pool_size=2, stride=2)
print(f"Original pooled:\n{pool_orig}")
print(f"Shifted pooled:\n{pool_shift}")
print(f"Same top-right value: {pool_orig[0, 2] == pool_shift[0, 2]}")
Building a Complete CNN from Scratch
Now we combine convolution, activation, pooling, and dense layers into a complete CNN. Each layer implements forward() and backward() methods, enabling end-to-end gradient computation.
import numpy as np
class ConvLayer:
"""Convolutional layer with forward and backward pass."""
def __init__(self, num_filters, kernel_size, input_channels=1):
self.num_filters = num_filters
self.kernel_size = kernel_size
# Xavier initialization
scale = np.sqrt(2.0 / (kernel_size * kernel_size * input_channels))
self.filters = np.random.randn(
num_filters, input_channels, kernel_size, kernel_size
) * scale
self.biases = np.zeros(num_filters)
self.input_cache = None
def forward(self, input_data):
"""
Forward pass.
input_data: (channels, H, W) or (H, W) for single channel
"""
if input_data.ndim == 2:
input_data = input_data[np.newaxis, :, :] # Add channel dim
self.input_cache = input_data
channels, h, w = input_data.shape
k = self.kernel_size
out_h = h - k + 1
out_w = w - k + 1
output = np.zeros((self.num_filters, out_h, out_w))
for f in range(self.num_filters):
for i in range(out_h):
for j in range(out_w):
patch = input_data[:, i:i+k, j:j+k]
output[f, i, j] = np.sum(patch * self.filters[f]) + self.biases[f]
return output
def backward(self, d_output, learning_rate):
"""Compute gradients and update filters."""
channels, h, w = self.input_cache.shape
k = self.kernel_size
out_h, out_w = d_output.shape[1], d_output.shape[2]
d_filters = np.zeros_like(self.filters)
d_biases = np.zeros_like(self.biases)
d_input = np.zeros_like(self.input_cache)
for f in range(self.num_filters):
d_biases[f] = np.sum(d_output[f])
for i in range(out_h):
for j in range(out_w):
patch = self.input_cache[:, i:i+k, j:j+k]
d_filters[f] += d_output[f, i, j] * patch
d_input[:, i:i+k, j:j+k] += d_output[f, i, j] * self.filters[f]
self.filters -= learning_rate * d_filters
self.biases -= learning_rate * d_biases
return d_input
class MaxPoolLayer:
"""Max pooling layer with forward and backward pass."""
def __init__(self, pool_size=2):
self.pool_size = pool_size
self.input_cache = None
self.mask_cache = None
def forward(self, input_data):
"""input_data: (channels, H, W)"""
self.input_cache = input_data
c, h, w = input_data.shape
p = self.pool_size
out_h = h // p
out_w = w // p
output = np.zeros((c, out_h, out_w))
self.mask_cache = np.zeros_like(input_data)
for ch in range(c):
for i in range(out_h):
for j in range(out_w):
window = input_data[ch, i*p:(i+1)*p, j*p:(j+1)*p]
max_val = np.max(window)
output[ch, i, j] = max_val
# Store mask for backward pass
mask = (window == max_val)
self.mask_cache[ch, i*p:(i+1)*p, j*p:(j+1)*p] = mask
return output
def backward(self, d_output, learning_rate):
"""Route gradients through max positions."""
c, h, w = self.input_cache.shape
p = self.pool_size
out_h, out_w = d_output.shape[1], d_output.shape[2]
d_input = np.zeros_like(self.input_cache)
for ch in range(c):
for i in range(out_h):
for j in range(out_w):
mask = self.mask_cache[ch, i*p:(i+1)*p, j*p:(j+1)*p]
d_input[ch, i*p:(i+1)*p, j*p:(j+1)*p] += d_output[ch, i, j] * mask
return d_input
class ReLULayer:
"""ReLU activation layer."""
def __init__(self):
self.input_cache = None
def forward(self, input_data):
self.input_cache = input_data
return np.maximum(0, input_data)
def backward(self, d_output, learning_rate):
return d_output * (self.input_cache > 0)
class FlattenLayer:
"""Flatten multi-dimensional input to 1D."""
def __init__(self):
self.input_shape = None
def forward(self, input_data):
self.input_shape = input_data.shape
return input_data.flatten()
def backward(self, d_output, learning_rate):
return d_output.reshape(self.input_shape)
class DenseLayer:
"""Fully connected layer."""
def __init__(self, input_size, output_size):
scale = np.sqrt(2.0 / input_size)
self.weights = np.random.randn(input_size, output_size) * scale
self.biases = np.zeros(output_size)
self.input_cache = None
def forward(self, input_data):
self.input_cache = input_data
return input_data @ self.weights + self.biases
def backward(self, d_output, learning_rate):
d_input = d_output @ self.weights.T
d_weights = self.input_cache[:, np.newaxis] @ d_output[np.newaxis, :]
d_biases = d_output
self.weights -= learning_rate * d_weights
self.biases -= learning_rate * d_biases
return d_input
class CNN:
"""Complete CNN combining all layers."""
def __init__(self):
self.layers = [
ConvLayer(num_filters=4, kernel_size=3, input_channels=1),
ReLULayer(),
MaxPoolLayer(pool_size=2),
FlattenLayer(),
DenseLayer(input_size=4 * 3 * 3, output_size=2), # For 8x8 input
]
def forward(self, x):
"""Forward pass through all layers."""
for layer in self.layers:
x = layer.forward(x)
return x
def backward(self, d_loss, learning_rate=0.01):
"""Backward pass through all layers in reverse."""
grad = d_loss
for layer in reversed(self.layers):
grad = layer.backward(grad, learning_rate)
def softmax(self, x):
"""Numerically stable softmax."""
exp_x = np.exp(x - np.max(x))
return exp_x / exp_x.sum()
def cross_entropy_loss(self, probs, label):
"""Cross-entropy loss for single sample."""
return -np.log(probs[label] + 1e-10)
def train_step(self, image, label, learning_rate=0.01):
"""Single training step: forward + backward."""
# Forward
logits = self.forward(image)
probs = self.softmax(logits)
loss = self.cross_entropy_loss(probs, label)
# Compute gradient of loss w.r.t. logits
d_logits = probs.copy()
d_logits[label] -= 1 # Softmax + CE gradient
# Backward
self.backward(d_logits, learning_rate)
return loss, probs
# Quick test
np.random.seed(42)
cnn = CNN()
test_image = np.random.randn(8, 8)
logits = cnn.forward(test_image)
print(f"CNN output (logits): {logits}")
print(f"Probabilities: {cnn.softmax(logits)}")
print(f"Network layers: {len(cnn.layers)}")
print("CNN built successfully!")
Training the CNN
Let us train our CNN on a synthetic task: classifying 8×8 images as containing either vertical lines or horizontal lines. This is simple enough to train quickly but demonstrates that our convolution layers learn meaningful filters.
import numpy as np
# --- Paste CNN classes from above or assume they are defined ---
# For self-contained execution, here is a minimal re-implementation:
class ConvLayer:
def __init__(self, num_filters, kernel_size, input_channels=1):
self.num_filters = num_filters
self.kernel_size = kernel_size
scale = np.sqrt(2.0 / (kernel_size * kernel_size * input_channels))
self.filters = np.random.randn(num_filters, input_channels, kernel_size, kernel_size) * scale
self.biases = np.zeros(num_filters)
self.input_cache = None
def forward(self, x):
if x.ndim == 2:
x = x[np.newaxis, :, :]
self.input_cache = x
c, h, w = x.shape
k = self.kernel_size
out_h, out_w = h - k + 1, w - k + 1
out = np.zeros((self.num_filters, out_h, out_w))
for f in range(self.num_filters):
for i in range(out_h):
for j in range(out_w):
out[f, i, j] = np.sum(x[:, i:i+k, j:j+k] * self.filters[f]) + self.biases[f]
return out
def backward(self, d_out, lr):
c, h, w = self.input_cache.shape
k = self.kernel_size
oh, ow = d_out.shape[1], d_out.shape[2]
d_f = np.zeros_like(self.filters)
d_input = np.zeros_like(self.input_cache)
for f in range(self.num_filters):
self.biases[f] -= lr * np.sum(d_out[f])
for i in range(oh):
for j in range(ow):
patch = self.input_cache[:, i:i+k, j:j+k]
d_f[f] += d_out[f, i, j] * patch
d_input[:, i:i+k, j:j+k] += d_out[f, i, j] * self.filters[f]
self.filters -= lr * d_f
return d_input
class ReLULayer:
def __init__(self): self.cache = None
def forward(self, x): self.cache = x; return np.maximum(0, x)
def backward(self, d, lr): return d * (self.cache > 0)
class MaxPoolLayer:
def __init__(self, ps=2): self.ps = ps; self.cache = None; self.mask = None
def forward(self, x):
self.cache = x
c, h, w = x.shape
p = self.ps
oh, ow = h // p, w // p
out = np.zeros((c, oh, ow))
self.mask = np.zeros_like(x)
for ch in range(c):
for i in range(oh):
for j in range(ow):
win = x[ch, i*p:(i+1)*p, j*p:(j+1)*p]
mv = np.max(win)
out[ch, i, j] = mv
self.mask[ch, i*p:(i+1)*p, j*p:(j+1)*p] = (win == mv)
return out
def backward(self, d, lr):
c, h, w = self.cache.shape
p = self.ps
oh, ow = d.shape[1], d.shape[2]
di = np.zeros_like(self.cache)
for ch in range(c):
for i in range(oh):
for j in range(ow):
di[ch, i*p:(i+1)*p, j*p:(j+1)*p] += d[ch, i, j] * self.mask[ch, i*p:(i+1)*p, j*p:(j+1)*p]
return di
class FlattenLayer:
def __init__(self): self.shape = None
def forward(self, x): self.shape = x.shape; return x.flatten()
def backward(self, d, lr): return d.reshape(self.shape)
class DenseLayer:
def __init__(self, ni, no):
self.W = np.random.randn(ni, no) * np.sqrt(2.0 / ni)
self.b = np.zeros(no)
self.cache = None
def forward(self, x): self.cache = x; return x @ self.W + self.b
def backward(self, d, lr):
di = d @ self.W.T
self.W -= lr * (self.cache[:, None] @ d[None, :])
self.b -= lr * d
return di
# --- Build CNN and generate data ---
np.random.seed(0)
# Generate synthetic data: vertical (label=0) vs horizontal (label=1) lines
def generate_data(n_samples=200):
images = []
labels = []
for _ in range(n_samples // 2):
# Vertical line
img = np.random.randn(8, 8) * 0.1
col = np.random.randint(1, 7)
img[:, col] = 1.0
images.append(img)
labels.append(0)
# Horizontal line
img = np.random.randn(8, 8) * 0.1
row = np.random.randint(1, 7)
img[row, :] = 1.0
images.append(img)
labels.append(1)
return images, labels
train_images, train_labels = generate_data(200)
test_images, test_labels = generate_data(40)
# Build CNN: Conv(4 filters, 3x3) -> ReLU -> MaxPool(2x2) -> Dense(36->2)
layers = [
ConvLayer(4, 3, 1),
ReLULayer(),
MaxPoolLayer(2),
FlattenLayer(),
DenseLayer(4 * 3 * 3, 2),
]
def softmax(x):
e = np.exp(x - np.max(x))
return e / e.sum()
# Training loop
epochs = 15
lr = 0.005
print("Training CNN on vertical vs horizontal line detection...\n")
for epoch in range(epochs):
total_loss = 0
correct = 0
# Shuffle training data
indices = np.random.permutation(len(train_images))
for idx in indices:
img = train_images[idx]
label = train_labels[idx]
# Forward pass
x = img
for layer in layers:
x = layer.forward(x)
probs = softmax(x)
total_loss += -np.log(probs[label] + 1e-10)
correct += (np.argmax(probs) == label)
# Backward pass
grad = probs.copy()
grad[label] -= 1
for layer in reversed(layers):
grad = layer.backward(grad, lr)
if (epoch + 1) % 3 == 0 or epoch == 0:
acc = correct / len(train_images) * 100
avg_loss = total_loss / len(train_images)
print(f"Epoch {epoch+1:2d} | Loss: {avg_loss:.4f} | Train Acc: {acc:.1f}%")
# Test accuracy
test_correct = 0
for img, label in zip(test_images, test_labels):
x = img
for layer in layers:
x = layer.forward(x)
if np.argmax(x) == label:
test_correct += 1
print(f"\nTest Accuracy: {test_correct / len(test_images) * 100:.1f}%")
# Inspect learned filters
print("\nLearned Conv Filters (4 filters, 3x3):")
conv_layer = layers[0]
for f in range(conv_layer.num_filters):
print(f"\nFilter {f}:")
print(np.round(conv_layer.filters[f, 0], 3))
After training, inspect the learned 3×3 filters. You will typically see filters resembling vertical and horizontal edge detectors — the network discovers the features it needs to solve the task. This is the power of learned representations.
Framework Implementations
Now that you understand exactly what happens inside a convolution layer — the sliding window, element-wise multiplication, gradient routing through max-pool masks — you can appreciate why frameworks exist. Our 100-line implementation works but is orders of magnitude slower than optimized CUDA kernels.
nn.Conv2d, transfer learning with ResNet, and data augmentation pipelines.
tf.keras.layers.Conv2D, see TensorFlow Mastery Part 6: CNNs & Vision. That article covers the Sequential API, callbacks, and TensorFlow-specific optimizations.
Here is a direct comparison — what took us ~30 lines to implement from scratch becomes 3 lines in PyTorch:
import numpy as np
# What we built from scratch (simplified):
# - Manual loop over output positions
# - Explicit gradient computation
# - No GPU support
# Total: ~30 lines for ConvLayer alone
# PyTorch equivalent (conceptual - requires torch installed):
# import torch
# import torch.nn as nn
#
# model = nn.Sequential(
# nn.Conv2d(1, 16, kernel_size=3, padding=1), # 1 line vs 30
# nn.ReLU(),
# nn.MaxPool2d(2),
# nn.Flatten(),
# nn.Linear(16 * 4 * 4, 2),
# )
#
# # Training is also simplified:
# optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# loss_fn = nn.CrossEntropyLoss()
#
# for images, labels in dataloader:
# loss = loss_fn(model(images), labels)
# loss.backward() # Automatic gradient computation!
# optimizer.step()
# optimizer.zero_grad()
# Key differences:
differences = {
"Speed": "Framework: GPU-accelerated CUDA kernels. Ours: pure Python loops.",
"Gradients": "Framework: automatic differentiation. Ours: manual backward().",
"Optimization": "Framework: Adam, SGD, etc. built-in. Ours: vanilla SGD only.",
"Batching": "Framework: handles batches natively. Ours: single sample at a time.",
"Memory": "Framework: optimized memory management. Ours: stores full input cache.",
}
print("Why use frameworks for production:")
print("=" * 50)
for aspect, explanation in differences.items():
print(f"\n{aspect}:")
print(f" {explanation}")
What’s Next
In this article we built convolutions, pooling, and a complete CNN from the ground up. You now understand:
- How spatial weight sharing reduces parameters by 100× or more
- The mechanics of kernel sliding, stride, and padding
- How max pooling creates translation invariance
- Full forward and backward passes through convolutional layers
- Why frameworks matter for production but understanding matters for mastery
Next in the Series
In Part 7: RNNs & LSTMs Deep Dive, we tackle sequential data. You will implement recurrent neural networks and Long Short-Term Memory cells from scratch, understanding how gating mechanisms solve the vanishing gradient problem for sequences like text and time series.