Part 6: CNNs Deep Dive - Convolutions from Scratch | Neural Networks Mastery

Why CNNs Work for Images

Fully connected networks treat every pixel as an independent feature. For a modest 28×28 grayscale image, that means 784 input neurons — and if we connect them to just 128 hidden neurons, we already have over 100,000 parameters in the first layer alone. Scale that to a 224×224 RGB image and the parameter count explodes to over 37 million in a single layer.

                            
                            Key Insight: Images have spatial structure. A horizontal edge in the top-left corner is the same feature as a horizontal edge in the bottom-right. CNNs exploit this by sharing weights across spatial locations, drastically reducing parameters.
                        

CNNs leverage three fundamental principles:

Local connectivity: Each neuron connects only to a small patch of the input
Weight sharing: The same filter is applied everywhere in the image
Translation equivariance: If an object moves, the feature map response moves correspondingly

Let us quantify the parameter savings:

import numpy as np

# Compare parameter counts: Fully Connected vs CNN
# Input: 28x28 grayscale image (like MNIST)
input_size = 28 * 28  # 784 pixels

# Fully Connected Network: first hidden layer with 128 neurons
fc_params = input_size * 128 + 128  # weights + biases
print(f"Fully Connected Layer: {fc_params:,} parameters")
# 784 * 128 + 128 = 100,480 parameters

# CNN: 16 filters of size 3x3, single input channel
# Each filter: 3 * 3 * 1 = 9 weights + 1 bias = 10 params
cnn_filters = 16
kernel_size = 3
channels_in = 1
cnn_params = cnn_filters * (kernel_size * kernel_size * channels_in + 1)
print(f"Conv Layer (16 filters, 3x3): {cnn_params:,} parameters")
# 16 * (9 + 1) = 160 parameters

# Reduction factor
reduction = fc_params / cnn_params
print(f"Parameter reduction: {reduction:.0f}x fewer parameters")
# ~628x fewer parameters!

# For RGB 224x224 image
input_rgb = 224 * 224 * 3  # 150,528
fc_rgb_params = input_rgb * 256 + 256
cnn_rgb_params = 64 * (3 * 3 * 3 + 1)  # 64 filters, 3x3, 3 channels
print(f"\nRGB 224x224 - FC layer: {fc_rgb_params:,} parameters")
print(f"RGB 224x224 - Conv layer: {cnn_rgb_params:,} parameters")
print(f"Reduction: {fc_rgb_params / cnn_rgb_params:.0f}x")

CNN Architecture Overview

flowchart LR
    A[Input Image\n28x28x1] --> B[Conv Layer\n3x3, 16 filters]
    B --> C[ReLU]
    C --> D[Max Pool\n2x2]
    D --> E[Conv Layer\n3x3, 32 filters]
    E --> F[ReLU]
    F --> G[Max Pool\n2x2]
    G --> H[Flatten]
    H --> I[Dense\n128 units]
    I --> J[Output\n10 classes]

Understanding Convolution Operations

A convolution operation slides a small matrix (the kernel or filter) over the input and computes element-wise multiplication followed by summation at each position. Mathematically:

$$(I * K)(i,j) = \sum_m \sum_n I(i+m, j+n) \cdot K(m,n)$$

where $I$ is the input, $K$ is the kernel, and $(i,j)$ is the output position. The kernel has a fixed size (commonly $3 \times 3$ or $5 \times 5$) and “slides” across the entire input.

Implementation 2D Convolution from Scratch

We implement cross-correlation (what deep learning frameworks call “convolution”) without any library functions. True mathematical convolution flips the kernel, but in practice all frameworks use cross-correlation.

NumPy From Scratch Edge Detection

import numpy as np

def convolve2d(image, kernel):
    """
    Perform 2D convolution (cross-correlation) from scratch.
    
    Parameters:
        image: 2D numpy array (H, W)
        kernel: 2D numpy array (kH, kW)
    
    Returns:
        output: 2D numpy array
    """
    img_h, img_w = image.shape
    k_h, k_w = kernel.shape
    
    # Output dimensions (valid convolution - no padding)
    out_h = img_h - k_h + 1
    out_w = img_w - k_w + 1
    
    # Initialize output
    output = np.zeros((out_h, out_w))
    
    # Slide kernel over input
    for i in range(out_h):
        for j in range(out_w):
            # Extract the patch
            patch = image[i:i+k_h, j:j+k_w]
            # Element-wise multiply and sum
            output[i, j] = np.sum(patch * kernel)
    
    return output

# Create a sample 8x8 image with a vertical edge
image = np.zeros((8, 8))
image[:, 4:] = 1.0  # Right half is white
print("Input image (vertical edge at column 4):")
print(image)

# Vertical edge detection kernel
kernel_vertical = np.array([
    [-1, 0, 1],
    [-1, 0, 1],
    [-1, 0, 1]
])

# Apply convolution
result = convolve2d(image, kernel_vertical)
print("\nVertical edge detection result:")
print(result)
print(f"\nOutput shape: {result.shape}")

Different kernels detect different features. Here are the most common ones used in image processing:

import numpy as np

def convolve2d(image, kernel):
    """Perform 2D convolution (cross-correlation)."""
    img_h, img_w = image.shape
    k_h, k_w = kernel.shape
    out_h = img_h - k_h + 1
    out_w = img_w - k_w + 1
    output = np.zeros((out_h, out_w))
    for i in range(out_h):
        for j in range(out_w):
            output[i, j] = np.sum(image[i:i+k_h, j:j+k_w] * kernel)
    return output

# Create a more complex test image
np.random.seed(42)
image = np.zeros((10, 10))
image[2:8, 2:8] = 1.0  # White square in center
image[4:6, 4:6] = 0.5  # Gray inner square

# Define common kernels
kernels = {
    "Horizontal Edge": np.array([
        [-1, -1, -1],
        [ 0,  0,  0],
        [ 1,  1,  1]
    ]),
    "Vertical Edge": np.array([
        [-1, 0, 1],
        [-1, 0, 1],
        [-1, 0, 1]
    ]),
    "Gaussian Blur (3x3)": np.array([
        [1, 2, 1],
        [2, 4, 2],
        [1, 2, 1]
    ]) / 16.0,
    "Sharpen": np.array([
        [ 0, -1,  0],
        [-1,  5, -1],
        [ 0, -1,  0]
    ]),
}

# Apply each kernel
for name, kernel in kernels.items():
    result = convolve2d(image, kernel)
    print(f"{name} kernel:")
    print(f"  Output shape: {result.shape}")
    print(f"  Value range: [{result.min():.2f}, {result.max():.2f}]")
    print(f"  Non-zero elements: {np.count_nonzero(result)}")
    print()

Convolution Operation Flow

flowchart TD
    A[Input Image\nH x W] --> B[Extract Patch\nat position i,j]
    B --> C[Element-wise Multiply\npatch x kernel]
    C --> D[Sum All Elements]
    D --> E[Store in Output\nat position i,j]
    E --> F{More positions?}
    F -->|Yes| B
    F -->|No| G[Output Feature Map\nH-kH+1 x W-kW+1]

Stride, Padding, and Output Size

Two parameters control how the kernel traverses the input:

Stride: How many pixels the kernel moves at each step. Stride 1 moves one pixel at a time; stride 2 skips every other position.
Padding: Zeros added around the input border. “Same” padding preserves spatial dimensions; “valid” means no padding.

The output size formula is:

$$\text{output\_size} = \left\lfloor \frac{\text{input\_size} - \text{kernel\_size} + 2 \times \text{padding}}{\text{stride}} \right\rfloor + 1$$

import numpy as np

def convolve2d_full(image, kernel, stride=1, padding=0):
    """
    2D convolution with configurable stride and padding.
    
    Parameters:
        image: 2D array (H, W)
        kernel: 2D array (kH, kW)
        stride: step size for kernel movement
        padding: number of zero-padding pixels on each side
    
    Returns:
        output: 2D array
    """
    # Apply padding
    if padding > 0:
        image = np.pad(image, padding, mode='constant', constant_values=0)
    
    img_h, img_w = image.shape
    k_h, k_w = kernel.shape
    
    # Calculate output dimensions
    out_h = (img_h - k_h) // stride + 1
    out_w = (img_w - k_w) // stride + 1
    
    output = np.zeros((out_h, out_w))
    
    for i in range(out_h):
        for j in range(out_w):
            row = i * stride
            col = j * stride
            patch = image[row:row+k_h, col:col+k_w]
            output[i, j] = np.sum(patch * kernel)
    
    return output

def calc_output_size(input_size, kernel_size, stride, padding):
    """Calculate convolution output size."""
    return (input_size - kernel_size + 2 * padding) // stride + 1

# Demonstrate different configurations
image = np.random.randn(8, 8)
kernel = np.ones((3, 3)) / 9.0  # Average filter

configs = [
    {"stride": 1, "padding": 0, "name": "Valid (no padding, stride 1)"},
    {"stride": 1, "padding": 1, "name": "Same (padding 1, stride 1)"},
    {"stride": 2, "padding": 0, "name": "Stride 2 (no padding)"},
    {"stride": 2, "padding": 1, "name": "Stride 2 (padding 1)"},
]

print("Input shape: 8x8, Kernel: 3x3\n")
print(f"{'Configuration':<32} {'Output Shape':<15} {'Formula'}")
print("-" * 70)

for cfg in configs:
    result = convolve2d_full(image, kernel, cfg["stride"], cfg["padding"])
    out_size = calc_output_size(8, 3, cfg["stride"], cfg["padding"])
    formula = f"(8 - 3 + 2*{cfg['padding']}) / {cfg['stride']} + 1 = {out_size}"
    print(f"{cfg['name']:<32} {str(result.shape):<15} {formula}")

                            
                            Same Padding Rule: To preserve spatial dimensions with stride 1, use padding = (kernel_size - 1) / 2. For a 3×3 kernel, that means padding = 1. For a 5×5 kernel, padding = 2.
                        

Pooling Layers

Pooling reduces the spatial dimensions of feature maps, providing two key benefits:

Dimensionality reduction: Fewer parameters in subsequent layers
Translation invariance: Small shifts in input produce the same pooled output

The two most common pooling operations are max pooling (takes the maximum value in each window) and average pooling (takes the mean).

import numpy as np

def max_pool2d(feature_map, pool_size=2, stride=2):
    """
    Max pooling operation.
    
    Parameters:
        feature_map: 2D array (H, W)
        pool_size: size of the pooling window
        stride: step size (typically equals pool_size)
    
    Returns:
        pooled: 2D array with reduced dimensions
    """
    h, w = feature_map.shape
    out_h = (h - pool_size) // stride + 1
    out_w = (w - pool_size) // stride + 1
    
    pooled = np.zeros((out_h, out_w))
    
    for i in range(out_h):
        for j in range(out_w):
            row = i * stride
            col = j * stride
            window = feature_map[row:row+pool_size, col:col+pool_size]
            pooled[i, j] = np.max(window)
    
    return pooled

def avg_pool2d(feature_map, pool_size=2, stride=2):
    """
    Average pooling operation.
    
    Parameters:
        feature_map: 2D array (H, W)
        pool_size: size of the pooling window
        stride: step size (typically equals pool_size)
    
    Returns:
        pooled: 2D array with reduced dimensions
    """
    h, w = feature_map.shape
    out_h = (h - pool_size) // stride + 1
    out_w = (w - pool_size) // stride + 1
    
    pooled = np.zeros((out_h, out_w))
    
    for i in range(out_h):
        for j in range(out_w):
            row = i * stride
            col = j * stride
            window = feature_map[row:row+pool_size, col:col+pool_size]
            pooled[i, j] = np.mean(window)
    
    return pooled

# Create a sample feature map
feature_map = np.array([
    [1, 3, 2, 4],
    [5, 6, 1, 2],
    [7, 2, 3, 1],
    [4, 8, 5, 6]
], dtype=float)

print("Feature Map (4x4):")
print(feature_map)

# Apply max pooling
max_result = max_pool2d(feature_map, pool_size=2, stride=2)
print(f"\nMax Pooling (2x2, stride 2) -> {max_result.shape}:")
print(max_result)

# Apply average pooling
avg_result = avg_pool2d(feature_map, pool_size=2, stride=2)
print(f"\nAverage Pooling (2x2, stride 2) -> {avg_result.shape}:")
print(avg_result)

# Demonstrate translation invariance
print("\n--- Translation Invariance Demo ---")
original = np.array([
    [0, 0, 0, 0, 5, 8],
    [0, 0, 0, 0, 7, 3],
    [0, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0, 0]
], dtype=float)

shifted = np.array([
    [0, 0, 0, 0, 0, 0],
    [0, 0, 0, 5, 8, 0],
    [0, 0, 0, 7, 3, 0],
    [0, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0, 0]
], dtype=float)

pool_orig = max_pool2d(original, pool_size=2, stride=2)
pool_shift = max_pool2d(shifted, pool_size=2, stride=2)

print(f"Original pooled:\n{pool_orig}")
print(f"Shifted pooled:\n{pool_shift}")
print(f"Same top-right value: {pool_orig[0, 2] == pool_shift[0, 2]}")

Building a Complete CNN from Scratch

Now we combine convolution, activation, pooling, and dense layers into a complete CNN. Each layer implements forward() and backward() methods, enabling end-to-end gradient computation.

import numpy as np

class ConvLayer:
    """Convolutional layer with forward and backward pass."""
    
    def __init__(self, num_filters, kernel_size, input_channels=1):
        self.num_filters = num_filters
        self.kernel_size = kernel_size
        # Xavier initialization
        scale = np.sqrt(2.0 / (kernel_size * kernel_size * input_channels))
        self.filters = np.random.randn(
            num_filters, input_channels, kernel_size, kernel_size
        ) * scale
        self.biases = np.zeros(num_filters)
        self.input_cache = None
    
    def forward(self, input_data):
        """
        Forward pass.
        input_data: (channels, H, W) or (H, W) for single channel
        """
        if input_data.ndim == 2:
            input_data = input_data[np.newaxis, :, :]  # Add channel dim
        
        self.input_cache = input_data
        channels, h, w = input_data.shape
        k = self.kernel_size
        out_h = h - k + 1
        out_w = w - k + 1
        
        output = np.zeros((self.num_filters, out_h, out_w))
        
        for f in range(self.num_filters):
            for i in range(out_h):
                for j in range(out_w):
                    patch = input_data[:, i:i+k, j:j+k]
                    output[f, i, j] = np.sum(patch * self.filters[f]) + self.biases[f]
        
        return output
    
    def backward(self, d_output, learning_rate):
        """Compute gradients and update filters."""
        channels, h, w = self.input_cache.shape
        k = self.kernel_size
        out_h, out_w = d_output.shape[1], d_output.shape[2]
        
        d_filters = np.zeros_like(self.filters)
        d_biases = np.zeros_like(self.biases)
        d_input = np.zeros_like(self.input_cache)
        
        for f in range(self.num_filters):
            d_biases[f] = np.sum(d_output[f])
            for i in range(out_h):
                for j in range(out_w):
                    patch = self.input_cache[:, i:i+k, j:j+k]
                    d_filters[f] += d_output[f, i, j] * patch
                    d_input[:, i:i+k, j:j+k] += d_output[f, i, j] * self.filters[f]
        
        self.filters -= learning_rate * d_filters
        self.biases -= learning_rate * d_biases
        return d_input


class MaxPoolLayer:
    """Max pooling layer with forward and backward pass."""
    
    def __init__(self, pool_size=2):
        self.pool_size = pool_size
        self.input_cache = None
        self.mask_cache = None
    
    def forward(self, input_data):
        """input_data: (channels, H, W)"""
        self.input_cache = input_data
        c, h, w = input_data.shape
        p = self.pool_size
        out_h = h // p
        out_w = w // p
        
        output = np.zeros((c, out_h, out_w))
        self.mask_cache = np.zeros_like(input_data)
        
        for ch in range(c):
            for i in range(out_h):
                for j in range(out_w):
                    window = input_data[ch, i*p:(i+1)*p, j*p:(j+1)*p]
                    max_val = np.max(window)
                    output[ch, i, j] = max_val
                    # Store mask for backward pass
                    mask = (window == max_val)
                    self.mask_cache[ch, i*p:(i+1)*p, j*p:(j+1)*p] = mask
        
        return output
    
    def backward(self, d_output, learning_rate):
        """Route gradients through max positions."""
        c, h, w = self.input_cache.shape
        p = self.pool_size
        out_h, out_w = d_output.shape[1], d_output.shape[2]
        
        d_input = np.zeros_like(self.input_cache)
        
        for ch in range(c):
            for i in range(out_h):
                for j in range(out_w):
                    mask = self.mask_cache[ch, i*p:(i+1)*p, j*p:(j+1)*p]
                    d_input[ch, i*p:(i+1)*p, j*p:(j+1)*p] += d_output[ch, i, j] * mask
        
        return d_input


class ReLULayer:
    """ReLU activation layer."""
    
    def __init__(self):
        self.input_cache = None
    
    def forward(self, input_data):
        self.input_cache = input_data
        return np.maximum(0, input_data)
    
    def backward(self, d_output, learning_rate):
        return d_output * (self.input_cache > 0)


class FlattenLayer:
    """Flatten multi-dimensional input to 1D."""
    
    def __init__(self):
        self.input_shape = None
    
    def forward(self, input_data):
        self.input_shape = input_data.shape
        return input_data.flatten()
    
    def backward(self, d_output, learning_rate):
        return d_output.reshape(self.input_shape)


class DenseLayer:
    """Fully connected layer."""
    
    def __init__(self, input_size, output_size):
        scale = np.sqrt(2.0 / input_size)
        self.weights = np.random.randn(input_size, output_size) * scale
        self.biases = np.zeros(output_size)
        self.input_cache = None
    
    def forward(self, input_data):
        self.input_cache = input_data
        return input_data @ self.weights + self.biases
    
    def backward(self, d_output, learning_rate):
        d_input = d_output @ self.weights.T
        d_weights = self.input_cache[:, np.newaxis] @ d_output[np.newaxis, :]
        d_biases = d_output
        
        self.weights -= learning_rate * d_weights
        self.biases -= learning_rate * d_biases
        return d_input


class CNN:
    """Complete CNN combining all layers."""
    
    def __init__(self):
        self.layers = [
            ConvLayer(num_filters=4, kernel_size=3, input_channels=1),
            ReLULayer(),
            MaxPoolLayer(pool_size=2),
            FlattenLayer(),
            DenseLayer(input_size=4 * 3 * 3, output_size=2),  # For 8x8 input
        ]
    
    def forward(self, x):
        """Forward pass through all layers."""
        for layer in self.layers:
            x = layer.forward(x)
        return x
    
    def backward(self, d_loss, learning_rate=0.01):
        """Backward pass through all layers in reverse."""
        grad = d_loss
        for layer in reversed(self.layers):
            grad = layer.backward(grad, learning_rate)
    
    def softmax(self, x):
        """Numerically stable softmax."""
        exp_x = np.exp(x - np.max(x))
        return exp_x / exp_x.sum()
    
    def cross_entropy_loss(self, probs, label):
        """Cross-entropy loss for single sample."""
        return -np.log(probs[label] + 1e-10)
    
    def train_step(self, image, label, learning_rate=0.01):
        """Single training step: forward + backward."""
        # Forward
        logits = self.forward(image)
        probs = self.softmax(logits)
        loss = self.cross_entropy_loss(probs, label)
        
        # Compute gradient of loss w.r.t. logits
        d_logits = probs.copy()
        d_logits[label] -= 1  # Softmax + CE gradient
        
        # Backward
        self.backward(d_logits, learning_rate)
        
        return loss, probs


# Quick test
np.random.seed(42)
cnn = CNN()
test_image = np.random.randn(8, 8)
logits = cnn.forward(test_image)
print(f"CNN output (logits): {logits}")
print(f"Probabilities: {cnn.softmax(logits)}")
print(f"Network layers: {len(cnn.layers)}")
print("CNN built successfully!")

Training the CNN

Let us train our CNN on a synthetic task: classifying 8×8 images as containing either vertical lines or horizontal lines. This is simple enough to train quickly but demonstrates that our convolution layers learn meaningful filters.

import numpy as np

# --- Paste CNN classes from above or assume they are defined ---
# For self-contained execution, here is a minimal re-implementation:

class ConvLayer:
    def __init__(self, num_filters, kernel_size, input_channels=1):
        self.num_filters = num_filters
        self.kernel_size = kernel_size
        scale = np.sqrt(2.0 / (kernel_size * kernel_size * input_channels))
        self.filters = np.random.randn(num_filters, input_channels, kernel_size, kernel_size) * scale
        self.biases = np.zeros(num_filters)
        self.input_cache = None

    def forward(self, x):
        if x.ndim == 2:
            x = x[np.newaxis, :, :]
        self.input_cache = x
        c, h, w = x.shape
        k = self.kernel_size
        out_h, out_w = h - k + 1, w - k + 1
        out = np.zeros((self.num_filters, out_h, out_w))
        for f in range(self.num_filters):
            for i in range(out_h):
                for j in range(out_w):
                    out[f, i, j] = np.sum(x[:, i:i+k, j:j+k] * self.filters[f]) + self.biases[f]
        return out

    def backward(self, d_out, lr):
        c, h, w = self.input_cache.shape
        k = self.kernel_size
        oh, ow = d_out.shape[1], d_out.shape[2]
        d_f = np.zeros_like(self.filters)
        d_input = np.zeros_like(self.input_cache)
        for f in range(self.num_filters):
            self.biases[f] -= lr * np.sum(d_out[f])
            for i in range(oh):
                for j in range(ow):
                    patch = self.input_cache[:, i:i+k, j:j+k]
                    d_f[f] += d_out[f, i, j] * patch
                    d_input[:, i:i+k, j:j+k] += d_out[f, i, j] * self.filters[f]
        self.filters -= lr * d_f
        return d_input

class ReLULayer:
    def __init__(self): self.cache = None
    def forward(self, x): self.cache = x; return np.maximum(0, x)
    def backward(self, d, lr): return d * (self.cache > 0)

class MaxPoolLayer:
    def __init__(self, ps=2): self.ps = ps; self.cache = None; self.mask = None
    def forward(self, x):
        self.cache = x
        c, h, w = x.shape
        p = self.ps
        oh, ow = h // p, w // p
        out = np.zeros((c, oh, ow))
        self.mask = np.zeros_like(x)
        for ch in range(c):
            for i in range(oh):
                for j in range(ow):
                    win = x[ch, i*p:(i+1)*p, j*p:(j+1)*p]
                    mv = np.max(win)
                    out[ch, i, j] = mv
                    self.mask[ch, i*p:(i+1)*p, j*p:(j+1)*p] = (win == mv)
        return out
    def backward(self, d, lr):
        c, h, w = self.cache.shape
        p = self.ps
        oh, ow = d.shape[1], d.shape[2]
        di = np.zeros_like(self.cache)
        for ch in range(c):
            for i in range(oh):
                for j in range(ow):
                    di[ch, i*p:(i+1)*p, j*p:(j+1)*p] += d[ch, i, j] * self.mask[ch, i*p:(i+1)*p, j*p:(j+1)*p]
        return di

class FlattenLayer:
    def __init__(self): self.shape = None
    def forward(self, x): self.shape = x.shape; return x.flatten()
    def backward(self, d, lr): return d.reshape(self.shape)

class DenseLayer:
    def __init__(self, ni, no):
        self.W = np.random.randn(ni, no) * np.sqrt(2.0 / ni)
        self.b = np.zeros(no)
        self.cache = None
    def forward(self, x): self.cache = x; return x @ self.W + self.b
    def backward(self, d, lr):
        di = d @ self.W.T
        self.W -= lr * (self.cache[:, None] @ d[None, :])
        self.b -= lr * d
        return di

# --- Build CNN and generate data ---
np.random.seed(0)

# Generate synthetic data: vertical (label=0) vs horizontal (label=1) lines
def generate_data(n_samples=200):
    images = []
    labels = []
    for _ in range(n_samples // 2):
        # Vertical line
        img = np.random.randn(8, 8) * 0.1
        col = np.random.randint(1, 7)
        img[:, col] = 1.0
        images.append(img)
        labels.append(0)
        # Horizontal line
        img = np.random.randn(8, 8) * 0.1
        row = np.random.randint(1, 7)
        img[row, :] = 1.0
        images.append(img)
        labels.append(1)
    return images, labels

train_images, train_labels = generate_data(200)
test_images, test_labels = generate_data(40)

# Build CNN: Conv(4 filters, 3x3) -> ReLU -> MaxPool(2x2) -> Dense(36->2)
layers = [
    ConvLayer(4, 3, 1),
    ReLULayer(),
    MaxPoolLayer(2),
    FlattenLayer(),
    DenseLayer(4 * 3 * 3, 2),
]

def softmax(x):
    e = np.exp(x - np.max(x))
    return e / e.sum()

# Training loop
epochs = 15
lr = 0.005
print("Training CNN on vertical vs horizontal line detection...\n")

for epoch in range(epochs):
    total_loss = 0
    correct = 0
    
    # Shuffle training data
    indices = np.random.permutation(len(train_images))
    
    for idx in indices:
        img = train_images[idx]
        label = train_labels[idx]
        
        # Forward pass
        x = img
        for layer in layers:
            x = layer.forward(x)
        
        probs = softmax(x)
        total_loss += -np.log(probs[label] + 1e-10)
        correct += (np.argmax(probs) == label)
        
        # Backward pass
        grad = probs.copy()
        grad[label] -= 1
        for layer in reversed(layers):
            grad = layer.backward(grad, lr)
    
    if (epoch + 1) % 3 == 0 or epoch == 0:
        acc = correct / len(train_images) * 100
        avg_loss = total_loss / len(train_images)
        print(f"Epoch {epoch+1:2d} | Loss: {avg_loss:.4f} | Train Acc: {acc:.1f}%")

# Test accuracy
test_correct = 0
for img, label in zip(test_images, test_labels):
    x = img
    for layer in layers:
        x = layer.forward(x)
    if np.argmax(x) == label:
        test_correct += 1

print(f"\nTest Accuracy: {test_correct / len(test_images) * 100:.1f}%")

# Inspect learned filters
print("\nLearned Conv Filters (4 filters, 3x3):")
conv_layer = layers[0]
for f in range(conv_layer.num_filters):
    print(f"\nFilter {f}:")
    print(np.round(conv_layer.filters[f, 0], 3))

Observation What the Filters Learn

After training, inspect the learned 3×3 filters. You will typically see filters resembling vertical and horizontal edge detectors — the network discovers the features it needs to solve the task. This is the power of learned representations.

Feature Learning Interpretability

Framework Implementations

Now that you understand exactly what happens inside a convolution layer — the sliding window, element-wise multiplication, gradient routing through max-pool masks — you can appreciate why frameworks exist. Our 100-line implementation works but is orders of magnitude slower than optimized CUDA kernels.

PyTorch Implementation: For production CNN training with GPU acceleration, automatic differentiation, and pre-trained models, see PyTorch Mastery Part 5: CNNs & Computer Vision. That article covers nn.Conv2d, transfer learning with ResNet, and data augmentation pipelines.

TensorFlow Implementation: For the Keras approach to CNNs with tf.keras.layers.Conv2D, see TensorFlow Mastery Part 6: CNNs & Vision. That article covers the Sequential API, callbacks, and TensorFlow-specific optimizations.

Here is a direct comparison — what took us ~30 lines to implement from scratch becomes 3 lines in PyTorch:

import numpy as np

# What we built from scratch (simplified):
# - Manual loop over output positions
# - Explicit gradient computation
# - No GPU support
# Total: ~30 lines for ConvLayer alone

# PyTorch equivalent (conceptual - requires torch installed):
# import torch
# import torch.nn as nn
#
# model = nn.Sequential(
#     nn.Conv2d(1, 16, kernel_size=3, padding=1),  # 1 line vs 30
#     nn.ReLU(),
#     nn.MaxPool2d(2),
#     nn.Flatten(),
#     nn.Linear(16 * 4 * 4, 2),
# )
#
# # Training is also simplified:
# optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# loss_fn = nn.CrossEntropyLoss()
#
# for images, labels in dataloader:
#     loss = loss_fn(model(images), labels)
#     loss.backward()       # Automatic gradient computation!
#     optimizer.step()
#     optimizer.zero_grad()

# Key differences:
differences = {
    "Speed": "Framework: GPU-accelerated CUDA kernels. Ours: pure Python loops.",
    "Gradients": "Framework: automatic differentiation. Ours: manual backward().",
    "Optimization": "Framework: Adam, SGD, etc. built-in. Ours: vanilla SGD only.",
    "Batching": "Framework: handles batches natively. Ours: single sample at a time.",
    "Memory": "Framework: optimized memory management. Ours: stores full input cache.",
}

print("Why use frameworks for production:")
print("=" * 50)
for aspect, explanation in differences.items():
    print(f"\n{aspect}:")
    print(f"  {explanation}")

What’s Next

In this article we built convolutions, pooling, and a complete CNN from the ground up. You now understand:

How spatial weight sharing reduces parameters by 100× or more
The mechanics of kernel sliding, stride, and padding
How max pooling creates translation invariance
Full forward and backward passes through convolutional layers
Why frameworks matter for production but understanding matters for mastery

Next in the Series

In Part 7: RNNs & LSTMs Deep Dive, we tackle sequential data. You will implement recurrent neural networks and Long Short-Term Memory cells from scratch, understanding how gating mechanisms solve the vanishing gradient problem for sequences like text and time series.

Previous Part 5: Neural Network Architectures Overview Next Part 7: RNNs & LSTMs Deep Dive

Table of Contents

Part 6: CNNs Deep Dive — Convolutions from Scratch

Why CNNs Work for Images

Understanding Convolution Operations

Stride, Padding, and Output Size

Pooling Layers

Building a Complete CNN from Scratch

Training the CNN

Framework Implementations

What’s Next

Next in the Series

Table of Contents

Part 6: CNNs Deep Dive — Convolutions from Scratch

Why CNNs Work for Images

Understanding Convolution Operations

Stride, Padding, and Output Size

Pooling Layers

Building a Complete CNN from Scratch

Training the CNN

Framework Implementations

What’s Next

Next in the Series

Related Articles in This Series

Part 5: Neural Network Architectures Overview

Part 7: RNNs & LSTMs Deep Dive

Part 4: Training Deep Networks