Part 5: CNNs & Computer Vision

Why CNNs for Vision?

In the previous parts of this series, we built fully-connected (dense) neural networks that can classify structured data and even simple images. But what happens when you try to feed a real-world photograph — say, a 224 × 224 colour image — into a fully-connected layer? You get a parameter explosion that makes training impractical and almost guarantees overfitting.

The Fully-Connected Problem

Consider a single 224 × 224 RGB image. Flattened, that is 224 × 224 × 3 = 150,528 input values. If your first hidden layer has just 1,000 neurons, you already need over 150 million parameters in that single layer alone. This is wildly inefficient for two reasons: (1) images have strong local spatial structure — a pixel's meaning depends mostly on its neighbours, not on pixels on the opposite side; and (2) the same edge or texture can appear anywhere in the image, so we want the network to reuse the same detector across all positions.

The following code demonstrates just how enormous the parameter count becomes with a fully-connected approach to images:

import torch
import torch.nn as nn

# A tiny 32x32 RGB image (like CIFAR-10) flattened
input_size = 32 * 32 * 3  # 3,072 values

# Fully-connected layer with 512 neurons
fc_layer = nn.Linear(input_size, 512)

# Count parameters: weight matrix + bias
total_params = sum(p.numel() for p in fc_layer.parameters())
print(f"Input size: {input_size}")
print(f"FC layer parameters: {total_params:,}")
# FC layer parameters: 1,573,376

# Now imagine a 224x224 RGB image
input_large = 224 * 224 * 3  # 150,528 values
fc_large = nn.Linear(input_large, 512)
total_large = sum(p.numel() for p in fc_large.parameters())
print(f"\n224x224 FC layer parameters: {total_large:,}")
# 224x224 FC layer parameters: 77,070,848

Even with the small 32 × 32 CIFAR-10 images, a single fully-connected layer already has over 1.5 million parameters. Scale up to ImageNet-sized inputs and the numbers become absurd. Convolutional Neural Networks solve this by exploiting two key ideas:

The Two Key Ideas Behind CNNs

Local connectivity — each neuron connects only to a small spatial region (the receptive field) instead of every input pixel
Weight sharing — the same small filter (kernel) slides across the entire image, so the network learns one set of weights that works everywhere

Together, these reduce parameters by orders of magnitude while preserving — and even enhancing — the network's ability to detect spatial patterns.

The diagram below illustrates how a convolutional layer processes an image compared to a fully-connected layer. Notice how each output neuron in the convolutional layer connects to only a small patch of the input, and the same kernel weights are reused across all positions:

Convolution vs Fully-Connected Comparison

flowchart LR
    subgraph FC["Fully-Connected Layer"]
        direction TB
        A1["Every pixel"] -->|"150K+ weights"| B1["Single neuron"]
    end
    subgraph CNN["Convolutional Layer"]
        direction TB
        A2["3×3 patch"] -->|"9 shared weights"| B2["One output pixel"]
        A3["Next 3×3 patch"] -->|"Same 9 weights"| B3["Next output pixel"]
        A4["Slides across image"] -->|"Same 9 weights"| B4["Full feature map"]
    end
    FC -.->|"Replace with"| CNN

Convolution Operations

The Core Idea (Plain English)

A convolution is just a small magnifying glass sliding over an image, looking for one specific pattern at every position. That's it. The "magnifying glass" is a tiny grid of numbers (the kernel/filter), and "looking for a pattern" means computing a dot product.

The Best Analogy: A Metal Detector at the Beach

Imagine sweeping a metal detector across a beach:

The detector = the kernel/filter (a small 3×3 or 5×5 grid of learned weights)
Sweeping it across the beach = sliding the kernel over the image (stride)
The beep strength at each position = the output feature map (how strongly the pattern is detected there)
Multiple detectors = multiple output channels (one for edges, one for curves, one for textures...)

Key insight: one detector, reused everywhere. You don't need a separate detector for each square metre — the same kernel works at every position. That's weight sharing.

Ultra-compressed version:

# Convolution in pseudocode:
for every_position in image:
    output[position] = dot_product(kernel, image_patch_at_position)

# One 3×3 kernel = 9 parameters
# Slides across entire image = detects the SAME pattern EVERYWHERE
# 16 kernels = 16 different patterns detected simultaneously

At the heart of every CNN is the convolution operation. A small matrix of learnable weights — called a kernel or filter — slides across the input image, computing a dot product at each position. The result is a feature map that highlights where specific patterns (edges, corners, textures) appear in the input.

In PyTorch, the workhorse class is nn.Conv2d. Let's explore its parameters one by one:

in_channels — number of input channels (3 for RGB, 1 for grayscale, or the depth from a previous conv layer)
out_channels — number of filters to learn; each produces one feature map
kernel_size — height and width of the sliding window (e.g., 3 means a 3×3 filter)
stride — how many pixels the kernel moves at each step (default 1)
padding — zero-padding added around the input to control the output spatial size

The following code creates a convolutional layer and passes a single image through it so you can see the input and output shapes:

import torch
import torch.nn as nn

# Create a Conv2d layer:
#   3 input channels (RGB), 16 output channels (filters),
#   3x3 kernel, stride=1, padding=1 (same padding)
conv = nn.Conv2d(in_channels=3, out_channels=16,
                 kernel_size=3, stride=1, padding=1)

# Simulate a batch of 1 RGB image, 32x32 pixels
# Shape: (batch, channels, height, width)
x = torch.randn(1, 3, 32, 32)
print(f"Input shape:  {x.shape}")   # [1, 3, 32, 32]

# Forward pass through the conv layer
output = conv(x)
print(f"Output shape: {output.shape}")  # [1, 16, 32, 32]

# With padding=1 and stride=1, spatial size is preserved!
# We now have 16 feature maps instead of 3 channels

# Count parameters: (in_ch * kernel_h * kernel_w * out_ch) + out_ch (bias)
params = sum(p.numel() for p in conv.parameters())
print(f"Conv2d parameters: {params}")  # (3*3*3*16) + 16 = 448

Notice the dramatic difference: the convolutional layer uses only 448 parameters to process the same 32×32 RGB image that required over 1.5 million with a fully-connected layer. That's the power of weight sharing.

Feature Maps & Receptive Field

Each of the 16 output channels from our conv layer is called a feature map. You can think of each one as a "heat map" that lights up wherever its particular filter detects a matching pattern. Early-layer filters typically learn to detect simple features like horizontal edges, vertical edges, and colour gradients.

The receptive field is the region of the original input that influences a particular output pixel. A single 3×3 conv layer has a receptive field of 3×3. Stack two 3×3 conv layers and the effective receptive field grows to 5×5, because each output neuron in the second layer "looks through" the first layer. This is why deep CNNs can capture increasingly complex patterns.

Calculating Output Sizes

Understanding how convolution changes spatial dimensions is essential for designing architectures. The formula is:

Output Size = ⌊(Input Size + 2 × Padding − Kernel Size) / Stride⌋ + 1

The code below is a handy utility that computes the output size for any combination of parameters, plus demonstrates a few common configurations:

import torch
import torch.nn as nn

def conv_output_size(input_size, kernel_size, stride=1, padding=0):
    """Calculate the output spatial dimension after a Conv2d layer."""
    return (input_size + 2 * padding - kernel_size) // stride + 1

# Common configurations on a 32x32 input
configs = [
    {"kernel": 3, "stride": 1, "padding": 0, "label": "3x3, no padding"},
    {"kernel": 3, "stride": 1, "padding": 1, "label": "3x3, same padding"},
    {"kernel": 5, "stride": 1, "padding": 2, "label": "5x5, same padding"},
    {"kernel": 3, "stride": 2, "padding": 1, "label": "3x3, stride 2"},
    {"kernel": 7, "stride": 2, "padding": 3, "label": "7x7, stride 2"},
]

print("Input: 32x32")
print("-" * 45)
for cfg in configs:
    out = conv_output_size(32, cfg["kernel"], cfg["stride"], cfg["padding"])
    print(f"{cfg['label']:25s} → {out}x{out}")

# Verify with actual PyTorch layers
print("\nVerification with PyTorch:")
x = torch.randn(1, 3, 32, 32)
conv_stride2 = nn.Conv2d(3, 16, kernel_size=3, stride=2, padding=1)
print(f"3x3, stride 2 output: {conv_stride2(x).shape}")  # [1, 16, 16, 16]

The key takeaway: padding=1 with a 3×3 kernel preserves spatial dimensions (called "same" padding), while stride=2 halves them. These two knobs — padding and stride — are your primary tools for controlling feature map sizes throughout the network.

Pooling Layers

The Core Idea (Plain English)

Pooling is just zooming out. After your convolution finds edges and textures, pooling shrinks the image so the network can look at bigger patterns without drowning in pixel-level detail.

The Best Analogy: Summarizing a Paragraph

If each pixel is a word:

MaxPool = "Keep only the most important word in every 4-word chunk" (aggressive, keeps peaks)
AvgPool = "Average the meaning of every 4-word chunk" (smooth, preserves overall tone)
AdaptivePool = "Summarize the entire paragraph into exactly N sentences, regardless of input length"

Result: the image gets smaller (fewer pixels) but richer (each remaining pixel represents a larger area).

While convolutions can reduce spatial dimensions via stride, pooling layers provide a more explicit way to downsample feature maps. Pooling reduces the spatial resolution while retaining the most important information, which accomplishes three things: (1) reduces the number of parameters in subsequent layers, (2) increases the effective receptive field, and (3) provides a degree of translation invariance — the network becomes less sensitive to small shifts in the input.

PyTorch provides two main pooling operations. MaxPool2d takes the maximum value in each window — it's aggressive and preserves the strongest activations. AvgPool2d averages all values in the window — it's smoother and preserves more subtle information. Let's compare them side by side:

import torch
import torch.nn as nn

# Create a small 4x4 feature map (1 batch, 1 channel)
x = torch.tensor([[[[1.0, 2.0, 3.0, 4.0],
                     [5.0, 6.0, 7.0, 8.0],
                     [9.0, 10., 11., 12.],
                     [13., 14., 15., 16.]]]])
print(f"Input shape: {x.shape}")  # [1, 1, 4, 4]

# MaxPool2d with 2x2 window, stride 2
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
max_out = max_pool(x)
print(f"\nMaxPool2d output:\n{max_out}")
# [[[ 6.,  8.],
#   [14., 16.]]]

# AvgPool2d with 2x2 window, stride 2
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)
avg_out = avg_pool(x)
print(f"\nAvgPool2d output:\n{avg_out}")
# [[[3.5, 5.5],
#   [11.5, 13.5]]]

print(f"\nMax output shape: {max_out.shape}")  # [1, 1, 2, 2]
print(f"Avg output shape: {avg_out.shape}")    # [1, 1, 2, 2]

Both pooling layers reduced the 4×4 input to 2×2 — halving each spatial dimension. MaxPool kept the largest value in each 2×2 region, while AvgPool computed the mean. In practice, MaxPool2d is far more common in classification networks because it preserves the strongest detected features.

AdaptiveAvgPool2d

AdaptiveAvgPool2d is a special pooling layer that accepts a target output size rather than a kernel size. No matter how large or small the input, it automatically adjusts the pooling window to produce the exact output dimensions you specify. This is incredibly useful for making your network accept images of any resolution:

import torch
import torch.nn as nn

# AdaptiveAvgPool2d always outputs the target size
adaptive_pool = nn.AdaptiveAvgPool2d(output_size=(1, 1))

# Test with different input sizes
for size in [8, 16, 32, 64]:
    x = torch.randn(1, 256, size, size)
    out = adaptive_pool(x)
    print(f"Input: {size}x{size} → Output: {out.shape}")
    # Always produces [1, 256, 1, 1] regardless of input size

# This is why modern architectures use it before the classifier:
# it collapses spatial dims to 1x1, giving a fixed-size vector
out_flat = adaptive_pool(torch.randn(1, 512, 7, 7)).view(1, -1)
print(f"\nFlattened for classifier: {out_flat.shape}")  # [1, 512]

This is exactly how architectures like ResNet handle the transition from convolutional feature maps to the fully-connected classifier head — AdaptiveAvgPool2d(1, 1) collapses any spatial dimensions down to 1×1, giving you a fixed-length feature vector regardless of input image size.

Batch Normalization & Dropout for CNNs

As CNNs grow deeper, two problems emerge: internal covariate shift (the distribution of each layer's inputs changes during training, slowing convergence) and overfitting (the model memorizes training data instead of learning general patterns). Batch normalization and dropout are the standard remedies.

BatchNorm2d normalizes each channel's activations across the batch to have zero mean and unit variance, then applies learnable scale (gamma) and shift (beta) parameters. The key benefit is that it allows higher learning rates and acts as a mild regularizer. For convolutional layers, you always use BatchNorm2d (not BatchNorm1d) because it operates on 4D tensors (batch, channels, height, width).

import torch
import torch.nn as nn

# BatchNorm2d: normalizes across (batch, height, width) per channel
bn = nn.BatchNorm2d(num_features=16)  # 16 channels

# Simulate a batch of 4 images, 16 channels, 8x8 spatial
x = torch.randn(4, 16, 8, 8) * 5 + 10  # intentionally shifted and scaled

print(f"Before BN — mean: {x.mean():.2f}, std: {x.std():.2f}")

# Apply batch norm (in training mode by default)
x_bn = bn(x)
print(f"After BN  — mean: {x_bn.mean():.4f}, std: {x_bn.std():.4f}")
# Mean ≈ 0, Std ≈ 1

# BN has 2 learnable parameters per channel: gamma (weight) and beta (bias)
print(f"\nBN parameters: {sum(p.numel() for p in bn.parameters())}")
# 16 (gamma) + 16 (beta) = 32 parameters

Batch normalization re-centres and re-scales each channel's activations, taming the distribution so that subsequent layers always receive well-behaved inputs. This seemingly simple trick often cuts training time in half and stabilizes deeper networks.

Dropout2d

Standard Dropout zeroes individual elements, but for convolutional feature maps this can be ineffective because adjacent pixels in the same channel are highly correlated — if you zero one, its neighbours still carry the same information. Dropout2d solves this by dropping entire channels at once, forcing the network to avoid relying on any single feature map:

import torch
import torch.nn as nn

# Dropout2d: drops entire channels (not individual pixels)
dropout2d = nn.Dropout2d(p=0.25)

x = torch.ones(1, 8, 4, 4)  # 1 batch, 8 channels, 4x4

# Training mode: some channels will be zeroed entirely
dropout2d.train()
x_dropped = dropout2d(x)

# Count how many channels were zeroed
for ch in range(8):
    is_zero = (x_dropped[0, ch] == 0).all().item()
    status = "DROPPED" if is_zero else "kept"
    print(f"Channel {ch}: {status}")

# Eval mode: no dropout applied
dropout2d.eval()
x_eval = dropout2d(x)
print(f"\nEval mode — all channels present: {(x_eval == x).all().item()}")

The standard placement order in a conv block is: Conv2d → BatchNorm2d → ReLU → Dropout2d (though some practitioners skip Dropout2d when BatchNorm already provides sufficient regularization).

Common Mistake: BN Before vs After Activation

The original BatchNorm paper places BN before the activation: Conv → BN → ReLU. Some recent work suggests placing it after: Conv → ReLU → BN. Both work, but be consistent throughout your network. Mixing the two orderings in different layers can lead to subtle training instabilities.

Building a CNN from Scratch

The Core Idea (Plain English)

A CNN is a two-stage pipeline: first extract features (what's in the image?), then classify (which category?). That's the entire architecture.

The Best Analogy: A Multi-Stage Funnel

Think of a CNN as a funnel that compresses images into decisions:

Wide end (input) — raw pixels, lots of spatial detail, few channels (3 for RGB)
Each conv block — squeezes space smaller, adds more channels (richer features)
Narrow end (output) — no spatial info left, just a compact "what is this?" vector
Classifier — reads that vector and picks a class label

Pattern: [3×32×32] → [32×32×32] → [64×16×16] → [128×8×8] → [128] → "cat"
Spatial dims shrink, channel depth grows. That's every CNN.

Now let's put everything together and build a complete CNN for classifying CIFAR-10 images (32×32 colour images across 10 classes). A typical CNN architecture has two parts: a feature extractor (conv blocks that learn spatial features) and a classifier head (fully-connected layers that map features to class predictions).

The following diagram shows the high-level architecture of our CNN. The feature extractor progressively reduces spatial dimensions while increasing channel depth, and the classifier flattens the output into a 10-class prediction:

CNN Architecture for CIFAR-10

flowchart LR
    A["Input
3×32×32"] --> B["Conv Block 1
32 filters
→ 32×32×32"]
    B --> C["Conv Block 2
64 filters
→ 64×16×16"]
    C --> D["Conv Block 3
128 filters
→ 128×8×8"]
    D --> E["AdaptivePool
→ 128×1×1"]
    E --> F["Flatten
→ 128"]
    F --> G["FC 128→64
ReLU + Dropout"]
    G --> H["FC 64→10
Logits"]

Each "Conv Block" consists of a Conv2d, BatchNorm2d, ReLU activation, and MaxPool2d. The first block preserves spatial size (padding=1, no pooling reduction), while blocks 2 and 3 each halve the dimensions through pooling. Here is the full implementation:

import torch
import torch.nn as nn

class CIFAR10CNN(nn.Module):
    """A 3-block CNN for CIFAR-10 classification."""

    def __init__(self, num_classes=10):
        super().__init__()

        # Feature extractor: 3 convolutional blocks
        self.features = nn.Sequential(
            # Block 1: 3 → 32 channels, 32x32 → 32x32 (stride 1, no pool)
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),

            # Block 2: 32 → 64 channels, 32x32 → 16x16 (pool halves)
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),

            # Block 3: 64 → 128 channels, 16x16 → 8x8 (pool halves)
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),
        )

        # Global average pooling → flatten → classifier
        self.pool = nn.AdaptiveAvgPool2d((1, 1))
        self.classifier = nn.Sequential(
            nn.Linear(128, 64),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(64, num_classes),
        )

    def forward(self, x):
        x = self.features(x)      # [B, 128, 8, 8]
        x = self.pool(x)          # [B, 128, 1, 1]
        x = x.view(x.size(0), -1) # [B, 128]
        x = self.classifier(x)    # [B, 10]
        return x

# Instantiate and inspect
model = CIFAR10CNN()
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")

# Test with a dummy batch
dummy = torch.randn(4, 3, 32, 32)
output = model(dummy)
print(f"Output shape: {output.shape}")  # [4, 10]

Forward Pass Walkthrough

To build intuition about what happens at each stage, let's trace the shape transformations through the network. Understanding these intermediate sizes is crucial for debugging shape mismatches — one of the most common CNN errors:

import torch
import torch.nn as nn

# Recreate model components to trace shapes manually
conv1 = nn.Conv2d(3, 32, 3, padding=1)
bn1 = nn.BatchNorm2d(32)
conv2 = nn.Conv2d(32, 64, 3, padding=1)
bn2 = nn.BatchNorm2d(64)
pool = nn.MaxPool2d(2, 2)
conv3 = nn.Conv2d(64, 128, 3, padding=1)
bn3 = nn.BatchNorm2d(128)
gap = nn.AdaptiveAvgPool2d((1, 1))

x = torch.randn(2, 3, 32, 32)
print(f"Input:          {x.shape}")

x = torch.relu(bn1(conv1(x)))
print(f"After Block 1:  {x.shape}")  # [2, 32, 32, 32]

x = pool(torch.relu(bn2(conv2(x))))
print(f"After Block 2:  {x.shape}")  # [2, 64, 16, 16]

x = pool(torch.relu(bn3(conv3(x))))
print(f"After Block 3:  {x.shape}")  # [2, 128, 8, 8]

x = gap(x)
print(f"After GAP:      {x.shape}")  # [2, 128, 1, 1]

x = x.view(x.size(0), -1)
print(f"After Flatten:  {x.shape}")  # [2, 128]

This trace shows the pattern clearly: spatial dimensions shrink (32 → 16 → 8 → 1) while channel depth grows (3 → 32 → 64 → 128). The network progressively trades spatial resolution for richer feature representations. By the time we reach the classifier, each image is a compact 128-dimensional vector.

Training a CNN End-to-End

Having built the architecture, let's train it on CIFAR-10. This section covers the complete pipeline: loading data with transforms, running the training loop, tracking validation metrics, and visualizing the results. We'll use torchvision for both the dataset and the augmentation pipeline.

First, let's set up the data loaders with appropriate augmentation for training and standard preprocessing for validation:

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Data augmentation for training; only normalize for validation
train_transform = transforms.Compose([
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomCrop(32, padding=4),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.4914, 0.4822, 0.4465],
                         std=[0.2470, 0.2435, 0.2616]),
])

val_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.4914, 0.4822, 0.4465],
                         std=[0.2470, 0.2435, 0.2616]),
])

# Load CIFAR-10
train_set = datasets.CIFAR10(root='./data', train=True,
                              download=True, transform=train_transform)
val_set = datasets.CIFAR10(root='./data', train=False,
                            download=True, transform=val_transform)

train_loader = DataLoader(train_set, batch_size=128,
                          shuffle=True, num_workers=2)
val_loader = DataLoader(val_set, batch_size=256,
                        shuffle=False, num_workers=2)

print(f"Training samples: {len(train_set):,}")
print(f"Validation samples: {len(val_set):,}")
print(f"Classes: {train_set.classes}")

The augmentation pipeline randomly flips images horizontally, crops with padding (simulating small translations), and slightly jitters brightness and contrast. These transformations artificially increase the diversity of the training data, reducing overfitting. Crucially, we do not augment the validation set — it must reflect real-world conditions to give reliable accuracy estimates.

Training Loop with Validation

The training loop below trains for a configurable number of epochs, evaluates on the validation set after each epoch, and records metrics for plotting. The device variable automatically uses a GPU if available:

import torch
import torch.nn as nn
import torch.optim as optim

# ---- Define CIFAR10CNN inline so this snippet is self-contained ----
class CIFAR10CNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1), nn.BatchNorm2d(32), nn.ReLU(True),
            nn.Conv2d(32, 64, 3, padding=1), nn.BatchNorm2d(64), nn.ReLU(True),
            nn.MaxPool2d(2, 2),
            nn.Conv2d(64, 128, 3, padding=1), nn.BatchNorm2d(128), nn.ReLU(True),
            nn.MaxPool2d(2, 2),
        )
        self.pool = nn.AdaptiveAvgPool2d((1, 1))
        self.classifier = nn.Sequential(
            nn.Linear(128, 64), nn.ReLU(True), nn.Dropout(0.5),
            nn.Linear(64, num_classes),
        )
    def forward(self, x):
        x = self.features(x)
        x = self.pool(x).view(x.size(0), -1)
        return self.classifier(x)

# ---- Training setup ----
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = CIFAR10CNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5)

# Placeholder: replace with actual DataLoaders from the previous snippet
# train_loader = ...
# val_loader = ...

num_epochs = 3  # increase to 20-30 for real training
history = {'train_loss': [], 'val_loss': [], 'val_acc': []}

for epoch in range(num_epochs):
    # --- Training phase ---
    model.train()
    running_loss = 0.0
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item() * images.size(0)

    epoch_train_loss = running_loss / len(train_loader.dataset)
    history['train_loss'].append(epoch_train_loss)

    # --- Validation phase ---
    model.eval()
    val_loss, correct, total = 0.0, 0, 0
    with torch.no_grad():
        for images, labels in val_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            val_loss += criterion(outputs, labels).item() * images.size(0)
            _, predicted = outputs.max(1)
            correct += predicted.eq(labels).sum().item()
            total += labels.size(0)

    epoch_val_loss = val_loss / total
    epoch_val_acc = 100.0 * correct / total
    history['val_loss'].append(epoch_val_loss)
    history['val_acc'].append(epoch_val_acc)

    scheduler.step()
    print(f"Epoch {epoch+1}/{num_epochs} — "
          f"Train Loss: {epoch_train_loss:.4f}, "
          f"Val Loss: {epoch_val_loss:.4f}, "
          f"Val Acc: {epoch_val_acc:.1f}%")

print("\nTraining complete!")

A few key details: weight_decay=1e-4 in Adam adds L2 regularization, StepLR halves the learning rate every 10 epochs to fine-tune convergence, and we call model.eval() during validation to disable dropout and switch batch norm to use running statistics. With 20-30 epochs, this simple architecture should reach roughly 80-85% accuracy on CIFAR-10.

Plotting Loss & Accuracy Curves

Visualizing training dynamics helps diagnose problems like overfitting (training loss keeps falling while validation loss rises) or underfitting (both losses remain high). Here's a reusable plotting function:

import matplotlib.pyplot as plt

# Example history data (replace with real values from training)
history = {
    'train_loss': [1.45, 1.12, 0.95, 0.82, 0.73, 0.66, 0.60, 0.55],
    'val_loss':   [1.30, 1.05, 0.92, 0.85, 0.82, 0.80, 0.79, 0.78],
    'val_acc':    [52.0, 62.5, 68.3, 72.1, 74.8, 76.2, 77.5, 78.3],
}

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# Loss curves
epochs = range(1, len(history['train_loss']) + 1)
ax1.plot(epochs, history['train_loss'], 'b-o', label='Train Loss', markersize=4)
ax1.plot(epochs, history['val_loss'], 'r-s', label='Val Loss', markersize=4)
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Training & Validation Loss')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Accuracy curve
ax2.plot(epochs, history['val_acc'], 'g-^', label='Val Accuracy', markersize=4)
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy (%)')
ax2.set_title('Validation Accuracy')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('training_curves.png', dpi=150, bbox_inches='tight')
plt.show()
print("Training curves saved to training_curves.png")

If you see the gap between training loss and validation loss widening, that's a sign of overfitting — add more augmentation, increase dropout, or use weight decay. If both losses plateau early at a high value, the model may be too small and you need more capacity (deeper or wider architecture).

Understanding Feature Maps

The Core Idea (Plain English)

If convolution kernels are "detectors," feature maps are the detection reports — heat maps showing where each pattern appears in the image. Visualizing these reveals what the CNN is actually "seeing" at each layer.

The Best Analogy: Different Colored Glasses

Imagine looking at a photo through different tinted glasses:

Layer 1 glasses — you see only edges (like night vision highlighting outlines)
Layer 2 glasses — you see textures (fur, brick, water ripples)
Layer 3 glasses — you see object parts (eyes, wheels, leaves)
Deep layer glasses — you see entire objects (faces, cars, dogs)

Each feature map is one pair of glasses. A CNN with 128 filters in a layer = 128 different ways of "seeing" the same image simultaneously.

One of the most powerful ways to understand what a CNN has learned is to visualize its intermediate feature maps — the activations produced by each convolutional layer. Early layers typically detect low-level features like edges and gradients, middle layers capture textures and patterns, and deeper layers respond to high-level object parts.

Visualizing Intermediate Activations

To extract feature maps, we register forward hooks on the layers we're interested in. A hook is a callback function that PyTorch calls every time data passes through that layer, giving us access to the output without modifying the model:

import torch
import torch.nn as nn
import matplotlib.pyplot as plt

# Simple 2-layer CNN for demonstration
class TinyCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 8, 3, padding=1)
        self.relu1 = nn.ReLU()
        self.conv2 = nn.Conv2d(8, 16, 3, padding=1)
        self.relu2 = nn.ReLU()

    def forward(self, x):
        x = self.relu1(self.conv1(x))
        x = self.relu2(self.conv2(x))
        return x

model = TinyCNN()
activations = {}

# Register hooks to capture activations
def get_hook(name):
    def hook(module, input, output):
        activations[name] = output.detach()
    return hook

model.relu1.register_forward_hook(get_hook('conv1'))
model.relu2.register_forward_hook(get_hook('conv2'))

# Pass a random "image" through the model
x = torch.randn(1, 3, 32, 32)
_ = model(x)

# Visualize first 4 feature maps from conv1
fig, axes = plt.subplots(1, 4, figsize=(12, 3))
for i in range(4):
    axes[i].imshow(activations['conv1'][0, i].numpy(), cmap='viridis')
    axes[i].set_title(f'Conv1 Filter {i}')
    axes[i].axis('off')
plt.suptitle('Feature Maps After Conv Layer 1', fontsize=14)
plt.tight_layout()
plt.show()

print(f"Conv1 activations shape: {activations['conv1'].shape}")
print(f"Conv2 activations shape: {activations['conv2'].shape}")

Even with random weights (untrained), you can see that different filters respond to different parts of the input. After training, these patterns become much more meaningful — some filters become dedicated edge detectors, others respond to specific colours, and deeper filters activate on complex textures or object parts. This "hierarchy of features" is the fundamental insight behind deep learning for vision.

Experiment

Feature Map Exploration

Try training the TinyCNN on a simple dataset (like MNIST) for a few epochs, then visualize the feature maps again. You'll see that Conv1 filters learn to detect edges at different orientations, while Conv2 filters combine those edges into corners, curves, and more complex patterns.

visualization interpretability hooks

Famous CNN Architectures

The history of CNNs is a fascinating story of increasing depth, clever engineering, and breakthrough performance on ImageNet. Understanding these milestone architectures gives you a toolkit of design patterns you can apply to your own models.

LeNet-5 (1998)

Yann LeCun's pioneering architecture for handwritten digit recognition. Two conv layers followed by three fully-connected layers. Simple, elegant, and the blueprint for everything that followed. It proved that learned features outperform hand-crafted ones for image recognition.

AlexNet (2012)

The architecture that ignited the deep learning revolution by winning ImageNet 2012 with a massive margin. Key innovations: ReLU activation (instead of sigmoid), dropout for regularization, and GPU training. It had 5 conv layers and 3 FC layers, with about 60 million parameters.

VGGNet (2014)

Showed that depth matters. VGG-16 stacked 13 conv layers (all 3×3 kernels) plus 3 FC layers. The key insight: two stacked 3×3 convs have the same receptive field as one 5×5 conv but with fewer parameters and more non-linearity. However, its 138 million parameters made it expensive.

GoogLeNet / Inception (2014)

Introduced the Inception module — processing the same input through parallel branches of 1×1, 3×3, and 5×5 convolutions, then concatenating the results. This let the network decide which filter size is most appropriate at each layer. Dramatically more parameter-efficient than VGG.

ResNet (2015)

The single most influential CNN architecture. Introduced skip connections (residual connections) that allow gradients to flow directly through the network, enabling training of extremely deep networks (50, 101, even 152 layers). Before ResNet, networks deeper than ~20 layers suffered from degradation — adding more layers actually hurt performance. Residual connections solved this by letting each block learn the residual (difference) rather than the full mapping.

Evolution Timeline

Here's a quick reference comparing these architectures and their key contributions. The table shows how CNNs evolved from a handful of layers to hundreds, while error rates on ImageNet plummeted:

# CNN Architecture Evolution — reference table
architectures = [
    {"name": "LeNet-5",    "year": 1998, "layers":  7,  "params": "60K",    "top5_err": "N/A",   "key_idea": "First successful CNN"},
    {"name": "AlexNet",    "year": 2012, "layers":  8,  "params": "60M",    "top5_err": "16.4%", "key_idea": "ReLU, Dropout, GPU training"},
    {"name": "VGGNet-16",  "year": 2014, "layers": 16,  "params": "138M",   "top5_err": "7.3%",  "key_idea": "Stacked 3x3 convolutions"},
    {"name": "GoogLeNet",  "year": 2014, "layers": 22,  "params": "6.8M",   "top5_err": "6.7%",  "key_idea": "Inception module (parallel filters)"},
    {"name": "ResNet-50",  "year": 2015, "layers": 50,  "params": "25.6M",  "top5_err": "3.6%",  "key_idea": "Skip connections"},
    {"name": "ResNet-152", "year": 2015, "layers": 152, "params": "60.2M",  "top5_err": "3.0%",  "key_idea": "Very deep residual networks"},
]

print(f"{'Architecture':<14} {'Year':<6} {'Layers':<8} {'Params':<10} {'Top-5 Err':<10} {'Key Idea'}")
print("-" * 85)
for arch in architectures:
    print(f"{arch['name']:<14} {arch['year']:<6} {arch['layers']:<8} {arch['params']:<10} {arch['top5_err']:<10} {arch['key_idea']}")

The trend is clear: each generation went deeper, but raw depth alone isn't enough. The biggest breakthroughs came from architectural innovations — Inception's parallel branches, ResNet's skip connections — that made depth actually useful instead of harmful.

Why ResNet Changed Everything

Before ResNet, a 56-layer CNN performed worse than a 20-layer CNN on both training and test sets. This wasn't overfitting — it was a deeper problem with gradient flow. Skip connections fix this by providing a "shortcut" for gradients, letting information bypass layers entirely. If a layer has nothing useful to add, the network can simply pass the input through unchanged.

The mathematical insight: instead of learning H(x) directly, each block learns the residual F(x) = H(x) - x and the output is F(x) + x. Learning small residuals is far easier than learning the full transformation from scratch.

Implementing a ResNet Block

The Core Idea (Plain English)

A residual block says: "Learn only what's NEW, and always keep what you already have." Instead of forcing a layer to learn the complete transformation from input to output, it only learns the difference (residual) — then adds it back to the original input via a skip connection.

The Best Analogy: Highway Bypass

Imagine a highway with optional stops:

Skip connection = the highway (input passes through unmodified)
Conv layers = the off-ramp/town (processes input, learns something new)
Addition = the on-ramp (merges what's new back onto the highway)

If the town has nothing useful to add? The data just stays on the highway unchanged. This means adding layers can never hurt — worst case, a block learns to output zeros and passes input through.

Ultra-compressed version:

# ResNet block = "learn the residual, add it back"
output = input + conv_layers(input)

# Why this works:
# - If conv_layers learns NOTHING useful → output = input (no harm done)
# - If conv_layers learns SOMETHING → output = input + improvement
# - Gradients flow through the "+" directly back to earlier layers (no vanishing!)

Now that we understand the theory, let's implement the core building block of ResNet: the residual block. In its simplest form, a residual block passes the input through two convolutional layers and then adds the original input back to the output. This "shortcut" or "skip connection" is what enables training of very deep networks.

There's one subtlety: if the input and output have different numbers of channels or different spatial dimensions, we can't simply add them. In that case, we use a 1×1 convolution on the shortcut path to match dimensions. This is called a projection shortcut:

import torch
import torch.nn as nn

class ResidualBlock(nn.Module):
    """Basic residual block with two 3x3 conv layers."""

    def __init__(self, in_channels, out_channels, stride=1):
        super().__init__()

        # Main path: two 3x3 convolutions
        self.conv1 = nn.Conv2d(in_channels, out_channels, 3,
                               stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.conv2 = nn.Conv2d(out_channels, out_channels, 3,
                               stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)

        # Shortcut path: 1x1 conv if dimensions change
        self.shortcut = nn.Sequential()
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_channels, out_channels, 1,
                          stride=stride, bias=False),
                nn.BatchNorm2d(out_channels),
            )

    def forward(self, x):
        identity = self.shortcut(x)      # Shortcut connection

        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))

        out += identity                    # Add skip connection
        out = self.relu(out)
        return out

# Test: same dimensions (identity shortcut)
block_same = ResidualBlock(64, 64, stride=1)
x = torch.randn(2, 64, 16, 16)
print(f"Same dims — Input: {x.shape}, Output: {block_same(x).shape}")

# Test: changing dimensions (projection shortcut)
block_down = ResidualBlock(64, 128, stride=2)
print(f"Downsample — Input: {x.shape}, Output: {block_down(x).shape}")

When stride=1 and channels don't change, the shortcut is just the identity (pass-through). When we need to downsample (stride=2) or change channels (64 → 128), the projection shortcut ensures the dimensions match for the addition. Note that we set bias=False on conv layers followed by batch norm, since BN already has a learnable bias term.

Building a Mini-ResNet

Let's stack multiple residual blocks into a complete mini-ResNet suitable for CIFAR-10. This architecture follows the same general structure as the original ResNet paper but with fewer layers to keep it manageable:

import torch
import torch.nn as nn

class ResidualBlock(nn.Module):
    def __init__(self, in_ch, out_ch, stride=1):
        super().__init__()
        self.conv1 = nn.Conv2d(in_ch, out_ch, 3, stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_ch)
        self.conv2 = nn.Conv2d(out_ch, out_ch, 3, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_ch)
        self.relu = nn.ReLU(inplace=True)
        self.shortcut = nn.Sequential()
        if stride != 1 or in_ch != out_ch:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_ch, out_ch, 1, stride=stride, bias=False),
                nn.BatchNorm2d(out_ch),
            )
    def forward(self, x):
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out += self.shortcut(x)
        return self.relu(out)

class MiniResNet(nn.Module):
    """A small ResNet for CIFAR-10 with 3 stages of residual blocks."""

    def __init__(self, num_classes=10):
        super().__init__()

        # Initial convolution
        self.prep = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1, bias=False),
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
        )

        # Stage 1: 32 channels, 32x32
        self.stage1 = nn.Sequential(
            ResidualBlock(32, 32),
            ResidualBlock(32, 32),
        )

        # Stage 2: 64 channels, 16x16
        self.stage2 = nn.Sequential(
            ResidualBlock(32, 64, stride=2),
            ResidualBlock(64, 64),
        )

        # Stage 3: 128 channels, 8x8
        self.stage3 = nn.Sequential(
            ResidualBlock(64, 128, stride=2),
            ResidualBlock(128, 128),
        )

        # Global average pool + classifier
        self.pool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(128, num_classes)

    def forward(self, x):
        x = self.prep(x)       # [B, 32, 32, 32]
        x = self.stage1(x)     # [B, 32, 32, 32]
        x = self.stage2(x)     # [B, 64, 16, 16]
        x = self.stage3(x)     # [B, 128, 8, 8]
        x = self.pool(x)       # [B, 128, 1, 1]
        x = x.view(x.size(0), -1)
        return self.fc(x)

# Inspect
model = MiniResNet()
total = sum(p.numel() for p in model.parameters())
print(f"MiniResNet parameters: {total:,}")

# Verify with dummy input
out = model(torch.randn(4, 3, 32, 32))
print(f"Output shape: {out.shape}")  # [4, 10]

This MiniResNet has 6 residual blocks (2 per stage) plus the initial conv and final classifier. Despite being relatively shallow, it should outperform our plain CNN because the skip connections allow gradients to flow more easily during backpropagation, making every layer's contribution more effective.

Experiment

ResNet vs Plain CNN

Train both the CIFAR10CNN and MiniResNet on CIFAR-10 for 30 epochs with the same hyperparameters. Compare their validation accuracy curves. You should observe that MiniResNet converges faster and reaches higher final accuracy, even though both have similar parameter counts. This demonstrates the power of residual connections.

residual learning comparison CIFAR-10

Practical Tips for Vision Tasks

Building a CNN architecture is only half the battle. The difference between a mediocre model and a high-performing one often comes down to preprocessing, augmentation, and training strategies. Here are the most impactful techniques for real-world vision tasks.

Image Preprocessing Best Practices

Always normalize your images using the dataset's channel-wise mean and standard deviation. For ImageNet-pretrained models, use the ImageNet statistics. For custom datasets, compute them from your training set. Here's a utility to calculate these values from any dataset:

import torch
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# Load dataset with only ToTensor (no normalization yet)
raw_dataset = datasets.CIFAR10(root='./data', train=True,
                                download=True,
                                transform=transforms.ToTensor())
loader = DataLoader(raw_dataset, batch_size=1024,
                    shuffle=False, num_workers=2)

# Compute mean and std across the dataset
mean = torch.zeros(3)
std = torch.zeros(3)
n_samples = 0

for images, _ in loader:
    batch_size = images.size(0)
    images = images.view(batch_size, 3, -1)  # [B, 3, H*W]
    mean += images.mean(dim=[0, 2]) * batch_size
    std += images.std(dim=[0, 2]) * batch_size
    n_samples += batch_size

mean /= n_samples
std /= n_samples

print(f"Dataset mean: [{mean[0]:.4f}, {mean[1]:.4f}, {mean[2]:.4f}]")
print(f"Dataset std:  [{std[0]:.4f}, {std[1]:.4f}, {std[2]:.4f}]")

Using the correct normalization statistics ensures that each channel has approximately zero mean and unit variance, which helps the optimizer converge faster and prevents one channel from dominating the learned features.

Data Augmentation for Different Domains

The choice of augmentation depends heavily on your domain. Medical images shouldn't be flipped if orientation matters; satellite images can be rotated by any angle; natural photos benefit from colour jitter. Here's a comprehensive augmentation pipeline demonstrating several common techniques:

import torch
from torchvision import transforms

# General-purpose augmentation pipeline
general_augmentation = transforms.Compose([
    transforms.RandomResizedCrop(224, scale=(0.8, 1.0)),
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomRotation(degrees=15),
    transforms.ColorJitter(brightness=0.3, contrast=0.3,
                           saturation=0.3, hue=0.1),
    transforms.RandomGrayscale(p=0.1),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406],
                         [0.229, 0.224, 0.225]),
    transforms.RandomErasing(p=0.2),  # cutout-style augmentation
])

# Medical imaging: no flips, small rotations, no colour changes
medical_augmentation = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.RandomAffine(degrees=5, translate=(0.05, 0.05)),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406],
                         [0.229, 0.224, 0.225]),
])

# Satellite / aerial: aggressive geometric transforms
aerial_augmentation = transforms.Compose([
    transforms.RandomResizedCrop(224, scale=(0.5, 1.0)),
    transforms.RandomHorizontalFlip(),
    transforms.RandomVerticalFlip(),
    transforms.RandomRotation(degrees=180),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406],
                         [0.229, 0.224, 0.225]),
])

print("Augmentation pipelines defined!")
print(f"General:  {len(general_augmentation.transforms)} transforms")
print(f"Medical:  {len(medical_augmentation.transforms)} transforms")
print(f"Aerial:   {len(aerial_augmentation.transforms)} transforms")

The golden rule of augmentation: only apply transforms that preserve the label. Flipping a photo of a cat still produces a cat, so horizontal flips are safe. But flipping an X-ray might change the diagnosis, so be careful. RandomErasing (a.k.a. cutout) randomly masks a rectangle of the image, forcing the network to look at multiple parts rather than relying on one discriminative region.

Transfer Learning Preview

In practice, you rarely train a CNN from scratch on a small dataset. Instead, you start with a model pre-trained on ImageNet (which has learned rich, general-purpose visual features) and fine-tune it on your specific task. This technique — called transfer learning — is covered in depth in Part 8, but here's a taste of how simple it is in PyTorch:

import torch
import torch.nn as nn
from torchvision import models

# Load a pre-trained ResNet-18 (downloads weights automatically)
model = models.resnet18(weights='IMAGENET1K_V1')

# Freeze all feature extraction layers
for param in model.parameters():
    param.requires_grad = False

# Replace the final classification head for your task (e.g., 5 classes)
num_classes = 5
model.fc = nn.Linear(model.fc.in_features, num_classes)

# Only the new fc layer will be trained
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable parameters: {trainable:,} / {total:,}")
print(f"Only {100*trainable/total:.1f}% of parameters need training!")

# Quick test
x = torch.randn(2, 3, 224, 224)
output = model(x)
print(f"Output shape: {output.shape}")  # [2, 5]

With just a few lines, you get access to a model that has already learned to detect edges, textures, shapes, and objects from millions of images. For most real-world applications, transfer learning is the recommended starting point — you'll achieve better accuracy with less data and far less training time than training from scratch.

                            When to Train from Scratch vs Transfer Learn
                            Train from scratch — your domain is very different from natural images (e.g., spectrograms, medical scans with unusual modalities) and you have a large dataset (100K+ images)
Fine-tune all layers — your domain is somewhat different from ImageNet and you have a medium-sized dataset (10K-100K images)
Fine-tune last few layers — your domain is similar to natural images and you have a small dataset (1K-10K images)
Feature extraction only — very small dataset (<1K images); freeze everything, only train the classifier head

                        

Conclusion & Next Steps

In this part we've covered the full landscape of convolutional neural networks in PyTorch:

Why CNNs exist — weight sharing and local connectivity solve the parameter explosion of fully-connected networks on images
Convolution mechanics — Conv2d parameters, feature maps, receptive fields, and output size calculations
Pooling — MaxPool2d, AvgPool2d, and AdaptiveAvgPool2d for spatial downsampling
Regularization — BatchNorm2d for training stability and Dropout2d for channel-level regularization
End-to-end training — data augmentation, training loops, and diagnosing learning curves
Feature visualization — using hooks to inspect what each layer has learned
Landmark architectures — from LeNet to ResNet, and why skip connections were revolutionary
ResNet implementation — building residual blocks and a complete MiniResNet from scratch

Next in the Series

In Part 6: RNNs, LSTMs & Sequences, we'll leave the world of grids and pixels to tackle sequential data — text, time series, and audio. You'll learn how recurrent networks maintain memory across time steps, why LSTMs solve the vanishing gradient problem, and how to build sequence models for real-world tasks.

Previous Part 4: Datasets, DataLoaders & Data Pipelines Next Part 6: RNNs, LSTMs & Sequences

Cookie Consent

Table of Contents

Why CNNs for Vision?

The Fully-Connected Problem

The Two Key Ideas Behind CNNs

Convolution Operations

The Core Idea (Plain English)

The Best Analogy: A Metal Detector at the Beach

Feature Maps & Receptive Field

Calculating Output Sizes

Pooling Layers

The Core Idea (Plain English)

The Best Analogy: Summarizing a Paragraph

AdaptiveAvgPool2d

Batch Normalization & Dropout for CNNs

Dropout2d

Common Mistake: BN Before vs After Activation

Building a CNN from Scratch

The Core Idea (Plain English)

The Best Analogy: A Multi-Stage Funnel

Forward Pass Walkthrough

Training a CNN End-to-End

Training Loop with Validation

Plotting Loss & Accuracy Curves

Understanding Feature Maps

The Core Idea (Plain English)

The Best Analogy: Different Colored Glasses

Visualizing Intermediate Activations

Feature Map Exploration

Famous CNN Architectures

LeNet-5 (1998)

AlexNet (2012)

VGGNet (2014)

GoogLeNet / Inception (2014)

ResNet (2015)

Evolution Timeline

Why ResNet Changed Everything

Implementing a ResNet Block

The Core Idea (Plain English)

The Best Analogy: Highway Bypass

Building a Mini-ResNet

ResNet vs Plain CNN

Practical Tips for Vision Tasks

Image Preprocessing Best Practices

Data Augmentation for Different Domains

Transfer Learning Preview

When to Train from Scratch vs Transfer Learn

Conclusion & Next Steps

Next in the Series

Related Articles in This Series

Part 4: Datasets, DataLoaders & Data Pipelines

Part 6: RNNs, LSTMs & Sequences

Part 8: Transfer Learning & Fine-Tuning