Back to Technology

PyTorch Deep Learning: Complete Beginner's Guide to Building Neural Networks

January 1, 2026 Wasil Zafar 40 min read

Master PyTorch from the ground up—learn tensors, automatic differentiation, neural network construction, training loops, CNNs, RNNs, transfer learning, and production deployment. A comprehensive hands-on guide with executable code examples.

What is PyTorch?

PyTorch is an open-source deep learning framework developed by Facebook's AI Research lab (FAIR). It has rapidly become one of the most popular frameworks in both research and production environments due to its flexibility, ease of use, and Pythonic design.

Unlike static graph frameworks, PyTorch uses dynamic computation graphs (define-by-run), which means the graph is built on-the-fly during forward passes. This makes debugging intuitive and enables complex architectures with varying control flow.

Why PyTorch? It combines the flexibility of NumPy with the power of GPU acceleration and automatic differentiation. You write standard Python code, and PyTorch handles gradient computation automatically—making it perfect for rapid prototyping and research.
Key Concepts
  • Tensors: Multi-dimensional arrays (like NumPy) optimized for GPU computation
  • Autograd: Automatic differentiation engine for computing gradients
  • nn.Module: Base class for building neural network layers and models
  • Optimizers: Algorithms (SGD, Adam) that update model parameters
  • DataLoader: Efficient data loading with batching, shuffling, and parallel processing
Deep Learning Research-Friendly Production-Ready

When to use PyTorch:

  • Building custom neural network architectures with complex control flow
  • Research projects requiring flexibility and rapid experimentation
  • Projects needing GPU acceleration for tensor operations
  • Production deployments with TorchScript for optimized inference
  • Computer vision, NLP, reinforcement learning, or any deep learning task

Installation & Setup

PyTorch installation varies based on your system configuration (CPU vs GPU, CUDA version). Visit pytorch.org for the latest installation commands tailored to your setup.

CPU-Only Installation

# Install PyTorch (CPU version)
pip install torch torchvision torchaudio

# Verify installation
python -c "import torch; print('PyTorch version:', torch.__version__)"

GPU Installation (CUDA-enabled)

# Example: PyTorch with CUDA 12.1 support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Verify GPU availability
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"

Verify Installation

# Import PyTorch and check configuration
import torch
import torchvision
import torchaudio
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

# Display version and device information
print('PyTorch version:', torch.__version__)
print('CUDA available:', torch.cuda.is_available())
print('CUDA device count:', torch.cuda.device_count())
if torch.cuda.is_available():
    print('Current CUDA device:', torch.cuda.current_device())
    print('Device name:', torch.cuda.get_device_name(0))
GPU Recommendation: If you have an NVIDIA GPU with CUDA support, installing the GPU version will significantly speed up training. Deep learning benefits massively from parallel GPU computation—training times can be 10-100x faster compared to CPU.

Tensors: The Foundation

Tensors are the fundamental data structure in PyTorch—multi-dimensional arrays similar to NumPy ndarrays but with GPU acceleration and automatic differentiation support. Everything in PyTorch operates on tensors. Think of a tensor as a generalization: scalars are 0D tensors, vectors are 1D tensors, matrices are 2D tensors, and arrays with more dimensions are higher-order tensors.

Tensor Basics: A tensor is a container for numerical data. The shape describes its dimensions: a tensor of shape [3, 4] has 3 rows and 4 columns. The dtype (data type) specifies what kind of numbers it stores (float32, int64, etc.). The device tells you where it lives (CPU or GPU).

Creating Tensors

PyTorch provides multiple ways to create tensors. Each method has different use cases:

import torch
import numpy as np

# METHOD 1: From a Python list
# torch.tensor() converts Python lists to PyTorch tensors
# Parameters:
#   - data: the actual values (list or nested list)
#   - dtype: data type (torch.int32, torch.float32, torch.float64, etc.)
a = torch.tensor([1, 2, 3], dtype=torch.int32)
print('Tensor a:', a)
# Output: tensor([1, 2, 3], dtype=torch.int32)

# METHOD 2: Create with specific shape and random values
# torch.randn() fills tensor with random values from normal distribution (mean=0, std=1)
# Parameters:
#   - *size: dimensions of the tensor (variadic - can be any number of dimensions)
b = torch.randn(2, 3)  # Creates a 2x3 tensor (2 rows, 3 columns)
print('Tensor b shape:', b.shape)  # torch.Size([2, 3])
print('Tensor b dtype:', b.dtype)  # torch.float32 (default for randn)
print('Tensor b:\n', b)

# METHOD 3: Create with specific values
# torch.zeros() creates a tensor filled with zeros
# torch.ones() creates a tensor filled with ones
zeros = torch.zeros(2, 3)  # 2x3 tensor of all zeros
ones = torch.ones(2, 3)   # 2x3 tensor of all ones
print('Zeros:\n', zeros)
print('Ones:\n', ones)

NumPy Interoperability

PyTorch and NumPy work seamlessly together. You can convert between them easily, which is useful when working with existing NumPy code or libraries.

import torch
import numpy as np

# Convert NumPy array to PyTorch tensor
# torch.from_numpy() is efficient—it shares memory, not copying data
np_array = np.array([[10, 20], [30, 40]], dtype=np.float32)
tensor_from_numpy = torch.from_numpy(np_array)
print('From NumPy:', tensor_from_numpy)
# Output: tensor([[10., 20.], [30., 40.]])

# Convert PyTorch tensor back to NumPy array
# WARNING: This shares memory with the original tensor!
# If you modify one, the other changes too
back_to_numpy = tensor_from_numpy.numpy()
print('Back to NumPy:', back_to_numpy)
print('Type:', type(back_to_numpy))  # 

# Demonstrate shared memory (they point to same data)
tensor_from_numpy[0, 0] = 99  # Modify tensor
print('Modified NumPy array:', back_to_numpy)  # [[99., 20.], [30., 40.]] - changed!

# If you need a separate copy (not sharing memory), use .clone()
independent_copy = torch.from_numpy(np_array).clone()
independent_copy[0, 0] = 999  # Modifying copy doesn't affect original
print('Original unchanged:', np_array[0, 0])  # Still 10

Common Tensor Creation Functions

PyTorch provides helper functions for quickly creating tensors with specific properties. These are essential for initialization and testing:

import torch

# torch.zeros(shape) - all elements are 0
# Use for: initializing bias terms, creating masks
zeros = torch.zeros(3, 4)  # 3x4 matrix of zeros
print('Zeros shape:', zeros.shape)  # torch.Size([3, 4])
print('Zeros:\n', zeros)

# torch.ones(shape) - all elements are 1
# Use for: creating ones masks, initialization
ones = torch.ones(2, 3)
print('Ones:\n', ones)

# torch.empty(shape) - uninitialized values (faster but garbage values)
# Use for: performance-critical code when values will be filled immediately
# WARNING: values are undefined—whatever was in memory!
empty = torch.empty(2, 2)
print('Empty (uninitialized, unpredictable values):\n', empty)

# torch.arange(start, end, step) - like NumPy's range
# Creates tensor with evenly spaced values [start, end)
# Note: end is EXCLUSIVE (not included)
range_tensor = torch.arange(0, 10, 2)  # [0, 2, 4, 6, 8]
print('Range (0 to 10, step 2):', range_tensor)

# torch.linspace(start, end, steps) - evenly spaced values BOTH endpoints included
# Creates exactly 'steps' number of values from start to end
linspace = torch.linspace(0, 1, 5)  # 5 values from 0 to 1
print('Linspace (5 values from 0 to 1):', linspace)  # [0.0, 0.25, 0.5, 0.75, 1.0]

# torch.eye(n) - identity matrix (diagonal ones, rest zeros)
# Use for: initializing attention weights, creating masks
identity = torch.eye(3)  # 3x3 identity matrix
print('Identity (3x3):\n', identity)

# torch.rand(shape) - random values from uniform distribution [0, 1)
# Use for: random initialization, Monte Carlo sampling
rand_uniform = torch.rand(2, 3)  # Random values in [0, 1)
print('Random uniform:\n', rand_uniform)

# torch.randn(shape) - random values from standard normal (Gaussian)
# Use for: neural network weight initialization, simulations
# Distribution: mean=0, std=1 (bell curve)
rand_normal = torch.randn(2, 3)
print('Random normal (mean=0, std=1):\n', rand_normal)

# torch.randint(low, high, shape) - random integers in [low, high)
# Parameters:
#   - low: minimum value (inclusive)
#   - high: maximum value (exclusive)
#   - shape: tuple of output shape
rand_int = torch.randint(0, 10, (3, 3))  # Random integers between 0 and 9
print('Random integers (0-9):\n', rand_int)

Tensor Attributes & Properties

Every tensor has several attributes that describe its properties. Understanding these is crucial for debugging shape mismatches and device placement errors:

import torch

# Create sample tensor to inspect
tensor = torch.randn(2, 3, 4)  # 3D tensor with dimensions 2, 3, 4

# tensor.shape - dimensions of the tensor as a torch.Size object
# torch.Size is like a tuple of integers
print('Shape:', tensor.shape)  # torch.Size([2, 3, 4])
print('Shape as list:', list(tensor.shape))  # [2, 3, 4]

# tensor.size() - same as .shape (alternative method)
print('Size:', tensor.size())  # torch.Size([2, 3, 4])

# tensor.ndim - number of dimensions (rank)
# 1D tensor: ndim=1 (vector)
# 2D tensor: ndim=2 (matrix)
# 3D tensor: ndim=3 (cube)
print('Number of dimensions:', tensor.ndim)  # 3

# tensor.numel() - total number of elements (product of all dimensions)
# Useful for calculating memory usage: memory = numel() * bytes_per_element
total_elements = tensor.numel()  # 2 * 3 * 4 = 24
print('Total elements:', total_elements)  # 24

# tensor.dtype - data type of elements
# Common types: torch.float32 (4 bytes), torch.float64 (8 bytes),
#              torch.int32 (4 bytes), torch.int64 (8 bytes)
print('Data type:', tensor.dtype)  # torch.float32

# tensor.device - where the tensor lives (CPU or GPU)
# device='cpu': tensor is in system RAM
# device='cuda:0': tensor is on first GPU's VRAM
print('Device:', tensor.device)  # cpu or cuda:0

# tensor.requires_grad - whether gradients will be tracked for this tensor
# True: gradients computed during backprop (for parameters)
# False: no gradients (for inputs or fixed tensors)
print('Requires grad:', tensor.requires_grad)  # False by default

# tensor.is_contiguous() - whether elements are laid out consecutively in memory
# Contiguous tensors are faster for operations
# Non-contiguous after transpose, slicing, etc.
print('Is contiguous:', tensor.is_contiguous())  # True for new tensors
Tensor Concepts
  • Shape vs Size: Both give dimensions, but .shape returns torch.Size (like tuple), .size() returns torch.Size too—they're equivalent
  • dtype matters: float32 vs float64 affects memory and precision. Use float32 for speed, float64 for high precision
  • Device placement: Model and data MUST be on same device. Mismatch causes "expected CPU tensor" errors
  • requires_grad: Set True for learnable parameters, False for inputs. Saves memory and computation
  • Memory sharing: from_numpy() shares memory; .clone() makes independent copy

Common Tensor Creation Functions

import torch

# Zeros tensor: all elements are 0
zeros = torch.zeros(3, 4)  # 3x4 matrix of zeros
print('Zeros shape:', zeros.shape)  # torch.Size([3, 4])

# Ones tensor: all elements are 1
ones = torch.ones(2, 3)
print('Ones:\n', ones)

# Empty tensor: uninitialized values (whatever was in memory)
# Faster than zeros/ones but values are garbage
empty = torch.empty(2, 2)
print('Empty (uninitialized):\n', empty)

# Range tensor: evenly spaced values (like NumPy arange)
# arange(start, end, step) - end is exclusive
range_tensor = torch.arange(0, 10, 2)  # [0, 2, 4, 6, 8]
print('Range:', range_tensor)

# Linspace: evenly spaced values including both endpoints
linspace = torch.linspace(0, 1, 5)  # 5 values from 0 to 1
print('Linspace:', linspace)  # [0.0, 0.25, 0.5, 0.75, 1.0]

# Identity matrix: diagonal ones, rest zeros
identity = torch.eye(3)  # 3x3 identity matrix
print('Identity:\n', identity)

# Random tensors
rand_uniform = torch.rand(2, 3)  # Uniform distribution [0, 1)
rand_normal = torch.randn(2, 3)  # Standard normal distribution (mean=0, std=1)
rand_int = torch.randint(0, 10, (3, 3))  # Random integers in [0, 10)
print('Random uniform:\n', rand_uniform)
print('Random normal:\n', rand_normal)
print('Random integers:\n', rand_int)

Tensor Attributes

import torch

# Create sample tensor
tensor = torch.randn(2, 3, 4)  # 3D tensor

# Shape: dimensions of the tensor
print('Shape:', tensor.shape)  # torch.Size([2, 3, 4])
print('Size:', tensor.size())  # Same as shape: torch.Size([2, 3, 4])

# Number of dimensions
print('Number of dimensions:', tensor.ndim)  # 3

# Total number of elements
print('Total elements:', tensor.numel())  # 2 * 3 * 4 = 24

# Data type
print('Data type:', tensor.dtype)  # torch.float32

# Device (CPU or GPU)
print('Device:', tensor.device)  # cpu or cuda:0

# Check if requires gradient tracking
print('Requires grad:', tensor.requires_grad)  # False by default

# Memory layout (row-major by default)
print('Is contiguous:', tensor.is_contiguous())  # True
Memory Efficiency: Tensors created with from_numpy() share memory with the original NumPy array. Modifying one affects the other. Use .clone() if you need an independent copy: independent = tensor_from_numpy.clone()

Tensor Operations & Manipulations

PyTorch provides a rich set of operations for manipulating tensors—from basic arithmetic to advanced reshaping and slicing. Most operations have both functional (torch.add()) and method (tensor.add()) forms.

Reshaping & Views

import torch

# Create 1D tensor with 12 elements
x = torch.arange(12)  # [0, 1, 2, ..., 11]
print('Original:', x)

# Reshape to 2D (3 rows, 4 columns)
# view() returns a new tensor sharing the same data (no copy)
x_2d = x.view(3, 4)
print('Reshaped to 3x4:\n', x_2d)
# tensor([[ 0,  1,  2,  3],
#         [ 4,  5,  6,  7],
#         [ 8,  9, 10, 11]])

# Transpose: swap dimensions
# t() works only for 2D tensors
x_transposed = x_2d.t()
print('Transposed (4x3):\n', x_transposed)

# Automatic dimension inference with -1
# PyTorch calculates the missing dimension
auto_shape = x.view(4, -1)  # -1 means "figure this out" (becomes 3)
print('Auto-shaped (4, 3):', auto_shape.shape)  # torch.Size([4, 3])

# WARNING: view() requires contiguous memory
# If tensor is not contiguous, use reshape() instead
x_permuted = x_2d.permute(1, 0)  # Swap dimensions (now non-contiguous)
try:
    x_view = x_permuted.view(-1)  # This will fail!
except RuntimeError as e:
    print('Error:', str(e)[:50])  # "view size is not compatible..."

# reshape() handles non-contiguous tensors (may copy data)
x_reshaped = x_permuted.reshape(-1)
print('Reshaped from non-contiguous:', x_reshaped.shape)

Visualizing Tensor Reshaping & Indexing

import torch
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import numpy as np

# Create sample data
x = torch.arange(12).reshape(3, 4)  # 3x4 matrix

# Visualize the original 3x4 tensor
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: Original 3x4 matrix with values
ax = axes[0, 0]
im = ax.imshow(x.numpy(), cmap='viridis', aspect='auto')
ax.set_title('Original Tensor (3×4)', fontsize=12, fontweight='bold')
ax.set_xlabel('Columns')
ax.set_ylabel('Rows')

# Add text annotations showing values
for i in range(3):
    for j in range(4):
        text = ax.text(j, i, int(x[i, j]), ha="center", va="center", color="white", fontsize=12, fontweight='bold')

plt.colorbar(im, ax=ax)

# Plot 2: Transposed 4x3 matrix (same data, different shape)
x_t = x.t()
ax = axes[0, 1]
im = ax.imshow(x_t.numpy(), cmap='plasma', aspect='auto')
ax.set_title('Transposed Tensor (4×3)', fontsize=12, fontweight='bold')
ax.set_xlabel('Columns')
ax.set_ylabel('Rows')

for i in range(4):
    for j in range(3):
        text = ax.text(j, i, int(x_t[i, j]), ha="center", va="center", color="white", fontsize=12, fontweight='bold')

plt.colorbar(im, ax=ax)

# Plot 3: Flattened to 1D (12 elements)
ax = axes[1, 0]
x_flat = x.reshape(-1).numpy()
ax.bar(range(len(x_flat)), x_flat, color='steelblue', alpha=0.7)
ax.set_title('Flattened to 1D (12 elements)', fontsize=12, fontweight='bold')
ax.set_xlabel('Index')
ax.set_ylabel('Value')
ax.grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for i, v in enumerate(x_flat):
    ax.text(i, v + 0.3, str(int(v)), ha='center', fontsize=9, fontweight='bold')

# Plot 4: Slicing visualization
ax = axes[1, 1]
im = ax.imshow(x.numpy(), cmap='cool', aspect='auto')

# Highlight a slice [0:2, 1:3] with a rectangle
rect = patches.Rectangle((0.5, -0.5), 2, 2, linewidth=3, edgecolor='red', facecolor='none', linestyle='--')
ax.add_patch(rect)
ax.set_title('Slicing Example: x[0:2, 1:3]', fontsize=12, fontweight='bold')
ax.set_xlabel('Columns')
ax.set_ylabel('Rows')

# Add text annotations
for i in range(3):
    for j in range(4):
        color = 'white' if (i < 2 and j >= 1 and j < 3) else 'white'
        text = ax.text(j, i, int(x[i, j]), ha="center", va="center", color=color, fontsize=12, fontweight='bold')

plt.colorbar(im, ax=ax)

plt.tight_layout()
plt.show()

# Demonstrate indexing operations
print('\nTensor Indexing Summary:')
print(f'Original shape: {x.shape}')
print(f'Transposed shape: {x.t().shape}')
print(f'Flattened shape: {x.reshape(-1).shape}')
print(f'Reshaped to 2x6: {x.reshape(2, 6).shape}')
print(f'Slice [0:2, 1:3] shape: {x[0:2, 1:3].shape}')
print(f'Slice values:\n{x[0:2, 1:3]}')

Slicing & Indexing

import torch

# Create 2D tensor
matrix = torch.tensor([[1, 2, 3],
                       [4, 5, 6],
                       [7, 8, 9]])

# Access single element (returns 0D tensor)
element = matrix[1, 2]  # Row 1, Column 2 (value: 6)
print('Element [1, 2]:', element)  # tensor(6)
print('As Python scalar:', element.item())  # 6

# Access entire row
row = matrix[0, :]  # Row 0, all columns
print('First row:', row)  # tensor([1, 2, 3])

# Access entire column
column = matrix[:, 1]  # All rows, column 1
print('Second column:', column)  # tensor([2, 5, 8])

# Slice submatrix
submatrix = matrix[0:2, 1:3]  # Rows 0-1, columns 1-2
print('Submatrix:\n', submatrix)
# tensor([[2, 3],
#         [5, 6]])

# Advanced indexing with lists
rows = matrix[[0, 2], :]  # Select rows 0 and 2
print('Rows 0 and 2:\n', rows)

# Boolean masking
mask = matrix > 5  # Create boolean mask
print('Mask (elements > 5):\n', mask)
filtered = matrix[mask]  # Extract elements where mask is True
print('Filtered values:', filtered)  # tensor([6, 7, 8, 9])

Arithmetic Operations (Element-wise)

import torch

# Create sample tensors
a = torch.tensor([1, 2, 3, 4])
b = torch.tensor([10, 20, 30, 40])

# Element-wise addition
print('Addition:', a + b)  # tensor([11, 22, 33, 44])

# Element-wise multiplication
print('Multiplication:', a * b)  # tensor([10, 40, 90, 160])

# Element-wise division
print('Division:', b / a)  # tensor([10., 10., 10., 10.])

# Power
print('a squared:', a ** 2)  # tensor([1, 4, 9, 16])

# Universal functions (ufuncs)
print('Square root:', torch.sqrt(a.float()))  # tensor([1.0, 1.414, 1.732, 2.0])
print('Exponential:', torch.exp(a.float()))  # tensor([2.718, 7.389, 20.086, 54.598])
print('Sine:', torch.sin(a.float()))  # tensor([0.841, 0.909, 0.141, -0.757])
print('Natural log:', torch.log(a.float()))  # tensor([0.0, 0.693, 1.099, 1.386])

# In-place operations (modify tensor directly, denoted by underscore suffix)
a.add_(10)  # Add 10 to all elements in-place
print('After in-place add:', a)  # tensor([11, 12, 13, 14])

Matrix Operations

import torch

# Matrix multiplication (dot product)
A = torch.tensor([[1, 2], [3, 4]])  # 2x2 matrix
B = torch.tensor([[5, 6], [7, 8]])  # 2x2 matrix

# @ operator for matrix multiplication (recommended)
C = A @ B
print('Matrix multiplication (A @ B):\n', C)
# tensor([[19, 22],
#         [43, 50]])

# Alternative: torch.matmul()
C_alt = torch.matmul(A, B)
print('Same result:', torch.equal(C, C_alt))  # True

# Batch matrix multiplication (3D tensors)
batch_a = torch.randn(10, 3, 4)  # 10 matrices of size 3x4
batch_b = torch.randn(10, 4, 5)  # 10 matrices of size 4x5
batch_c = batch_a @ batch_b  # Result: 10 matrices of size 3x5
print('Batch matmul shape:', batch_c.shape)  # torch.Size([10, 3, 5])

# Element-wise matrix multiplication (Hadamard product)
hadamard = A * B  # NOT matrix multiplication!
print('Element-wise multiplication:\n', hadamard)
# tensor([[ 5, 12],
#         [21, 32]])

Concatenation & Stacking

import torch

# Create sample tensors
x = torch.tensor([[1, 2], [3, 4]])
y = torch.tensor([[5, 6], [7, 8]])

# Concatenate along rows (dimension 0)
# Stacks vertically - increases number of rows
concat_rows = torch.cat([x, y], dim=0)
print('Concatenate rows (dim=0):\n', concat_rows)
# tensor([[1, 2],
#         [3, 4],
#         [5, 6],
#         [7, 8]])
print('Shape:', concat_rows.shape)  # torch.Size([4, 2])

# Concatenate along columns (dimension 1)
# Stacks horizontally - increases number of columns
concat_cols = torch.cat([x, y], dim=1)
print('Concatenate columns (dim=1):\n', concat_cols)
# tensor([[1, 2, 5, 6],
#         [3, 4, 7, 8]])
print('Shape:', concat_cols.shape)  # torch.Size([2, 4])

# Stack: adds new dimension
# Creates 3D tensor by stacking 2D tensors
stacked = torch.stack([x, y], dim=0)
print('Stacked (dim=0) shape:', stacked.shape)  # torch.Size([2, 2, 2])
print('Stacked:\n', stacked)
# tensor([[[1, 2],
#          [3, 4]],
#         [[5, 6],
#          [7, 8]]])
Performance Tip: In-place operations (suffix _) modify tensors directly without allocating new memory. Use them carefully—they save memory but can interfere with autograd if the tensor requires gradients. Avoid in-place ops on tensors involved in gradient computation.
Practice Exercises

Tensor Operations Exercises

Exercise 1 (Beginner): Create a tensor [1,2,3,4,5,6]. Reshape to 2x3 and 3x2. Compare shapes. Try reshaping to incompatible shape (e.g., 2x4) and observe the error.

Exercise 2 (Beginner): Create a 3x4 matrix. Slice [0:2, 1:3]. Transpose it. Flatten it. Verify the shape after each operation.

Exercise 3 (Intermediate): Create two 2x3 tensors. Concatenate along axis=0 and axis=1. Stack them to create a 3D tensor. Compare all three operations.

Exercise 4 (Intermediate): Create a matrix and use boolean masking to extract elements > 5. Count how many elements satisfy the condition. Use .sum() on the mask.

Challenge (Advanced): Create a 4x4 tensor. Perform multiple operations (reshape, transpose, slice). Use both view() and reshape() and explain the difference in memory sharing.

Autograd: Automatic Differentiation

Autograd is PyTorch's automatic differentiation engine—the magic behind neural network training. It automatically computes gradients (derivatives) of tensor operations, eliminating the need for manual backpropagation calculations. Without Autograd, you'd have to manually compute how changes in weights affect the loss function—a tedious and error-prone task.

How Autograd Works: When you set requires_grad=True on a tensor, PyTorch records every operation. This creates a "computational graph"—a record of how the final output depends on the inputs. Calling .backward() traces backwards through this graph to compute how much each parameter should change to reduce the loss.

Basic Gradient Computation

Gradients measure how a function changes with respect to its inputs. In neural networks, we use gradients to update weights in the direction that reduces loss:

import torch

# Step 1: Create tensor with gradient tracking enabled
# requires_grad=True tells PyTorch: "I want to compute gradients for this"
# Without this flag, no gradients are tracked (saves memory)
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
print('x:', x)
print('x.requires_grad:', x.requires_grad)  # True

# Step 2: Perform operations to build computation graph
# PyTorch records each operation in a graph structure
y = x ** 2  # y = x²  (element-wise squaring)
print('y:', y)  # tensor([1., 4., 9.], grad_fn=)
# Note: grad_fn shows HOW this tensor was created

z = y.sum()  # z = sum(y) = x₁² + x₂² + x₃² = 1 + 4 + 9 = 14
print('z:', z)  # tensor(14., grad_fn=)

# Step 3: Compute gradients via BACKPROPAGATION
# .backward() starts at z and traces backwards to x
# It computes dz/dx (how z changes with respect to x)
z.backward()

# Step 4: Access computed gradients
# x.grad contains dz/dx for each element of x
print('x.grad:', x.grad)  # tensor([2., 4., 6.])

# Explanation of results:
# If x = [1, 2, 3]:
# z = x₁² + x₂² + x₃² = 1 + 4 + 9 = 14
# dz/dx₁ = 2*x₁ = 2*1 = 2 ✓
# dz/dx₂ = 2*x₂ = 2*2 = 4 ✓
# dz/dx₃ = 2*x₃ = 2*3 = 6 ✓
print('Gradients match mathematical derivative: dz/dx = 2x')

Multiple Backward Passes (Gradient Accumulation)

By default, gradients accumulate (add up) when you call .backward() multiple times. This is useful for training with gradient accumulation but requires careful management:

import torch

# Create a tensor with gradient tracking
# A single call to .backward() computes gradients
# But subsequent .backward() calls ADD to existing gradients
x = torch.tensor([1.0, 2.0], requires_grad=True)

# First computation and backward pass
y1 = (x ** 2).sum()  # y1 = 1² + 2² = 5
y1.backward()  # Compute dy1/dx
print('After first backward:', x.grad)  # tensor([2., 4.])
# Gradients: [2*1, 2*2] = [2, 4]

# Second computation WITHOUT zeroing gradients
# IMPORTANT: Gradients ACCUMULATE (add to existing values)
y2 = (x ** 3).sum()  # y2 = 1³ + 2³ = 9
y2.backward()  # Compute dy2/dx AND ADD to existing gradients
print('After second backward (accumulated):', x.grad)  # tensor([5., 16.])
# Why? [2 + 3, 4 + 12] = [2+3*1², 4+3*2²] = [5, 16]
# This is because gradients ADD up: d(y1+y2)/dx = dy1/dx + dy2/dx

# CRITICAL LESSON: In training loops, you MUST ZERO gradients
# before each new loss computation to prevent accumulation!
x.grad.zero_()  # Reset all gradients to zero
print('After zeroing:', x.grad)  # tensor([0., 0.])

# Now compute fresh gradients for a new batch
y3 = (x ** 2).sum()
y3.backward()
print('Fresh gradients (no accumulation):', x.grad)  # tensor([2., 4.]) - clean slate
Autograd Concepts
  • requires_grad=True: Enable gradient tracking for a tensor. Only needed for parameters, not inputs
  • Computational Graph: Internal record of operations. Enables .backward() to trace back and compute derivatives
  • .backward(): Computes gradients by traversing the graph. Must be called on a scalar (single number)
  • Gradient Accumulation: .grad += new_gradients (adds up). Always call .zero_grad() before new forward pass
  • .detach(): Remove tensor from graph; use when you want values without gradient computation

Detaching from Computation Graph

import torch

x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2

# Detach: create a tensor without gradient tracking
# Useful when you want to use values without affecting gradients
y_detached = y.detach()
print('y_detached.requires_grad:', y_detached.requires_grad)  # False

# Operations on detached tensor won't be tracked
z = y_detached * 2
# z.backward() would fail because z has no grad_fn

# Alternative: torch.no_grad() context manager
# Temporarily disables gradient tracking for all operations
with torch.no_grad():
    y_no_grad = x ** 3
    z_no_grad = y_no_grad.sum()
    print('z_no_grad.requires_grad:', z_no_grad.requires_grad)  # False

# Use no_grad during inference to save memory and speed up computation

Gradient Flow Example

import torch

# Simulate a simple neural network computation
# Input → Weight multiplication → Activation → Loss

# Create learnable weights (parameters)
w = torch.randn(3, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)

# Input data (no gradient needed for inputs)
x = torch.randn(5, 3)  # 5 samples, 3 features

# Forward pass: compute predictions
# Linear transformation: y = xW + b
out = x @ w + b  # @ is matrix multiplication
print('Output shape:', out.shape)  # torch.Size([5, 3])

# Compute loss (mean squared error)
# In real training, you'd compare with actual labels
loss = out.pow(2).mean()  # L = mean((out)²)
print('Loss:', loss.item())  # Scalar value

# Backward pass: compute gradients
loss.backward()

# Check gradients
print('w.grad shape:', w.grad.shape)  # torch.Size([3, 3])
print('b.grad shape:', b.grad.shape)  # torch.Size([3])
print('Weight gradient (first 3 values):', w.grad.flatten()[:3])

# These gradients tell us how to adjust w and b to reduce loss
# Optimizers use these gradients to update parameters

Visualizing Gradient Descent and Parameter Updates

import torch
import torch.optim as optim
import matplotlib.pyplot as plt
import numpy as np

# Create a simple 2D loss landscape to visualize gradient descent
# Loss function: L(w1, w2) = (w1-2)² + (w2+1)² (quadratic bowl)
def loss_function(w1, w2):
    return (w1 - 2) ** 2 + (w2 + 1) ** 2

# Create grid for contour plot
w1_range = np.linspace(-2, 5, 100)
w2_range = np.linspace(-4, 2, 100)
W1, W2 = np.meshgrid(w1_range, w2_range)
Z = loss_function(W1, W2)

# Initialize weights at poor location
weights = torch.tensor([-1.5, 1.5], requires_grad=True)
optimizer = optim.SGD([weights], lr=0.1)

# Run gradient descent
iterations = 50
weight_history = [weights.data.clone().numpy()]
loss_history = []

for i in range(iterations):
    # Compute loss
    loss = loss_function(weights[0], weights[1])
    loss_history.append(loss.item())
    
    # Backward pass
    if weights.grad is not None:
        weights.grad.zero_()
    loss.backward()
    
    # Update weights
    optimizer.step()
    
    # Record history
    weight_history.append(weights.data.clone().numpy())

weight_history = np.array(weight_history)

# Create visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Gradient descent on loss landscape
ax = axes[0]
contour = ax.contour(W1, W2, Z, levels=20, cmap='viridis', alpha=0.6)
ax.clabel(contour, inline=True, fontsize=8)
im = ax.contourf(W1, W2, Z, levels=20, cmap='viridis', alpha=0.3)

# Plot the gradient descent path
ax.plot(weight_history[:, 0], weight_history[:, 1], 'ro-', linewidth=2, markersize=4, label='Optimization path', alpha=0.8)

# Mark start and end
ax.plot(weight_history[0, 0], weight_history[0, 1], 'g*', markersize=20, label='Start', zorder=5)
ax.plot(weight_history[-1, 0], weight_history[-1, 1], 'r*', markersize=20, label='End (minimum)', zorder=5)

# Mark true minimum
ax.plot(2, -1, 'b+', markersize=15, markeredgewidth=3, label='True minimum (2, -1)', zorder=5)

ax.set_xlabel('Weight w₁', fontsize=12, fontweight='bold')
ax.set_ylabel('Weight w₂', fontsize=12, fontweight='bold')
ax.set_title('Gradient Descent on Loss Landscape', fontsize=13, fontweight='bold')
ax.legend(fontsize=10, loc='upper right')
ax.grid(True, alpha=0.3)
plt.colorbar(im, ax=ax, label='Loss Value')

# Plot 2: Loss over iterations
ax = axes[1]
iterations_range = np.arange(len(loss_history))
ax.plot(iterations_range, loss_history, 'b-', linewidth=2.5, label='Training loss')
ax.fill_between(iterations_range, loss_history, alpha=0.3)

ax.set_xlabel('Iteration', fontsize=12, fontweight='bold')
ax.set_ylabel('Loss Value', fontsize=12, fontweight='bold')
ax.set_title('Loss Convergence During Training', fontsize=13, fontweight='bold')
ax.grid(True, alpha=0.3)
ax.legend(fontsize=11)

# Add annotations
final_loss = loss_history[-1]
ax.annotate(f'Final loss: {final_loss:.4f}', 
            xy=(len(loss_history)-1, final_loss), 
            xytext=(len(loss_history)-15, final_loss+2),
            fontsize=10, fontweight='bold',
            arrowprops=dict(arrowstyle='->', color='red', lw=2))

plt.tight_layout()
plt.show()

# Print optimization details
print('Gradient Descent Optimization Summary:')
print('='*50)
print(f'Initial weights: w1={weight_history[0, 0]:.3f}, w2={weight_history[0, 1]:.3f}')
print(f'Initial loss: {loss_history[0]:.4f}')
print(f'\nFinal weights: w1={weight_history[-1, 0]:.3f}, w2={weight_history[-1, 1]:.3f}')
print(f'Final loss: {loss_history[-1]:.4f}')
print(f'True minimum: w1=2.0, w2=-1.0 (loss=0.0)')
print(f'\nImprovement: {loss_history[0]:.4f} → {loss_history[-1]:.4f}')
print(f'Percent reduction: {(1 - loss_history[-1]/loss_history[0])*100:.1f}%')
print(f'Iterations: {len(loss_history)}')

# Show weight trajectory
print('\nWeight trajectory (every 5 iterations):')
print('Iter │   w₁    │   w₂    │  Loss   │ Distance to min')
print('─'*50)
for i in range(0, len(weight_history), max(1, len(weight_history)//11)):
    dist = np.sqrt((weight_history[i, 0]-2)**2 + (weight_history[i, 1]+1)**2)
    loss_val = loss_function(weight_history[i, 0], weight_history[i, 1])
    print(f'{i:4d} │ {weight_history[i, 0]:7.3f} │ {weight_history[i, 1]:7.3f} │ {loss_val:7.4f} │ {dist:7.4f}')
How Autograd Works: PyTorch builds a Dynamic Computation Graph during the forward pass. Each operation creates a node storing the operation and its inputs. When you call .backward(), PyTorch traverses this graph in reverse (backpropagation), applying the chain rule to compute gradients for all tensors with requires_grad=True.
Common Autograd Patterns
  • Training: Enable gradients with requires_grad=True for parameters
  • Inference: Disable gradients with torch.no_grad() to save memory
  • Zero Gradients: Call optimizer.zero_grad() before each backward pass
  • Gradient Clipping: Prevent exploding gradients with torch.nn.utils.clip_grad_norm_()
  • Detach: Use .detach() to break gradient flow when needed

Visualizing Gradient Flow Through Layers (Backpropagation)

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import numpy as np

# Create a 3-layer neural network
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(10, 8)
        self.fc2 = nn.Linear(8, 4)
        self.fc3 = nn.Linear(4, 1)
    
    def forward(self, x):
        self.h1 = self.fc1(x)
        self.a1 = torch.relu(self.h1)
        self.h2 = self.fc2(self.a1)
        self.a2 = torch.relu(self.h2)
        self.out = self.fc3(self.a2)
        return self.out

# Initialize network
net = SimpleNet()

# Create synthetic data
x = torch.randn(32, 10)  # Batch of 32 samples, 10 features
y = torch.randn(32, 1)   # Target values

# Forward pass
output = net(x)
loss = ((output - y) ** 2).mean()

# Backward pass to compute gradients
loss.backward()

# Collect gradient information
gradient_data = {
    'Layer': [],
    'Parameter': [],
    'Mean Gradient': [],
    'Max Gradient': [],
    'Gradient Norm': []
}

for name, param in net.named_parameters():
    if param.grad is not None:
        grad_mean = param.grad.mean().item()
        grad_max = param.grad.max().item()
        grad_norm = param.grad.norm().item()
        
        layer_name = name.split('.')[0]  # Extract layer name (fc1, fc2, fc3)
        param_type = name.split('.')[-1]  # Extract weight or bias
        
        gradient_data['Layer'].append(layer_name)
        gradient_data['Parameter'].append(param_type)
        gradient_data['Mean Gradient'].append(abs(grad_mean))
        gradient_data['Max Gradient'].append(grad_max)
        gradient_data['Gradient Norm'].append(grad_norm)

# Create visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot 1: Gradient magnitude by layer
ax = axes[0, 0]
layers = gradient_data['Layer']
mean_grads = gradient_data['Mean Gradient']
colors = ['#FF6B6B', '#4ECDC4', '#45B7D1']
bars = ax.bar(range(len(layers)), mean_grads, color=colors[:len(layers)])

# Add value labels on bars
for i, (bar, val) in enumerate(zip(bars, mean_grads)):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height(),
            f'{val:.4f}', ha='center', va='bottom', fontweight='bold', fontsize=10)

ax.set_xticks(range(len(layers)))
ax.set_xticklabels(layers)
ax.set_ylabel('Mean |Gradient|', fontsize=11, fontweight='bold')
ax.set_title('Gradient Magnitude by Layer', fontsize=12, fontweight='bold')
ax.grid(True, alpha=0.3, axis='y')

# Plot 2: Gradient norm comparison
ax = axes[0, 1]
param_labels = [f"{l.replace('fc', 'Layer ')}\n{p}" for l, p in zip(layers, gradient_data['Parameter'])]
norms = gradient_data['Gradient Norm']
bars = ax.barh(range(len(param_labels)), norms, color=colors[:len(param_labels)])

# Add value labels
for i, (bar, val) in enumerate(zip(bars, norms)):
    ax.text(bar.get_width(), bar.get_y() + bar.get_height()/2,
            f' {val:.3f}', ha='left', va='center', fontweight='bold', fontsize=9)

ax.set_yticks(range(len(param_labels)))
ax.set_yticklabels(param_labels)
ax.set_xlabel('Gradient Norm (L2)', fontsize=11, fontweight='bold')
ax.set_title('Gradient Norms Across Layers', fontsize=12, fontweight='bold')
ax.grid(True, alpha=0.3, axis='x')

# Plot 3: Network architecture with gradient flow
ax = axes[1, 0]
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
ax.axis('off')

# Draw network layers
layer_info = [
    {'name': 'Input\n(10)', 'x': 1, 'grad': None, 'color': '#95E1D3'},
    {'name': 'Hidden 1\n(8)', 'x': 3.5, 'grad': mean_grads[0] if len(mean_grads) > 0 else 0, 'color': '#F38181'},
    {'name': 'Hidden 2\n(4)', 'x': 6, 'grad': mean_grads[1] if len(mean_grads) > 1 else 0, 'color': '#AA96DA'},
    {'name': 'Output\n(1)', 'x': 8.5, 'grad': mean_grads[2] if len(mean_grads) > 2 else 0, 'color': '#FCBAD3'},
]

for layer in layer_info:
    # Draw layer box
    color = layer['color']
    ax.add_patch(plt.Rectangle((layer['x']-0.6, 3.5), 1.2, 3, 
                                fill=True, facecolor=color, edgecolor='black', linewidth=2))
    ax.text(layer['x'], 5.2, layer['name'], ha='center', va='center', 
            fontweight='bold', fontsize=10)
    
    # Add gradient magnitude if exists
    if layer['grad'] is not None:
        grad_size = max(0.1, min(1, layer['grad'] * 10))  # Scale for visibility
        ax.text(layer['x'], 2.5, f'∇={layer["grad"]:.3f}', ha='center', va='top',
                fontsize=8, style='italic', color='darkred', fontweight='bold')

# Draw forward pass arrows
ax.annotate('', xy=(2.9, 5), xytext=(1.6, 5),
            arrowprops=dict(arrowstyle='->', lw=2.5, color='blue', alpha=0.7))
ax.annotate('', xy=(5.4, 5), xytext=(4.1, 5),
            arrowprops=dict(arrowstyle='->', lw=2.5, color='blue', alpha=0.7))
ax.annotate('', xy=(7.9, 5), xytext=(6.6, 5),
            arrowprops=dict(arrowstyle='->', lw=2.5, color='blue', alpha=0.7))

# Draw backward pass arrows (dashed, reverse)
ax.annotate('', xy=(4.1, 4.5), xytext=(5.4, 4.5),
            arrowprops=dict(arrowstyle='->', lw=2.5, color='red', alpha=0.7, linestyle='dashed'))
ax.annotate('', xy=(2.9, 4.5), xytext=(1.6, 4.5),
            arrowprops=dict(arrowstyle='->', lw=2.5, color='red', alpha=0.7, linestyle='dashed'))
ax.annotate('', xy=(6.6, 4.5), xytext=(7.9, 4.5),
            arrowprops=dict(arrowstyle='->', lw=2.5, color='red', alpha=0.7, linestyle='dashed'))

# Add labels
ax.text(0.5, 9.2, 'Forward Pass (Blue →)', fontsize=11, fontweight='bold', color='blue')
ax.text(0.5, 8.8, 'Backward Pass (Red ⟶)', fontsize=11, fontweight='bold', color='red', style='italic')
ax.set_title('Gradient Flow: Forward and Backward Pass', fontsize=12, fontweight='bold')

# Plot 4: Gradient distribution histogram
ax = axes[1, 1]
all_gradients = []

for i, (param_name, param) in enumerate(net.named_parameters()):
    if param.grad is not None:
        all_gradients.extend(param.grad.detach().flatten().numpy())

ax.hist(all_gradients, bins=30, color='skyblue', edgecolor='black', alpha=0.7)
ax.axvline(np.mean(all_gradients), color='red', linestyle='--', linewidth=2, label=f'Mean: {np.mean(all_gradients):.4f}')
ax.axvline(np.median(all_gradients), color='green', linestyle='--', linewidth=2, label=f'Median: {np.median(all_gradients):.4f}')

ax.set_xlabel('Gradient Value', fontsize=11, fontweight='bold')
ax.set_ylabel('Frequency', fontsize=11, fontweight='bold')
ax.set_title('Distribution of All Gradients', fontsize=12, fontweight='bold')
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

# Print backpropagation summary
print('Backpropagation Gradient Summary:')
print('='*60)
print(f'Loss value: {loss.item():.4f}\n')
print('Layer-wise Gradient Statistics:')
print('─'*60)
print(f'{"Layer":<15} {"Parameter":<10} {"Mean |∇|":<15} {"Max |∇|":<15}')
print('─'*60)

for layer, param, mean_grad, max_grad in zip(
    gradient_data['Layer'], 
    gradient_data['Parameter'],
    gradient_data['Mean Gradient'],
    gradient_data['Max Gradient']
):
    print(f'{layer:<15} {param:<10} {mean_grad:>12.4f}   {max_grad:>12.4f}')

print('─'*60)
print(f'\nTotal parameters: {sum(p.numel() for p in net.parameters())}')
print(f'Total parameters with gradients: {sum(p.grad.numel() for p in net.parameters() if p.grad is not None)}')
Key Insights on Gradient Flow: Gradients flow differently through layers—earlier layers typically have smaller gradients (vanishing gradient problem) while later layers have larger ones. Batch normalization, residual connections, and proper activation functions help stabilize gradient flow, enabling training of very deep networks.
Practice Exercises

Autograd & Backpropagation Exercises

Exercise 1 (Beginner): Create x with requires_grad=True. Compute y = x^2 and z = y.sum(). Call z.backward(). Verify gradients match dz/dx = 2x.

Exercise 2 (Beginner): Perform backward pass twice on same tensor. Observe gradient accumulation. Use zero_grad() to reset. Explain why zeroing gradients is necessary.

Exercise 3 (Intermediate): Create a simple computation graph (3+ operations). Call backward() and inspect .grad for each intermediate tensor. Visualize which tensors have gradients.

Exercise 4 (Intermediate): Implement manual gradient descent: compute loss, backward(), update weights using .data -= lr * .grad, zero_grad(). Train on simple data.

Challenge (Advanced): Implement the full training loop with optimizer. Create a 2-layer network, train on synthetic data, plot loss curve. Compare manual gradients with optimizer.

Building Neural Networks with nn.Module

The nn.Module class is the foundation for building neural networks in PyTorch. All models inherit from this base class, which provides essential functionality for parameter management, device transfer, and training/evaluation modes.

Basic Neural Network Structure

import torch
import torch.nn as nn

# Define a simple feedforward neural network
class SimpleNet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleNet, self).__init__()  # Initialize parent class
        
        # Define layers as instance attributes
        # PyTorch automatically tracks these as parameters
        self.fc1 = nn.Linear(input_size, hidden_size)  # Input → Hidden
        self.fc2 = nn.Linear(hidden_size, output_size)  # Hidden → Output
        self.relu = nn.ReLU()  # Activation function
    
    def forward(self, x):
        # Define forward pass: how data flows through the network
        x = self.fc1(x)      # Apply first linear transformation
        x = self.relu(x)     # Apply activation
        x = self.fc2(x)      # Apply second linear transformation
        return x

# Create model instance
model = SimpleNet(input_size=10, hidden_size=20, output_size=5)
print(model)
# Output shows model architecture:
# SimpleNet(
#   (fc1): Linear(in_features=10, out_features=20, bias=True)
#   (fc2): Linear(in_features=20, out_features=5, bias=True)
#   (relu): ReLU()
# )

# Count total parameters
total_params = sum(p.numel() for p in model.parameters())
print(f'Total parameters: {total_params}')
# fc1: 10*20 + 20 = 220, fc2: 20*5 + 5 = 105, Total: 325

Forward Pass Example

import torch
import torch.nn as nn

# Create model
class SimpleNet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)
        self.relu = nn.ReLU()
    
    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

model = SimpleNet(input_size=10, hidden_size=20, output_size=5)

# Create batch of input data (32 samples, 10 features each)
batch = torch.randn(32, 10)

# Forward pass: model(batch) calls forward() automatically
output = model(batch)
print('Output shape:', output.shape)  # torch.Size([32, 5])
print('Output (first sample):', output[0])  # 5 logits for this sample

nn.Sequential: Quick Model Building

import torch
import torch.nn as nn

# Sequential allows building models without defining forward()
# Layers execute in order automatically
model = nn.Sequential(
    nn.Linear(10, 20),      # Layer 1
    nn.ReLU(),              # Activation 1
    nn.Linear(20, 20),      # Layer 2
    nn.ReLU(),              # Activation 2
    nn.Linear(20, 5)        # Output layer
)

print(model)

# Forward pass
x = torch.randn(32, 10)
output = model(x)
print('Output shape:', output.shape)  # torch.Size([32, 5])

# Access individual layers
print('First layer:', model[0])  # Linear(in_features=10, out_features=20)
print('First layer weights shape:', model[0].weight.shape)  # torch.Size([20, 10])
When to use Sequential vs Custom nn.Module: Use nn.Sequential for simple feed-forward architectures with linear data flow. Use custom nn.Module classes when you need complex control flow, skip connections (ResNet), multiple inputs/outputs, or custom forward logic.
Practice Exercises

Neural Network Architecture Exercises

Exercise 1 (Beginner): Create a simple nn.Sequential model with 3 layers: (10→20)→(20→10)→(10→2). Print model architecture. Count total parameters.

Exercise 2 (Beginner): Create custom nn.Module class. Implement __init__ (define layers) and forward (use layers). Pass random input through network.

Exercise 3 (Intermediate): Build a model with multiple paths (split input, process separately, concatenate). Use custom nn.Module with complex forward logic.

Exercise 4 (Intermediate): Create two identical architectures—one with Sequential, one with custom Module. Compare code readability and flexibility.

Challenge (Advanced): Build a network with skip connections or residual blocks. Implement custom forward with layer fusion or dynamic layer selection.

Activation Functions & Initialization

Activation functions introduce non-linearity that allows neural networks to learn complex, non-linear patterns. Without them, stacking linear layers is just matrix multiplication—no more powerful than a single linear layer! Proper weight initialization ensures gradients flow effectively during backpropagation, making the difference between converging quickly and getting stuck in poor local minima.

Understanding Activation Functions

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import numpy as np

# ACTIVATION FUNCTION INTUITION:
# f(x) = max(0, x)  # ReLU
# f(x) = 1/(1 + e^-x)  # Sigmoid
# f(x) = (e^x - e^-x)/(e^x + e^-x)  # Tanh
#
# These non-linear transformations enable networks to fit curved boundaries
# Example: Sigmoid squashes output to (0,1) for probabilities

# Example input: 5 sample values
x = torch.linspace(-3, 3, 100)

# ReLU: max(0, x) - most popular, simple
relu = nn.ReLU()
y_relu = relu(x)

# Why ReLU?
print('ReLU (f(x) = max(0, x)):')
print('  - Simple and fast (just a comparison)')
print('  - No vanishing gradient problem (gradient = 1 for x > 0)')
print('  - Biologically inspired (how neurons work)')
print('  - Problems: "Dying ReLU" (neurons output 0 forever)')

# Leaky ReLU: allows small negative slope (fixes dying ReLU)
leaky_relu = nn.LeakyReLU(negative_slope=0.01)
y_leaky = leaky_relu(x)

print('\nLeaky ReLU (f(x) = x if x>0 else 0.01*x):')
print('  - Fixes dying ReLU: allows small negatives')
print('  - Still fast and simple')
print('  - Better gradient flow than ReLU')
print('  - Default negative_slope: 0.01 (or 0.2 for PReLU)')

# Sigmoid: squashes to (0,1) - used for binary classification output
sigmoid = nn.Sigmoid()
y_sigmoid = sigmoid(x)

print('\nSigmoid (f(x) = 1/(1 + e^-x)):')
print('  - Output range: (0, 1) - perfect for probabilities')
print('  - Problem: vanishing gradient near -∞ or +∞')
print('  - Slow (exponential computation)')
print('  - Use: Binary classification output layer')

# Tanh: squashes to (-1,1) - zero-centered
tanh = nn.Tanh()
y_tanh = tanh(x)

print('\nTanh (f(x) = (e^x - e^-x)/(e^x + e^-x)):')
print('  - Output range: (-1, 1) - zero-centered')
print('  - Better than sigmoid (gradient stronger at center)')
print('  - Still slower and vanishing gradient issues')
print('  - Use: RNN/LSTM (historical reason)')

# Test with sample values
print('\nComparison with x = [-2, -1, 0, 1, 2]:')
test_x = torch.tensor([-2.0, -1.0, 0.0, 1.0, 2.0])
print(f'ReLU:       {relu(test_x)}')
print(f'Leaky ReLU: {leaky_relu(test_x)}')
print(f'Sigmoid:    {sigmoid(test_x)}')
print(f'Tanh:       {tanh(test_x)}')

Visualizing Activation Functions

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import numpy as np

# Create input range from -3 to 3
x = torch.linspace(-3, 3, 100)
x_np = x.numpy()

# Apply different activation functions
relu = nn.ReLU()
leaky_relu = nn.LeakyReLU(negative_slope=0.01)
sigmoid = nn.Sigmoid()
tanh = nn.Tanh()

y_relu = relu(x).detach().numpy()
y_leaky = leaky_relu(x).detach().numpy()
y_sigmoid = sigmoid(x).detach().numpy()
y_tanh = tanh(x).detach().numpy()

# Create 2x2 subplot grid showing all activation functions
fig, axes = plt.subplots(2, 2, figsize=(10, 8))

# ReLU plot
axes[0, 0].plot(x_np, y_relu, 'b-', linewidth=2, label='ReLU')
axes[0, 0].axhline(y=0, color='k', linestyle='-', alpha=0.3)
axes[0, 0].axvline(x=0, color='k', linestyle='-', alpha=0.3)
axes[0, 0].set_title('ReLU: f(x) = max(0, x)', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Input (x)')
axes[0, 0].set_ylabel('Output')
axes[0, 0].grid(True, alpha=0.3)
axes[0, 0].set_ylim([-0.5, 3.5])

# Leaky ReLU plot
axes[0, 1].plot(x_np, y_leaky, 'g-', linewidth=2, label='Leaky ReLU')
axes[0, 1].axhline(y=0, color='k', linestyle='-', alpha=0.3)
axes[0, 1].axvline(x=0, color='k', linestyle='-', alpha=0.3)
axes[0, 1].set_title('Leaky ReLU: f(x) = max(0.01x, x)', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('Input (x)')
axes[0, 1].set_ylabel('Output')
axes[0, 1].grid(True, alpha=0.3)
axes[0, 1].set_ylim([-0.1, 3.5])

# Sigmoid plot
axes[1, 0].plot(x_np, y_sigmoid, 'r-', linewidth=2, label='Sigmoid')
axes[1, 0].axhline(y=0.5, color='k', linestyle='--', alpha=0.3)
axes[1, 0].axvline(x=0, color='k', linestyle='-', alpha=0.3)
axes[1, 0].set_title('Sigmoid: f(x) = 1/(1 + e^-x)', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Input (x)')
axes[1, 0].set_ylabel('Output')
axes[1, 0].set_ylim([-0.1, 1.1])
axes[1, 0].grid(True, alpha=0.3)

# Tanh plot
axes[1, 1].plot(x_np, y_tanh, 'm-', linewidth=2, label='Tanh')
axes[1, 1].axhline(y=0, color='k', linestyle='-', alpha=0.3)
axes[1, 1].axvline(x=0, color='k', linestyle='-', alpha=0.3)
axes[1, 1].set_title('Tanh: f(x) = (e^x - e^-x)/(e^x + e^-x)', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Input (x)')
axes[1, 1].set_ylabel('Output')
axes[1, 1].set_ylim([-1.1, 1.1])
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Create comparison plot showing all on same graph (to see differences)
fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(x_np, y_relu, 'b-', linewidth=2.5, label='ReLU', alpha=0.8)
ax.plot(x_np, y_leaky, 'g-', linewidth=2.5, label='Leaky ReLU', alpha=0.8)
ax.plot(x_np, y_sigmoid, 'r-', linewidth=2.5, label='Sigmoid', alpha=0.8)
ax.plot(x_np, y_tanh, 'm-', linewidth=2.5, label='Tanh', alpha=0.8)

ax.axhline(y=0, color='k', linestyle='-', alpha=0.2)
ax.axvline(x=0, color='k', linestyle='-', alpha=0.2)
ax.set_xlabel('Input (x)', fontsize=12)
ax.set_ylabel('Output', fontsize=12)
ax.set_title('Comparison of Activation Functions', fontsize=14, fontweight='bold')
ax.legend(fontsize=11, loc='upper left')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Print key properties
print('Activation Function Properties:')
print('┌─────────────┬──────────────┬─────────────┬──────────────┐')
print('│ Function    │ Output Range │ Vanishing?  │ Best Use     │')
print('├─────────────┼──────────────┼─────────────┼──────────────┤')
print('│ ReLU        │ [0, ∞)       │ No (x>0)    │ Hidden Layer │')
print('│ Leaky ReLU  │ (-∞, ∞)      │ No          │ Hidden Layer │')
print('│ Sigmoid     │ (0, 1)       │ Yes (ends)  │ Binary Out   │')
print('│ Tanh        │ (-1, 1)      │ Yes (ends)  │ RNN/LSTM     │')
print('└─────────────┴──────────────┴─────────────┴──────────────┘')

Softmax: Multi-Class Output Activation

import torch
import torch.nn as nn

# SOFTMAX INTUITION:
# For binary classification: Sigmoid
# For multi-class: Softmax (generalization of sigmoid)
#
# Given raw scores (logits): [2.0, 1.0, 0.5]
# Softmax converts to probabilities that sum to 1.0
#
# Formula: softmax(x_i) = e^(x_i) / sum(e^(x_j))

logits = torch.tensor([2.0, 1.0, 0.5])  # Raw model outputs
softmax = nn.Softmax(dim=0)
probs = softmax(logits)

print('Raw logits:', logits)
print('After Softmax:', probs)
print('Sum of probabilities:', probs.sum())  # Always 1.0

# Example: 3-class classification
print('\nExample: Image classifier predicting (Cat, Dog, Bird)')
logits = torch.tensor([[2.5, 1.0, 0.5],   # Sample 1
                       [0.1, 2.3, 1.2],   # Sample 2
                       [1.1, 1.2, 3.4]])  # Sample 3

probs = softmax(logits)
print('Batch of 3 images:')
print(probs)
print('Each row sums to 1.0')

predicted_classes = torch.argmax(probs, dim=1)
print(f'Predictions: {predicted_classes}')  # [0, 1, 2] = [Cat, Dog, Bird]

# WHY SOFTMAX?
print('\nWhy Softmax?')
print('  - Converts raw scores to probabilities')
print('  - Larger logits have exponentially larger probability')
print('  - Probabilities sum to 1.0 (valid probability distribution)')
print('  - Numerically stable (use LogSoftmax in loss for better stability)')
print('  - Used with CrossEntropyLoss for stable gradient flow')

Weight Initialization: Starting the Training Right

import torch
import torch.nn as nn
import math

# WHY WEIGHT INITIALIZATION MATTERS:
# Bad initialization → gradients vanish/explode → slow training or failure
# Good initialization → gradients flow smoothly → fast convergence

# PROBLEM: With random initialization, signals can explode or vanish
# Example: 100 layers deep
# If each layer multiplies by 0.9, final signal = 0.9^100 ≈ 0
# If each layer multiplies by 1.1, final signal = 1.1^100 ≈ ∞

# SOLUTION: Initialize weights to preserve signal magnitude
# Key insight: weight scale should depend on layer input size

# Create a simple linear layer
layer = nn.Linear(in_features=10, out_features=20)

print('DEFAULT INITIALIZATION:')
print(f'  Weights shape: {layer.weight.shape} (20 × 10)')
print(f'  Weight values (first 3): {layer.weight.data[0, :3]}')
print(f'  Weight std: {layer.weight.data.std():.4f}')
print(f'  Bias values (first 3): {layer.bias.data[:3]}')

# XAVIER (Glorot) INITIALIZATION:
# Good for: sigmoid, tanh
# Std = sqrt(1 / (fan_in + fan_out))
# fan_in = input size, fan_out = output size
nn.init.xavier_uniform_(layer.weight)
nn.init.zeros_(layer.bias)

fan_in = layer.weight.shape[1]  # 10
fan_out = layer.weight.shape[0]  # 20
xavier_std = math.sqrt(1.0 / (fan_in + fan_out))
print(f'\nXAVIER INITIALIZATION (for Sigmoid/Tanh):')
print(f'  fan_in={fan_in}, fan_out={fan_out}')
print(f'  Target std: {xavier_std:.4f}')
print(f'  Actual std: {layer.weight.data.std():.4f}')
print(f'  Weight values (first 3): {layer.weight.data[0, :3]}')
print('  Goal: Preserve signal magnitude through layer')

# HE (Kaiming) INITIALIZATION:
# Good for: ReLU and variants
# Std = sqrt(2 / fan_in)  # Extra factor of 2 for ReLU
nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')
nn.init.zeros_(layer.bias)

he_std = math.sqrt(2.0 / fan_in)
print(f'\nHE INITIALIZATION (for ReLU):')
print(f'  fan_in={fan_in}')
print(f'  Target std: {he_std:.4f}')
print(f'  Actual std: {layer.weight.data.std():.4f}')
print(f'  Weight values (first 3): {layer.weight.data[0, :3]}')
print('  Goal: Adapt for ReLU which zeros half the values')

# COMPARISON:
print('\nInitialization methods:')
print('  Random: No consideration for layer sizes')
print('  Xavier: Good for sigmoid/tanh (symmetric range)')
print('  He: Better for ReLU (asymmetric zero)')
print('  Others: orthogonal, sparse (rarely needed)')

Batch Normalization: Stabilizing Training

import torch
import torch.nn as nn

# BATCH NORMALIZATION CONCEPT:
# During training, internal layer distributions shift (covariate shift)
# This slows down learning. Batch norm fixes this!
#
# Steps:
# 1. Normalize batch to mean=0, std=1
# 2. Scale and shift with learnable parameters γ, β
# 3. Result: stable distributions, faster training

# WITHOUT batch norm:
model_no_bn = nn.Sequential(
    nn.Linear(20, 64),
    nn.ReLU(),
    nn.Linear(64, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
)

# WITH batch norm:
model_with_bn = nn.Sequential(
    nn.Linear(20, 64),
    nn.BatchNorm1d(64),      # Normalize 64-d output
    nn.ReLU(),
    nn.Linear(64, 128),
    nn.BatchNorm1d(128),     # Normalize 128-d output
    nn.ReLU(),
    nn.Linear(128, 10)
)

print('Model WITH Batch Normalization:')
for name, module in model_with_bn.named_modules():
    print(f'  {module}')

# BATCH NORM BEHAVIOR:
print('\nBatch Norm behavior:')
print('  Training mode: normalize by batch statistics')
print('  Eval mode: normalize by running statistics (from training)')
print('  Use model.train() and model.eval() appropriately!')

# EXAMPLE:
batch = torch.randn(32, 64)  # Batch of 32 samples, 64 features
bn = nn.BatchNorm1d(64)

print(f'\nInput batch:')
print(f'  Mean: {batch.mean():.4f}, Std: {batch.std():.4f}')

# Training mode
bn.train()
normalized = bn(batch)
print(f'After BatchNorm (training):')
print(f'  Mean: {normalized.mean():.4f}, Std: {normalized.std():.4f}')

# GUIDELINES:
print('\nWhen to use Batch Norm:')
print('  ✓ Deep networks (>4 layers): speeds up training significantly')
print('  ✓ Large batch sizes (>32): batch statistics more reliable')
print('  ✓ Before activation in CNNs')
print('  ⚠ Small batch sizes (<8): batch statistics unreliable')
print('  ⚠ RNNs/LSTMs: use LayerNorm instead (more stable)')
print('  ⚠ Remember: different behavior in train() vs eval()')

Visualizing Batch Normalization Effect on Training

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import numpy as np

# Create two models: with and without batch norm
class NetworkWithoutBN(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(20, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, 64)
        self.fc4 = nn.Linear(64, 10)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = torch.relu(self.fc3(x))
        x = self.fc4(x)
        return x

class NetworkWithBN(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(20, 64)
        self.bn1 = nn.BatchNorm1d(64)
        self.fc2 = nn.Linear(64, 64)
        self.bn2 = nn.BatchNorm1d(64)
        self.fc3 = nn.Linear(64, 64)
        self.bn3 = nn.BatchNorm1d(64)
        self.fc4 = nn.Linear(64, 10)
    
    def forward(self, x):
        x = torch.relu(self.bn1(self.fc1(x)))
        x = torch.relu(self.bn2(self.fc2(x)))
        x = torch.relu(self.bn3(self.fc3(x)))
        x = self.fc4(x)
        return x

# Create data
np.random.seed(42)
torch.manual_seed(42)

X_train = torch.randn(1000, 20)
y_train = torch.randint(0, 10, (1000,))
X_val = torch.randn(200, 20)
y_val = torch.randint(0, 10, (200,))

# Training function
def train_model(model, X, y, X_val, y_val, epochs=50, lr=0.01):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    loss_fn = nn.CrossEntropyLoss()
    
    train_losses = []
    val_losses = []
    
    for epoch in range(epochs):
        model.train()
        # Training
        output = model(X)
        loss = loss_fn(output, y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        train_losses.append(loss.item())
        
        # Validation
        model.eval()
        with torch.no_grad():
            val_output = model(X_val)
            val_loss = loss_fn(val_output, y_val)
            val_losses.append(val_loss.item())
    
    return train_losses, val_losses

# Train both models
print('Training network WITHOUT Batch Norm...')
model_no_bn = NetworkWithoutBN()
train_loss_no_bn, val_loss_no_bn = train_model(model_no_bn, X_train, y_train, X_val, y_val)

print('Training network WITH Batch Norm...')
model_with_bn = NetworkWithBN()
train_loss_with_bn, val_loss_with_bn = train_model(model_with_bn, X_train, y_train, X_val, y_val)

# Create visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Training convergence comparison
ax = axes[0]
epochs_range = np.arange(len(train_loss_no_bn))

ax.plot(epochs_range, train_loss_no_bn, 'b-', linewidth=2, label='No BN - Training', alpha=0.7)
ax.plot(epochs_range, val_loss_no_bn, 'b--', linewidth=2, label='No BN - Validation', alpha=0.7)
ax.plot(epochs_range, train_loss_with_bn, 'r-', linewidth=2, label='With BN - Training', alpha=0.7)
ax.plot(epochs_range, val_loss_with_bn, 'r--', linewidth=2, label='With BN - Validation', alpha=0.7)

ax.fill_between(epochs_range, train_loss_no_bn, alpha=0.1, color='blue')
ax.fill_between(epochs_range, train_loss_with_bn, alpha=0.1, color='red')

ax.set_xlabel('Epoch', fontsize=12, fontweight='bold')
ax.set_ylabel('Loss', fontsize=12, fontweight='bold')
ax.set_title('Convergence: With vs Without Batch Normalization', fontsize=13, fontweight='bold')
ax.legend(fontsize=10, loc='upper right')
ax.grid(True, alpha=0.3)

# Plot 2: Final loss comparison
ax = axes[1]
models = ['Without BN', 'With BN']
final_train = [train_loss_no_bn[-1], train_loss_with_bn[-1]]
final_val = [val_loss_no_bn[-1], val_loss_with_bn[-1]]

x = np.arange(len(models))
width = 0.35

bars1 = ax.bar(x - width/2, final_train, width, label='Training Loss', color='skyblue', edgecolor='black')
bars2 = ax.bar(x + width/2, final_val, width, label='Validation Loss', color='coral', edgecolor='black')

# Add value labels
for bars in [bars1, bars2]:
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height,
                f'{height:.3f}', ha='center', va='bottom', fontweight='bold', fontsize=10)

ax.set_ylabel('Loss', fontsize=12, fontweight='bold')
ax.set_title('Final Loss Comparison', fontsize=13, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(models)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3, axis='y')

# Calculate improvement
improvement_train = ((train_loss_no_bn[-1] - train_loss_with_bn[-1]) / train_loss_no_bn[-1]) * 100
improvement_val = ((val_loss_no_bn[-1] - val_loss_with_bn[-1]) / val_loss_no_bn[-1]) * 100

ax.text(0.5, 0.98, f'Training improvement: {improvement_train:.1f}%\nValidation improvement: {improvement_val:.1f}%',
        transform=ax.transAxes, fontsize=11, fontweight='bold',
        verticalalignment='top', horizontalalignment='center',
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

plt.tight_layout()
plt.show()

# Print detailed comparison
print('\nBatch Normalization Impact Analysis:')
print('='*70)
print(f'{"Metric":<30} {"Without BN":<20} {"With BN":<20}')
print('='*70)
print(f'{"Initial Train Loss":<30} {train_loss_no_bn[0]:>18.4f}   {train_loss_with_bn[0]:>18.4f}')
print(f'{"Final Train Loss":<30} {train_loss_no_bn[-1]:>18.4f}   {train_loss_with_bn[-1]:>18.4f}')
print(f'{"Loss Reduction %":<30} {((train_loss_no_bn[0]-train_loss_no_bn[-1])/train_loss_no_bn[0]*100):>18.1f}%   {((train_loss_with_bn[0]-train_loss_with_bn[-1])/train_loss_with_bn[0]*100):>18.1f}%')
print('─'*70)
print(f'{"Initial Val Loss":<30} {val_loss_no_bn[0]:>18.4f}   {val_loss_with_bn[0]:>18.4f}')
print(f'{"Final Val Loss":<30} {val_loss_no_bn[-1]:>18.4f}   {val_loss_with_bn[-1]:>18.4f}')
print(f'{"Val Loss Reduction %":<30} {((val_loss_no_bn[0]-val_loss_no_bn[-1])/val_loss_no_bn[0]*100):>18.1f}%   {((val_loss_with_bn[0]-val_loss_with_bn[-1])/val_loss_with_bn[0]*100):>18.1f}%')
print('='*70)

# Analyze learning stability
no_bn_variance = np.var(train_loss_no_bn[5:])  # After initial drop
with_bn_variance = np.var(train_loss_with_bn[5:])

print(f'\nTraining Stability (lower variance = more stable):')
print(f'  Without BN - Loss variance: {no_bn_variance:.6f}')
print(f'  With BN    - Loss variance: {with_bn_variance:.6f}')
print(f'  Improvement: {((no_bn_variance - with_bn_variance) / no_bn_variance * 100):.1f}% more stable')
Batch Normalization Benefits: Batch norm enables higher learning rates and faster convergence by stabilizing the distribution of layer inputs. It also acts as a regularizer, reducing overfitting. The cost is slightly more computation and memory, plus different training/eval behavior.
Activation & Initialization Best Practices
  • Default activation: ReLU for hidden layers, Sigmoid for binary output, Softmax for multi-class output
  • Dying ReLU: Switch to Leaky ReLU if many neurons output 0 consistently
  • Initialization for ReLU: Use He/Kaiming, not Xavier (ReLU changes signal scaling)
  • Batch norm placement: After linear/conv, before activation for best results
  • Batch norm modes: Always use model.train()/model.eval() at appropriate times
  • Layer norm for RNNs: More stable than Batch Norm for sequential data
  • Custom initialization: Use nn.init functions for specific needs, rarely needed in modern practice
Practice Exercises

Activations & Initialization Exercises

Exercise 1 (Beginner): Test different activations (ReLU, Sigmoid, Tanh, LeakyReLU) on same data. Plot output ranges. Explain when to use each.

Exercise 2 (Beginner): Create two models: one with random init, one with proper init (He/Kaiming for ReLU). Compare training convergence on synthetic data.

Exercise 3 (Intermediate): Build model with Batch Norm. Train with model.train() and evaluate with model.eval(). Observe difference in accuracy and stability.

Exercise 4 (Intermediate): Implement Dying ReLU problem: very low learning rate on ReLU network. Observe neurons that output 0. Switch to LeakyReLU and compare.

Challenge (Advanced): Design initialization strategy for deep network (10+ layers). Test convergence with different initialization schemes. Analyze gradient flow at different depths.

Optimizers & Learning Rate Scheduling

Optimizers update model parameters based on computed gradients. PyTorch provides many optimization algorithms, each with different convergence properties and hyperparameters.

Common Optimizers

import torch
import torch.nn as nn
import torch.optim as optim

# Create a simple model
model = nn.Sequential(
    nn.Linear(10, 20),
    nn.ReLU(),
    nn.Linear(20, 5)
)

# SGD (Stochastic Gradient Descent)
# Simple, requires careful learning rate tuning
optimizer_sgd = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
print('SGD optimizer:', optimizer_sgd)

# Adam (Adaptive Moment Estimation)
# Most popular, adapts learning rate per parameter
# Good default choice for most tasks
optimizer_adam = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))
print('Adam optimizer:', optimizer_adam)

# AdamW (Adam with weight decay fix)
# Better generalization than Adam
optimizer_adamw = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
print('AdamW optimizer:', optimizer_adamw)

# RMSprop (Root Mean Square Propagation)
# Good for RNNs, adapts learning rates
optimizer_rmsprop = optim.RMSprop(model.parameters(), lr=0.001)
print('RMSprop optimizer:', optimizer_rmsprop)

Optimizer Usage Pattern: The Complete Training Loop

import torch
import torch.nn as nn
import torch.optim as optim

# Create model and optimizer
model = nn.Sequential(nn.Linear(10, 5))
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()  # Loss function

# TRAINING LOOP: Standard pattern in PyTorch
for epoch in range(10):
    # STEP 1: ZERO GRADIENTS
    # Gradients accumulate by default (important for multi-task learning)
    # But usually we want fresh gradients each iteration
    # Call zero_grad() to reset all gradients to 0
    optimizer.zero_grad()
    
    # STEP 2: FORWARD PASS
    # Create sample data (batch_size=32, features=10)
    inputs = torch.randn(32, 10)
    targets = torch.randn(32, 5)  # True values
    
    # Feed data through model
    outputs = model(inputs)  # Shape: [32, 5]
    
    # STEP 3: COMPUTE LOSS
    # Measure how far predictions are from targets
    loss = criterion(outputs, targets)  # Returns a scalar
    
    # STEP 4: BACKWARD PASS (Backpropagation)
    # Automatically computes ∂loss/∂parameter for all parameters
    # This is the "A" in autograd—automatic differentiation
    loss.backward()
    
    # At this point, every parameter has a .grad attribute
    # containing ∂loss/∂parameter
    # Example: model[0].weight.grad has shape [5, 10]
    
    # STEP 5: UPDATE PARAMETERS (Gradient Descent)
    # Apply optimizer: param = param - learning_rate * gradient
    # For Adam: more complex, uses momentum and adaptive rates
    optimizer.step()
    
    # Print progress
    if epoch % 2 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item():.4f}')

# Final weights are now learned from data!

Why optimizer.zero_grad() is Critical

import torch
import torch.nn as nn

# Create simple layer and parameter
layer = nn.Linear(2, 2)
x = torch.tensor([[1.0, 2.0]])
target = torch.tensor([[0.0, 0.0]])

# First backward pass
output1 = layer(x)
loss1 = (output1 ** 2).sum()
loss1.backward()
print('Gradient after first backward:', layer.weight.grad)

# WITHOUT zero_grad(): Gradients accumulate!
output2 = layer(x)
loss2 = (output2 ** 2).sum()
loss2.backward()
print('Gradient after 2nd backward (accumulated):', layer.weight.grad)
# Notice: grad values are LARGER, not fresh

# WITH zero_grad(): Correct pattern
layer.zero_grad()  # Reset gradients to 0
output3 = layer(x)
loss3 = (output3 ** 2).sum()
loss3.backward()
print('Gradient after zero_grad() + backward:', layer.weight.grad)
# Now gradient is computed fresh for this iteration only

# REMEMBER:
# optimizer.zero_grad()  → Reset gradients to 0
# loss.backward()        → Compute new gradients
# optimizer.step()       → Update parameters using gradients

Learning Rate Scheduling

Learning Rate Scheduling: Decay Over Time

import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim.lr_scheduler import StepLR, ReduceLROnPlateau, CosineAnnealingLR

# Create model and optimizer
model = nn.Sequential(nn.Linear(10, 5))
optimizer = optim.Adam(model.parameters(), lr=0.1)

# ===== STEPLR: Reduce LR by factor every N epochs =====
# Parameters:
#   - step_size: reduce LR every N epochs
#   - gamma: multiply LR by this factor (e.g., 0.1 = divide by 10)
# Schedule: lr(t) = lr_0 * gamma^floor(t / step_size)
scheduler_step = StepLR(optimizer, step_size=30, gamma=0.1)
# This means: Every 30 epochs, multiply LR by 0.1
# Epoch 0-29: lr = 0.1
# Epoch 30-59: lr = 0.01
# Epoch 60+: lr = 0.001

print('Initial LR:', optimizer.param_groups[0]['lr'])  # 0.1

for epoch in range(5):
    # Training code here...
    pass
    
    # Step scheduler at end of epoch
    # Call AFTER training, updates optimizer's learning rate
    scheduler_step.step()
    print(f'Epoch {epoch+1}, LR: {optimizer.param_groups[0]["lr"]:.6f}')

# ===== REDUCELRONPLATEAU: Reduce when metric stops improving =====
# Monitors validation metric, reduces LR when it plateaus
# Parameters:
#   - mode: 'min' for loss (lower is better), 'max' for accuracy
#   - factor: multiply LR by this (0.5 = halve learning rate)
#   - patience: wait this many epochs of no improvement before reducing
#   - min_lr: don't go below this learning rate
optimizer2 = optim.Adam(model.parameters(), lr=0.1)
scheduler_plateau = ReduceLROnPlateau(
    optimizer2, 
    mode='min',           # Monitor loss (lower is better)
    factor=0.5,           # Halve learning rate when triggered
    patience=5,           # Wait 5 epochs of no improvement
    min_lr=0.00001        # Don't reduce below 1e-5
)

# During training, call with validation loss
for epoch in range(10):
    val_loss = 0.5 - epoch * 0.03  # Simulated decreasing loss
    # step() takes the metric value (validation loss)
    # Automatically reduces LR if no improvement for 'patience' epochs
    scheduler_plateau.step(val_loss)
    print(f'Epoch {epoch+1}, Val Loss: {val_loss:.4f}, LR: {optimizer2.param_groups[0]["lr"]:.6f}')

# ===== COSINEANNEALINGLR: Smooth cosine decay =====
# Gradually decreases LR using cosine function
# Nice smooth schedule, often works better than step decay
# Parameters:
#   - T_max: number of epochs for one cosine cycle
#   - eta_min: minimum learning rate (end value)
# Formula: lr(t) = eta_min + (lr_0 - eta_min) * (1 + cos(πt/T_max)) / 2
optimizer3 = optim.Adam(model.parameters(), lr=0.1)
scheduler_cosine = CosineAnnealingLR(
    optimizer3, 
    T_max=10,           # Complete one cosine cycle in 10 epochs
    eta_min=0.001       # Minimum LR at end
)

for epoch in range(10):
    scheduler_cosine.step()
    print(f'Epoch {epoch+1}, LR: {optimizer3.param_groups[0]["lr"]:.6f}')

# LEARNING RATE SCHEDULING GUIDE:
print('\nWhen to use each schedule:')
print('StepLR: Simple, predictable schedule. Use when you know training length.')
print('ReduceLROnPlateau: Adaptive, responds to actual training progress.')
print('CosineAnnealingLR: Smooth, empirically works well for many tasks.')
Optimizer Choice Guide: Start with Adam or AdamW (lr=0.001) as default. Use SGD with momentum (lr=0.01-0.1, momentum=0.9) for computer vision when you need best final performance and have time to tune. Use RMSprop for RNNs. Always enable weight decay for regularization.

Datasets & DataLoaders

PyTorch's Dataset and DataLoader classes provide efficient data loading with automatic batching, shuffling, and parallel loading. This is essential for training on large datasets.

Creating Custom Dataset

import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np

# Custom Dataset: must implement __len__ and __getitem__
class CustomDataset(Dataset):
    def __init__(self, num_samples=1000, num_features=10):
        # Initialize data (in practice, load from files here)
        self.data = torch.randn(num_samples, num_features)
        self.labels = torch.randint(0, 2, (num_samples,))  # Binary labels
    
    def __len__(self):
        # Return total number of samples
        return len(self.data)
    
    def __getitem__(self, idx):
        # Return one sample (data, label) at index idx
        return self.data[idx], self.labels[idx]

# Create dataset instance
dataset = CustomDataset(num_samples=1000, num_features=10)
print(f'Dataset size: {len(dataset)}')

# Access individual samples
sample_data, sample_label = dataset[0]
print(f'Sample data shape: {sample_data.shape}')  # torch.Size([10])
print(f'Sample label: {sample_label.item()}')  # 0 or 1

DataLoader: Batching & Shuffling

import torch
from torch.utils.data import Dataset, DataLoader

# Custom dataset (same as above)
class CustomDataset(Dataset):
    def __init__(self, num_samples=1000, num_features=10):
        self.data = torch.randn(num_samples, num_features)
        self.labels = torch.randint(0, 2, (num_samples,))
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]

dataset = CustomDataset(num_samples=1000, num_features=10)

# Create DataLoader
dataloader = DataLoader(
    dataset,
    batch_size=32,        # Load 32 samples per batch
    shuffle=True,         # Shuffle data every epoch
    num_workers=0,        # Parallel data loading (0 = single process)
    pin_memory=True       # Faster GPU transfer (use with CUDA)
)

print(f'Number of batches: {len(dataloader)}')  # 1000 / 32 = 32 batches

# Iterate through batches
for batch_idx, (data, labels) in enumerate(dataloader):
    print(f'Batch {batch_idx}: data shape {data.shape}, labels shape {labels.shape}')
    # data shape: torch.Size([32, 10])
    # labels shape: torch.Size([32])
    
    if batch_idx == 2:  # Show only first 3 batches
        break

Using Built-in Datasets (TorchVision)

import torch
from torch.utils.data import DataLoader
import torchvision
import torchvision.transforms as transforms

# Define data transformations
transform = transforms.Compose([
    transforms.ToTensor(),  # Convert PIL Image to Tensor
    transforms.Normalize((0.5,), (0.5,))  # Normalize to [-1, 1]
])

# Load MNIST dataset (downloads automatically on first run)
train_dataset = torchvision.datasets.MNIST(
    root='./data',           # Download location
    train=True,              # Training set
    download=True,           # Download if not present
    transform=transform      # Apply transformations
)

test_dataset = torchvision.datasets.MNIST(
    root='./data',
    train=False,             # Test set
    download=True,
    transform=transform
)

print(f'Training samples: {len(train_dataset)}')  # 60,000
print(f'Test samples: {len(test_dataset)}')      # 10,000

# Create DataLoaders
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

# Examine one batch
images, labels = next(iter(train_loader))
print(f'Image batch shape: {images.shape}')  # torch.Size([64, 1, 28, 28])
print(f'Label batch shape: {labels.shape}')  # torch.Size([64])
DataLoader Best Practices
  • Batch Size: Start with 32 or 64. Larger batches (256+) need higher learning rates. GPU memory limits max batch size.
  • Shuffle: Always shuffle training data. Don't shuffle validation/test data (reproducible evaluation).
  • num_workers: Use 2-4 workers for parallel loading on multi-core CPUs. Set to 0 on Windows to avoid issues.
  • pin_memory: Set True when using GPU—faster data transfer to CUDA.
  • drop_last: Set True to drop incomplete final batch (useful for batch normalization).
Practice Exercises

Optimizers & Learning Rate Exercises

Exercise 1 (Beginner): Train same model with SGD, Adam, RMSprop. Compare convergence speed and final accuracy. Plot loss curves.

Exercise 2 (Beginner): Use SGD with different momentum values (0, 0.5, 0.9). Observe impact on training stability and convergence.

Exercise 3 (Intermediate): Implement learning rate scheduling (StepLR, ExponentialLR, CosineAnnealingLR). Train with each schedule and compare results.

Exercise 4 (Intermediate): Create DataLoader with different batch sizes (16, 32, 128). Train and compare convergence. Adjust learning rates appropriately for each.

Challenge (Advanced): Implement learning rate finder: sweep learning rates, track loss, plot curve to find optimal lr. Use one-cycle learning rate scheduler.

The Training Loop

The training loop is where all components come together: model, optimizer, loss function, and data. Understanding this pattern is crucial for successfully training neural networks.

Complete Training Loop Example

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

# 1. Define Dataset
class SimpleDataset(Dataset):
    def __init__(self, num_samples=1000):
        self.data = torch.randn(num_samples, 10)
        self.labels = torch.randint(0, 2, (num_samples,))
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]

# 2. Create DataLoader
train_dataset = SimpleDataset(num_samples=1000)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# 3. Define Model
model = nn.Sequential(
    nn.Linear(10, 20),
    nn.ReLU(),
    nn.Linear(20, 2)  # Binary classification (2 classes)
)

# 4. Define Loss and Optimizer
criterion = nn.CrossEntropyLoss()  # For classification
optimizer = optim.Adam(model.parameters(), lr=0.001)

# 5. Training Loop
num_epochs = 10

for epoch in range(num_epochs):
    model.train()  # Set model to training mode (enables dropout, batch norm)
    running_loss = 0.0
    correct = 0
    total = 0
    
    for batch_idx, (data, labels) in enumerate(train_loader):
        # Zero gradients
        optimizer.zero_grad()
        
        # Forward pass
        outputs = model(data)
        loss = criterion(outputs, labels)
        
        # Backward pass
        loss.backward()
        
        # Update weights
        optimizer.step()
        
        # Track metrics
        running_loss += loss.item()
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
    
    # Epoch statistics
    epoch_loss = running_loss / len(train_loader)
    epoch_acc = 100 * correct / total
    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {epoch_loss:.4f}, Accuracy: {epoch_acc:.2f}%')

Training with Validation

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

# Dataset (same as above)
class SimpleDataset(Dataset):
    def __init__(self, num_samples=1000):
        self.data = torch.randn(num_samples, 10)
        self.labels = torch.randint(0, 2, (num_samples,))
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]

# Create train and validation sets
train_loader = DataLoader(SimpleDataset(800), batch_size=32, shuffle=True)
val_loader = DataLoader(SimpleDataset(200), batch_size=32, shuffle=False)

model = nn.Sequential(nn.Linear(10, 20), nn.ReLU(), nn.Linear(20, 2))
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop with validation
for epoch in range(10):
    # TRAINING PHASE
    model.train()
    train_loss = 0.0
    
    for data, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(data)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
    
    train_loss /= len(train_loader)
    
    # VALIDATION PHASE
    model.eval()  # Set to evaluation mode (disables dropout, batch norm training)
    val_loss = 0.0
    correct = 0
    total = 0
    
    with torch.no_grad():  # Disable gradient computation (saves memory)
        for data, labels in val_loader:
            outputs = model(data)
            loss = criterion(outputs, labels)
            val_loss += loss.item()
            
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    
    val_loss /= len(val_loader)
    val_acc = 100 * correct / total
    
    print(f'Epoch {epoch+1}: Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.2f}%')
Critical Training Steps: Always call optimizer.zero_grad() before backward pass (gradients accumulate by default). Use model.train() for training and model.eval() + torch.no_grad() for validation/testing. This ensures correct behavior of dropout and batch normalization layers.

Model Evaluation & Metrics

Proper evaluation requires setting the model to evaluation mode and computing appropriate metrics. Always evaluate on held-out test data to assess generalization.

Evaluation Pattern

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset

# Simple dataset and model
class TestDataset(Dataset):
    def __init__(self):
        self.data = torch.randn(100, 10)
        self.labels = torch.randint(0, 3, (100,))
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        return self.data[idx], self.labels[idx]

test_loader = DataLoader(TestDataset(), batch_size=32)
model = nn.Sequential(nn.Linear(10, 20), nn.ReLU(), nn.Linear(20, 3))

# Evaluation function
def evaluate_model(model, dataloader):
    model.eval()  # Set to evaluation mode
    correct = 0
    total = 0
    all_preds = []
    all_labels = []
    
    with torch.no_grad():  # Disable gradient tracking
        for data, labels in dataloader:
            outputs = model(data)
            _, predicted = torch.max(outputs, 1)
            
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
            
            all_preds.extend(predicted.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())
    
    accuracy = 100 * correct / total
    return accuracy, all_preds, all_labels

acc, preds, labels = evaluate_model(model, test_loader)
print(f'Test Accuracy: {acc:.2f}%')

Saving & Loading Models

PyTorch provides two approaches: save the entire model or save only the state dictionary (recommended). The state dict contains all learnable parameters.

Save and Load State Dict (Recommended)

import torch
import torch.nn as nn

# Define and train a model
model = nn.Sequential(nn.Linear(10, 20), nn.ReLU(), nn.Linear(20, 5))

# Save only the state dict (parameters)
torch.save(model.state_dict(), 'model_weights.pth')
print('Model state dict saved')

# Load state dict into a NEW model instance
# IMPORTANT: Model architecture must match exactly
model_loaded = nn.Sequential(nn.Linear(10, 20), nn.ReLU(), nn.Linear(20, 5))
model_loaded.load_state_dict(torch.load('model_weights.pth'))
model_loaded.eval()
print('Model state dict loaded')

# Verify weights match
print('Weights match:', torch.equal(model.state_dict()['0.weight'], 
                                     model_loaded.state_dict()['0.weight']))

Save Entire Model (Alternative)

import torch
import torch.nn as nn

model = nn.Sequential(nn.Linear(10, 5))

# Save entire model (architecture + weights)
torch.save(model, 'entire_model.pth')

# Load entire model
model_loaded = torch.load('entire_model.pth')
model_loaded.eval()
print('Entire model loaded')

Save Training Checkpoint

import torch
import torch.nn as nn
import torch.optim as optim

model = nn.Sequential(nn.Linear(10, 5))
optimizer = optim.Adam(model.parameters(), lr=0.001)
epoch = 10
loss = 0.5

# Save complete training state
checkpoint = {
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss
}
torch.save(checkpoint, 'checkpoint.pth')
print('Checkpoint saved')

# Load checkpoint and resume training
checkpoint = torch.load('checkpoint.pth')
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
start_epoch = checkpoint['epoch']
last_loss = checkpoint['loss']
print(f'Resumed from epoch {start_epoch}, loss {last_loss:.4f}')

GPU Acceleration & Device Management

Training on GPU can be 10-100x faster than CPU. PyTorch makes GPU usage simple with the .to(device) method.

Basic GPU Usage

import torch
import torch.nn as nn

# Check GPU availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

if torch.cuda.is_available():
    print(f'GPU: {torch.cuda.get_device_name(0)}')
    print(f'Memory allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB')

# Move model to GPU
model = nn.Sequential(nn.Linear(10, 20), nn.ReLU(), nn.Linear(20, 5))
model = model.to(device)
print('Model moved to', device)

# Move data to GPU (must match model device!)
data = torch.randn(32, 10).to(device)
labels = torch.randint(0, 5, (32,)).to(device)

# Forward pass on GPU
outputs = model(data)
print('Output device:', outputs.device)  # cuda:0

Training Loop with GPU

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Setup device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Create data
X_train = torch.randn(1000, 10)
y_train = torch.randint(0, 2, (1000,))
train_loader = DataLoader(TensorDataset(X_train, y_train), batch_size=32)

# Model, loss, optimizer on GPU
model = nn.Sequential(nn.Linear(10, 20), nn.ReLU(), nn.Linear(20, 2)).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(5):
    for data, labels in train_loader:
        # Move batch to GPU
        data, labels = data.to(device), labels.to(device)
        
        optimizer.zero_grad()
        outputs = model(data)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
    
    print(f'Epoch {epoch+1}, Loss: {loss.item():.4f}')
GPU Best Practices: Always move both model AND data to the same device. Use torch.cuda.empty_cache() to free unused GPU memory. Monitor memory with torch.cuda.memory_allocated(). Use mixed precision training (next section) to reduce memory usage and speed up training.

Mixed Precision Training

Mixed precision uses FP16 (16-bit) instead of FP32 (32-bit) for faster training and lower memory usage. PyTorch's automatic mixed precision (AMP) handles this automatically.

Using torch.cuda.amp

import torch
import torch.nn as nn
import torch.optim as optim
from torch.cuda.amp import GradScaler, autocast
from torch.utils.data import DataLoader, TensorDataset

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Model and data
model = nn.Sequential(nn.Linear(10, 20), nn.ReLU(), nn.Linear(20, 2)).to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

X_train = torch.randn(1000, 10)
y_train = torch.randint(0, 2, (1000,))
train_loader = DataLoader(TensorDataset(X_train, y_train), batch_size=32)

# Create gradient scaler for mixed precision
scaler = GradScaler()

# Training loop with AMP
for epoch in range(5):
    for data, labels in train_loader:
        data, labels = data.to(device), labels.to(device)
        
        optimizer.zero_grad()
        
        # Wrap forward pass in autocast
        with autocast():
            outputs = model(data)
            loss = criterion(outputs, labels)
        
        # Scale loss and backward pass
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
    
    print(f'Epoch {epoch+1}, Loss: {loss.item():.4f}')

print('Mixed precision training complete - typically 2-3x faster!')

Transfer Learning with Pretrained Models

Transfer learning leverages models pretrained on massive datasets (ImageNet: 1+ million images) to jump-start learning on your task. The core insight: early layers learn universal patterns (edges, shapes), middle layers learn textures, deep layers learn object parts. You reuse these learned representations, only training a small new head for your specific classes. This reduces data requirements by 10-100× and speeds up training dramatically.

Understanding Pretrained Models

import torch
import torchvision.models as models

# POPULAR PRETRAINED MODELS (all trained on ImageNet):
print('Pretrained models available in torchvision:')
print('  ResNet (18, 34, 50, 101, 152): Residual networks, deep')
print('  VGG (11, 13, 16, 19): Simple, large models')
print('  Inception: Complex architecture, high accuracy')
print('  EfficientNet: Better accuracy-efficiency trade-off')
print('  MobileNet: Lightweight, mobile deployment')
print('  DenseNet: Dense connections, feature reuse')

# Load pretrained ResNet-50 (trained on ImageNet)
# First time: downloads weights automatically (~100 MB)
resnet50 = models.resnet50(pretrained=True)
print(f'\nResNet-50 architecture: {resnet50}')

# ARCHITECTURE STRUCTURE:
# ResNet-50 last layer: 2048 features → 1000 ImageNet classes
print(f'\nFinal layer: {resnet50.fc}')
print('  Input: 2048 features from average pooling')
print('  Output: 1000 classes (ImageNet)')
print('  We will replace this for our task!')

# Count parameters
total_params = sum(p.numel() for p in resnet50.parameters())
print(f'\nTotal parameters: {total_params:,}')
print('Most parameters in convolutional base (learning ImageNet features)')
print('Only 2048→1000 weights in final layer need retraining')

Using Pretrained Features: Feature Extraction

import torch
import torch.nn as nn
import torchvision.models as models

# TRANSFER LEARNING STRATEGY:
# 1. Load pretrained model
# 2. FREEZE backbone (don't update pretrained weights)
# 3. REMOVE final layer (trained on 1000 ImageNet classes)
# 4. ADD custom head for your task (your number of classes)
# 5. TRAIN ONLY the new head

# Load pretrained ResNet-18
model = models.resnet18(pretrained=True)

print('Original model:')
print(f'  Final layer: {model.fc}')
print(f'  Input features: {model.fc.in_features}')

# STEP 1: Freeze entire model
# requires_grad=False → gradients not computed, weights not updated
for param in model.parameters():
    param.requires_grad = False

print('\nAfter freezing backbone:')
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
frozen = sum(p.numel() for p in model.parameters() if not p.requires_grad)
print(f'  Trainable: {trainable:,} params')
print(f'  Frozen: {frozen:,} params')

# STEP 2: Replace final layer for your task
# Input to ResNet-18 fc layer: 512 features
# Output: 10 classes (your dataset)
num_classes = 10
num_features = model.fc.in_features  # 512

model.fc = nn.Sequential(
    nn.Linear(num_features, 256),  # Hidden layer
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(256, num_classes)    # Output layer
)

print(f'\nAfter replacing final layer:')
print(f'  New classifier: {model.fc}')

# STEP 3: Unfreeze only the new head
for param in model.fc.parameters():
    param.requires_grad = True

trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
frozen = sum(p.numel() for p in model.parameters() if not p.requires_grad)
print(f'\nAfter unfreezing new head:')
print(f'  Trainable: {trainable:,} params (new head only)')
print(f'  Frozen: {frozen:,} params (pretrained backbone)')
print(f'  Ratio: {(frozen/trainable):.0f}x more frozen params')

# STEP 4: Setup optimizer (only for trainable params!)
# Only the new head gets updated
optimizer = torch.optim.Adam(model.fc.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

print('\nOptimizer: Adam(lr=0.001) for new head only')
print('Training: fit new layer on your data for 5-10 epochs')
print('Result: 100-1000x faster than training from scratch!')

Fine-Tuning: Adapting Pretrained Features

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.models as models

# TWO-STAGE TRANSFER LEARNING:
# Stage 1: Train new head with frozen backbone (fast, coarse adjustment)
# Stage 2: Fine-tune backbone with lower LR (slow, fine-grained adjustment)

# Load pretrained model with custom head
model = models.resnet18(pretrained=True)
model.fc = nn.Linear(model.fc.in_features, 10)

# Freeze backbone for stage 1
for param in model.parameters():
    param.requires_grad = False
model.fc.requires_grad_(True)

# STAGE 1: Train new head (5-10 epochs, learning_rate = 1e-3)
optimizer = optim.Adam(model.fc.parameters(), lr=0.001)
print('STAGE 1: Training new head with frozen backbone')
print('  Learning rate: 0.001')
print('  Duration: 5-10 epochs (fast)')
print('  Purpose: Coarse adaptation to your data')

# Simulate training
for epoch in range(3):
    print(f'  Epoch {epoch+1}/3: Training new head...')
    # model.fit(train_data, ...)

print('\nSTAGE 1 complete. New head learned. Backbone unchanged.')

# STAGE 2: Fine-tune entire model (10-20 epochs, learning_rate = 1e-5)
print('\nSTAGE 2: Fine-tuning entire model')

# Unfreeze entire model
for param in model.parameters():
    param.requires_grad = True

# Use MUCH LOWER learning rate
# Why? Small updates to preserve learned features
# Rule: fine-tuning LR ≈ initial LR / 10-100
optimizer = optim.Adam(model.parameters(), lr=0.00001)  # 1e-5

print('  Learning rate: 0.00001 (100x lower than stage 1!)')
print('  Duration: 10-20 epochs')
print('  Purpose: Subtle adaptation of backbone layers')
print('')
print('Learning rate comparison:')
print('  Training from scratch: 0.001')
print('  Stage 1 (head only): 0.001')
print('  Stage 2 (fine-tune): 0.00001 (0.001 ÷ 100)')

# Simulate training
for epoch in range(5):
    print(f'  Epoch {epoch+1}/5: Fine-tuning entire model...')
    # model.fit(train_data, ...)

print('\nFine-tuning complete!')
print('\nWhen to use which strategy:')
print('  Small dataset (<5k images), different domain → Stage 1 only')
print('  Medium dataset (5k-50k), similar domain → Stage 1 + Stage 2')
print('  Large dataset (100k+), same domain → Fine-tune with higher LR')
print('  Very different domain → May need additional preprocessing/augmentation')

Selective Layer Freezing

import torch
import torch.nn as nn
import torchvision.models as models

# ADVANCED: Fine-tune only SOME backbone layers
# Strategy: freeze early layers (general features), tune later layers (domain-specific)

model = models.resnet18(pretrained=True)
model.fc = nn.Linear(model.fc.in_features, 10)

# ResNet structure (simplified):
# model.conv1 → layer1 → layer2 → layer3 → layer4 → avgpool → fc
#
# Early layers: learn edges, colors, simple textures (ImageNet-like)
# Middle layers: learn shapes, corners, textures
# Late layers: learn objects, specific patterns (domain-specific)

# Strategy: Freeze early, tune late
print('Selective layer freezing:')

# Option 1: Freeze early layers, tune middle + late
for param in model.conv1.parameters():
    param.requires_grad = False
for param in model.layer1.parameters():
    param.requires_grad = False
# layer2, layer3, layer4, fc will be trained

trainable_layers = []
for name, param in model.named_parameters():
    if param.requires_grad:
        trainable_layers.append(name)

print(f'\nFrozen: conv1, layer1')
print(f'Trainable: {trainable_layers}')

# Option 2: Freeze everything except last layer of each block
for param in model.layer1[:-1].parameters():
    param.requires_grad = False
for param in model.layer2[:-1].parameters():
    param.requires_grad = False
for param in model.layer3[:-1].parameters():
    param.requires_grad = False

print('\nAlternative: Freeze all but last sub-layer in each block')
print('  Tradeoff: Fewer parameters to train, but more adaptation')

# GUIDELINES:
print('\nSelective freezing guidelines:')
print('  Small data, similar domain: Freeze conv1, layer1, layer2')
print('  Medium data: Freeze conv1, layer1 only')
print('  Large data, different domain: Freeze conv1 only')
print('  Very large data: Train everything (no freezing)')
Transfer Learning Best Practices
  • Two-stage training: Stage 1 train new head (high LR), Stage 2 fine-tune backbone (low LR)
  • Learning rate selection: Stage 1 ≈ 1e-3 (normal), Stage 2 ≈ 1e-5 (100× lower) to preserve features
  • Freezing strategy: Start fully frozen, gradually unfreeze from output backwards if needed
  • Model selection: MobileNet for speed, ResNet/EfficientNet for accuracy, match image size to model training
  • When helpful: <5k training images, similar domain (objects, faces, etc.), limited compute
  • When unnecessary: >100k training images, unique domain, unlimited compute resources
  • Batch normalization: Keep eval mode during fine-tuning if freezing early layers (prevents distribution shift)

Convolutional Neural Networks (CNNs)

CNNs are the gold standard for computer vision tasks. They learn spatial hierarchies of features through convolutional layers that apply small filters (kernels) across images. The key insight: sharing parameters across spatial locations reduces the number of learnable weights dramatically compared to fully connected layers.

Understanding Convolution: Filters and Feature Maps

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

# CONVOLUTION INTUITION:
# A filter (3x3) slides across an image, computing dot products
# Each position outputs a scalar → creates a feature map

# Example: 3x3 Sobel filter for vertical edge detection
edge_filter = torch.tensor([
    [-1., 0., 1.],
    [-2., 0., 2.],
    [-1., 0., 1.]
], dtype=torch.float32)

print('Vertical edge detection filter (Sobel):')
print(edge_filter)
print('')

# CONV2D PARAMETERS EXPLAINED:
print('nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding)')
print('')

# Example: process RGB images (3 channels) with 32 filters
conv_layer = nn.Conv2d(
    in_channels=3,      # RGB image: 3 channels (red, green, blue)
    out_channels=32,    # Learn 32 different filters
    kernel_size=3,      # Each filter is 3x3
    stride=1,           # Move filter by 1 pixel at a time
    padding=1,          # Pad input with zeros (preserves spatial dimensions)
    bias=True           # Add bias term to each filter output
)

print(f'Conv2d layer parameters:')
print(f'  - in_channels: {conv_layer.in_channels} (RGB has 3 channels)')
print(f'  - out_channels: {conv_layer.out_channels} (32 learned filters)')
print(f'  - kernel_size: {conv_layer.kernel_size} (3x3 filter)')
print(f'  - weight shape: {conv_layer.weight.shape}')
print(f'    (32 filters × 3×3×3 = 32 different 3×3×3 weight matrices)')

# TOTAL PARAMETERS:
# Each of 32 filters: 3×3×3 = 27 weights
# Plus bias: 32
# Total: 32 × 27 + 32 = 896 parameters
params = (3 * 3 * 3 * 32) + 32
print(f'  - Total parameters: {params}')

# TEST FORWARD PASS:
x = torch.randn(4, 3, 32, 32)  # Batch of 4 RGB images (32x32 each)
output = conv_layer(x)
print(f'\nInput shape: {x.shape} (batch, channels, height, width)')
print(f'Output shape: {output.shape} (batch, 32 filters, 32, 32)')
print('  Height/width stay same because padding="same"')

# OUTPUT SIZE FORMULA:
# output_size = (input_size - kernel_size + 2*padding) / stride + 1
def output_spatial_size(input_size, kernel_size, stride, padding):
    return (input_size - kernel_size + 2 * padding) // stride + 1

print(f'\nWith different stride:')
print(f'  stride=1: {output_spatial_size(32, 3, 1, 1)}x{output_spatial_size(32, 3, 1, 1)}')
print(f'  stride=2: {output_spatial_size(32, 3, 2, 1)}x{output_spatial_size(32, 3, 2, 1)}')
print(f'  stride=2 + padding=0: {output_spatial_size(32, 3, 2, 0)}x{output_spatial_size(32, 3, 2, 0)}')

MaxPooling: Downsampling Feature Maps

import torch
import torch.nn as nn

# MaxPooling INTUITION:
# Take the MAXIMUM value in each 2x2 window
# Reduces spatial dimensions while keeping important features

# Example: 4x4 feature map
feature_map = torch.tensor([
    [1., 3., 2., 5.],
    [4., 2., 1., 3.],
    [2., 5., 6., 1.],
    [3., 2., 4., 7.]
], dtype=torch.float32)

print('Original 4x4 feature map:')
print(feature_map)

# MaxPooling2d((2,2)): take max from each 2x2 window
# [[1,3], [4,2]] → max = 4
# [[2,5], [1,3]] → max = 5
# [[2,5], [3,2]] → max = 5
# [[6,1], [4,7]] → max = 7

pool = nn.MaxPool2d(kernel_size=2, stride=2)
pooled = pool(feature_map.unsqueeze(0).unsqueeze(0))  # Add batch and channel dims
print(f'\nAfter MaxPooling2d(2x2):')
print(pooled.squeeze())

print('\nBenefits of MaxPooling:')
print('  1. Reduces spatial dimensions (4x4 → 2x2, 75% fewer values)')
print('  2. Keeps most important information (max values)')
print('  3. Creates translation invariance (small shifts don\'t matter)')
print('  4. Reduces parameters in subsequent layers (fewer multiplications)')

# PADDING OPTIONS:
print('\nMaxPooling with different padding:')
pool_valid = nn.MaxPool2d(2, stride=2, padding=0)
pool_same = nn.MaxPool2d(2, stride=1, padding=0)
x = torch.randn(1, 32, 32, 32)  # 32x32 feature maps
print(f'  Input: {x.shape}')
print(f'  MaxPool2d(2, stride=2, padding=0): {pool_valid(x).shape}')

Visualizing Convolution and MaxPooling

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.patches as patches

# Create a sample image (8x8) with edge patterns
np.random.seed(42)
image = np.array([
    [100, 100, 100, 200, 200, 200, 100, 100],
    [100, 100, 100, 200, 200, 200, 100, 100],
    [100, 100, 100, 200, 200, 200, 100, 100],
    [50,  50,  50,  150, 150, 150, 50,  50],
    [50,  50,  50,  150, 150, 150, 50,  50],
    [200, 200, 200, 100, 100, 100, 200, 200],
    [200, 200, 200, 100, 100, 100, 200, 200],
    [200, 200, 200, 100, 100, 100, 200, 200]
], dtype=np.float32)

image_tensor = torch.from_numpy(image).unsqueeze(0).unsqueeze(0)

# Apply Conv2d - vertical edge detection kernel (Sobel)
conv = nn.Conv2d(1, 1, kernel_size=3, stride=1, padding=0, bias=False)
conv.weight.data = torch.tensor([[[[-1., 0., 1.],
                                   [-2., 0., 2.],
                                   [-1., 0., 1.]]]], dtype=torch.float32)

feature_map = conv(image_tensor).squeeze().detach().numpy()

# Apply MaxPooling
pool = nn.MaxPool2d(kernel_size=2, stride=2)
pooled_feature = pool(torch.from_numpy(feature_map).unsqueeze(0).unsqueeze(0)).squeeze().detach().numpy()

# Create comprehensive visualization
fig = plt.figure(figsize=(15, 10))
gs = fig.add_gridspec(3, 3, hspace=0.3, wspace=0.3)

# Row 1: Input and kernel
ax1 = fig.add_subplot(gs[0, 0])
im = ax1.imshow(image, cmap='gray', aspect='auto')
ax1.set_title('Input Image (8×8)\nGrayscale pixel values', fontsize=11, fontweight='bold')
ax1.set_xticks([])
ax1.set_yticks([])
plt.colorbar(im, ax=ax1, fraction=0.046)

ax2 = fig.add_subplot(gs[0, 1])
kernel = np.array([[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]])
im = ax2.imshow(kernel, cmap='RdBu_r', aspect='auto', vmin=-3, vmax=3)
ax2.set_title('Convolution Kernel (3×3)\nSobel Edge Detector', fontsize=11, fontweight='bold')
ax2.set_xticks([0, 1, 2])
ax2.set_yticks([0, 1, 2])
for i in range(3):
    for j in range(3):
        ax2.text(j, i, f'{kernel[i, j]:+.0f}', ha='center', va='center', color='white', fontweight='bold', fontsize=10)
plt.colorbar(im, ax=ax2, fraction=0.046)

ax3 = fig.add_subplot(gs[0, 2])
im = ax3.imshow(feature_map, cmap='hot', aspect='auto')
ax3.set_title(f'Feature Map After Conv\n(6×6, detects vertical edges)', fontsize=11, fontweight='bold')
ax3.set_xticks([])
ax3.set_yticks([])
plt.colorbar(im, ax=ax3, fraction=0.046)

# Row 2: MaxPooling visualization
ax4 = fig.add_subplot(gs[1, :2])
im = ax4.imshow(feature_map, cmap='hot', aspect='auto')
ax4.set_title('Feature Map with MaxPooling Windows (2×2)', fontsize=11, fontweight='bold')

# Draw 2x2 pooling regions and mark maximums
for i in range(0, feature_map.shape[0]-1, 2):
    for j in range(0, feature_map.shape[1]-1, 2):
        rect = patches.Rectangle((j-0.5, i-0.5), 2, 2, linewidth=2, edgecolor='red', facecolor='none', linestyle='--')
        ax4.add_patch(rect)
        
        # Find max in this region
        region_max = feature_map[i:i+2, j:j+2].max()
        max_pos = np.unravel_index(feature_map[i:i+2, j:j+2].argmax(), (2, 2))
        max_i, max_j = i + max_pos[0], j + max_pos[1]
        
        # Mark the maximum value with a star
        ax4.plot(max_j, max_i, marker='*', color='yellow', markersize=15, markeredgecolor='white', markeredgewidth=1.5)

ax4.set_xticks(range(feature_map.shape[1]))
ax4.set_yticks(range(feature_map.shape[0]))
plt.colorbar(im, ax=ax4, fraction=0.046)

# MaxPooled result
ax5 = fig.add_subplot(gs[1, 2])
im = ax5.imshow(pooled_feature, cmap='hot', aspect='auto')
ax5.set_title(f'After MaxPooling\n(3×3, 75% size reduction)', fontsize=11, fontweight='bold')
ax5.set_xticks(range(pooled_feature.shape[1]))
ax5.set_yticks(range(pooled_feature.shape[0]))
plt.colorbar(im, ax=ax5, fraction=0.046)

# Row 3: Explanation and summary
ax6 = fig.add_subplot(gs[2, :])
ax6.axis('off')

explanation = '''
HOW CONVOLUTION WORKS:
1. Kernel (3×3 filter) slides across image from left to right, top to bottom
2. At each position: multiply kernel values by image pixels, sum the results
3. Result: single number per position → creates feature map
4. Different kernels detect different features: edges, corners, textures, etc.

HOW MAXPOOLING WORKS:
1. Divide feature map into non-overlapping 2×2 windows
2. Keep the MAXIMUM value from each window
3. Discard the other 3 values
4. Result: smaller feature map with important information preserved

WHY USE THESE LAYERS?
✓ Convolution: Detects local features (hierarchical feature learning)
✓ MaxPooling: Reduces size (fewer parameters, faster computation), provides translation invariance
✓ Together: Create efficient, translation-invariant feature hierarchies for image classification

TYPICAL CNN ARCHITECTURE:
[Conv → ReLU → Pool] → [Conv → ReLU → Pool] → [Dense → Softmax]
  Feature detection    Feature refinement    Classification
'''

ax6.text(0.05, 0.95, explanation, transform=ax6.transAxes, fontsize=10,
         verticalalignment='top', fontfamily='monospace', wrap=True,
         bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.5))

plt.suptitle('Convolutional Neural Network: Convolution + MaxPooling Pipeline', fontsize=14, fontweight='bold', y=0.995)
plt.show()

# Print detailed pooling calculations
print('\nMaxPooling Calculation Example:')
print('Feature Map (6×6) → MaxPool(2×2, stride=2) → Output (3×3)')
print('\nSample window - Top-left 2×2:')
print(feature_map[:2, :2].astype(int))
print(f'Maximum: {feature_map[:2, :2].max():.0f}')
print('\nResult creates 3×3 output with each cell being max of corresponding 2×2 window')

Building a CNN: Conv + ReLU + Pool Pattern

import torch
import torch.nn as nn
import torch.nn.functional as F

# CNN ARCHITECTURE: Standard pattern is [Conv → ReLU → Pool] repeated
class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        
        # BLOCK 1: Extract low-level features (edges, corners)
        # Input: 1x28x28 grayscale image (MNIST)
        self.conv1 = nn.Conv2d(
            in_channels=1,    # Grayscale image
            out_channels=32,  # Learn 32 filters detecting different patterns
            kernel_size=3,    # 3x3 filter
            padding=1         # Keep spatial dimensions
        )
        # Output: 32x28x28 (32 feature maps, each 28x28)
        
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        # Output after pool: 32x14x14
        
        # BLOCK 2: Extract mid-level features (textures, shapes)
        self.conv2 = nn.Conv2d(
            in_channels=32,   # Input: 32 feature maps from conv1
            out_channels=64,  # Increase filter count
            kernel_size=3,
            padding=1
        )
        # Output: 64x14x14
        
        # Output after pool: 64x7x7
        
        # CLASSIFIER: Convert spatial features to class predictions
        # Flatten: 64x7x7 = 3136 features
        self.fc1 = nn.Linear(64 * 7 * 7, 128)
        self.dropout = nn.Dropout(p=0.5)  # Reduce overfitting
        self.fc2 = nn.Linear(128, 10)     # 10 MNIST classes
    
    def forward(self, x):
        # Block 1: Conv → ReLU → Pool
        # Input: (batch, 1, 28, 28)
        x = self.conv1(x)              # (batch, 32, 28, 28)
        x = F.relu(x)                  # Apply activation
        x = self.pool(x)               # (batch, 32, 14, 14)
        
        # Block 2: Conv → ReLU → Pool
        x = self.conv2(x)              # (batch, 64, 14, 14)
        x = F.relu(x)                  # Apply activation
        x = self.pool(x)               # (batch, 64, 7, 7)
        
        # Classifier
        x = x.view(x.size(0), -1)      # Flatten: (batch, 3136)
        x = F.relu(self.fc1(x))        # Dense layer + ReLU
        x = self.dropout(x)            # Dropout during training
        x = self.fc2(x)                # Output: (batch, 10)
        return x

# Create model
model = SimpleCNN()
print('CNN Architecture:')
print(model)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'\nTotal parameters: {total_params:,}')
print(f'Trainable parameters: {trainable_params:,}')

# Test forward pass with batch of 4 images
x = torch.randn(4, 1, 28, 28)
output = model(x)
print(f'\nInput: {x.shape} (4 images, 1 channel, 28x28)')
print(f'Output: {output.shape} (4 images, 10 classes)')
print(f'Output probabilities (after softmax):\n{F.softmax(output, dim=1)[:2]}')

Training CNN on MNIST

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# BUILD MODEL
class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, padding=1)
        self.conv2 = nn.Conv2d(32, 64, 3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(64 * 7 * 7, 128)
        self.fc2 = nn.Linear(128, 10)
        self.dropout = nn.Dropout(0.5)
    
    def forward(self, x):
        x = self.pool(torch.relu(self.conv1(x)))
        x = self.pool(torch.relu(self.conv2(x)))
        x = x.view(x.size(0), -1)
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        return x

model = SimpleCNN()

# DATA LOADING with preprocessing
# Normalize MNIST: (pixel - mean) / std
# For MNIST: mean ≈ 0.1307, std ≈ 0.3081
transform = transforms.Compose([
    transforms.ToTensor(),  # Convert PIL image to tensor
    transforms.Normalize((0.1307,), (0.3081,))  # Normalize
])

# Load MNIST training set
train_dataset = datasets.MNIST(
    root='./data',
    train=True,
    transform=transform,
    download=True
)
train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)

# Load MNIST test set
test_dataset = datasets.MNIST(
    root='./data',
    train=False,
    transform=transform,
    download=True
)
test_loader = DataLoader(test_dataset, batch_size=128, shuffle=False)

print(f'Training samples: {len(train_dataset)}')
print(f'Test samples: {len(test_dataset)}')

# TRAINING SETUP
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

criterion = nn.CrossEntropyLoss()  # Combines LogSoftmax + NLLLoss
optimizer = optim.Adam(model.parameters(), lr=0.001)

# TRAINING LOOP (5 epochs)
num_epochs = 5

for epoch in range(num_epochs):
    model.train()
    train_loss = 0.0
    correct = 0
    total = 0
    
    for images, labels in train_loader:
        images, labels = images.to(device), labels.to(device)
        
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Statistics
        train_loss += loss.item()
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
    
    accuracy = 100 * correct / total
    print(f'Epoch {epoch+1}/{num_epochs} - Loss: {train_loss/len(train_loader):.4f}, Accuracy: {accuracy:.2f}%')

# EVALUATION on test set
model.eval()
test_loss = 0.0
correct = 0
total = 0

with torch.no_grad():
    for images, labels in test_loader:
        images, labels = images.to(device), labels.to(device)
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        test_loss += loss.item()
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

test_accuracy = 100 * correct / total
print(f'\nTest Accuracy: {test_accuracy:.2f}%')
CNN Architecture Guidelines:
  • Progressive depth: Conv → ReLU → Pool, increasing filters (32 → 64 → 128) as spatial dimensions decrease
  • Kernel size: 3×3 is standard; 5×5 or 7×7 for large spatial patterns. Two stacked 3×3 layers ≈ one 5×5 layer but fewer parameters
  • Stride: stride=1 for feature extraction, stride≥2 for downsampling (alternative to pooling)
  • Padding: padding='same' preserves dimensions; no padding gradually reduces size
  • GlobalAveragePooling: Reduces 8×8×256 → 256 vector, more efficient than Flatten + Dense
  • Dropout: Place before Dense layers; typical rates 0.3-0.5. Reduces co-adaptation, improves generalization

Recurrent Neural Networks (RNNs & LSTMs)

RNNs process sequential data (text, time series, audio) by maintaining hidden states across time steps. The key insight: the same weights are applied at each time step, but the hidden state accumulates information from previous steps. LSTMs (Long Short-Term Memory) solve the vanishing gradient problem of standard RNNs through gating mechanisms that explicitly control what information to remember.

RNN Fundamentals: Processing Sequences

import torch
import torch.nn as nn

# RNN INTUITION:
# For text: "I love deep learning"
# Token sequence: [42, 156, 89, 234]
#
# Processing:
# h0 = initial hidden state (zeros)
# h1 = RNN(word_42, h0)  # Process first word with previous hidden
# h2 = RNN(word_156, h1) # Process second word with updated hidden
# h3 = RNN(word_89, h2)  # And so on...
# h4 = RNN(word_234, h3)
#
# Final h4 contains context of entire sequence!

# SIMPLE RNN LAYER IMPLEMENTATION CONCEPT:
# At each time step t:
# h[t] = tanh(W_ih @ x[t] + W_hh @ h[t-1] + b_h)
# Where:
#   W_ih: weight matrix for input
#   W_hh: weight matrix for hidden state (recurrence)
#   @ : matrix multiplication
#   tanh: activation function

class SimpleRNNCell(nn.Module):
    """Manually implemented RNN cell to understand the math."""
    def __init__(self, input_size, hidden_size):
        super(SimpleRNNCell, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        
        # Weights
        self.weight_ih = nn.Linear(input_size, hidden_size)  # Input to hidden
        self.weight_hh = nn.Linear(hidden_size, hidden_size) # Hidden to hidden (recurrence)
    
    def forward(self, x, hidden):
        # x: (batch_size, input_size) - current token
        # hidden: (batch_size, hidden_size) - previous hidden state
        
        # Compute new hidden state
        new_hidden = torch.tanh(self.weight_ih(x) + self.weight_hh(hidden))
        return new_hidden

# PYTORCH'S RNN LAYER (easier to use):
rnn_layer = nn.RNN(
    input_size=10,      # Feature dimension per time step
    hidden_size=20,     # Hidden state dimension
    num_layers=1,       # Stack 1 RNN layer
    batch_first=True    # Input format: (batch, seq_len, features)
)

# Example: sequence of 5 tokens, each with 10 features
x = torch.randn(4, 5, 10)  # (batch_size=4, seq_len=5, input_size=10)

# Initialize hidden state to zeros
h0 = torch.zeros(1, 4, 20)  # (num_layers, batch_size, hidden_size)

# Forward pass
outputs, hidden_final = rnn_layer(x, h0)
# outputs: (batch_size=4, seq_len=5, hidden_size=20) - all hidden states
# hidden_final: (num_layers=1, batch_size=4, hidden_size=20) - final hidden state

print(f'Input shape: {x.shape}')
print(f'Output (all hidden states): {outputs.shape}')
print(f'Final hidden state: {hidden_final.shape}')
print('Final hidden state ≈ context of entire sequence')

LSTM: Solving the Vanishing Gradient Problem

import torch
import torch.nn as nn

# VANISHING GRADIENT PROBLEM:
# In standard RNNs, h[t] = tanh(W @ h[t-1] + ...)
# Backprop: dh[t]/dh[t-1] = W^T * tanh'
# Over many time steps: dL/dh[0] = product of many small numbers → approaches 0
# Result: early tokens have almost zero gradient, can't learn long dependencies

# LSTM SOLUTION: Gating mechanisms + cell state
# LSTM cells have:
# - hidden state h[t]: short-term context (changes each step)
# - cell state c[t]: long-term memory (preserved across steps)
#
# Three gates control information flow:
# 1. FORGET GATE: What to discard from previous cell state
# 2. INPUT GATE: What new information to add
# 3. OUTPUT GATE: What hidden state to output

# LSTM EQUATIONS (simplified):
# f[t] = sigmoid(W_f @ [h[t-1], x[t]] + b_f)      # Forget gate
# i[t] = sigmoid(W_i @ [h[t-1], x[t]] + b_i)      # Input gate
# c_candidate = tanh(W_c @ [h[t-1], x[t]] + b_c)  # New cell info
# c[t] = f[t] * c[t-1] + i[t] * c_candidate       # Update cell state
# o[t] = sigmoid(W_o @ [h[t-1], x[t]] + b_o)      # Output gate
# h[t] = o[t] * tanh(c[t])                         # New hidden state
#
# KEY: c[t] = f[t]*c[t-1] + ... uses addition, not multiplication!
# Backprop: dL/dc[t] = dL/dc[t+1] + dL/dh[t]
# Gradient can flow backwards without vanishing

# PYTORCH LSTM:
lstm_layer = nn.LSTM(
    input_size=10,      # Features per time step
    hidden_size=20,     # Hidden/cell state dimension
    num_layers=1,       # Stack 1 LSTM layer
    batch_first=True,   # Input format: (batch, seq_len, features)
    bidirectional=False,# Process left-to-right only
    dropout=0.0         # Dropout between layers (>1 layer)
)

# Example: process sequence
x = torch.randn(4, 5, 10)  # (batch=4, seq_len=5, input=10)

# Initialize hidden and cell states
h0 = torch.zeros(1, 4, 20)  # (num_layers, batch, hidden)
c0 = torch.zeros(1, 4, 20)  # (num_layers, batch, hidden)

# Forward pass: returns outputs, (h_final, c_final)
outputs, (hidden, cell) = lstm_layer(x, (h0, c0))
# outputs: (4, 5, 20) - hidden state at each time step
# hidden: (1, 4, 20) - final hidden state
# cell: (1, 4, 20) - final cell state

print(f'Input: {x.shape}')
print(f'Output: {outputs.shape} (all hidden states)')
print(f'Final hidden state: {hidden.shape}')
print(f'Final cell state: {cell.shape}')

# Key advantage: LSTM "remembers" information from earlier steps
print('\nLSTM advantages:')
print('  - Solves vanishing gradient (addition preserves gradient flow)')
print('  - Long-term memory via cell state (separate from hidden)')
print('  - Gates control what information flows (learnable)')
print('  - Can learn dependencies >100 steps apart')

Building LSTM Text Classifier

import torch
import torch.nn as nn
import torch.optim as optim

class LSTMTextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes, num_layers=1):
        super(LSTMTextClassifier, self).__init__()
        
        # LAYER 1: Embedding
        # Convert token IDs → dense vectors
        # Parameters: vocab_size × embed_dim
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        
        # LAYER 2: LSTM
        # Process sequences while maintaining memory
        # Returns outputs and (hidden, cell) tuple
        self.lstm = nn.LSTM(
            input_size=embed_dim,      # Input to LSTM is embedding vectors
            hidden_size=hidden_dim,    # Hidden state dimension
            num_layers=num_layers,     # Stack multiple LSTM layers
            batch_first=True,          # Input: (batch, seq_len, features)
            bidirectional=False,       # Process left-to-right
            dropout=0.3 if num_layers > 1 else 0  # Dropout between LSTM layers
        )
        
        # LAYER 3: Dropout
        self.dropout = nn.Dropout(p=0.3)
        
        # LAYER 4: Output classifier
        # Hidden state → logits
        self.fc = nn.Linear(hidden_dim, num_classes)
    
    def forward(self, x):
        # x: (batch_size, seq_len) token IDs
        # Example: [[5, 42, 156, 0], [89, 234, 0, 0]]
        
        # Embedding: token IDs → vectors
        embedded = self.embedding(x)
        # Shape: (batch_size, seq_len, embed_dim)
        
        # LSTM: process sequence
        lstm_out, (hidden, cell) = self.lstm(embedded)
        # lstm_out: (batch, seq_len, hidden_dim) all hidden states
        # hidden: (num_layers, batch, hidden_dim) final hidden from each layer
        # cell: (num_layers, batch, hidden_dim) final cell state
        
        # Use final hidden state (contains sequence context)
        # For single layer: hidden[0] = final hidden state
        # Shape: (batch_size, hidden_dim)
        final_hidden = hidden[-1]  # Last layer's final hidden state
        
        # Dropout regularization
        dropped = self.dropout(final_hidden)
        
        # Classification
        logits = self.fc(dropped)  # (batch_size, num_classes)
        return logits

# CREATE MODEL:
vocab_size = 5000        # Dictionary size
embed_dim = 128          # Word vector dimension
hidden_dim = 256         # LSTM hidden state dimension
num_classes = 2          # Binary classification (positive/negative)
num_layers = 2           # Stack 2 LSTM layers

model = LSTMTextClassifier(vocab_size, embed_dim, hidden_dim, num_classes, num_layers)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'Model parameters: {total_params:,}')
print(f'Trainable: {trainable_params:,}')

# TEST FORWARD PASS:
batch_size = 4
seq_len = 20
x = torch.randint(0, vocab_size, (batch_size, seq_len))  # Random token IDs
output = model(x)
print(f'\nInput: {x.shape} (batch, sequence_length)')
print(f'Output: {output.shape} (batch, num_classes)')
print(f'Output probabilities:\n{torch.softmax(output, dim=1)}')

Bidirectional LSTM: Context from Both Directions

import torch
import torch.nn as nn

# BIDIRECTIONAL LSTM INTUITION:
# Standard LSTM: "I love [?] learning"
# Process left-to-right: I → love → ? → learning
# Hidden state at [?]: context from I, love (past only)
#
# Bidirectional: process BOTH directions
# Left-to-right: I → love → [?] ← learning
# Right-to-left: learning → ? → love → I
# Result: hidden state at [?] contains context from BOTH sides!
#
# Advantage: much better for understanding full context
# Disadvantage: can't be used for real-time prediction (need future tokens)

# PYTORCH BIDIRECTIONAL LSTM:
bilstm = nn.LSTM(
    input_size=100,
    hidden_size=256,
    num_layers=1,
    batch_first=True,
    bidirectional=True,  # Process both directions
    dropout=0.0
)

# Example sequence
x = torch.randn(4, 20, 100)  # (batch=4, seq_len=20, embed_dim=100)

outputs, (hidden, cell) = bilstm(x)
# IMPORTANT: bidirectional output dimensions change!
# outputs: (batch, seq_len, hidden_dim * 2) - concatenate forward and backward
print(f'Input shape: {x.shape}')
print(f'Output shape: {outputs.shape}')
print(f'  (batch=4, seq_len=20, hidden*2=512)')

# hidden shape: (num_layers*2, batch, hidden) - separate for forward and backward
print(f'Hidden shape: {hidden.shape}')
print(f'  (num_layers*2=2, batch=4, hidden=256)')
print('  hidden[0] = forward final state, hidden[1] = backward final state')

# Typically concatenate both directions:
forward_hidden = hidden[0]  # (batch, hidden_dim)
backward_hidden = hidden[1]  # (batch, hidden_dim)
combined = torch.cat([forward_hidden, backward_hidden], dim=1)  # (batch, hidden_dim*2)
print(f'\nCombined hidden state: {combined.shape}')

# In classification, use combined or outputs[:, -1, :] (final output)
context = outputs[:, -1, :]  # Last token output (includes both directions)
print(f'Final bidirectional context: {context.shape} (batch, hidden*2)')

Training LSTM on Real Data

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset

# Model from previous example
class LSTMTextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
        super(LSTMTextClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, num_classes)
        self.dropout = nn.Dropout(0.3)
    
    def forward(self, x):
        embedded = self.embedding(x)
        lstm_out, (hidden, _) = self.lstm(embedded)
        context = self.dropout(hidden[-1])  # Final hidden state
        logits = self.fc(context)
        return logits

# TRAINING SETUP:
vocab_size = 5000
embed_dim = 100
hidden_dim = 256
num_classes = 2
batch_size = 32
num_epochs = 5
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Create model
model = LSTMTextClassifier(vocab_size, embed_dim, hidden_dim, num_classes)
model.to(device)

# Optimizer and loss
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()  # For multi-class classification

# Synthetic data for demo
X_train = torch.randint(0, vocab_size, (1000, 20))  # 1000 samples, 20 tokens each
y_train = torch.randint(0, num_classes, (1000,))    # Random labels

train_loader = DataLoader(
    list(zip(X_train, y_train)),
    batch_size=batch_size,
    shuffle=True
)

# TRAINING LOOP:
for epoch in range(num_epochs):
    model.train()
    total_loss = 0.0
    correct = 0
    total = 0
    
    for batch_x, batch_y in train_loader:
        batch_x, batch_y = batch_x.to(device), batch_y.to(device)
        
        # Forward pass
        logits = model(batch_x)
        loss = criterion(logits, batch_y)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Statistics
        total_loss += loss.item()
        _, predicted = torch.max(logits, 1)
        correct += (predicted == batch_y).sum().item()
        total += batch_y.size(0)
    
    accuracy = 100 * correct / total
    avg_loss = total_loss / len(train_loader)
    print(f'Epoch {epoch+1}/{num_epochs} - Loss: {avg_loss:.4f}, Accuracy: {accuracy:.2f}%')

print('Training complete!')
RNN vs LSTM vs GRU Comparison:
  • RNN: Basic recurrence, simple but vanishing gradient problem on long sequences (>20 steps)
  • LSTM: Full gating mechanism (forget, input, output gates), cell state preserves gradient, best for long sequences (>100 steps)
  • GRU: Simplified LSTM (reset and update gates), faster, 25-30% fewer parameters, good for medium sequences
  • Bidirectional: Process sequence both directions; better accuracy but can't be used for real-time prediction
  • When to use: Short sequences (<20 steps) → RNN, Medium (20-100) → GRU, Long (>100) or very important → LSTM, Transformer

Embeddings for NLP

Embeddings convert discrete tokens (word IDs: 42, 156, 89) into continuous dense vectors that capture semantic relationships. Word "love" (ID 42) and "like" (ID 156) get similar vectors because they mean similar things. This is crucial for NLP—neural networks don't understand integers, but they understand vectors with proximity meaning.

Embedding Fundamentals: Tokens to Vectors

import torch
import torch.nn as nn
import numpy as np

# EMBEDDING INTUITION:
# Raw text: "I love deep learning"
# After tokenization: [42, 156, 89, 234]
# These are just integers—no semantic meaning!
#
# Embedding: token ID → learnable dense vector
# Word 42 ("love") → [0.2, -0.5, 0.1, ..., 0.7] (128-d vector)
# Word 156 ("like") → [0.19, -0.52, 0.11, ..., 0.68] (similar to "love"!)
# Word 89 ("deep") → [0.8, 0.3, -0.2, ..., -0.1] (different from love/like)

# EMBEDDING LAYER:
# A lookup table: vocabulary_size × embedding_dim
# Initialize with random values, learn during training

vocab_size = 10000        # 10,000 different words
embedding_dim = 128       # Each word → 128-dimensional vector

embedding = nn.Embedding(vocab_size, embedding_dim)
print(f'Embedding weight matrix shape: {embedding.weight.shape}')
print('  Shape: (vocab_size=10000, embedding_dim=128)')
print(f'  Total parameters: {10000 * 128:,} (all learnable!)')

# LOOKUP OPERATION:
# Give it a token ID, get back its vector
word_id = torch.tensor([42])  # Word with ID 42
word_vector = embedding(word_id)
print(f'\nToken ID 42 → vector shape: {word_vector.shape}')
print(f'Vector values (first 5): {word_vector[0, :5]}')

# BATCH LOOKUP:
# Tokenized sentence: "I love deep learning"
# Token IDs: [123, 42, 89, 234]
sentence_ids = torch.tensor([[123, 42, 89, 234, 0, 0]])  # 0 = padding for fixed length
sentence_vectors = embedding(sentence_ids)
print(f'\nSentence embeddings shape: {sentence_vectors.shape}')
print('  (batch_size=1, sequence_length=6, embedding_dim=128)')
print('  Now each token is a 128-d vector!')

# THE LEARNING PROCESS:
print('\nHow embeddings are learned:')
print('1. Initialize: random vectors for each word')
print('2. Train on downstream task (e.g., sentiment classification)')
print('3. Backprop updates embedding vectors via gradient descent')
print('4. Similar words naturally learn similar vectors')
print('5. After training: embeddings capture semantic relationships')
print('   Example: vector("king") - vector("man") + vector("woman") ≈ vector("queen")')

Embedding Layer Parameters

import torch
import torch.nn as nn

# CREATING EMBEDDING LAYER:
embedding = nn.Embedding(
    num_embeddings=10000,    # Vocabulary size: 10,000 different words
    embedding_dim=128,       # Output vector dimension
    padding_idx=0,           # Token ID 0 = padding (always zero vector)
    max_norm=None,           # Optional: clip vectors to max length
    norm_type=2.0,           # L2 norm (distance metric)
    scale_grad_by_freq=False # Scale gradients by token frequency
)

print('Embedding layer created:')
print(f'  vocab_size=10000: can embed 10,000 different words')
print(f'  embedding_dim=128: each word → 128-d vector')
print(f'  padding_idx=0: token 0 stays [0, 0, ..., 0]')
print(f'  Parameters: 10,000 × 128 = {10000*128:,}')

# PADDING IN PRACTICE:
print('\nPadding example:')
sentence = torch.tensor([[5, 42, 89, 0, 0]])  # Sentence of length 5 with 2 padding tokens
embeddings = embedding(sentence)
print(f'Input: {sentence}')
print(f'Output shape: {embeddings.shape}')
print(f'Padding token (0) embedding: {embedding(torch.tensor([0]))}')
print('  All zeros! Padding has no semantic meaning')

# CONTROLLING EMBEDDING SIZE:
print('\nEmbedding dimension selection:')
print('  Small datasets, few words: 32-64 dimensions')
print('  Medium datasets, 1-10k words: 64-256 dimensions')
print('  Large datasets, 100k+ words: 256-768 dimensions')
print('  Rule of thumb: embedding_dim ≈ sqrt(vocab_size)')
vocab = 10000
suggested_dim = int(vocab ** 0.5)
print(f'  For vocab_size={vocab}: suggested dim ≈ {suggested_dim}')

# MAX_NORM: Clip embeddings to max length
embedding_clipped = nn.Embedding(
    num_embeddings=10000,
    embedding_dim=128,
    max_norm=1.0  # All vectors have length ≤ 1.0
)
print('\nmax_norm=1.0: keeps all embedding vectors on a unit sphere')
print('  Benefits: Prevents embeddings from growing unbounded')
print('  Use for: High-dimensional embeddings, similarity-sensitive tasks')

Pretrained Embeddings: Word2Vec, GloVe, FastText

import torch
import torch.nn as nn
import numpy as np

# PRETRAINED EMBEDDINGS CONCEPT:
# Instead of training embeddings from scratch, use knowledge from large corpora
# Word2Vec, GloVe, FastText trained on billions of words
# Pre-learned semantic relationships: vector("king") - vector("man") ≈ vector("queen")

# PRETRAINED OPTIONS:
print('Popular pretrained embeddings:')
print('  Word2Vec (Google, 2013): 300-d, trained on Google News')
print('  GloVe (Stanford, 2014): 50, 100, 200, 300-d, trained on Wikipedia+Gigaword')
print('  FastText (Facebook, 2016): 300-d, handles misspellings (char n-grams)')
print('  BERT, GPT (modern): 768-4096-d, context-dependent, expensive')

# SIMULATING PRETRAINED EMBEDDINGS:
vocab_size = 10000
embedding_dim = 300

# Create embedding layer
embedding = nn.Embedding(vocab_size, embedding_dim)

# LOADING PRETRAINED WEIGHTS (example):
# In practice, download from official source
# For demo: create synthetic pretrained weights
pretrained_weights = torch.randn(vocab_size, embedding_dim) / np.sqrt(embedding_dim)
embedding.weight.data.copy_(pretrained_weights)
print(f'\nPretrained weights loaded: {embedding.weight.shape}')

# FREEZING vs FINE-TUNING:
print('\nOptions for using pretrained embeddings:')

# Option 1: FROZEN (keep as-is)
embedding.weight.requires_grad = False
print('1. Frozen: embedding.weight.requires_grad = False')
print('   Pros: Preserves semantic relationships, few parameters')
print('   Cons: Can\'t adapt to your domain')
print('   Use when: Limited data, domain similar to pretraining')

# Option 2: FINE-TUNING (allow updates)
embedding.weight.requires_grad = True
print('2. Fine-tuning: embedding.weight.requires_grad = True')
print('   Pros: Adapts to your domain-specific vocabulary')
print('   Cons: More parameters, risk of overfitting')
print('   Use when: Sufficient data, domain-specific terms matter')

# COMPARISON: Training from scratch vs pretrained
print('\nTraining from scratch vs Pretrained:')
print('From scratch:')
print('  - 10,000 × 300 = 3M parameters to learn')
print('  - Need hundreds of thousands of sentences')
print('  - Time: weeks on GPU')
print('Pretrained (frozen):')
print('  - 0 parameters in embedding layer')
print('  - Need only thousands of sentences')
print('  - Time: hours on GPU')
print('Pretrained (fine-tuned):')
print('  - 3M parameters to update (small gradients)')
print('  - Need tens of thousands of sentences')
print('  - Time: days on GPU')

Practical Embedding Example with LSTM

import torch
import torch.nn as nn

class SentimentClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_classes):
        super(SentimentClassifier, self).__init__()
        
        # EMBEDDING LAYER
        self.embedding = nn.Embedding(
            num_embeddings=vocab_size,
            embedding_dim=embedding_dim,
            padding_idx=0  # Don't learn padding token
        )
        
        # LSTM LAYER
        self.lstm = nn.LSTM(
            input_size=embedding_dim,   # Input from embeddings
            hidden_size=hidden_dim,
            batch_first=True,
            bidirectional=True
        )
        
        # OUTPUT CLASSIFIER
        self.fc = nn.Linear(hidden_dim * 2, num_classes)
        self.dropout = nn.Dropout(0.3)
    
    def forward(self, token_ids):
        # token_ids: (batch_size, seq_len) - integer IDs
        # Example: [[5, 42, 156, 0], [89, 234, 0, 0]]
        
        # Step 1: Convert token IDs to embedding vectors
        # (batch, seq_len) → (batch, seq_len, embedding_dim)
        embedded = self.embedding(token_ids)
        
        # Step 2: Process sequence with LSTM
        # Returns hidden state containing sequence context
        lstm_out, (hidden, _) = self.lstm(embedded)
        
        # Step 3: Use final hidden state for classification
        # Concatenate forward and backward (bidirectional)
        context = self.dropout(hidden[-1])
        
        # Step 4: Classify
        logits = self.fc(context)
        return logits

# CREATE MODEL:
vocab_size = 10000
embedding_dim = 100      # 100-dimensional word vectors
hidden_dim = 256         # LSTM hidden state
num_classes = 2          # Positive/Negative sentiment

model = SentimentClassifier(vocab_size, embedding_dim, hidden_dim, num_classes)

# Parameter breakdown:
embedding_params = vocab_size * embedding_dim
lstm_params = 4 * embedding_dim * hidden_dim * 2  # Simplified (bidirectional)
fc_params = hidden_dim * 2 * num_classes

print(f'Model parameter breakdown:')
print(f'  Embedding: {vocab_size} × {embedding_dim} = {embedding_params:,}')
print(f'  LSTM: ~{lstm_params:,} (bidirectional)')
print(f'  FC: {hidden_dim*2} × {num_classes} = {fc_params:,}')
print(f'  Total: {sum(p.numel() for p in model.parameters()):,}')

# INFERENCE EXAMPLE:
batch_size = 4
seq_len = 20
token_ids = torch.randint(0, vocab_size, (batch_size, seq_len))

output = model(token_ids)
probs = torch.softmax(output, dim=1)

print(f'\nInference:')
print(f'  Input shape: {token_ids.shape} (batch of 4, max 20 tokens)')
print(f'  Output shape: {output.shape} (batch of 4, 2 classes)')
print(f'  Probabilities:\n{probs}')
print(f'  Predictions: {torch.argmax(probs, dim=1)}')
Embedding Best Practices
  • Dimension selection: Start with 64-128 for small datasets, 256-512 for large; follow rule of thumb ≈ sqrt(vocab_size)
  • Pretrained vs random: Pretrained significantly better for <10k training sentences; beneficial even with more data
  • Padding handling: Always set padding_idx=0 to prevent padding tokens from affecting predictions
  • Freezing strategy: Freeze if domain-independent task, fine-tune if domain-specific terminology matters
  • Large vocabularies: Consider subword models (BPE, WordPiece) to reduce vocabulary size while preserving morphology
  • Out-of-vocabulary: Use <UNK> token for unknown words; some frameworks (FastText) handle this better
  • Similarity metrics: Use cosine similarity for comparing embeddings; embeddings learn to place similar words close together

Advanced Training Techniques

Gradient Clipping

import torch
import torch.nn as nn
import torch.optim as optim

model = nn.Sequential(nn.Linear(10, 20), nn.ReLU(), nn.Linear(20, 5))
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

# Training loop with gradient clipping
data = torch.randn(32, 10)
labels = torch.randint(0, 5, (32,))

optimizer.zero_grad()
outputs = model(data)
loss = criterion(outputs, labels)
loss.backward()

# Clip gradients to prevent exploding gradients
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

optimizer.step()
print('Gradients clipped to max norm 1.0')

Learning Rate Scheduling

import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim.lr_scheduler import StepLR, ReduceLROnPlateau

model = nn.Linear(10, 5)
optimizer = optim.Adam(model.parameters(), lr=0.01)

# StepLR: Decay LR by gamma every step_size epochs
scheduler = StepLR(optimizer, step_size=10, gamma=0.1)

# Training loop
for epoch in range(30):
    # ... training code ...
    
    # Update learning rate
    scheduler.step()
    print(f'Epoch {epoch}, LR: {optimizer.param_groups[0]["lr"]:.6f}')

# ReduceLROnPlateau: Reduce LR when metric plateaus
scheduler_plateau = ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=5)

for epoch in range(30):
    # ... training code ...
    val_loss = 0.5  # Example validation loss
    
    # Reduce LR if validation loss doesn't improve
    scheduler_plateau.step(val_loss)

Early Stopping

import torch
import torch.nn as nn
import torch.optim as optim

class EarlyStopping:
    def __init__(self, patience=7, min_delta=0):
        self.patience = patience
        self.min_delta = min_delta
        self.counter = 0
        self.best_loss = None
        self.early_stop = False
    
    def __call__(self, val_loss):
        if self.best_loss is None:
            self.best_loss = val_loss
        elif val_loss > self.best_loss - self.min_delta:
            self.counter += 1
            if self.counter >= self.patience:
                self.early_stop = True
        else:
            self.best_loss = val_loss
            self.counter = 0

# Usage
model = nn.Linear(10, 5)
optimizer = optim.Adam(model.parameters())
early_stopping = EarlyStopping(patience=10)

for epoch in range(100):
    # ... training code ...
    val_loss = 0.5  # Example validation loss
    
    early_stopping(val_loss)
    if early_stopping.early_stop:
        print(f'Early stopping at epoch {epoch}')
        break

Custom Layers & Loss Functions

Creating Custom Layers

import torch
import torch.nn as nn

class CustomLinear(nn.Module):
    def __init__(self, in_features, out_features, use_bias=True):
        super(CustomLinear, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        
        # Initialize weights and bias
        self.weight = nn.Parameter(torch.randn(out_features, in_features))
        if use_bias:
            self.bias = nn.Parameter(torch.randn(out_features))
        else:
            self.register_parameter('bias', None)
    
    def forward(self, x):
        # Custom linear transformation: y = xW^T + b
        output = torch.matmul(x, self.weight.t())
        if self.bias is not None:
            output += self.bias
        return output

# Use custom layer
layer = CustomLinear(10, 5)
x = torch.randn(32, 10)
output = layer(x)
print(f'Output shape: {output.shape}')  # torch.Size([32, 5])

Custom Activation Function

import torch
import torch.nn as nn

class Swish(nn.Module):
    """Swish activation: x * sigmoid(x)"""
    def forward(self, x):
        return x * torch.sigmoid(x)

# Use custom activation
model = nn.Sequential(
    nn.Linear(10, 20),
    Swish(),  # Custom activation
    nn.Linear(20, 5)
)

x = torch.randn(32, 10)
output = model(x)
print(f'Output shape: {output.shape}')  # torch.Size([32, 5])

Custom Loss Function

import torch
import torch.nn as nn

class FocalLoss(nn.Module):
    """Focal Loss for handling class imbalance"""
    def __init__(self, alpha=1, gamma=2):
        super(FocalLoss, self).__init__()
        self.alpha = alpha
        self.gamma = gamma
        self.ce_loss = nn.CrossEntropyLoss(reduction='none')
    
    def forward(self, inputs, targets):
        ce_loss = self.ce_loss(inputs, targets)
        pt = torch.exp(-ce_loss)
        focal_loss = self.alpha * (1 - pt) ** self.gamma * ce_loss
        return focal_loss.mean()

# Use custom loss
model = nn.Linear(10, 5)
criterion = FocalLoss(alpha=1, gamma=2)

outputs = model(torch.randn(32, 10))
targets = torch.randint(0, 5, (32,))
loss = criterion(outputs, targets)
print(f'Focal loss: {loss.item():.4f}')

Model Interpretability (Grad-CAM)

Grad-CAM (Gradient-weighted Class Activation Mapping) visualizes which parts of an image a CNN focuses on for predictions.

Implementing Grad-CAM

import torch
import torch.nn as nn
import torch.nn.functional as F

class GradCAM:
    def __init__(self, model, target_layer):
        self.model = model
        self.target_layer = target_layer
        self.gradients = None
        self.activations = None
        
        # Register hooks
        target_layer.register_forward_hook(self.save_activation)
        target_layer.register_backward_hook(self.save_gradient)
    
    def save_activation(self, module, input, output):
        self.activations = output
    
    def save_gradient(self, module, grad_input, grad_output):
        self.gradients = grad_output[0]
    
    def generate_cam(self, input_image, target_class):
        # Forward pass
        output = self.model(input_image)
        
        # Backward pass for target class
        self.model.zero_grad()
        class_loss = output[0, target_class]
        class_loss.backward()
        
        # Get gradients and activations
        gradients = self.gradients.detach()
        activations = self.activations.detach()
        
        # Global average pooling of gradients
        weights = torch.mean(gradients, dim=(2, 3), keepdim=True)
        
        # Weighted combination of activation maps
        cam = torch.sum(weights * activations, dim=1, keepdim=True)
        cam = F.relu(cam)  # Only positive influence
        
        # Normalize
        cam = cam - cam.min()
        cam = cam / cam.max()
        
        return cam

# Example usage with ResNet
import torchvision.models as models

model = models.resnet18(pretrained=True)
model.eval()

# Get last convolutional layer
target_layer = model.layer4[-1].conv2

# Create Grad-CAM
grad_cam = GradCAM(model, target_layer)

# Generate CAM for an image
image = torch.randn(1, 3, 224, 224)  # Example image
cam = grad_cam.generate_cam(image, target_class=243)  # 243 = "bull mastiff"
print(f'CAM shape: {cam.shape}')  # torch.Size([1, 1, 7, 7])
Interpretability Tools: For production use, consider libraries like captum (PyTorch's official interpretability library) which provides Grad-CAM, Integrated Gradients, and many other attribution methods out of the box.

Transformer Architectures (nn.Transformer)

Transformers have revolutionized deep learning, particularly in NLP and increasingly in computer vision. PyTorch's nn.Transformer module provides an efficient, ready-to-use implementation of the complete Transformer architecture with multi-head self-attention, feed-forward networks, and layer normalization.

Transformer Power: Transformers use self-attention to process sequences in parallel (unlike RNNs which process sequentially), enabling better training efficiency and longer-range dependency capture. Each token can directly attend to every other token in the sequence.

Key Components: The Transformer consists of an encoder and decoder stack, each with multi-head attention, feed-forward networks, and residual connections.

import torch
import torch.nn as nn

# Create a Transformer model
batch_size = 32
seq_length = 10
d_model = 512
num_heads = 8
num_layers = 6

transformer = nn.Transformer(
    d_model=d_model,
    nhead=num_heads,
    num_encoder_layers=num_layers,
    num_decoder_layers=num_layers,
    dim_feedforward=2048,
    dropout=0.1,
    batch_first=True
)

# Input: (batch_size, seq_length, d_model)
src = torch.randn(batch_size, seq_length, d_model)
tgt = torch.randn(batch_size, seq_length, d_model)

# Forward pass
output = transformer(src, tgt)
print(f"Output shape: {output.shape}")  # [32, 10, 512]

Creating a Custom Transformer-based Model:

import torch
import torch.nn as nn
import math

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_length=100, dropout=0.1):
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        
        # Create position encodings
        pe = torch.zeros(max_seq_length, d_model)
        position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                             (-math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        self.register_buffer('pe', pe.unsqueeze(0))
    
    def forward(self, x):
        # x shape: (batch, seq_length, d_model)
        x = x + self.pe[:, :x.size(1), :]
        return self.dropout(x)

class TransformerSequenceModel(nn.Module):
    def __init__(self, vocab_size, d_model=512, num_heads=8, 
                 num_layers=6, dim_feedforward=2048, max_seq_length=100):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, max_seq_length)
        self.transformer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=num_heads,
            dim_feedforward=dim_feedforward,
            dropout=0.1,
            batch_first=True
        )
        self.encoder = nn.TransformerEncoder(self.transformer, num_layers=num_layers)
        self.output_layer = nn.Linear(d_model, vocab_size)
    
    def forward(self, x):
        # x shape: (batch, seq_length)
        embedded = self.embedding(x)  # (batch, seq_length, d_model)
        pos_encoded = self.pos_encoding(embedded)  # Add positional info
        encoded = self.encoder(pos_encoded)  # (batch, seq_length, d_model)
        output = self.output_layer(encoded)  # (batch, seq_length, vocab_size)
        return output

# Create and use the model
vocab_size = 10000
model = TransformerSequenceModel(vocab_size, d_model=512, num_heads=8, num_layers=6)

# Forward pass
batch_size = 16
seq_length = 20
input_ids = torch.randint(0, vocab_size, (batch_size, seq_length))
output = model(input_ids)
print(f"Output shape: {output.shape}")  # [16, 20, 10000]

Masking for Autoregressive Generation: For sequence-to-sequence tasks, we need to prevent the model from attending to future tokens during training.

import torch
import torch.nn as nn

def create_causal_mask(seq_length, device):
    """Create a mask preventing attention to future tokens"""
    mask = torch.triu(torch.ones(seq_length, seq_length, device=device) * float('-inf'), diagonal=1)
    return mask

# Use mask in Transformer
transformer = nn.TransformerEncoderLayer(d_model=512, nhead=8, batch_first=True)
src = torch.randn(32, 20, 512)  # (batch, seq_length, d_model)

# Create causal mask
mask = create_causal_mask(20, src.device)

# Forward with mask
output = transformer(src, src_mask=mask)
print(f"Output shape: {output.shape}")  # [32, 20, 512]
Transformer Components
  • Multi-Head Attention: Multiple representation subspaces allow the model to attend to information from different representation subspaces at different positions
  • Feed-Forward Networks: Two linear layers with ReLU activation applied to each position separately and identically
  • Positional Encoding: Sine and cosine functions at different frequencies encode token positions since attention is position-agnostic
  • Layer Normalization: Applied before (pre-norm) or after (post-norm) sublayers to stabilize training
  • Residual Connections: Skip connections around each sublayer enable training of very deep networks

Transformer Decoder for Sequence Generation:

import torch
import torch.nn as nn

class TransformerDecoder(nn.Module):
    def __init__(self, vocab_size, d_model=512, num_heads=8, 
                 num_layers=6, dim_feedforward=2048, max_seq_length=100):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, max_seq_length)
        
        decoder_layer = nn.TransformerDecoderLayer(
            d_model=d_model,
            nhead=num_heads,
            dim_feedforward=dim_feedforward,
            dropout=0.1,
            batch_first=True
        )
        self.decoder = nn.TransformerDecoder(decoder_layer, num_layers=num_layers)
        self.output_layer = nn.Linear(d_model, vocab_size)
    
    def forward(self, tgt, memory, tgt_mask=None, memory_mask=None):
        # tgt shape: (batch, tgt_seq_length, d_model)
        # memory shape: (batch, src_seq_length, d_model)
        
        tgt_embedded = self.embedding(tgt)
        tgt_encoded = self.pos_encoding(tgt_embedded)
        
        decoded = self.decoder(
            tgt_encoded, 
            memory,
            tgt_mask=tgt_mask,
            memory_mask=memory_mask
        )
        output = self.output_layer(decoded)
        return output

# Create encoder-decoder architecture
encoder = nn.TransformerEncoder(
    nn.TransformerEncoderLayer(d_model=512, nhead=8, batch_first=True),
    num_layers=6
)
decoder = TransformerDecoder(vocab_size=10000, d_model=512, num_heads=8, num_layers=6)

# Encode source sequence
src = torch.randn(16, 30, 512)  # (batch, src_len, d_model)
memory = encoder(src)

# Decode with target and encoded source
tgt = torch.randint(0, 10000, (16, 20))  # (batch, tgt_len)
output = decoder(tgt, memory)
print(f"Decoder output shape: {output.shape}")  # [16, 20, 10000]

Vision Transformers (ViT)

Vision Transformers apply the Transformer architecture directly to image patches, treating them like sequence tokens. This approach has achieved state-of-the-art results on image classification while being more efficient to train on large datasets compared to CNNs.

ViT Innovation: Instead of using convolutional filters, ViT divides an image into non-overlapping patches, embeds them linearly, and processes them with standard Transformers. This enables the model to learn global dependencies from the start, rather than building them up through multiple convolutional layers.

Basic Vision Transformer Implementation:

import torch
import torch.nn as nn
import math

class VisionTransformer(nn.Module):
    def __init__(self, image_size=224, patch_size=16, num_classes=1000, 
                 d_model=768, num_heads=12, num_layers=12, dim_feedforward=3072):
        super().__init__()
        
        # Patch embedding
        num_patches = (image_size // patch_size) ** 2
        patch_dim = 3 * patch_size * patch_size  # 3 channels * patch_size * patch_size
        
        self.patch_embedding = nn.Linear(patch_dim, d_model)
        self.cls_token = nn.Parameter(torch.zeros(1, 1, d_model))
        
        # Positional embeddings
        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, d_model))
        
        # Transformer encoder
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=num_heads,
            dim_feedforward=dim_feedforward,
            dropout=0.1,
            batch_first=True
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        
        # Classification head
        self.norm = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, num_classes)
        
        self.patch_size = patch_size
    
    def forward(self, x):
        # x shape: (batch, 3, 224, 224)
        batch_size = x.shape[0]
        
        # Split into patches and embed
        # Reshape to (batch, num_patches, patch_dim)
        x = x.reshape(batch_size, 3, -1, self.patch_size, self.patch_size)
        x = x.permute(0, 2, 3, 4, 1).reshape(batch_size, -1, 3 * self.patch_size * self.patch_size)
        
        x = self.patch_embedding(x)  # (batch, num_patches, d_model)
        
        # Add class token
        cls_tokens = self.cls_token.expand(batch_size, -1, -1)
        x = torch.cat([cls_tokens, x], dim=1)  # (batch, num_patches+1, d_model)
        
        # Add positional embeddings
        x = x + self.pos_embedding
        
        # Transformer encoder
        x = self.transformer(x)  # (batch, num_patches+1, d_model)
        
        # Classification from [CLS] token
        x = self.norm(x[:, 0])  # Take [CLS] token
        x = self.head(x)  # (batch, num_classes)
        
        return x

# Create and use ViT
vit = VisionTransformer(image_size=224, patch_size=16, num_classes=1000, d_model=768, num_heads=12, num_layers=12)

# Forward pass
images = torch.randn(8, 3, 224, 224)  # (batch, channels, height, width)
logits = vit(images)
print(f"Output shape: {logits.shape}")  # [8, 1000]
print(f"Total parameters: {sum(p.numel() for p in vit.parameters()) / 1e6:.1f}M")

Using Pretrained Vision Transformers from torchvision:

import torch
import torchvision.models as models
from torchvision.transforms import Compose, Resize, Normalize, ToTensor

# Load pretrained ViT-Base model
vit = models.vit_b_16(pretrained=True, progress=True)

# Freeze feature extraction layers
for param in vit.features.parameters():
    param.requires_grad = False

# Replace classification head for custom task
num_classes = 10
vit.heads.head = torch.nn.Linear(768, num_classes)

# Prepare input
transform = Compose([
    Resize((224, 224)),
    ToTensor(),
    Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Forward pass
image = torch.randn(1, 3, 224, 224)
output = vit(image)
print(f"Output shape: {output.shape}")  # [1, 10]

Patch-level Feature Extraction: Vision Transformers learn rich patch-level representations useful for detection and segmentation tasks.

import torch
import torch.nn as nn

class ViTPatchFeatures(nn.Module):
    def __init__(self, vit_model):
        super().__init__()
        self.vit = vit_model
    
    def forward(self, x):
        # Remove classification head
        batch_size = x.shape[0]
        
        # Embed patches
        x = self.vit._process_input(x)
        n, _, c = x.shape
        
        # Expand the class token to the full batch
        batch_class_token = self.vit.class_token.expand(batch_size, -1, -1)
        x = torch.cat([batch_class_token, x], dim=1)
        
        # Apply transformer
        x = self.vit.encoder(x)
        
        # Return patch features (excluding class token)
        return x[:, 1:, :]  # (batch, num_patches, d_model)

# Extract patch features for custom downstream tasks
vit = models.vit_b_16(pretrained=True)
patch_extractor = ViTPatchFeatures(vit)

# Get patch-level features
images = torch.randn(4, 3, 224, 224)
patch_features = patch_extractor(images)
print(f"Patch features shape: {patch_features.shape}")  # [4, 196, 768]

# Use for object detection or segmentation
# Each patch can be processed independently with a detection head
detection_head = nn.Sequential(
    nn.Linear(768, 256),
    nn.ReLU(),
    nn.Linear(256, 4)  # x, y, width, height
)

detections = detection_head(patch_features)
print(f"Detections shape: {detections.shape}")  # [4, 196, 4]
ViT Advantages
  • Global Receptive Field: Each patch can attend to all other patches from the first layer, unlike CNNs which need multiple layers
  • Parallelizable: No sequential convolutions; all patches processed in parallel
  • Scalable: Scales better to very large models and datasets compared to CNNs
  • Transfer Learning: Pretrained on ImageNet-21k, achieves excellent results with minimal fine-tuning
  • Interpretability: Attention weights show which patches the model focuses on for predictions

Fine-tuning Vision Transformers for Custom Tasks:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Setup
vit = models.vit_b_16(pretrained=True)

# Replace head for binary classification
vit.heads.head = nn.Linear(768, 2)

# Freeze encoder, train only head
for param in vit.features.parameters():
    param.requires_grad = False
for param in vit.encoder.parameters():
    param.requires_grad = False

# Training setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
vit = vit.to(device)

optimizer = optim.AdamW(vit.heads.head.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

# Create dummy data
X = torch.randn(100, 3, 224, 224)
y = torch.randint(0, 2, (100,))
dataset = TensorDataset(X, y)
dataloader = DataLoader(dataset, batch_size=8)

# Training loop
for epoch in range(3):
    total_loss = 0
    for images, labels in dataloader:
        images, labels = images.to(device), labels.to(device)
        
        # Forward
        outputs = vit(images)
        loss = criterion(outputs, labels)
        
        # Backward
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
    
    print(f"Epoch {epoch+1}, Loss: {total_loss/len(dataloader):.4f}")

Advanced Attention Mechanisms

PyTorch's nn.MultiheadAttention module is highly flexible and can be customized for various attention patterns. Understanding how to use and modify attention mechanisms is crucial for building state-of-the-art models.

Attention is All You Need: The attention mechanism computes a weighted sum of values based on the similarity between queries and keys. Multi-head attention applies this operation multiple times in parallel, capturing different types of relationships.

Standard Multi-Head Attention:

import torch
import torch.nn as nn

# Create multi-head attention
d_model = 512
num_heads = 8

mha = nn.MultiheadAttention(
    embed_dim=d_model,
    num_heads=num_heads,
    dropout=0.1,
    batch_first=True  # Expect (batch, seq_len, d_model) instead of (seq_len, batch, d_model)
)

# Prepare inputs
batch_size = 32
seq_length = 20

query = torch.randn(batch_size, seq_length, d_model)
key = torch.randn(batch_size, seq_length, d_model)
value = torch.randn(batch_size, seq_length, d_model)

# Forward pass
attn_output, attn_weights = mha(query, key, value)

print(f"Attention output shape: {attn_output.shape}")  # [32, 20, 512]
print(f"Attention weights shape: {attn_weights.shape}")  # [32, 20, 20]

Cross-Attention (Query from one sequence, Key/Value from another):

import torch
import torch.nn as nn

# Setup
mha = nn.MultiheadAttention(embed_dim=512, num_heads=8, batch_first=True)

# Encoder output (context for cross-attention)
encoder_output = torch.randn(16, 30, 512)  # (batch, src_len, d_model)

# Decoder query
decoder_query = torch.randn(16, 20, 512)  # (batch, tgt_len, d_model)

# Cross-attention: Query from decoder, Key/Value from encoder
cross_attn_output, cross_attn_weights = mha(
    query=decoder_query,
    key=encoder_output,
    value=encoder_output
)

print(f"Cross-attention output shape: {cross_attn_output.shape}")  # [16, 20, 512]
print(f"Cross-attention weights shape: {cross_attn_weights.shape}")  # [16, 20, 30]

Attention with Masking (Causal and Padding Masks):

import torch
import torch.nn as nn

def create_causal_mask(seq_length, device):
    """Prevent attending to future positions"""
    mask = torch.triu(torch.ones(seq_length, seq_length, device=device) * float('-inf'), diagonal=1)
    return mask

def create_padding_mask(seq_lengths, max_length, device):
    """Mask padded positions"""
    mask = torch.arange(max_length, device=device).unsqueeze(0) < seq_lengths.unsqueeze(1)
    return mask

# Setup
mha = nn.MultiheadAttention(embed_dim=512, num_heads=8, batch_first=True)

# Data
batch_size = 16
seq_length = 20

query = torch.randn(batch_size, seq_length, 512)
key = torch.randn(batch_size, seq_length, 512)
value = torch.randn(batch_size, seq_length, 512)

# Create causal mask (for autoregressive generation)
causal_mask = create_causal_mask(seq_length, query.device)

# Forward with causal mask
attn_output, attn_weights = mha(query, key, value, attn_mask=causal_mask)

print(f"Output shape: {attn_output.shape}")  # [16, 20, 512]

# For padding mask, typical usage with variable length sequences
seq_lengths = torch.tensor([20, 18, 15, 20, 19, 17, 20, 16, 19, 20, 18, 17, 20, 19, 15, 20])
padding_mask = create_padding_mask(seq_lengths, seq_length, query.device)

# Padding mask needs to be inverted (False for valid, True for padding)
key_padding_mask = ~padding_mask

attn_output, _ = mha(query, key, value, key_padding_mask=key_padding_mask)
print(f"Output with padding mask: {attn_output.shape}")  # [16, 20, 512]

Custom Attention Layer with Scaled Dot-Product:

import torch
import torch.nn as nn
import math

class ScaledDotProductAttention(nn.Module):
    def __init__(self, d_k):
        super().__init__()
        self.d_k = d_k
    
    def forward(self, query, key, value, mask=None):
        # Scaled dot product
        scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(self.d_k)
        
        # Apply mask if provided
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        
        # Apply softmax
        attention_weights = torch.softmax(scores, dim=-1)
        
        # Apply dropout for regularization
        attention_weights = torch.nn.functional.dropout(attention_weights, p=0.1, training=self.training)
        
        # Weighted sum of values
        output = torch.matmul(attention_weights, value)
        
        return output, attention_weights

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        self.linear_q = nn.Linear(d_model, d_model)
        self.linear_k = nn.Linear(d_model, d_model)
        self.linear_v = nn.Linear(d_model, d_model)
        self.linear_out = nn.Linear(d_model, d_model)
        
        self.attention = ScaledDotProductAttention(self.d_k)
    
    def forward(self, query, key, value, mask=None):
        batch_size = query.shape[0]
        
        # Linear transformations and reshape for multiple heads
        Q = self.linear_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.linear_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.linear_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        
        # Apply attention
        attn_output, attn_weights = self.attention(Q, K, V, mask)
        
        # Concatenate heads
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        
        # Final linear transformation
        output = self.linear_out(attn_output)
        
        return output, attn_weights

# Use custom attention
custom_mha = MultiHeadAttention(d_model=512, num_heads=8)

query = torch.randn(32, 20, 512)
key = torch.randn(32, 20, 512)
value = torch.randn(32, 20, 512)

output, weights = custom_mha(query, key, value)
print(f"Custom MHA output shape: {output.shape}")  # [32, 20, 512]

Sparse Attention (Linear Attention): For long sequences, standard attention becomes prohibitively expensive (O(n²) complexity). Sparse attention patterns reduce this cost.

import torch
import torch.nn as nn

class SparseAttention(nn.Module):
    def __init__(self, d_model, num_heads, window_size=64):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.window_size = window_size
        self.d_k = d_model // num_heads
        
        self.linear_q = nn.Linear(d_model, d_model)
        self.linear_k = nn.Linear(d_model, d_model)
        self.linear_v = nn.Linear(d_model, d_model)
        self.linear_out = nn.Linear(d_model, d_model)
    
    def forward(self, query, key, value):
        batch_size = query.shape[0]
        seq_length = query.shape[1]
        
        Q = self.linear_q(query).view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)
        K = self.linear_k(key).view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)
        V = self.linear_v(value).view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)
        
        # Local attention window
        window_size = self.window_size
        scores = torch.zeros(batch_size, self.num_heads, seq_length, seq_length, device=query.device)
        
        for i in range(seq_length):
            start = max(0, i - window_size // 2)
            end = min(seq_length, i + window_size // 2 + 1)
            
            # Attention only within window
            local_scores = torch.matmul(Q[:, :, i:i+1, :], K[:, :, start:end, :].transpose(-2, -1)) / (self.d_k ** 0.5)
            scores[:, :, i, start:end] = torch.softmax(local_scores.squeeze(2), dim=-1)
        
        # Apply sparse attention
        attn_output = torch.matmul(scores, V)
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model)
        output = self.linear_out(attn_output)
        
        return output

# Use sparse attention
sparse_attn = SparseAttention(d_model=512, num_heads=8, window_size=64)

# Long sequence
long_query = torch.randn(4, 2048, 512)  # 2048 tokens
output = sparse_attn(long_query, long_query, long_query)
print(f"Sparse attention output shape: {output.shape}")  # [4, 2048, 512]
print(f"Computational complexity: O(n * window_size) instead of O(n²)")
Attention Variants
  • Self-Attention: Query, Key, Value from same source; captures internal dependencies
  • Cross-Attention: Query from different source than Key/Value; for encoder-decoder fusion
  • Causal Attention: Masked to prevent attending to future tokens; for autoregressive generation
  • Sparse Attention: Local window attention; reduces O(n²) to O(n*w) for long sequences
  • Linear Attention: Kernel-based approximation; O(n) complexity but lower quality

Deployment with TorchScript

TorchScript converts PyTorch models into an intermediate representation that can run independently of Python, enabling production deployment in C++.

Tracing a Model

import torch
import torch.nn as nn

# Define a simple model
class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.linear = nn.Linear(10, 5)
    
    def forward(self, x):
        return torch.relu(self.linear(x))

model = SimpleModel()
model.eval()

# Create example input
example_input = torch.randn(1, 10)

# Trace the model
traced_model = torch.jit.trace(model, example_input)

# Save traced model
traced_model.save('model_traced.pt')
print('Model traced and saved')

# Load and use traced model
loaded_model = torch.jit.load('model_traced.pt')
output = loaded_model(example_input)
print(f'Output: {output}')

Scripting a Model (for Control Flow)

import torch
import torch.nn as nn

class ModelWithControlFlow(nn.Module):
    def __init__(self):
        super(ModelWithControlFlow, self).__init__()
        self.linear = nn.Linear(10, 5)
    
    def forward(self, x):
        # Control flow - use scripting instead of tracing
        if x.sum() > 0:
            return torch.relu(self.linear(x))
        else:
            return torch.sigmoid(self.linear(x))

model = ModelWithControlFlow()
model.eval()

# Script the model (handles control flow)
scripted_model = torch.jit.script(model)

# Save scripted model
scripted_model.save('model_scripted.pt')
print('Model scripted and saved')

# Load and use
loaded_model = torch.jit.load('model_scripted.pt')
output = loaded_model(torch.randn(1, 10))
print(f'Output: {output}')

Optimizing for Mobile Deployment

import torch
import torch.nn as nn
from torch.utils.mobile_optimizer import optimize_for_mobile

# Create and trace model
model = nn.Sequential(
    nn.Linear(10, 20),
    nn.ReLU(),
    nn.Linear(20, 5)
)
model.eval()

example = torch.randn(1, 10)
traced_model = torch.jit.trace(model, example)

# Optimize for mobile
mobile_model = optimize_for_mobile(traced_model)

# Save optimized model
mobile_model._save_for_lite_interpreter('model_mobile.ptl')
print('Model optimized and saved for mobile deployment')

# Model can now be loaded in iOS/Android apps
Production Deployment Options: For web services, use TorchServe (official PyTorch serving framework) or ONNX Runtime. For edge devices, use TorchScript with mobile optimization. For maximum performance, convert to ONNX and use TensorRT (NVIDIA GPUs).

Best Practices & Common Pitfalls

Essential Best Practices

Model Development

  • Start simple: Begin with a small model, ensure it overfits on a tiny dataset (proves implementation works), then scale up.
  • Normalize inputs: Always normalize/standardize input data (mean=0, std=1) for faster convergence.
  • Batch normalization: Add batch norm layers after linear/conv layers for more stable training.
  • Dropout: Use dropout (0.2-0.5) to prevent overfitting, especially in fully connected layers.

Training

  • Learning rate: Most important hyperparameter. Start with 0.001 for Adam, 0.01-0.1 for SGD. Use learning rate scheduling.
  • Gradient clipping: Prevent exploding gradients with torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
  • Early stopping: Stop training when validation loss stops improving (patience ~10 epochs).
  • Checkpointing: Save model checkpoints every N epochs to avoid losing progress.

Common Pitfalls

  • Forgetting .eval(): Always set model to eval mode during validation/testing to disable dropout and batch norm training behavior.
  • Not zeroing gradients: Always call optimizer.zero_grad() before backward pass—gradients accumulate by default!
  • Device mismatch: Ensure model and data are on the same device (both CPU or both GPU).
  • Wrong loss function: Use CrossEntropyLoss for classification (includes softmax), MSELoss for regression.
  • Data leakage: Never include test data in training. Don't normalize train and test together—fit on train, transform test.

Debugging

  • Check shapes: Print tensor shapes frequently. Most bugs are shape mismatches.
  • Overfit small batch: Ensure model can overfit 1-2 batches (loss → 0). If not, there's a bug in implementation.
  • Gradient checks: Verify gradients are flowing: print([p.grad.abs().mean() for p in model.parameters()])
  • Learning rate finder: Plot loss vs learning rate to find optimal LR range.

Performance Optimization

  • GPU utilization: Use larger batch sizes to maximize GPU usage. Monitor with nvidia-smi.
  • Mixed precision: Enable AMP for 2-3x speedup on modern GPUs (Volta, Turing, Ampere).
  • DataLoader workers: Use num_workers=4-8 for parallel data loading on multi-core CPUs.
  • Pin memory: Enable pin_memory=True in DataLoader when using GPU for faster transfers.
Best Practices Production Ready Debugging
Next Steps: You now have a solid foundation in PyTorch! Practice by implementing classic papers (ResNet, LSTM), participate in Kaggle competitions, or build your own projects. The PyTorch documentation and community forums are excellent resources for continued learning. Remember: the best way to learn deep learning is by doing—start building!

PyTorch Complete Reference

Comprehensive cheat sheet for essential PyTorch APIs, code patterns, and best practices.

Quick Start Code Patterns

1. Build & Train a Model

model = nn.Sequential(
    nn.Linear(20, 64),
    nn.ReLU(),
    nn.Dropout(0.3),
    nn.Linear(64, 10)
)
opt = optim.Adam(model.parameters())
for epoch in range(10):
    pred = model(x_batch)
    loss = F.cross_entropy(pred, y_batch)
    loss.backward()
    opt.step()
    opt.zero_grad()

2. Custom Model Class

class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(20, 64)
        self.fc2 = nn.Linear(64, 10)
    
    def forward(self, x):
        x = F.relu(self.fc1(x))
        return self.fc2(x)

model = MyModel()
device = torch.device('cuda')
model = model.to(device)

3. Training Loop with Validation

model.train()
for epoch in range(epochs):
    for x_batch, y_batch in train_loader:
        x_batch = x_batch.to(device)
        y_batch = y_batch.to(device)
        
        pred = model(x_batch)
        loss = criterion(pred, y_batch)
        loss.backward()
        opt.step()
        opt.zero_grad()
    
    # Validation
    model.eval()
    with torch.no_grad():
        val_loss = evaluate(val_loader)

4. Save & Load Checkpoint

# Save
checkpoint = {
    'epoch': epoch,
    'model': model.state_dict(),
    'optimizer': opt.state_dict(),
    'loss': loss
}
torch.save(checkpoint, 'model.pt')

# Load
checkpoint = torch.load('model.pt')
model.load_state_dict(checkpoint['model'])
opt.load_state_dict(checkpoint['optimizer'])
epoch = checkpoint['epoch']

Architecture & Hyperparameter Decisions

Activation Functions (When to Use):

  • F.relu() - Hidden layers (default, fast)
  • F.sigmoid() - Binary classification output
  • F.softmax() - Multi-class classification
  • F.tanh() - RNNs, when [-1,1] bounds needed
  • F.leaky_relu() - Fix dead neurons
  • F.elu() - Alternative to ReLU

Loss Functions (By Task):

  • F.cross_entropy() - Multi-class (faster)
  • nn.CrossEntropyLoss() - Multi-class (with weight)
  • F.binary_cross_entropy() - Binary classification
  • F.mse_loss() / mae_loss() - Regression
  • F.nll_loss() - Negative log likelihood
  • F.kl_div() - KL divergence

Optimizers (When to Use):

  • optim.Adam - ⭐ Default, works for most problems
  • optim.AdamW - Better for Transformers & vision
  • optim.SGD - Stable with momentum, good with schedules
  • optim.RMSprop - Good for RNNs
  • optim.LAMB - Large batch training

Regularization & Overfitting:

  • nn.Dropout(0.3-0.5) - Disable random neurons
  • weight_decay - L2 regularization in optimizer
  • nn.BatchNorm2d() - Stabilize training
  • LambdaLR scheduler - Custom LR schedule
  • torch.nn.utils.clip_grad_norm_() - Gradient clipping

API Reference Tables

Detailed function reference for PyTorch APIs.

Tensor Creation
torch.tensor([1, 2, 3])Create from list
torch.zeros((3, 4))Zeros tensor
torch.ones((2, 3))Ones tensor
torch.randn(5, 3)Normal distribution
torch.rand(4, 4)Uniform [0, 1)
torch.arange(10)Range 0-9
torch.linspace(0, 1, 10)Evenly spaced
torch.eye(3)Identity matrix
Tensor Operations
x.shapeGet dimensions
x.reshape(2, 6)Reshape tensor
x.TTranspose
x.to(device)Move to device
x.clone()Copy tensor
x.requires_grad_(True)Track gradients
x.detach()Stop tracking
x.item()Get scalar value
Neural Network Layers
nn.Linear(in, out)Fully connected
nn.Conv2d(in_c, out_c, 3)2D convolution
nn.ReLU()ReLU activation
nn.Sigmoid()Sigmoid activation
nn.Tanh()Tanh activation
nn.BatchNorm2d(64)Batch normalization
nn.Dropout(0.5)Dropout regularization
nn.LSTM(input, hidden)LSTM layer
Model Training
nn.Sequential(...)Stack layers
model.parameters()Get all weights
optim.Adam(params)Adam optimizer
F.cross_entropy(out, y)Classification loss
F.mse_loss(out, y)Regression loss
loss.backward()Backpropagation
optimizer.step()Update weights
model.train() / eval()Training/eval mode
Activation Functions
F.relu(x)Rectified Linear Unit
F.leaky_relu(x)Leaky ReLU
F.sigmoid(x)Sigmoid [0, 1]
F.tanh(x)Tanh [-1, 1]
F.softmax(x, dim=1)Softmax probabilities
F.gelu(x)GELU (Transformers)
F.elu(x)Exponential Linear Unit
torch.nn.functionalAll functions in F
Utilities & Debugging
torch.cuda.is_available()Check GPU support
device = torch.device('cuda')Set GPU device
torch.manual_seed(42)Reproducibility
torch.no_grad()Disable gradients
model.state_dict()Get all weights
torch.save(model, 'path')Save model
torch.load('path')Load model
summary(model, (C, H, W))Model summary