Back to PyTorch Mastery Series

Part 4: Datasets, DataLoaders & Data Pipelines

May 3, 2026 Wasil Zafar 25 min read

Build custom datasets, harness DataLoader for efficient batching, apply transforms and augmentation, handle imbalanced data, and stream from any source — the complete guide to PyTorch data pipelines.

Table of Contents

  1. Why Data Pipelines Matter
  2. torch.utils.data.Dataset
  3. Custom Dataset Classes
  4. DataLoader Deep Dive
  5. torchvision.transforms
  6. Data Augmentation Strategies
  7. Built-in Datasets
  8. Real-World Data Challenges
  9. IterableDataset
  10. Conclusion & Next Steps

Why Data Pipelines Matter

A common misconception among deep learning beginners is that the GPU is always the bottleneck. In reality, data loading is often the true performance killer. If your GPU finishes a forward-backward pass in 10 milliseconds but waits 50 milliseconds for the next batch to arrive from disk, your expensive hardware sits idle 80% of the time. This is the data loading problem.

PyTorch solves this with a carefully designed data pipeline consisting of three core abstractions:

  • Dataset — defines what your data is and how to access individual samples
  • Transforms — defines how to preprocess, augment, and normalise each sample
  • DataLoader — defines how to batch, shuffle, and deliver samples to the GPU efficiently
Key Insight: A well-tuned data pipeline keeps the GPU fed with pre-processed batches so it never idles. The goal is zero GPU wait time — the next batch should always be ready before the current training step finishes.

The Data Pipeline Architecture

The following diagram shows how data flows from raw storage through the PyTorch pipeline to the GPU. Notice that the DataLoader uses multiple worker processes to prepare batches in parallel while the GPU is busy training on the current batch.

PyTorch Data Pipeline Flow
                                flowchart LR
                                    A["Raw Data
(Disk / DB / API)"] --> B["Dataset
__getitem__()"] B --> C["Transforms
(Resize, Normalise,
Augment)"] C --> D["DataLoader
(Batch, Shuffle,
Workers)"] D --> E["pin_memory
(Page-Locked RAM)"] E --> F["GPU
(Training)"]

Each component in this pipeline is independently configurable. You can swap out transforms, change batch sizes, or replace the dataset entirely — all without touching your training loop.

torch.utils.data.Dataset

At the heart of PyTorch's data system is the Dataset class. A map-style dataset is any object that implements two methods: __len__() (returns the total number of samples) and __getitem__(idx) (returns the sample at the given index). That's it — just two methods and you can feed any data format into PyTorch's training pipeline.

Think of a Dataset like a Python list: you can ask how long it is and grab items by index. The DataLoader later handles batching and shuffling on top of this simple interface.

Creating a Dataset from NumPy Arrays

The simplest way to create a Dataset is to wrap NumPy arrays (or PyTorch tensors) with TensorDataset. This is ideal when your entire dataset fits in memory. The following example creates a small regression dataset, wraps it, and accesses individual samples by index — exactly the way a DataLoader will later access them during training.

import torch
from torch.utils.data import TensorDataset

# Create sample features and labels
X = torch.tensor([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0],
                   [7.0, 8.0], [9.0, 10.0]], dtype=torch.float32)
y = torch.tensor([3.0, 7.0, 11.0, 15.0, 19.0], dtype=torch.float32)

# Wrap in TensorDataset
dataset = TensorDataset(X, y)

# Access like a list
print(f"Dataset length: {len(dataset)}")        # 5
print(f"First sample:   {dataset[0]}")           # (tensor([1., 2.]), tensor(3.))
print(f"Last sample:    {dataset[-1]}")          # (tensor([9., 10.]), tensor(19.))

# Iterate over the dataset
for features, label in dataset:
    print(f"  Features: {features.tolist()}, Label: {label.item()}")

TensorDataset simply stores your tensors and returns tuples on indexing. It is a convenience wrapper — behind the scenes it does the same thing as writing __getitem__ and __len__ yourself, but saves you the boilerplate when your data is already in tensor form.

Creating a Dataset from a CSV File

Real-world data rarely comes pre-loaded as tensors. More often you have CSV files, JSON files, or databases. To handle these, you subclass Dataset and implement the two required methods yourself. The pattern below reads a CSV file with pandas and converts rows to tensors on-the-fly inside __getitem__.

import torch
from torch.utils.data import Dataset
import pandas as pd
import io

# Simulate a CSV file (in practice, use pd.read_csv('data.csv'))
csv_data = """age,income,purchased
25,40000,0
30,50000,0
35,60000,1
40,80000,1
45,90000,1
"""

class CSVDataset(Dataset):
    def __init__(self, csv_string):
        self.df = pd.read_csv(io.StringIO(csv_string))
        self.features = self.df[['age', 'income']].values
        self.labels = self.df['purchased'].values

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        x = torch.tensor(self.features[idx], dtype=torch.float32)
        y = torch.tensor(self.labels[idx], dtype=torch.float32)
        return x, y

# Create and test the dataset
dataset = CSVDataset(csv_data)
print(f"Dataset size: {len(dataset)}")
print(f"Sample 0: features={dataset[0][0].tolist()}, label={dataset[0][1].item()}")
print(f"Sample 4: features={dataset[4][0].tolist()}, label={dataset[4][1].item()}")

Notice that the CSV is read once in __init__, but individual rows are converted to tensors only when __getitem__ is called. This is the typical pattern: do heavy I/O once at initialisation, do lightweight per-sample processing at access time.

Custom Dataset Classes

While TensorDataset and simple CSV wrappers work for small experiments, production datasets need more flexibility. A custom Dataset class lets you handle arbitrary data formats, apply transforms, and control exactly how data is loaded. The skeleton is always the same: subclass Dataset, implement __init__, __len__, and __getitem__.

The following example builds a more realistic dataset class that accepts optional transforms, which is the standard pattern used throughout PyTorch's ecosystem. The transform is applied inside __getitem__ so that every sample goes through the same preprocessing pipeline automatically.

import torch
from torch.utils.data import Dataset
import numpy as np

class CustomSensorDataset(Dataset):
    """Dataset for time-series sensor readings with optional transforms."""

    def __init__(self, num_samples=100, seq_length=10, transform=None):
        # Simulate sensor data: num_samples sequences of seq_length readings
        np.random.seed(42)
        self.data = np.random.randn(num_samples, seq_length).astype(np.float32)
        # Binary labels: 1 if mean reading > 0, else 0
        self.labels = (self.data.mean(axis=1) > 0).astype(np.float32)
        self.transform = transform

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        sample = torch.tensor(self.data[idx])
        label = torch.tensor(self.labels[idx])

        # Apply transform if provided
        if self.transform:
            sample = self.transform(sample)

        return sample, label

# Use the dataset without transforms
dataset = CustomSensorDataset(num_samples=50, seq_length=8)
print(f"Dataset size: {len(dataset)}")
sample, label = dataset[0]
print(f"Sample shape: {sample.shape}")   # torch.Size([8])
print(f"Label: {label.item()}")

# Define a simple normalisation transform
def normalize(x):
    return (x - x.mean()) / (x.std() + 1e-8)

# Use with transform
dataset_normed = CustomSensorDataset(num_samples=50, seq_length=8, transform=normalize)
sample_n, label_n = dataset_normed[0]
print(f"\nNormalised sample mean: {sample_n.mean().item():.4f}")   # ~0
print(f"Normalised sample std:  {sample_n.std().item():.4f}")     # ~1

The key takeaway is the transform parameter. By accepting transforms as arguments, your Dataset becomes composable — you can mix and match different preprocessing pipelines without modifying the dataset code itself.

Lazy Loading vs Preloading

Design Decision
Lazy Loading vs Preloading — Which Should You Use?

Preloading reads all data into memory during __init__. This is fast at access time but uses a lot of RAM. Best for small-to-medium datasets that fit in memory (under ~8 GB).

Lazy loading reads each sample from disk inside __getitem__. This uses minimal RAM but adds I/O latency per sample. Best for large datasets (ImageNet, video data) where storing everything in memory is impossible.

Hybrid approach: preload metadata (file paths, labels) in __init__, load actual data (images, audio) lazily in __getitem__. This is the most common pattern in practice.

Memory Performance Scalability

DataLoader Deep Dive

The DataLoader sits between your Dataset and your training loop. It handles four critical responsibilities: batching (grouping samples into mini-batches), shuffling (randomising order each epoch), parallel loading (using multiple worker processes), and memory pinning (for faster CPU-to-GPU transfers). Without a DataLoader, you would need to write all of this boilerplate yourself every time.

The following example demonstrates the core DataLoader parameters. We create a small dataset and observe how batching, shuffling, and drop_last affect what the training loop receives.

import torch
from torch.utils.data import DataLoader, TensorDataset

# Create a dataset with 7 samples (intentionally not divisible by batch_size)
X = torch.arange(7, dtype=torch.float32).unsqueeze(1)  # Shape: (7, 1)
y = torch.arange(7, dtype=torch.float32)                # Shape: (7,)
dataset = TensorDataset(X, y)

# Basic DataLoader — batch_size=3, no shuffling
loader = DataLoader(dataset, batch_size=3, shuffle=False)

print("=== Without drop_last (default) ===")
for batch_idx, (features, labels) in enumerate(loader):
    print(f"Batch {batch_idx}: features={features.squeeze().tolist()}, "
          f"labels={labels.tolist()}, size={len(labels)}")

# With drop_last=True — drops the incomplete last batch
loader_drop = DataLoader(dataset, batch_size=3, shuffle=False, drop_last=True)

print("\n=== With drop_last=True ===")
for batch_idx, (features, labels) in enumerate(loader_drop):
    print(f"Batch {batch_idx}: features={features.squeeze().tolist()}, "
          f"labels={labels.tolist()}, size={len(labels)}")

Notice that with 7 samples and a batch size of 3, the last batch only has 1 sample. Setting drop_last=True discards this incomplete batch, which is often desirable during training because some operations (like batch normalisation) behave differently with very small batches. During validation, you typically keep drop_last=False to evaluate every sample.

num_workers and pin_memory

Two DataLoader parameters have an outsized impact on training speed but are often left at their defaults. num_workers controls how many subprocess workers prepare batches in parallel, and pin_memory allocates batches in page-locked memory for faster GPU transfers. The right values depend on your hardware — too many workers waste CPU cores, too few starve the GPU.

import torch
from torch.utils.data import DataLoader, TensorDataset
import time

# Create a moderately large dataset
X = torch.randn(10000, 64)
y = torch.randint(0, 10, (10000,))
dataset = TensorDataset(X, y)

# Measure iteration time with different num_workers
for workers in [0, 2]:
    loader = DataLoader(dataset, batch_size=64, shuffle=True,
                        num_workers=workers, pin_memory=False)

    start = time.time()
    for batch_x, batch_y in loader:
        pass  # Simulate consuming the batch
    elapsed = time.time() - start
    print(f"num_workers={workers}: {elapsed:.3f}s to iterate full dataset")

# Demonstrate pin_memory flag
loader_pinned = DataLoader(dataset, batch_size=64, shuffle=True,
                           num_workers=0, pin_memory=True)
print(f"\npin_memory=True loader created (speeds up .to(device) transfers)")
print(f"Total batches: {len(loader_pinned)}")
Rule of Thumb: Set num_workers to 2–4 × number of GPUs. On a single GPU machine with 8 CPU cores, try num_workers=4. Set pin_memory=True whenever training on GPU. On Windows, num_workers > 0 requires the if __name__ == '__main__' guard.

Custom collate_fn

By default, the DataLoader stacks samples along a new batch dimension using torch.stack. This works perfectly when all samples have the same shape. But what if your samples have different sizes — variable-length text, differently-sized images, or nested structures? That is where collate_fn comes in. It is a function that receives a list of individual samples and returns a properly formatted batch.

import torch
from torch.utils.data import DataLoader, Dataset
from torch.nn.utils.rnn import pad_sequence

class VariableLengthDataset(Dataset):
    """Simulates text data where each sample has a different sequence length."""
    def __init__(self):
        self.sequences = [
            torch.tensor([1, 2, 3]),
            torch.tensor([4, 5]),
            torch.tensor([6, 7, 8, 9]),
            torch.tensor([10]),
            torch.tensor([11, 12, 13, 14, 15]),
        ]
        self.labels = torch.tensor([0, 1, 0, 1, 0])

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        return self.sequences[idx], self.labels[idx]

def pad_collate(batch):
    """Pads variable-length sequences to the length of the longest in the batch."""
    sequences, labels = zip(*batch)
    # pad_sequence pads with 0 by default, batch_first=True gives (B, max_len)
    padded = pad_sequence(sequences, batch_first=True, padding_value=0)
    lengths = torch.tensor([len(s) for s in sequences])
    labels = torch.stack(labels)
    return padded, labels, lengths

dataset = VariableLengthDataset()
loader = DataLoader(dataset, batch_size=3, shuffle=False, collate_fn=pad_collate)

for padded_seqs, labels, lengths in loader:
    print(f"Padded batch shape: {padded_seqs.shape}")
    print(f"Sequences:\n{padded_seqs}")
    print(f"Labels:  {labels.tolist()}")
    print(f"Lengths: {lengths.tolist()}")
    print("---")

The custom pad_collate function uses pad_sequence to pad all sequences to the length of the longest one in the current batch, and returns the original lengths so you can later mask the padded positions. This is the standard pattern for handling text data in NLP models like RNNs and Transformers.

torchvision.transforms

Transforms are the glue between raw data and model-ready tensors. The torchvision.transforms module provides a library of composable preprocessing operations for images. You chain them together using Compose, which creates a single callable pipeline that applies each transform in sequence. Every image dataset in practice uses transforms for resizing, normalisation, and augmentation.

The following example demonstrates the most common transforms and how to compose them into a pipeline. We use transforms.v2 (the modern API) where available, but the classic API works identically for these operations.

import torch
from torchvision import transforms

# Define a transform pipeline for training data
train_transform = transforms.Compose([
    transforms.Resize((256, 256)),           # Resize to 256x256
    transforms.RandomCrop(224),              # Random 224x224 crop
    transforms.RandomHorizontalFlip(p=0.5),  # Flip horizontally 50% of the time
    transforms.ColorJitter(brightness=0.2, contrast=0.2),  # Random colour changes
    transforms.ToTensor(),                   # Convert PIL Image to tensor [0, 1]
    transforms.Normalize(                    # Normalise with ImageNet stats
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    ),
])

# Define a simpler pipeline for validation (no augmentation)
val_transform = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.CenterCrop(224),              # Deterministic center crop
    transforms.ToTensor(),
    transforms.Normalize(
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    ),
])

# Show what the pipelines contain
print("Training transforms:")
for i, t in enumerate(train_transform.transforms):
    print(f"  {i+1}. {t}")

print(f"\nValidation transforms:")
for i, t in enumerate(val_transform.transforms):
    print(f"  {i+1}. {t}")

Training vs Validation Transforms

Critical Distinction: Training transforms include random augmentation (RandomCrop, RandomHorizontalFlip, ColorJitter) to make the model robust. Validation transforms must be deterministic (CenterCrop, no flips) so that evaluation results are reproducible. Using random augmentation during validation gives inconsistent metrics and can mask overfitting.

The key transforms you should know:

  • ToTensor() — converts a PIL Image or NumPy array to a float32 tensor in the range [0, 1] with shape (C, H, W)
  • Normalize(mean, std) — normalises each channel. ImageNet stats are mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
  • Resize(size) — resizes the image to the given size
  • RandomCrop(size) — crops a random region (training only)
  • CenterCrop(size) — crops the centre region (validation/test)
  • RandomHorizontalFlip(p) — flips horizontally with probability p

When composing transforms, the order matters: resizing and cropping should come before ToTensor(), and Normalize() should always be last because it operates on tensor values.

Data Augmentation Strategies

Data augmentation is one of the most powerful regularisation techniques in deep learning. Instead of collecting more data (expensive and slow), you artificially expand your training set by applying random transformations to existing samples. Each epoch, the model sees slightly different versions of the same images — different crops, flips, colour variations — which forces it to learn features rather than memorise specific pixels.

Why It Works
Augmentation as Implicit Regularisation

Without augmentation, a model trained on 1,000 cat images sees the same 1,000 images every epoch. With random crops, flips, and colour jitter, each image produces thousands of unique variations. The model can no longer memorise exact pixel patterns and must learn general features (ears, whiskers, fur texture) that are invariant to position, orientation, and lighting.

Empirically, augmentation can reduce overfitting as much as 5–15% accuracy gap between training and validation. It is effectively "free data" — you get the benefit of a larger dataset without actually collecting more samples.

Regularisation Overfitting Generalisation

Visualising Augmented Samples

It is always a good idea to visually inspect what your augmentation pipeline does to your data. The following example creates a synthetic image and applies the same augmentation pipeline multiple times to show the random variation. In practice, you would do this with a real image from your dataset to verify that augmentations look reasonable.

import torch
from torchvision import transforms

# Create a synthetic "image" tensor (3 channels, 64x64)
torch.manual_seed(42)
fake_image = torch.rand(3, 64, 64)  # Random pixels in [0, 1]

# Define augmentation pipeline (operates on tensors)
augmentation = transforms.Compose([
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.RandomRotation(degrees=15),
    transforms.RandomResizedCrop(size=56, scale=(0.8, 1.0)),
])

# Apply the same augmentation 5 times — each produces different output
print("Original image shape:", fake_image.shape)
for i in range(5):
    augmented = augmentation(fake_image)
    pixel_diff = (augmented.mean() - fake_image[:, :56, :56].mean()).abs().item()
    print(f"  Augmented #{i+1}: shape={augmented.shape}, "
          f"mean pixel change={pixel_diff:.4f}")

Each call to the augmentation pipeline produces a different result because the transforms are stochastic. During training, this means your model never sees the exact same image twice — it always gets a fresh, slightly modified version.

Built-in Datasets

PyTorch's ecosystem provides dozens of ready-to-use datasets through torchvision.datasets, torchaudio.datasets, and torchtext.datasets. These are invaluable for learning, prototyping, and benchmarking — they download automatically, handle file parsing, and integrate seamlessly with DataLoader. The most commonly used ones include MNIST (handwritten digits), CIFAR-10 (small colour images), and ImageNet (1.2 million high-resolution images).

Using CIFAR-10

CIFAR-10 contains 60,000 32×32 colour images across 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck). It is the go-to dataset for quickly testing image classification models. The following example downloads CIFAR-10, applies transforms, wraps it in a DataLoader, and inspects the first batch — the exact workflow you would use to start any computer vision project.

import torch
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Define transforms
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465),   # CIFAR-10 mean
                         (0.2470, 0.2435, 0.2616)),   # CIFAR-10 std
])

# Download and load CIFAR-10 training set
train_dataset = datasets.CIFAR10(
    root='./data',
    train=True,
    download=True,
    transform=transform
)

# Create DataLoader
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Inspect the first batch
images, labels = next(iter(train_loader))
print(f"Batch images shape: {images.shape}")    # [32, 3, 32, 32]
print(f"Batch labels shape: {labels.shape}")    # [32]
print(f"Label values: {labels[:10].tolist()}")

# Class names
classes = ('airplane', 'automobile', 'bird', 'cat', 'deer',
           'dog', 'frog', 'horse', 'ship', 'truck')
print(f"First 5 labels: {[classes[l] for l in labels[:5].tolist()]}")
print(f"\nTotal training samples: {len(train_dataset)}")
print(f"Total batches per epoch: {len(train_loader)}")

The download=True flag downloads the dataset the first time you run this code. On subsequent runs it uses the cached copy in the ./data directory. The normalisation values above are the per-channel mean and standard deviation computed over the entire CIFAR-10 training set — using these (rather than ImageNet values) gives slightly better results because they match the actual data distribution.

Other popular built-in datasets follow the same pattern:

  • datasets.MNIST — 70,000 grayscale handwritten digits (28×28)
  • datasets.FashionMNIST — 70,000 grayscale fashion items (28×28)
  • datasets.ImageFolder — loads images from a directory tree where each subdirectory is a class
  • datasets.ImageNet — 1.2 million images, 1000 classes (requires manual download)

Handling Real-World Data Challenges

Textbook datasets are clean, balanced, and uniformly sized. Real-world data is messy. Classes are imbalanced (99% negative, 1% positive in fraud detection), sequences have different lengths (tweets vs essays), and data comes from multiple modalities (images + text + metadata). PyTorch provides specialised tools for each of these challenges.

Imbalanced Datasets with WeightedRandomSampler

When one class vastly outnumbers another, the model learns to always predict the majority class because that minimises the loss. WeightedRandomSampler fixes this by oversampling minority classes — samples from rare classes are drawn more frequently, so each batch contains a roughly equal mix of all classes. This is often more effective than manually duplicating minority samples.

import torch
from torch.utils.data import DataLoader, TensorDataset, WeightedRandomSampler
import collections

# Create imbalanced dataset: 900 class-0, 100 class-1
torch.manual_seed(42)
X = torch.randn(1000, 10)
y = torch.cat([torch.zeros(900), torch.ones(100)]).long()

dataset = TensorDataset(X, y)

# Calculate class weights (inverse of class frequency)
class_counts = collections.Counter(y.tolist())
print(f"Class distribution: {dict(class_counts)}")

# Weight per sample: each sample gets the inverse frequency of its class
weights = [1.0 / class_counts[label.item()] for label in y]
sampler = WeightedRandomSampler(weights, num_samples=len(weights), replacement=True)

# Create loader with sampler (cannot use shuffle=True with sampler)
loader = DataLoader(dataset, batch_size=64, sampler=sampler)

# Check class distribution across a few batches
batch_labels = []
for batch_x, batch_y in loader:
    batch_labels.extend(batch_y.tolist())

resampled_dist = collections.Counter(batch_labels)
print(f"Resampled distribution: {dict(resampled_dist)}")
print(f"Class 0 ratio: {resampled_dist[0] / len(batch_labels):.2%}")
print(f"Class 1 ratio: {resampled_dist[1] / len(batch_labels):.2%}")

After resampling, both classes appear roughly equally in each epoch, even though the original dataset is 90/10 imbalanced. The replacement=True parameter means minority samples can be drawn multiple times per epoch. Note that you cannot use shuffle=True together with a sampler — the sampler controls the ordering instead.

Variable-Length Sequences

We already saw pad_collate earlier, but let us look at a more complete example that includes attention masks. When padding sequences, you need to tell the model which positions are real tokens and which are padding — otherwise the model will treat padding as meaningful input. Attention masks are binary tensors where 1 means "real token" and 0 means "padding".

import torch
from torch.utils.data import DataLoader, Dataset
from torch.nn.utils.rnn import pad_sequence

class TextDataset(Dataset):
    """Simulates tokenised text with variable-length sequences."""
    def __init__(self):
        # Each "sentence" has a different number of tokens
        self.texts = [
            torch.tensor([101, 2003, 2023, 102]),         # 4 tokens
            torch.tensor([101, 1996, 4937, 2003, 102]),   # 5 tokens
            torch.tensor([101, 2002, 102]),                # 3 tokens
            torch.tensor([101, 2054, 2003, 2115, 2171, 102]),  # 6 tokens
        ]
        self.labels = torch.tensor([1, 0, 1, 0])

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        return self.texts[idx], self.labels[idx]

def collate_with_masks(batch):
    """Pads sequences and creates attention masks."""
    texts, labels = zip(*batch)
    lengths = [len(t) for t in texts]

    # Pad sequences
    padded = pad_sequence(texts, batch_first=True, padding_value=0)

    # Create attention masks: 1 for real tokens, 0 for padding
    attention_mask = torch.zeros_like(padded)
    for i, length in enumerate(lengths):
        attention_mask[i, :length] = 1

    return padded, torch.stack(labels), attention_mask

dataset = TextDataset()
loader = DataLoader(dataset, batch_size=2, shuffle=False, collate_fn=collate_with_masks)

for padded, labels, mask in loader:
    print(f"Padded shape: {padded.shape}")
    print(f"Tokens:\n{padded}")
    print(f"Attention mask:\n{mask}")
    print(f"Labels: {labels.tolist()}")
    print("---")

The attention mask has the same shape as the padded tensor. Real tokens are marked with 1, and padding positions are 0. When you pass this mask to a Transformer or RNN, the model knows to ignore the padded positions during attention computation and loss calculation.

IterableDataset

Map-style datasets assume you know the total size upfront and can access any sample by index. But some data sources are sequential streams: log files being written in real-time, data coming from a network socket, results from a paginated API, or a dataset too large to index. For these cases, PyTorch provides IterableDataset, which only requires you to implement __iter__ — no __len__ or __getitem__ needed.

import torch
from torch.utils.data import DataLoader, IterableDataset

class CountingDataset(IterableDataset):
    """Streams numbers from start to end, simulating a data source
    where the total size may not be known upfront."""

    def __init__(self, start=0, end=20):
        self.start = start
        self.end = end

    def __iter__(self):
        for i in range(self.start, self.end):
            # Simulate reading from a file/API/database
            x = torch.tensor([float(i)])
            y = torch.tensor(float(i * 2))  # Simple target: double the input
            yield x, y

# Create an IterableDataset
stream_dataset = CountingDataset(start=0, end=10)

# DataLoader works with IterableDataset — but shuffle has no effect
loader = DataLoader(stream_dataset, batch_size=3)

print("Streaming batches from IterableDataset:")
for batch_x, batch_y in loader:
    print(f"  X: {batch_x.squeeze().tolist()}, Y: {batch_y.tolist()}")

Notice that we use yield instead of return — this makes __iter__ a Python generator that produces samples one at a time without holding the entire dataset in memory. The DataLoader still handles batching, but shuffle=True has no effect on IterableDatasets because you cannot randomly access items in a stream.

When to Use IterableDataset vs Dataset

Comparison
Map-Style Dataset vs IterableDataset
Feature Map-Style Dataset IterableDataset
Access pattern Random (by index) Sequential (iteration)
Required methods __len__, __getitem__ __iter__
Shuffling Supported Not supported
Size known Yes Not required
Best for Files on disk, in-memory data Streams, APIs, huge datasets
Architecture Scalability

In practice, you will use map-style Dataset 90% of the time. IterableDataset is for specialised scenarios: processing multi-terabyte datasets stored across multiple shards, reading from a real-time data stream, or interfacing with databases that support only cursor-based iteration.

Putting It All Together — End-to-End Pipeline

Let us build a complete, production-quality data pipeline that combines everything we have learned: a custom Dataset class with transforms, train/validation splitting, DataLoader configuration, and a training loop that consumes the data. This is the pattern you will use in virtually every PyTorch project.

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, random_split
import numpy as np

# --- Custom Dataset ---
class SyntheticImageDataset(Dataset):
    """Simulates an image classification dataset with transforms."""
    def __init__(self, num_samples=500, num_classes=5, img_size=16, transform=None):
        np.random.seed(0)
        self.images = np.random.randn(num_samples, 3, img_size, img_size).astype(np.float32)
        self.labels = np.random.randint(0, num_classes, num_samples)
        self.transform = transform

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        image = torch.tensor(self.images[idx])
        label = torch.tensor(self.labels[idx], dtype=torch.long)
        if self.transform:
            image = self.transform(image)
        return image, label

# --- Transforms ---
def train_transform(x):
    # Simple normalisation (in practice, use torchvision.transforms)
    return (x - x.mean()) / (x.std() + 1e-8)

# --- Dataset and Splits ---
full_dataset = SyntheticImageDataset(num_samples=500, transform=train_transform)
train_size = int(0.8 * len(full_dataset))
val_size = len(full_dataset) - train_size
train_dataset, val_dataset = random_split(full_dataset, [train_size, val_size])

# --- DataLoaders ---
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)

# --- Simple Model ---
model = nn.Sequential(
    nn.Flatten(),
    nn.Linear(3 * 16 * 16, 64),
    nn.ReLU(),
    nn.Linear(64, 5),
)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# --- Training Loop ---
for epoch in range(3):
    model.train()
    train_loss = 0.0
    for images, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()

    # Validation
    model.eval()
    val_loss, correct = 0.0, 0
    with torch.no_grad():
        for images, labels in val_loader:
            outputs = model(images)
            val_loss += criterion(outputs, labels).item()
            correct += (outputs.argmax(1) == labels).sum().item()

    print(f"Epoch {epoch+1}/3 — "
          f"Train Loss: {train_loss/len(train_loader):.4f}, "
          f"Val Loss: {val_loss/len(val_loader):.4f}, "
          f"Val Acc: {correct/len(val_dataset):.2%}")

This example ties together every concept from this article: a custom Dataset with transforms, random_split for train/validation partitioning, separate DataLoaders with appropriate settings (shuffle for training, no shuffle for validation), and a training loop that iterates over batches. This is the standard PyTorch data pipeline template.

ImageFolder — Loading Images from Directories

One of the most convenient built-in datasets is ImageFolder. If your images are organized in directories where each subdirectory name is the class label, ImageFolder automatically discovers classes, assigns integer labels, and loads images — no custom Dataset needed. This is the standard layout for most image classification tasks.

from torchvision import datasets, transforms
import os
import tempfile
from PIL import Image

# Create a temporary directory structure to demonstrate ImageFolder
# In practice, you'd point this at your actual image directory
tmp_dir = tempfile.mkdtemp()
for cls_name in ['cats', 'dogs']:
    cls_dir = os.path.join(tmp_dir, cls_name)
    os.makedirs(cls_dir, exist_ok=True)
    # Create 3 dummy images per class
    for i in range(3):
        img = Image.new('RGB', (64, 64), color=(i * 40, 100, 200))
        img.save(os.path.join(cls_dir, f'{cls_name}_{i}.png'))

# Load with ImageFolder — it automatically discovers classes
transform = transforms.Compose([
    transforms.Resize((32, 32)),
    transforms.ToTensor(),
])

dataset = datasets.ImageFolder(root=tmp_dir, transform=transform)

print(f"Classes found: {dataset.classes}")         # ['cats', 'dogs']
print(f"Class-to-index: {dataset.class_to_idx}")   # {'cats': 0, 'dogs': 1}
print(f"Total images: {len(dataset)}")             # 6

# Access a sample
image, label = dataset[0]
print(f"Image shape: {image.shape}")               # [3, 32, 32]
print(f"Label: {label} ({dataset.classes[label]})")

# Clean up
import shutil
shutil.rmtree(tmp_dir)

The expected directory structure for ImageFolder is:

root/
├── cats/
│   ├── cat_001.jpg
│   ├── cat_002.jpg
│   └── ...
├── dogs/
│   ├── dog_001.jpg
│   ├── dog_002.jpg
│   └── ...
└── birds/
    ├── bird_001.jpg
    └── ...

Each subdirectory becomes a class, and the directory name becomes the class label. This is the simplest way to get started with image classification — just organize your files and let PyTorch do the rest.

Splitting Datasets — random_split and Subset

You rarely train on your entire dataset. Standard practice is to split data into training, validation, and test sets. PyTorch provides random_split for random partitioning and Subset for index-based partitioning. The key advantage of these tools is that they create views of the original dataset without copying data, so memory usage stays constant.

import torch
from torch.utils.data import TensorDataset, random_split, Subset

# Create a dataset with 1000 samples
X = torch.randn(1000, 5)
y = torch.randint(0, 3, (1000,))
dataset = TensorDataset(X, y)

# Method 1: random_split (most common)
train_set, val_set, test_set = random_split(
    dataset,
    [700, 150, 150],                    # Sizes must sum to len(dataset)
    generator=torch.Generator().manual_seed(42)  # Reproducible split
)
print(f"random_split sizes: train={len(train_set)}, "
      f"val={len(val_set)}, test={len(test_set)}")

# Method 2: Subset with explicit indices (for stratified or custom splits)
indices = list(range(len(dataset)))
train_indices = indices[:700]
val_indices = indices[700:850]
test_indices = indices[850:]

train_sub = Subset(dataset, train_indices)
val_sub = Subset(dataset, val_indices)
test_sub = Subset(dataset, test_indices)
print(f"Subset sizes: train={len(train_sub)}, "
      f"val={len(val_sub)}, test={len(test_sub)}")

# Verify: accessing a Subset sample returns the correct original sample
original_sample = dataset[700]
subset_sample = val_sub[0]
print(f"\nOriginal[700] == ValSubset[0]: "
      f"{torch.equal(original_sample[0], subset_sample[0])}")

Use random_split when you want a simple random partition. Use Subset when you need control over exactly which indices go into each split — for example, when implementing stratified splitting where each split maintains the class distribution of the original dataset.

DataLoader Performance Tuning

Getting the DataLoader settings right can mean the difference between a training run that takes 2 hours and one that takes 8 hours. The following example demonstrates how to benchmark different configurations and find the optimal settings for your hardware. The key parameters to tune are batch_size, num_workers, and pin_memory.

import torch
from torch.utils.data import DataLoader, TensorDataset
import time

# Create a dataset that simulates realistic workload
X = torch.randn(5000, 3, 32, 32)  # 5000 "images"
y = torch.randint(0, 10, (5000,))
dataset = TensorDataset(X, y)

# Benchmark function
def benchmark_loader(batch_size, num_workers, pin_memory, n_epochs=2):
    loader = DataLoader(dataset, batch_size=batch_size,
                        shuffle=True, num_workers=num_workers,
                        pin_memory=pin_memory)
    start = time.time()
    for epoch in range(n_epochs):
        for batch_x, batch_y in loader:
            pass  # Simulate consuming the batch
    elapsed = time.time() - start
    return elapsed

# Test different configurations
configs = [
    (32, 0, False),
    (64, 0, False),
    (128, 0, False),
    (64, 0, True),
]

print(f"{'Batch':>6} {'Workers':>8} {'Pinned':>7} {'Time (s)':>10}")
print("-" * 35)
for bs, nw, pm in configs:
    t = benchmark_loader(bs, nw, pm)
    print(f"{bs:>6} {nw:>8} {str(pm):>7} {t:>10.3f}")
Performance Tuning Checklist:
  • Batch size: Larger batches utilise GPU parallelism better. Start with 32 or 64, increase until GPU memory is ~80% full
  • num_workers: Start with 2×num_GPUs, increase until CPU utilisation plateaus. Set to 0 for debugging
  • pin_memory: Always True when training on GPU. No effect on CPU-only training
  • prefetch_factor: Default is 2 (each worker prefetches 2 batches). Increase for high-latency I/O (network storage)
  • persistent_workers: Set True to keep workers alive between epochs (avoids re-spawn overhead)

Conclusion & Next Steps

Data pipelines are the foundation of every PyTorch project. In this article, we covered the entire journey from raw data to GPU-ready batches:

  • Dataset: Implement __len__ and __getitem__ for any data format — NumPy, CSV, images, or custom sources
  • Custom Datasets: Accept optional transforms, choose between lazy loading and preloading based on dataset size
  • DataLoader: Handles batching, shuffling, parallel loading (num_workers), and memory pinning (pin_memory)
  • Transforms: Use Compose to chain preprocessing. Random augmentation for training, deterministic for validation
  • Augmentation: Free regularisation — random crops, flips, and colour changes prevent overfitting
  • Built-in Datasets: MNIST, CIFAR-10, ImageFolder for quick prototyping
  • Imbalanced Data: WeightedRandomSampler ensures balanced class representation
  • Variable-Length Sequences: Custom collate_fn with pad_sequence and attention masks
  • IterableDataset: For streaming data sources where random access is impossible

Next in the Series

In Part 5: CNNs & Computer Vision, we'll build convolutional neural networks from scratch, understand feature maps and pooling, implement image classifiers, and explore architectures like ResNet — all using the data pipelines you just learned.