What Is Transfer Learning?
Imagine you already speak English fluently and now want to learn Dutch. You don't start from zero — you already understand grammar concepts, sentence structure, and even share vocabulary roots. You transfer what you know to accelerate learning the new language. Transfer learning in deep learning works exactly the same way: instead of training a neural network from scratch on your specific task, you start with a model that has already learned useful representations from a massive dataset and adapt it to your problem.
Why Not Train From Scratch?
Training a deep neural network from scratch requires three things most practitioners don't have: millions of labeled examples, hundreds of GPU-hours, and extensive hyperparameter tuning expertise. Models like ResNet-50 were trained on ImageNet's 1.2 million images for days on multiple GPUs. Models like BERT were trained on billions of words of text. Transfer learning lets you benefit from all that training with a fraction of the data and compute.
Universal Feature Hierarchy
Deep convolutional networks learn features in a hierarchy — early layers detect simple patterns like edges and color gradients, middle layers combine these into textures and repeated patterns, and later layers recognize complex structures like eyes, wheels, or faces. Research has shown that these early and middle features are remarkably universal across different visual tasks. A model trained to classify dogs and cats has learned edge detectors that are equally useful for classifying medical X-rays or satellite imagery.
The following diagram illustrates how features become increasingly task-specific as you move deeper into a network, and why transfer learning works by reusing the general-purpose lower layers.
flowchart LR
A["Layer 1-2\nEdges & Gradients\n(Very General)"] --> B["Layer 3-4\nTextures & Patterns\n(General)"]
B --> C["Layer 5-6\nObject Parts\n(Moderately Specific)"]
C --> D["Final Layers\nTask-Specific Classes\n(Very Specific)"]
style A fill:#3B9797,color:#fff,stroke:#132440
style B fill:#16476A,color:#fff,stroke:#132440
style C fill:#132440,color:#fff,stroke:#132440
style D fill:#BF092F,color:#fff,stroke:#132440
This hierarchy is the reason transfer learning works: the general features in early layers (edges, textures) are useful for virtually any visual task. Only the final classifier layer needs to be task-specific. By reusing those general layers and replacing just the head, you get a powerful model with very little data.
Feature Extraction vs Fine-Tuning
There are two main strategies for transfer learning, and choosing between them is one of the most important decisions you'll make. Both start with a pretrained model, but they differ in how much of that model you allow to change during training.
Feature Extraction treats the pretrained model as a fixed feature extractor. You freeze all the pretrained layers (preventing their weights from changing) and only train a new classifier head that you attach on top. This is fast, requires very little data, and is almost impossible to overfit — but the features are fixed and might not be optimal for your specific task.
Fine-Tuning starts with the pretrained weights but allows some or all layers to update during training. This lets the model adapt its learned features to your specific domain. It's more powerful but requires more data and careful hyperparameter tuning to avoid destroying the pretrained knowledge.
When to Use Each Approach
Feature Extraction vs Fine-Tuning
| Factor | Feature Extraction | Fine-Tuning |
|---|---|---|
| Dataset size | Small (100-1,000 samples) | Medium-Large (1,000+) |
| Domain similarity | Similar to ImageNet/pretraining data | Different from pretraining data |
| Compute budget | Limited (minutes) | Moderate (hours) |
| Overfitting risk | Very low | Higher (needs regularization) |
| Accuracy ceiling | Good | Best possible |
Decision Flowchart
Use the following flowchart when deciding your transfer learning strategy. The key factors are dataset size and how similar your target domain is to the pretrained model's training data (usually ImageNet for vision or large text corpora for NLP).
flowchart TD
A["Start: Have a new task"] --> B{"Large dataset?\n(10,000+ samples)"}
B -->|Yes| C{"Similar to pretrained\ndomain?"}
B -->|No| D{"Similar to pretrained\ndomain?"}
C -->|Yes| E["Fine-tune all layers\nLow LR, standard augmentation"]
C -->|No| F["Fine-tune from middle\nFreeze early layers, higher LR"]
D -->|Yes| G["Feature extraction\nFreeze backbone, train head only"]
D -->|No| H["Feature extraction + careful fine-tune\nStart frozen, gradually unfreeze"]
style A fill:#132440,color:#fff,stroke:#132440
style E fill:#3B9797,color:#fff,stroke:#132440
style F fill:#16476A,color:#fff,stroke:#132440
style G fill:#3B9797,color:#fff,stroke:#132440
style H fill:#BF092F,color:#fff,stroke:#132440
The following code demonstrates the fundamental operation behind both strategies — freezing and unfreezing parameters. When you freeze a parameter, you set requires_grad = False, telling PyTorch's autograd to skip computing gradients for that parameter. This means it won't be updated during backpropagation.
import torch
import torch.nn as nn
# Simple model to demonstrate freezing
model = nn.Sequential(
nn.Linear(100, 64), # Layer 0: pretrained
nn.ReLU(),
nn.Linear(64, 32), # Layer 2: pretrained
nn.ReLU(),
nn.Linear(32, 10) # Layer 4: new classifier head
)
# FREEZE all layers — feature extraction mode
for param in model.parameters():
param.requires_grad = False
# UNFREEZE only the last layer — train new head only
for param in model[4].parameters():
param.requires_grad = True
# Count trainable vs frozen parameters
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable: {trainable:,} / {total:,} ({100*trainable/total:.1f}%)")
# Trainable: 330 / 8,938 (3.7%)
Notice how only 3.7% of parameters are trainable — the new classifier head. The frozen backbone acts as a fixed feature extractor, converting inputs into a 32-dimensional representation that the trainable head then classifies. This is the essence of feature extraction.
torchvision Pretrained Models
PyTorch's torchvision library provides a comprehensive model zoo — a collection of well-known architectures pretrained on ImageNet (1.2 million images, 1,000 classes). Think of it as a library of pre-built engines: you pick the one that best fits your car (task) rather than building an engine from scratch.
The Weights Enum API
Since torchvision 0.13, the recommended way to load pretrained models is using the Weights enum API. This replaces the old pretrained=True parameter with explicit weight versions, so you always know exactly which checkpoint you're using and what preprocessing transforms it expects.
import torch
from torchvision import models
from torchvision.models import (
ResNet50_Weights, EfficientNet_B0_Weights,
MobileNet_V3_Small_Weights, VGG16_Weights
)
# Load ResNet-50 with ImageNet-1K pretrained weights
resnet = models.resnet50(weights=ResNet50_Weights.DEFAULT)
print(f"ResNet-50 params: {sum(p.numel() for p in resnet.parameters()):,}")
# Load EfficientNet-B0 — smaller, faster, often just as accurate
effnet = models.efficientnet_b0(weights=EfficientNet_B0_Weights.DEFAULT)
print(f"EfficientNet-B0 params: {sum(p.numel() for p in effnet.parameters()):,}")
# Load MobileNet-V3 — optimized for mobile/edge deployment
mobilenet = models.mobilenet_v3_small(weights=MobileNet_V3_Small_Weights.DEFAULT)
print(f"MobileNet-V3 params: {sum(p.numel() for p in mobilenet.parameters()):,}")
# Load VGG-16 — classic architecture, larger but well-understood
vgg = models.vgg16(weights=VGG16_Weights.DEFAULT)
print(f"VGG-16 params: {sum(p.numel() for p in vgg.parameters()):,}")
Each model offers different tradeoffs between accuracy, speed, and size. ResNet-50 is the workhorse (~25M parameters, strong accuracy). EfficientNet-B0 achieves similar accuracy with fewer parameters (~5.3M). MobileNet-V3 is the smallest and fastest, ideal for edge deployment. VGG-16 is the oldest but its simplicity makes it easy to understand (~138M parameters).
Inspecting a Pretrained Architecture
Before modifying a pretrained model, you need to understand its structure — specifically what the classifier head looks like so you can replace it. Every torchvision model ends with a classifier designed for ImageNet's 1,000 classes. You'll replace this with your own head targeting your number of classes.
import torch
from torchvision import models
from torchvision.models import ResNet50_Weights
# Load and inspect ResNet-50
model = models.resnet50(weights=ResNet50_Weights.DEFAULT)
# The final classifier is model.fc (fully connected)
print("ResNet-50 classifier head:")
print(model.fc)
# Output: Linear(in_features=2048, out_features=1000, bias=True)
# For EfficientNet, the classifier is model.classifier
from torchvision.models import EfficientNet_B0_Weights
effnet = models.efficientnet_b0(weights=EfficientNet_B0_Weights.DEFAULT)
print("\nEfficientNet-B0 classifier head:")
print(effnet.classifier)
# Output: Sequential(Dropout, Linear(in_features=1280, out_features=1000))
# You can also see all named children (top-level modules)
print("\nResNet-50 top-level modules:")
for name, module in model.named_children():
num_params = sum(p.numel() for p in module.parameters())
print(f" {name}: {num_params:,} params")
This inspection reveals the key information you need: ResNet-50's classifier is model.fc with 2,048 input features, while EfficientNet's is model.classifier with 1,280 input features. You'll replace these to match your number of target classes.
.fc (ResNet, DenseNet), .classifier (EfficientNet, MobileNet, VGG), or .head (Vision Transformers). Always inspect with print(model) or model.named_children() before modifying.
Feature Extraction Pipeline
The Core Idea (Plain English)
Feature extraction is the simplest form of transfer learning: freeze everything the pretrained model knows, and only teach it your new classification task. It's like hiring a pre-trained photographer and only teaching them which photos to sort into which folder.
The Best Analogy: A Camera + New Lens
Think of a pretrained model as an expensive camera body:
- Backbone (frozen) = the camera body — already knows how to capture light, detect edges, recognize shapes
- Classifier head (trained) = the lens you swap out — changes what the camera "looks for" (cats vs dogs vs cars)
- Freeze = don't open the camera body — just change the lens on top
Result: you get 90%+ accuracy with just a few hundred images and minutes of training, because the "camera" already knows what objects look like.
Ultra-compressed version:
# Feature extraction in 4 lines:
model = load_pretrained("resnet50") # 1. Get a trained model
freeze(model) # 2. Lock ALL weights
model.head = new_classifier(num_classes) # 3. Replace the final layer
train(model.head, my_small_dataset) # 4. Train ONLY the new head
Feature extraction is the simplest and safest form of transfer learning. The process has three steps: (1) load a pretrained model, (2) freeze all its parameters, and (3) replace the classifier head with a new one matching your number of classes. Only the new head gets trained — the rest of the model acts as a powerful, fixed feature extractor.
Load, Freeze, and Replace Classifier Head
Let's walk through a complete feature extraction pipeline using ResNet-50 to classify a custom dataset with 5 classes. We freeze the entire pretrained backbone, replace the 1,000-class head with a 5-class head, and set up the optimizer to only update the new head's parameters.
import torch
import torch.nn as nn
from torchvision import models
from torchvision.models import ResNet50_Weights
# Step 1: Load pretrained ResNet-50
model = models.resnet50(weights=ResNet50_Weights.DEFAULT)
# Step 2: Freeze ALL parameters
for param in model.parameters():
param.requires_grad = False
# Step 3: Replace the classifier head for 5 classes
num_classes = 5
num_features = model.fc.in_features # 2048
model.fc = nn.Sequential(
nn.Dropout(0.3),
nn.Linear(num_features, 256),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(256, num_classes)
)
# New head parameters are trainable by default (requires_grad=True)
# Step 4: Set up optimizer — only train the new head
optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)
# Verify: count trainable parameters
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
frozen = total - trainable
print(f"Frozen: {frozen:,} params (backbone)")
print(f"Trainable: {trainable:,} params (new head)")
print(f"Ratio: {100*trainable/total:.2f}% trainable")
The new classifier head has only a few thousand trainable parameters compared to the 23+ million frozen backbone parameters. This makes training extremely fast and almost immune to overfitting, even with tiny datasets.
Complete Training Loop for Feature Extraction
Here's a complete training loop demonstrating feature extraction in action. We generate synthetic data to keep the example self-contained, but in practice you'd use a real dataset with DataLoader. Notice how the model is set to eval() mode for the backbone (to freeze batch norm statistics) and we only optimize model.fc.parameters().
import torch
import torch.nn as nn
from torchvision import models
from torchvision.models import ResNet50_Weights
# Setup: pretrained ResNet-50 with frozen backbone
model = models.resnet50(weights=ResNet50_Weights.DEFAULT)
for param in model.parameters():
param.requires_grad = False
model.fc = nn.Sequential(
nn.Linear(2048, 256),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(256, 5)
)
# Use GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
# Only optimize the new classifier head
optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
# Simulate training with synthetic data (replace with real DataLoader)
model.train()
for epoch in range(3):
# Synthetic batch: 8 RGB images of 224x224, 5 classes
images = torch.randn(8, 3, 224, 224).to(device)
labels = torch.randint(0, 5, (8,)).to(device)
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
_, predicted = outputs.max(1)
accuracy = (predicted == labels).float().mean()
print(f"Epoch {epoch+1}/3 — Loss: {loss.item():.4f}, Acc: {accuracy:.2%}")
Even with random synthetic data, the model trains quickly because only the small head is being updated. With real data, feature extraction typically converges in just a few epochs because the backbone features are already highly informative.
Fine-Tuning Pipeline
The Core Idea (Plain English)
Fine-tuning goes beyond feature extraction: instead of just teaching the model a new task, you let it adapt its existing knowledge to your specific domain. This is more powerful but riskier — you need to be careful not to "erase" what it already knows.
The Best Analogy: Retraining a Surgeon for a New Specialty
Think of fine-tuning as retraining a general surgeon to specialize in orthopedics:
- Basic skills (early layers) — anatomy, sterile technique, suturing — barely need updating (very low learning rate)
- General surgery skills (middle layers) — need some adaptation for bone work (medium learning rate)
- Specialty knowledge (final layers) — completely new procedures to learn (high learning rate)
Key principle: deeper layers change less, newer layers change more. This is called "discriminative learning rates" — the most important fine-tuning technique.
Ultra-compressed version:
# Fine-tuning = different learning rates for different depths
optimizer = Adam([
{"params": early_layers, "lr": 1e-5}, # barely change (general features)
{"params": middle_layers, "lr": 1e-4}, # adapt somewhat
{"params": classifier, "lr": 1e-3}, # change freely (task-specific)
])
When feature extraction doesn't give you enough accuracy — typically when your target domain differs significantly from ImageNet — fine-tuning lets you update the pretrained weights themselves. The key challenge is balancing adaptation (learning new features) against catastrophic forgetting (destroying useful pretrained features). The solution: use much smaller learning rates for pretrained layers than for the new head.
Discriminative Learning Rates
Discriminative learning rates (also called differential or per-layer learning rates) assign different learning rates to different parts of the model. The intuition is simple: early layers have learned general features that are already useful, so they should change slowly. Later layers and the new head need to adapt more, so they get higher learning rates. This technique was popularized by the ULMFiT paper and is now standard practice.
import torch
import torch.nn as nn
from torchvision import models
from torchvision.models import ResNet50_Weights
# Load pretrained model and replace head
model = models.resnet50(weights=ResNet50_Weights.DEFAULT)
model.fc = nn.Linear(2048, 5)
# Group parameters with different learning rates
# Early layers (conv1, bn1, layer1, layer2): very low LR
# Later layers (layer3, layer4): medium LR
# New classifier head (fc): high LR
param_groups = [
{"params": list(model.conv1.parameters()) +
list(model.bn1.parameters()) +
list(model.layer1.parameters()) +
list(model.layer2.parameters()),
"lr": 1e-5,
"name": "early_layers"},
{"params": list(model.layer3.parameters()) +
list(model.layer4.parameters()),
"lr": 1e-4,
"name": "later_layers"},
{"params": model.fc.parameters(),
"lr": 1e-3,
"name": "classifier_head"}
]
optimizer = torch.optim.AdamW(param_groups, weight_decay=1e-4)
# Print learning rates per group
for i, group in enumerate(optimizer.param_groups):
num_params = sum(p.numel() for p in group["params"])
print(f"Group '{group['name']}': LR={group['lr']}, Params={num_params:,}")
This creates a 100x difference between the early layers (1e-5) and the classifier head (1e-3). The early layers' edge and texture detectors change almost imperceptibly, while the classifier head learns rapidly. This gradient of learning rates is one of the most effective techniques for fine-tuning.
Gradual Unfreezing Strategy
Gradual unfreezing takes discriminative learning rates one step further. Instead of unfreezing everything at once, you start by training only the new head (feature extraction), then progressively unfreeze deeper layers epoch by epoch. This gives each layer a chance to adapt without disrupting the layers below it.
import torch
import torch.nn as nn
from torchvision import models
from torchvision.models import ResNet50_Weights
# Load pretrained model with new head
model = models.resnet50(weights=ResNet50_Weights.DEFAULT)
model.fc = nn.Linear(2048, 5)
# Start fully frozen except head
for param in model.parameters():
param.requires_grad = False
for param in model.fc.parameters():
param.requires_grad = True
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
# Define layer groups for gradual unfreezing (deepest to shallowest)
layer_groups = [
("layer4", model.layer4),
("layer3", model.layer3),
("layer2", model.layer2),
("layer1", model.layer1),
]
def unfreeze_group(group_name, group_module, lr=1e-4):
"""Unfreeze a layer group and add to optimizer."""
for param in group_module.parameters():
param.requires_grad = True
trainable = sum(p.numel() for p in group_module.parameters())
print(f" Unfroze {group_name}: {trainable:,} params (LR={lr})")
# Simulate gradual unfreezing schedule
total_epochs = 10
optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
for epoch in range(total_epochs):
# Unfreeze one layer group every 2 epochs
group_idx = epoch // 2 - 1 # epoch 2-3: layer4, 4-5: layer3, etc.
if 0 <= group_idx < len(layer_groups):
name, module = layer_groups[group_idx]
unfreeze_group(name, module, lr=1e-4 * (0.5 ** group_idx))
# Rebuild optimizer with all trainable params
optimizer = torch.optim.Adam(
filter(lambda p: p.requires_grad, model.parameters()),
lr=1e-4
)
# Synthetic training step
images = torch.randn(4, 3, 224, 224).to(device)
labels = torch.randint(0, 5, (4,)).to(device)
optimizer.zero_grad()
loss = criterion(model(images), labels)
loss.backward()
optimizer.step()
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Epoch {epoch+1}/{total_epochs} — Loss: {loss.item():.4f}, "
f"Trainable params: {trainable:,}")
This strategy is particularly effective when your target domain is very different from ImageNet. By unfreezing gradually, the model has time to stabilize after each unfreezing step, preventing the catastrophic forgetting that can happen when too many pretrained parameters change at once.
Hugging Face Transformers Integration
Transfer learning isn't limited to computer vision. The Hugging Face transformers library is the de facto standard for pretrained NLP models, providing thousands of models for text classification, question answering, translation, and generation. The AutoModel and AutoTokenizer classes make it trivial to load any pretrained model with just its name.
Loading a Pretrained BERT Model
The following example shows how to load a pretrained BERT model and its tokenizer. The tokenizer converts raw text into the token IDs that the model expects, handling subword splitting, special tokens ([CLS], [SEP]), and padding automatically. Think of the tokenizer as the translator between human-readable text and the model's numerical language.
from transformers import AutoTokenizer, AutoModel
import torch
# Load pretrained BERT and its tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Tokenize some text
text = "Transfer learning is incredibly powerful for NLP tasks."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
print("Input IDs:", inputs["input_ids"].shape)
print("Tokens:", tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]))
# Forward pass — get contextual embeddings
with torch.no_grad():
outputs = model(**inputs)
# outputs.last_hidden_state: [batch, seq_len, hidden_dim=768]
print(f"\nOutput shape: {outputs.last_hidden_state.shape}")
# The [CLS] token embedding (first token) is commonly used for classification
cls_embedding = outputs.last_hidden_state[:, 0, :]
print(f"[CLS] embedding shape: {cls_embedding.shape}") # [1, 768]
The [CLS] token's embedding captures a summary of the entire input sequence. In classification tasks, we typically add a linear layer on top of this embedding to predict class labels — exactly the same pattern as replacing ResNet's .fc layer, but for text instead of images.
Fine-Tuning BERT for Text Classification
Here's a complete example of building a text classifier on top of BERT. We wrap BERT in a custom module, add a classification head, and set up the training with discriminative learning rates — just like we did with ResNet. The BERT backbone gets a small learning rate to preserve its language knowledge, while the new classification head gets a higher rate.
import torch
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer
class BertClassifier(nn.Module):
"""BERT with a classification head for sentiment analysis."""
def __init__(self, model_name="bert-base-uncased", num_classes=3, dropout=0.3):
super().__init__()
self.bert = AutoModel.from_pretrained(model_name)
hidden_size = self.bert.config.hidden_size # 768 for bert-base
self.classifier = nn.Sequential(
nn.Dropout(dropout),
nn.Linear(hidden_size, 256),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(256, num_classes)
)
def forward(self, input_ids, attention_mask):
outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
cls_output = outputs.last_hidden_state[:, 0, :] # [CLS] token
return self.classifier(cls_output)
# Create model
model = BertClassifier(num_classes=3)
# Discriminative learning rates
optimizer = torch.optim.AdamW([
{"params": model.bert.parameters(), "lr": 2e-5}, # BERT: low LR
{"params": model.classifier.parameters(), "lr": 1e-3} # Head: high LR
], weight_decay=0.01)
# Tokenize a sample batch
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
texts = ["Great product!", "Terrible experience.", "It was okay."]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
# Forward pass
model.eval()
with torch.no_grad():
logits = model(inputs["input_ids"], inputs["attention_mask"])
predictions = torch.argmax(logits, dim=1)
print(f"Logits shape: {logits.shape}") # [3, 3]
print(f"Predictions: {predictions.tolist()}") # e.g., [0, 2, 1]
This pattern — pretrained backbone with a custom classification head and discriminative learning rates — works for any Hugging Face model: DistilBERT, RoBERTa, ALBERT, GPT-2, and more. The key is always the same: low learning rate for pretrained weights, higher learning rate for new parameters.
Saving and Loading Fine-Tuned Models
After fine-tuning, you need to save both the model weights and the tokenizer so you can reload the model later for inference. It's important to save the entire model state (including the custom classification head) and the tokenizer together.
import torch
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer
import os
# Define the same architecture
class BertClassifier(nn.Module):
def __init__(self, model_name="bert-base-uncased", num_classes=3, dropout=0.3):
super().__init__()
self.bert = AutoModel.from_pretrained(model_name)
hidden_size = self.bert.config.hidden_size
self.classifier = nn.Sequential(
nn.Dropout(dropout),
nn.Linear(hidden_size, 256),
nn.ReLU(),
nn.Dropout(dropout),
nn.Linear(256, num_classes)
)
def forward(self, input_ids, attention_mask):
outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
return self.classifier(outputs.last_hidden_state[:, 0, :])
# Create and "train" model (pretend we fine-tuned it)
model = BertClassifier(num_classes=3)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# ---- Save ----
save_dir = "./my_finetuned_bert"
os.makedirs(save_dir, exist_ok=True)
torch.save(model.state_dict(), os.path.join(save_dir, "model.pt"))
tokenizer.save_pretrained(save_dir)
print(f"Model saved to {save_dir}/")
# ---- Load ----
loaded_model = BertClassifier(num_classes=3)
loaded_model.load_state_dict(torch.load(os.path.join(save_dir, "model.pt"),
weights_only=True))
loaded_model.eval()
loaded_tokenizer = AutoTokenizer.from_pretrained(save_dir)
# Verify it works
inputs = loaded_tokenizer("Transfer learning rocks!", return_tensors="pt",
padding=True, truncation=True)
with torch.no_grad():
logits = loaded_model(inputs["input_ids"], inputs["attention_mask"])
print(f"Loaded model prediction: class {logits.argmax(dim=1).item()}")
Saving both the state_dict and the tokenizer together ensures reproducibility. When loading, you recreate the same architecture first, then load the trained weights into it. This two-step process (define architecture → load weights) is the standard PyTorch pattern for model persistence.
Domain Adaptation
Domain adaptation is the art of transferring knowledge when your target task looks significantly different from the pretraining data. Classifying X-ray images with an ImageNet model, detecting defects on factory assembly lines, or analyzing satellite imagery — these all involve a large domain gap. The pretrained features are still useful (edges are edges everywhere), but the later layers need substantial adaptation.
Learning Rate Warmup
When fine-tuning on a very different domain, starting with a high learning rate can immediately destroy pretrained features before the model has a chance to adapt. Learning rate warmup solves this by starting with a near-zero learning rate and linearly increasing it over the first few hundred steps. This gives the model a gentle start, allowing the gradients to stabilize before applying full-strength updates.
import torch
import torch.nn as nn
import math
class WarmupCosineScheduler:
"""Linear warmup followed by cosine annealing."""
def __init__(self, optimizer, warmup_steps, total_steps):
self.optimizer = optimizer
self.warmup_steps = warmup_steps
self.total_steps = total_steps
self.base_lrs = [group["lr"] for group in optimizer.param_groups]
self.current_step = 0
def step(self):
self.current_step += 1
if self.current_step <= self.warmup_steps:
# Linear warmup: 0 -> base_lr
scale = self.current_step / self.warmup_steps
else:
# Cosine decay: base_lr -> 0
progress = (self.current_step - self.warmup_steps) / \
(self.total_steps - self.warmup_steps)
scale = 0.5 * (1 + math.cos(math.pi * progress))
for group, base_lr in zip(self.optimizer.param_groups, self.base_lrs):
group["lr"] = base_lr * scale
def get_lr(self):
return [group["lr"] for group in self.optimizer.param_groups]
# Demo: visualize the learning rate schedule
model = nn.Linear(10, 2)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
scheduler = WarmupCosineScheduler(optimizer, warmup_steps=100, total_steps=1000)
# Print LR at key points
for step in range(1000):
scheduler.step()
if step in [0, 49, 99, 250, 500, 750, 999]:
print(f"Step {step+1:4d}: LR = {scheduler.get_lr()[0]:.6f}")
The warmup phase (steps 1-100) gradually increases the learning rate from near-zero to the target value. After warmup, cosine annealing smoothly decreases the rate, allowing the model to make progressively finer adjustments. This schedule is used in almost every modern fine-tuning recipe, from BERT to GPT.
Progressive Fine-Tuning Recipe
When adapting ImageNet models to highly specialized domains (medical imaging, satellite data, microscopy), follow this proven recipe:
- Phase 1 (5 epochs): Freeze backbone, train only the new head — establishes a good starting point.
- Phase 2 (5 epochs): Unfreeze last 2 residual blocks with LR=1e-5, keep head at LR=1e-3 — adapts high-level features.
- Phase 3 (10 epochs): Unfreeze all layers with warmup, LR=1e-6 for early layers, 1e-5 for later, 1e-4 for head — full adaptation with protection.
- Throughout: Use aggressive data augmentation and early stopping based on validation loss.
Data Augmentation for Transfer Learning
Data augmentation is your best friend when fine-tuning with small datasets. By applying random transformations (flips, rotations, color jitter, crops) to training images, you effectively multiply your dataset size and force the model to learn transformation-invariant features. This is especially important for transfer learning where your dataset may be 100-1,000x smaller than ImageNet.
Standard Augmentation Pipeline
When using ImageNet-pretrained models, you must normalize your images with ImageNet's mean and standard deviation. These values are baked into the pretrained weights — the model expects inputs centered around these statistics. Using different normalization values will produce garbage outputs because the activations will be outside the range the model was trained on.
from torchvision import transforms
# ImageNet normalization values — ALWAYS use these with ImageNet-pretrained models
IMAGENET_MEAN = [0.485, 0.456, 0.406]
IMAGENET_STD = [0.229, 0.224, 0.225]
# Training transforms: augmentation + normalization
train_transform = transforms.Compose([
transforms.RandomResizedCrop(224, scale=(0.8, 1.0)),
transforms.RandomHorizontalFlip(p=0.5),
transforms.RandomRotation(15),
transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),
transforms.RandomAffine(degrees=0, translate=(0.1, 0.1)),
transforms.ToTensor(),
transforms.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD),
])
# Validation/Test transforms: NO augmentation, just resize + normalize
val_transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD),
])
print("Training pipeline:")
for i, t in enumerate(train_transform.transforms):
print(f" {i+1}. {t.__class__.__name__}")
print("\nValidation pipeline:")
for i, t in enumerate(val_transform.transforms):
print(f" {i+1}. {t.__class__.__name__}")
Notice the critical difference between training and validation transforms. Training uses random augmentation to create variety, while validation uses deterministic transforms (resize + center crop) to ensure consistent evaluation. Never apply random augmentation to validation or test data — you want repeatable, comparable results.
mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) when using ImageNet-pretrained models is one of the most common transfer learning bugs. The model will train but accuracy will be significantly lower because the activations are in the wrong range. Always normalize!
Using Weights Transforms (The Modern Way)
The torchvision Weights API provides the exact transforms used during pretraining, so you never have to remember normalization values manually. This is the safest approach because it guarantees your preprocessing matches what the model was trained with.
from torchvision import models, transforms
from torchvision.models import ResNet50_Weights, EfficientNet_B0_Weights
# The Weights object includes the exact preprocessing transforms
resnet_weights = ResNet50_Weights.DEFAULT
resnet_preprocess = resnet_weights.transforms()
print("ResNet-50 preprocessing:")
print(resnet_preprocess)
effnet_weights = EfficientNet_B0_Weights.DEFAULT
effnet_preprocess = effnet_weights.transforms()
print("\nEfficientNet-B0 preprocessing:")
print(effnet_preprocess)
# Use the weights transforms for validation, add augmentation for training
train_transform = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ColorJitter(0.2, 0.2, 0.2),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]),
])
print("\nCustom training transform ready!")
print(f"Steps: {len(train_transform.transforms)}")
The weights.transforms() method returns the exact preprocessing pipeline (resize, crop, normalize) that the model expects for inference. For training, you add your own augmentation steps before the normalization. This eliminates any guesswork about the correct input format.
Practical Tips & Common Mistakes
Transfer learning is powerful but full of subtle pitfalls that can silently degrade performance. After years of practitioners sharing their failures and successes, a set of best practices has emerged. Here are the most important ones.
Common Pitfalls to Avoid
The following list summarizes the most frequent mistakes practitioners make with transfer learning, along with their solutions. If you're getting unexpectedly poor results, check these first.
Top Transfer Learning Mistakes
| Mistake | Symptom | Fix |
|---|---|---|
| LR too high for backbone | Accuracy spikes then crashes | Use 1e-5 for pretrained, 1e-3 for head |
| No ImageNet normalization | Model trains but poor accuracy | Apply mean/std normalization |
| Forgot to freeze BatchNorm | Unstable training with small batches | Use model.eval() or freeze BN layers |
| Too little augmentation | Overfitting on small datasets | Add RandomCrop, Flip, ColorJitter |
| Wrong classifier replacement | Shape mismatch errors | Check in_features before replacing |
| Training full model on tiny data | Massive overfitting | Use feature extraction, not fine-tuning |
Handling BatchNorm During Fine-Tuning
BatchNorm layers maintain running statistics (mean and variance) of the data they've seen during training. When fine-tuning with small batches, these statistics can be wildly inaccurate, causing poor performance. The fix is to freeze BatchNorm layers even when you're fine-tuning the rest of the model. Here's how to do it properly.
import torch
import torch.nn as nn
from torchvision import models
from torchvision.models import ResNet50_Weights
model = models.resnet50(weights=ResNet50_Weights.DEFAULT)
model.fc = nn.Linear(2048, 5)
def freeze_batchnorm(model):
"""Freeze all BatchNorm layers — keep pretrained running stats."""
for module in model.modules():
if isinstance(module, (nn.BatchNorm2d, nn.BatchNorm1d)):
module.eval() # Use running stats, not batch stats
for param in module.parameters():
param.requires_grad = False
# Apply before training
freeze_batchnorm(model)
# Count frozen BN layers
bn_count = sum(1 for m in model.modules()
if isinstance(m, (nn.BatchNorm2d, nn.BatchNorm1d)))
bn_params = sum(p.numel() for m in model.modules()
if isinstance(m, (nn.BatchNorm2d, nn.BatchNorm1d))
for p in m.parameters())
print(f"Frozen {bn_count} BatchNorm layers ({bn_params:,} params)")
print(f"These layers will use pretrained running statistics")
By calling module.eval() on BatchNorm layers, they use their stored running statistics instead of computing new ones from the current batch. This is especially important when your batch size is small (e.g., 8-16 images for medical imaging), where batch statistics would be very noisy and unreliable.
Model Selection Guide
Choosing the right pretrained model depends on your constraints. Here's a quick reference for common scenarios, along with the code to compare model sizes and computational costs.
import torch
from torchvision import models
from torchvision.models import (
ResNet50_Weights, ResNet18_Weights,
EfficientNet_B0_Weights, MobileNet_V3_Small_Weights
)
# Compare popular architectures
architectures = {
"ResNet-18": models.resnet18(weights=ResNet18_Weights.DEFAULT),
"ResNet-50": models.resnet50(weights=ResNet50_Weights.DEFAULT),
"EfficientNet": models.efficientnet_b0(weights=EfficientNet_B0_Weights.DEFAULT),
"MobileNet-V3": models.mobilenet_v3_small(weights=MobileNet_V3_Small_Weights.DEFAULT),
}
print(f"{'Model':<16} {'Params':>10} {'Size (MB)':>10}")
print("-" * 38)
for name, model in architectures.items():
params = sum(p.numel() for p in model.parameters())
size_mb = sum(p.numel() * p.element_size() for p in model.parameters()) / 1e6
print(f"{name:<16} {params:>10,} {size_mb:>9.1f}")
As a rule of thumb: start with ResNet-50 if you have a GPU and want the best accuracy-to-effort ratio. Use EfficientNet-B0 if you need similar accuracy with fewer parameters. Use MobileNet-V3 when deploying to mobile or edge devices. Use ResNet-18 for quick prototyping or when GPU memory is limited.
When NOT to Use Transfer Learning
Transfer learning isn't always the answer. Here are scenarios where training from scratch might actually be better.
- Your data is radically different from ImageNet/pretraining data (e.g., spectrograms, molecular structures, radar signals) — the pretrained features may be irrelevant.
- You have a massive dataset (1M+ samples) — you have enough data to learn good features from scratch, and task-specific features may outperform generic ones.
- Your input modality doesn't match — 1-channel grayscale, multi-spectral imagery, or 3D volumetric data may not benefit from RGB-trained models.
- Latency/size constraints are extreme — you might need a custom tiny architecture that's fundamentally different from standard pretrained models.
Parameter-Efficient Fine-Tuning with LoRA
The Core Idea (Plain English)
Full fine-tuning updates millions of parameters. LoRA asks: "What if the change we need is actually very small and simple?" Instead of modifying the entire weight matrix, LoRA learns a tiny "patch" that gets added on top — like a Post-It note stuck to a textbook rather than rewriting the whole book.
The Best Analogy: A Guitar Effect Pedal
Think of LoRA as an effects pedal for a guitar amplifier:
- The amplifier (frozen model) — a powerful, expensive piece of equipment that stays untouched
- The effects pedal (LoRA matrices) — a tiny, cheap device that modifies the signal slightly
- Combining both — the original amp sound + a small learned adjustment = perfectly adapted tone
- Swap pedals — you can have different LoRA adapters for different tasks, all sharing the same frozen base model
Instead of training 16 million parameters (full weight), you train just 65K (two small matrices). That's a 256× reduction — fine-tune GPT-3 on a single GPU.
Ultra-compressed version:
# Standard fine-tuning: update the ENTIRE weight matrix (expensive)
W_new = W_old + gradient_updates # millions of params changed
# LoRA: freeze W, learn a tiny low-rank "patch"
W_new = W_frozen + B @ A # B is [d×r], A is [r×d], r=8
# Only 2 × d × r trainable params instead of d × d
# For d=4096, r=8: 65K instead of 16.7M (256× smaller!)
Fine-tuning a full model (even just unfreezing a few layers) updates millions or billions of parameters. LoRA (Low-Rank Adaptation), introduced by Hu et al. (2021), offers a revolutionary alternative: freeze the entire pretrained model and inject tiny trainable low-rank matrices that adapt the model's behavior with 0.1–1% of the original parameters. This means you can fine-tune GPT-3 sized models on a single GPU.
How LoRA Works
The key insight is that the weight updates during fine-tuning have low intrinsic rank — the change matrix $\Delta W$ can be approximated by the product of two much smaller matrices. Instead of updating the full weight matrix $W \in \mathbb{R}^{d \times d}$, LoRA decomposes the update into:
$$W' = W + \Delta W = W + BA$$Where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times d}$ with rank $r \ll d$ (typically $r = 4, 8, 16$). The original weight $W$ stays frozen — only $A$ and $B$ are trained. For a layer with $d = 4096$, the full weight has $4096^2 = 16.7M$ parameters, while LoRA with $r = 8$ has only $2 \times 4096 \times 8 = 65K$ parameters — a 256x reduction.
flowchart LR
X["Input x"] --> W["Frozen W
(d × d)"]
X --> A["A matrix
(r × d)
trainable"]
A --> B["B matrix
(d × r)
trainable"]
W --> ADD["+"]
B --> ADD
ADD --> Y["Output"]
import torch
import torch.nn as nn
class LoRALinear(nn.Module):
"""Linear layer with LoRA adaptation."""
def __init__(self, original_layer, rank=8, alpha=16):
super().__init__()
self.original = original_layer
self.original.weight.requires_grad = False # Freeze original weights
if self.original.bias is not None:
self.original.bias.requires_grad = False
d_in = original_layer.in_features
d_out = original_layer.out_features
# LoRA matrices: A (down-projection) and B (up-projection)
self.lora_A = nn.Parameter(torch.randn(rank, d_in) * 0.01)
self.lora_B = nn.Parameter(torch.zeros(d_out, rank)) # initialized to zero!
# Scaling factor (alpha / rank) controls the magnitude of LoRA's contribution
self.scaling = alpha / rank
def forward(self, x):
# Original frozen forward pass + LoRA adaptation
original_output = self.original(x)
lora_output = (x @ self.lora_A.T @ self.lora_B.T) * self.scaling
return original_output + lora_output
# Demo: apply LoRA to a pretrained linear layer
pretrained_layer = nn.Linear(512, 512)
lora_layer = LoRALinear(pretrained_layer, rank=8, alpha=16)
# Count parameters
total_original = sum(p.numel() for p in pretrained_layer.parameters())
trainable_lora = sum(p.numel() for p in lora_layer.parameters() if p.requires_grad)
frozen_lora = sum(p.numel() for p in lora_layer.parameters() if not p.requires_grad)
print(f"Original layer params: {total_original:,}")
print(f"LoRA trainable params: {trainable_lora:,}")
print(f"LoRA frozen params: {frozen_lora:,}")
print(f"Reduction: {total_original / trainable_lora:.1f}x fewer trainable params")
# Forward pass works identically at init (B is zeros, so LoRA contribution is 0)
x = torch.randn(4, 512)
original_out = pretrained_layer(x)
lora_out = lora_layer(x)
print(f"\nOutputs match at init: {torch.allclose(original_out, lora_out, atol=1e-6)}")
Applying LoRA to a Full Model
In practice, LoRA is typically applied to the attention layers (Q, K, V projections and output projection) of a Transformer. Here's how to replace all linear layers in a model with LoRA-adapted versions:
import torch
import torch.nn as nn
class LoRALinear(nn.Module):
"""Linear layer with LoRA adaptation."""
def __init__(self, original_layer, rank=8, alpha=16):
super().__init__()
self.original = original_layer
self.original.weight.requires_grad = False
if self.original.bias is not None:
self.original.bias.requires_grad = False
d_in = original_layer.in_features
d_out = original_layer.out_features
self.lora_A = nn.Parameter(torch.randn(rank, d_in) * 0.01)
self.lora_B = nn.Parameter(torch.zeros(d_out, rank))
self.scaling = alpha / rank
def forward(self, x):
return self.original(x) + (x @ self.lora_A.T @ self.lora_B.T) * self.scaling
def apply_lora(model, rank=8, alpha=16, target_modules=None):
"""Replace target Linear layers with LoRA-adapted versions."""
for name, module in model.named_modules():
if isinstance(module, nn.Linear):
# Only apply to specified modules (e.g., attention layers)
if target_modules and not any(t in name for t in target_modules):
continue
# Replace the module
parent_name = '.'.join(name.split('.')[:-1])
child_name = name.split('.')[-1]
parent = model.get_submodule(parent_name) if parent_name else model
setattr(parent, child_name, LoRALinear(module, rank=rank, alpha=alpha))
return model
# Example: simple Transformer-like model
class MiniTransformer(nn.Module):
def __init__(self, d_model=256, num_heads=4):
super().__init__()
self.attn_qkv = nn.Linear(d_model, 3 * d_model)
self.attn_out = nn.Linear(d_model, d_model)
self.ffn = nn.Sequential(
nn.Linear(d_model, d_model * 4),
nn.GELU(),
nn.Linear(d_model * 4, d_model),
)
def forward(self, x):
# Simplified (no actual attention computation)
qkv = self.attn_qkv(x)
out = self.attn_out(qkv[..., :256])
return self.ffn(out) + x
model = MiniTransformer(d_model=256)
total_before = sum(p.numel() for p in model.parameters())
# Apply LoRA only to attention layers
model = apply_lora(model, rank=8, target_modules=['attn_qkv', 'attn_out'])
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_after = sum(p.numel() for p in model.parameters())
frozen = total_after - trainable
print(f"Total parameters: {total_after:,}")
print(f"Trainable (LoRA): {trainable:,}")
print(f"Frozen (pretrained): {frozen:,}")
print(f"Trainable fraction: {trainable/total_after*100:.1f}%")
peft library from Hugging Face for production LoRA. It handles targeting specific modules, merging weights for inference (eliminating the runtime overhead), and supports variants like QLoRA (LoRA + 4-bit quantization) that can fine-tune 65B-parameter models on a single 48GB GPU. Install with pip install peft and wrap any Hugging Face model with get_peft_model(model, LoraConfig(r=8, target_modules=["q_proj", "v_proj"])).
Conclusion & Next Steps
Transfer learning is arguably the most practical technique in modern deep learning. By reusing pretrained models, you can achieve state-of-the-art results with a fraction of the data, compute, and expertise that training from scratch requires. The key concepts to remember:
- Feature Extraction — freeze everything, train only the new head — best for small datasets similar to pretraining data
- Fine-Tuning — unfreeze some or all layers with discriminative learning rates — best when you need to adapt features to a different domain
- Discriminative LRs — lower learning rates for pretrained layers (1e-5), higher for new layers (1e-3) — prevents catastrophic forgetting
- Gradual Unfreezing — progressively unfreeze deeper layers during training — stabilizes learning for domain adaptation
- ImageNet Normalization — always apply mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225] with ImageNet models
- Hugging Face — same principles apply to NLP with AutoModel/AutoTokenizer
Next in the Series
In Part 9: Deployment & Production, we'll take your trained models from notebooks to the real world — TorchScript, ONNX export, model optimization, serving with FastAPI, and deploying on edge devices.