What Is PyTorch?
PyTorch is an open-source deep learning framework developed by Meta AI (formerly Facebook AI Research). Since its release in 2016, it has become the dominant framework in research and is rapidly gaining ground in production environments. If you're learning deep learning today, PyTorch is the best place to start.
At its core, PyTorch provides two fundamental capabilities:
- Tensor computation — like NumPy, but with powerful GPU acceleration
- Automatic differentiation — a system called autograd that computes gradients for you, making neural network training almost effortless
if, for, while) directly in your models, making debugging intuitive with standard tools like print() and pdb.
PyTorch vs TensorFlow
Both are powerful frameworks, but they have different philosophies:
| Feature | PyTorch | TensorFlow |
|---|---|---|
| Graph type | Dynamic (define-by-run) | Static by default (eager mode optional) |
| Debugging | Standard Python debugger | Requires tf.debugging tools |
| Research adoption | ~75% of papers at NeurIPS/ICML | Declining in research |
| Production deployment | TorchServe, ONNX, TorchScript | TF Serving, TFLite, TF.js |
| API style | Pythonic, object-oriented | Functional (Keras) or low-level |
| Learning curve | Gentle — feels like NumPy | Steeper — multiple API layers |
The PyTorch Ecosystem
PyTorch isn't just a tensor library — it's an entire ecosystem of specialized tools:
flowchart TD
A["PyTorch Core
Tensors + Autograd"] --> B["torchvision
Image Models & Datasets"]
A --> C["torchaudio
Audio Processing"]
A --> D["torchtext
NLP Utilities"]
A --> E["torch.distributed
Multi-GPU Training"]
A --> F["TorchServe
Model Serving"]
A --> G["PyTorch Lightning
High-Level Training"]
A --> H["Hugging Face
Transformers & LLMs"]
A --> I["ONNX
Model Export"]
style A fill:#132440,stroke:#3B9797,color:#ffffff
style B fill:#16476A,stroke:#3B9797,color:#ffffff
style C fill:#16476A,stroke:#3B9797,color:#ffffff
style D fill:#16476A,stroke:#3B9797,color:#ffffff
style E fill:#16476A,stroke:#3B9797,color:#ffffff
style F fill:#3B9797,stroke:#132440,color:#ffffff
style G fill:#3B9797,stroke:#132440,color:#ffffff
style H fill:#3B9797,stroke:#132440,color:#ffffff
style I fill:#3B9797,stroke:#132440,color:#ffffff
Installation & Setup
PyTorch installation depends on your hardware. The official installer at pytorch.org generates the exact command for your system.
CPU-Only Install
If you don't have an NVIDIA GPU, install the CPU-only version:
# CPU-only install (works on any machine)
pip install torch torchvision torchaudio
GPU Install (CUDA)
For NVIDIA GPU acceleration, specify your CUDA version:
# GPU install with CUDA 12.1 (check your CUDA version with: nvcc --version or nvidia-smi) Refer to https://pytorch.org/get-started/locally
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# GPU install with CUDA 11.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Verifying Your Installation
Once installed, run this quick check to confirm everything works. It will report your PyTorch version, whether a GPU is available, and create a tiny test tensor:
import torch
# Check PyTorch version
print("PyTorch version:", torch.__version__)
# Check if CUDA (GPU) is available
print("CUDA available:", torch.cuda.is_available())
# If CUDA is available, print GPU info
if torch.cuda.is_available():
print("GPU device:", torch.cuda.get_device_name(0))
print("CUDA version:", torch.version.cuda)
print("Number of GPUs:", torch.cuda.device_count())
else:
print("Running on CPU — all code in this tutorial works on CPU too!")
# Quick tensor test
x = torch.tensor([1.0, 2.0, 3.0])
print("Test tensor:", x)
print("Tensor device:", x.device)
Tensor Fundamentals
Tensors are the fundamental data structure in PyTorch — think of them as multi-dimensional arrays on steroids. If you've used NumPy's ndarray, tensors will feel immediately familiar, but with two superpowers: GPU acceleration and automatic differentiation.
| Dimensions | Mathematical Name | Example |
|---|---|---|
| 0 | Scalar | A single number: 42 |
| 1 | Vector | A list of numbers: [1, 2, 3] |
| 2 | Matrix | A table of numbers: rows × columns |
| 3 | 3D Tensor | A cube: e.g., color image (H × W × C) |
| 4 | 4D Tensor | A batch of images (N × C × H × W) |
Creating Tensors
PyTorch offers many factory functions for creating tensors. Let's explore them all:
import torch
# From a Python list
a = torch.tensor([1, 2, 3, 4])
print("From list:", a) # tensor([1, 2, 3, 4])
# From a nested list (2D tensor / matrix)
b = torch.tensor([[1, 2, 3],
[4, 5, 6]])
print("Matrix shape:", b.shape) # torch.Size([2, 3])
# Scalar (0-dimensional tensor)
s = torch.tensor(3.14)
print("Scalar:", s) # tensor(3.1400)
print("Scalar value:", s.item()) # 3.14 — extracts Python number
Beyond converting Python data, PyTorch provides factory functions for creating tensors filled with specific values. These are the workhorses you will use to initialize weights, create masks, and set up placeholder arrays:
import torch
# Zeros and ones
zeros = torch.zeros(3, 4) # 3×4 matrix of zeros
ones = torch.ones(2, 3, 5) # 2×3×5 tensor of ones
print("Zeros shape:", zeros.shape) # torch.Size([3, 4])
print("Ones shape:", ones.shape) # torch.Size([2, 3, 5])
# Full — fill with any value
sevens = torch.full((2, 3), 7.0)
print("Full:\n", sevens)
# Identity matrix
eye = torch.eye(4)
print("Identity:\n", eye)
Random tensors are critical for neural network weight initialization. PyTorch supports several random distributions, each suited to different scenarios:
import torch
# Random tensors — essential for initializing neural network weights
uniform = torch.rand(3, 3) # Uniform [0, 1)
normal = torch.randn(3, 3) # Standard normal (mean=0, std=1)
randint = torch.randint(0, 10, (2, 4)) # Random integers in [0, 10)
print("Uniform:\n", uniform)
print("Normal:\n", normal)
print("Random ints:\n", randint)
# Random permutation — useful for shuffling data
perm = torch.randperm(8)
print("Permutation:", perm) # e.g., tensor([5, 2, 0, 7, 3, 1, 6, 4])
Finally, PyTorch includes sequence generators (analogous to Python's range() and NumPy's linspace()) and "like" functions that clone the shape and dtype of an existing tensor:
import torch
# Sequences — like Python's range() and NumPy's linspace()
arange = torch.arange(0, 10, 2) # Start, stop, step
linspace = torch.linspace(0, 1, 5) # Start, end, num_points
print("Arange:", arange) # tensor([0, 2, 4, 6, 8])
print("Linspace:", linspace) # tensor([0.0000, 0.2500, 0.5000, 0.7500, 1.0000])
# Create a tensor with the same shape as another
template = torch.randn(3, 4)
zeros_like = torch.zeros_like(template)
ones_like = torch.ones_like(template)
rand_like = torch.rand_like(template)
print("zeros_like shape:", zeros_like.shape) # torch.Size([3, 4])
Data Types
PyTorch tensors have explicit data types. Choosing the right dtype affects memory usage and computation speed:
import torch
# Default dtypes
int_tensor = torch.tensor([1, 2, 3])
float_tensor = torch.tensor([1.0, 2.0, 3.0])
print("Int dtype:", int_tensor.dtype) # torch.int64
print("Float dtype:", float_tensor.dtype) # torch.float32
# Explicit dtype specification
x = torch.tensor([1.0, 2.0], dtype=torch.float16) # Half precision
y = torch.tensor([1.0, 2.0], dtype=torch.float32) # Single precision (default)
z = torch.tensor([1.0, 2.0], dtype=torch.float64) # Double precision
b = torch.tensor([True, False], dtype=torch.bool) # Boolean
print("float16:", x.dtype, "— uses", x.element_size(), "bytes per element")
print("float32:", y.dtype, "— uses", y.element_size(), "bytes per element")
print("float64:", z.dtype, "— uses", z.element_size(), "bytes per element")
# Casting between types
converted = int_tensor.float() # int64 → float32
print("Converted:", converted.dtype)
# Also works with .to()
as_half = float_tensor.to(torch.float16)
print("Half precision:", as_half)
torch.float32 for training. Using float16 (half precision) saves memory and can be faster on modern GPUs, but requires careful handling of numerical stability. We'll cover mixed-precision training in Part 3.
NumPy Interoperability
PyTorch and NumPy can share memory, allowing zero-copy conversions. But this comes with a critical caveat:
import torch
import numpy as np
# NumPy → PyTorch (shares memory by default!)
np_array = np.array([1.0, 2.0, 3.0])
tensor_shared = torch.from_numpy(np_array)
print("Before modification:")
print(" NumPy:", np_array)
print(" Tensor:", tensor_shared)
# Modifying one changes the other! (shared memory)
np_array[0] = 999.0
print("\nAfter modifying NumPy array:")
print(" NumPy:", np_array) # [999. 2. 3.]
print(" Tensor:", tensor_shared) # tensor([999., 2., 3.]) — also changed!
# Safe conversion — use .clone() to break shared memory
np_array2 = np.array([10.0, 20.0, 30.0])
tensor_safe = torch.from_numpy(np_array2).clone()
np_array2[0] = -1.0
print("\nWith .clone() — independent copy:")
print(" NumPy:", np_array2) # [-1. 20. 30.]
print(" Tensor:", tensor_safe) # tensor([10., 20., 30.]) — unchanged!
Converting in the other direction — from a PyTorch tensor back to a NumPy array — uses the .numpy() method. The same shared-memory behavior applies (for CPU tensors), and GPU tensors must be moved to CPU first:
import torch
import numpy as np
# PyTorch → NumPy
tensor = torch.tensor([4.0, 5.0, 6.0])
np_from_tensor = tensor.numpy() # Shares memory (CPU tensors only)
print("Tensor → NumPy:", np_from_tensor)
print("Type:", type(np_from_tensor)) # numpy.ndarray
# GPU tensors must be moved to CPU first
# tensor_gpu.cpu().numpy() # This is the pattern for GPU tensors
Tensor Attributes
Every tensor carries metadata that tells you its shape, data type, which device it lives on, and whether gradients are being tracked. Inspecting these properties is how you debug shape mismatches — the single most common category of PyTorch errors:
import torch
t = torch.randn(3, 4, 5)
# Shape and dimensions
print("Shape:", t.shape) # torch.Size([3, 4, 5])
print("Size:", t.size()) # torch.Size([3, 4, 5]) — same as .shape
print("Dimensions:", t.ndim) # 3
print("Total elements:", t.numel()) # 60 (3 × 4 × 5)
# Data type and device
print("Dtype:", t.dtype) # torch.float32
print("Device:", t.device) # cpu
print("Requires grad:", t.requires_grad) # False
# Memory layout
print("Is contiguous:", t.is_contiguous()) # True
print("Stride:", t.stride()) # (20, 5, 1)
print("Element size:", t.element_size(), "bytes") # 4 bytes (float32)
Tensor Operations
PyTorch supports hundreds of operations on tensors. Let's cover the most important ones you'll use every day.
Element-wise Operations
Element-wise operations apply the same function to every element independently. They are the building blocks of neural network computations — activation functions, scaling, and normalization are all element-wise under the hood:
import torch
a = torch.tensor([1.0, 2.0, 3.0, 4.0])
b = torch.tensor([10.0, 20.0, 30.0, 40.0])
# Arithmetic — operates element by element
print("Add:", a + b) # tensor([11., 22., 33., 44.])
print("Subtract:", a - b) # tensor([-9., -18., -27., -36.])
print("Multiply:", a * b) # tensor([10., 40., 90., 160.]) ← Hadamard product
print("Divide:", a / b) # tensor([0.1000, 0.1000, 0.1000, 0.1000])
print("Power:", a ** 2) # tensor([ 1., 4., 9., 16.])
# Mathematical functions
print("Sqrt:", torch.sqrt(a)) # tensor([1.0000, 1.4142, 1.7321, 2.0000])
print("Exp:", torch.exp(a)) # tensor([ 2.7183, 7.3891, 20.0855, 54.5981])
print("Log:", torch.log(a)) # tensor([0.0000, 0.6931, 1.0986, 1.3863])
print("Sin:", torch.sin(a)) # tensor([0.8415, 0.9093, 0.1411, -0.7568])
print("Abs:", torch.abs(torch.tensor([-3.0, -1.0, 2.0]))) # tensor([3., 1., 2.])
In-Place Operations: The Underscore Convention
PyTorch uses a trailing underscore _ convention for in-place operations. These modify the tensor directly instead of creating a new one, saving memory — but they can break autograd computation graphs, so use them carefully.
import torch
x = torch.tensor([1.0, 2.0, 3.0])
print("Original:", x)
# In-place addition (modifies x directly)
x.add_(10)
print("After add_(10):", x) # tensor([11., 12., 13.])
# In-place multiplication
x.mul_(2)
print("After mul_(2):", x) # tensor([22., 24., 26.])
# In-place clamp
x.clamp_(min=23, max=25)
print("After clamp_:", x) # tensor([23., 24., 25.])
# Comparison: out-of-place creates a NEW tensor
y = torch.tensor([1.0, 2.0, 3.0])
z = y.add(10) # y is unchanged, z is new
print("y unchanged:", y) # tensor([1., 2., 3.])
print("z is new:", z) # tensor([11., 12., 13.])
Matrix Multiplication
Matrix multiplication is the backbone of neural networks. PyTorch provides multiple ways to perform it:
import torch
A = torch.tensor([[1.0, 2.0],
[3.0, 4.0]])
B = torch.tensor([[5.0, 6.0],
[7.0, 8.0]])
# All three are equivalent for 2D matrix multiplication
result1 = A @ B # Python operator (preferred)
result2 = torch.matmul(A, B) # Function form
result3 = torch.mm(A, B) # Explicit 2D matrix multiply
print("A @ B:\n", result1)
# tensor([[19., 22.],
# [43., 50.]])
# Verify all methods give the same result
print("All equal:", torch.equal(result1, result2) and torch.equal(result2, result3))
# Matrix-vector multiplication
v = torch.tensor([1.0, 2.0])
print("A @ v:", A @ v) # tensor([ 5., 11.])
# Dot product (1D vectors only)
x = torch.tensor([1.0, 2.0, 3.0])
y = torch.tensor([4.0, 5.0, 6.0])
print("Dot product:", torch.dot(x, y)) # tensor(32.) — 1*4 + 2*5 + 3*6
In real models, you rarely multiply single matrices. Batch matrix multiplication applies the same operation across many matrices in parallel — this is how attention heads work in Transformers:
import torch
# Batch matrix multiplication — essential for transformer models
# Shape: (batch_size, rows, cols)
batch_A = torch.randn(8, 3, 4) # 8 matrices, each 3×4
batch_B = torch.randn(8, 4, 5) # 8 matrices, each 4×5
batch_result = torch.bmm(batch_A, batch_B)
print("Batch matmul shape:", batch_result.shape) # torch.Size([8, 3, 5])
# torch.matmul handles batches automatically too
batch_result2 = torch.matmul(batch_A, batch_B)
print("Same result:", torch.allclose(batch_result, batch_result2)) # True
Concatenation & Stacking
You will frequently need to combine tensors. torch.cat() joins tensors along an existing dimension, while torch.stack() creates a new dimension. The difference matters when you're assembling batches, merging feature maps, or averaging predictions from an ensemble:
import torch
a = torch.tensor([[1, 2], [3, 4]])
b = torch.tensor([[5, 6], [7, 8]])
# cat — joins along an EXISTING dimension
cat_rows = torch.cat([a, b], dim=0) # Stack vertically
cat_cols = torch.cat([a, b], dim=1) # Stack horizontally
print("Cat dim=0 (rows):\n", cat_rows) # Shape: [4, 2]
print("Cat dim=1 (cols):\n", cat_cols) # Shape: [2, 4]
# stack — joins along a NEW dimension
stacked = torch.stack([a, b], dim=0)
print("Stack dim=0:\n", stacked) # Shape: [2, 2, 2]
print("Stack shape:", stacked.shape)
# Practical example: stacking predictions from multiple models
pred1 = torch.tensor([0.9, 0.1])
pred2 = torch.tensor([0.8, 0.2])
pred3 = torch.tensor([0.7, 0.3])
ensemble = torch.stack([pred1, pred2, pred3])
avg_pred = ensemble.mean(dim=0)
print("Ensemble average:", avg_pred) # Average across 3 models
Broadcasting
Broadcasting automatically expands tensor dimensions for element-wise operations between tensors of different shapes, following NumPy's broadcasting rules:
import torch
# Scalar broadcast
a = torch.tensor([[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0]])
print("a + 10:\n", a + 10) # Scalar 10 broadcasts to every element
# Vector broadcast (add bias to each row)
bias = torch.tensor([100.0, 200.0, 300.0]) # Shape: [3]
print("a + bias:\n", a + bias)
# tensor([[101., 202., 303.],
# [104., 205., 306.]])
# Column vector broadcast (add per-row scaling)
scale = torch.tensor([[10.0],
[20.0]]) # Shape: [2, 1]
print("a * scale:\n", a * scale)
# tensor([[ 10., 20., 30.],
# [ 80., 100., 120.]])
# Broadcasting rules:
# 1. Align dimensions from the right
# 2. Dimensions must be equal, or one must be 1
# 3. Missing dimensions are treated as 1
print("\nBroadcasting shapes:")
print(" [2, 3] + [3] → [2, 3]") # bias added to each row
print(" [2, 3] * [2, 1] → [2, 3]") # scale applied per-row
print(" [4, 1] + [1, 5] → [4, 5]") # outer product-like
Reshaping & Indexing
View vs Reshape
Both view() and reshape() change the shape of a tensor without changing its data. The key difference is about contiguous memory:
import torch
x = torch.arange(12)
print("Original:", x) # tensor([ 0, 1, 2, ..., 11])
print("Shape:", x.shape) # torch.Size([12])
# view — requires contiguous memory, returns a VIEW (shared data)
v = x.view(3, 4)
print("View (3×4):\n", v)
# reshape — works even on non-contiguous tensors (may copy data)
r = x.reshape(4, 3)
print("Reshape (4×3):\n", r)
# Use -1 to let PyTorch infer one dimension
auto = x.view(2, -1) # -1 becomes 6 (12 / 2)
print("Auto shape:", auto.shape) # torch.Size([2, 6])
# view shares memory — modifying one changes the other
v[0, 0] = 999
print("After modifying view, original:", x[0]) # 999
# flatten — shorthand for reshaping to 1D
matrix = torch.randn(3, 4)
flat = matrix.flatten()
print("Flattened:", flat.shape) # torch.Size([12])
Transpose & Permute
.T and .transpose() swap two dimensions, while .permute() lets you reorder all dimensions at once. Permute is especially important for image data: most image libraries store pixels as (Height, Width, Channels), but PyTorch convolutions expect (Channels, Height, Width):
import torch
# Transpose — swap two dimensions
m = torch.tensor([[1, 2, 3],
[4, 5, 6]])
print("Original shape:", m.shape) # [2, 3]
print("Transposed:\n", m.T) # [3, 2]
print("Transpose:\n", m.transpose(0, 1)) # Same as .T for 2D
# Permute — reorder ALL dimensions (essential for image data)
# Image: (batch, height, width, channels) → (batch, channels, height, width)
img = torch.randn(8, 224, 224, 3) # 8 images, 224×224, RGB
print("Original (NHWC):", img.shape) # [8, 224, 224, 3]
img_pytorch = img.permute(0, 3, 1, 2) # Move channels to dim 1
print("PyTorch (NCHW):", img_pytorch.shape) # [8, 3, 224, 224]
Squeeze & Unsqueeze
unsqueeze() inserts a size-1 dimension at a specified position, while squeeze() removes size-1 dimensions. These are used constantly to add or remove batch dimensions when switching between single-sample inference and batch processing:
import torch
# unsqueeze — add a dimension of size 1
x = torch.tensor([1, 2, 3])
print("Original:", x.shape) # [3]
print("Unsqueeze(0):", x.unsqueeze(0).shape) # [1, 3] — row vector
print("Unsqueeze(1):", x.unsqueeze(1).shape) # [3, 1] — column vector
# squeeze — remove dimensions of size 1
y = torch.randn(1, 3, 1, 4)
print("Before squeeze:", y.shape) # [1, 3, 1, 4]
print("Squeeze all:", y.squeeze().shape) # [3, 4]
print("Squeeze dim 0:", y.squeeze(0).shape) # [3, 1, 4]
# Common pattern: add batch dimension for single inference
single_image = torch.randn(3, 224, 224) # C × H × W
batched = single_image.unsqueeze(0) # 1 × C × H × W
print("Single image:", single_image.shape) # [3, 224, 224]
print("Batched:", batched.shape) # [1, 3, 224, 224]
Advanced Indexing
PyTorch supports the same slicing syntax as NumPy, plus boolean masking and fancy indexing with integer lists. Masking is how you filter data by condition (e.g., keeping only predictions above a threshold), and fancy indexing is how DataLoaders select random samples from a dataset:
import torch
t = torch.tensor([[10, 20, 30],
[40, 50, 60],
[70, 80, 90]])
# Basic slicing (same as NumPy)
print("Row 0:", t[0]) # tensor([10, 20, 30])
print("Col 1:", t[:, 1]) # tensor([20, 50, 80])
print("Block:", t[0:2, 1:3]) # tensor([[20, 30], [50, 60]])
# Boolean (mask) indexing — filter elements
scores = torch.tensor([85.0, 92.0, 67.0, 78.0, 95.0])
passed = scores > 80
print("Mask:", passed) # tensor([True, True, False, False, True])
print("Passing scores:", scores[passed]) # tensor([85., 92., 95.])
# Fancy indexing with lists
indices = [0, 2, 4]
print("Selected:", scores[indices]) # tensor([85., 67., 95.])
# where — conditional selection
result = torch.where(scores > 80, scores, torch.tensor(0.0))
print("Where result:", result) # tensor([85., 92., 0., 0., 95.])
Autograd: Automatic Differentiation
Autograd is PyTorch's automatic differentiation engine — the magic that makes training neural networks possible. Instead of manually computing derivatives (gradients), you define the forward computation and PyTorch automatically computes all gradients for you via reverse-mode differentiation (backpropagation).
Here's how it works conceptually:
flowchart LR
X["x
(input)"] --> MUL["×"]
W["w
(weight)
requires_grad=True"] --> MUL
MUL --> ADD["+"]
B["b
(bias)
requires_grad=True"] --> ADD
ADD --> Y["y = wx + b
(output)"]
Y --> LOSS["loss = (y - target)²"]
LOSS -->|".backward()"| GW["∂loss/∂w"]
LOSS -->|".backward()"| GB["∂loss/∂b"]
style X fill:#3B9797,stroke:#132440,color:#fff
style W fill:#BF092F,stroke:#132440,color:#fff
style B fill:#BF092F,stroke:#132440,color:#fff
style MUL fill:#132440,stroke:#3B9797,color:#fff
style ADD fill:#132440,stroke:#3B9797,color:#fff
style Y fill:#16476A,stroke:#3B9797,color:#fff
style LOSS fill:#16476A,stroke:#3B9797,color:#fff
style GW fill:#3B9797,stroke:#132440,color:#fff
style GB fill:#3B9797,stroke:#132440,color:#fff
When you call .backward(), PyTorch walks this graph in reverse, computing gradients using the chain rule at each node.
Backward Pass & Gradients
The .backward() method triggers backpropagation — it walks through the computation graph in reverse and fills in the .grad attribute of every leaf tensor that has requires_grad=True. Let’s see this with a simple polynomial and then a multi-variable linear model:
import torch
# Simple example: y = x² + 3x + 1, find dy/dx at x=2
x = torch.tensor(2.0, requires_grad=True)
# Forward pass — builds the computation graph
y = x**2 + 3*x + 1
print("y =", y.item()) # y = 2² + 3(2) + 1 = 11
# Backward pass — computes gradients
y.backward()
# dy/dx = 2x + 3, at x=2 → dy/dx = 7
print("dy/dx =", x.grad.item()) # 7.0
Autograd shines with multi-variable functions too. In the next example, we compute partial derivatives of a loss with respect to multiple parameters — exactly what happens inside every neural network during training:
import torch
# Multi-variable example: linear regression forward pass
# y = w1*x1 + w2*x2 + b
w1 = torch.tensor(3.0, requires_grad=True)
w2 = torch.tensor(-2.0, requires_grad=True)
b = torch.tensor(1.0, requires_grad=True)
x1, x2 = torch.tensor(4.0), torch.tensor(5.0)
target = torch.tensor(10.0)
# Forward pass
y = w1 * x1 + w2 * x2 + b # 3*4 + (-2)*5 + 1 = 3
loss = (y - target) ** 2 # (3 - 10)² = 49
print(f"Prediction: {y.item():.1f}")
print(f"Loss: {loss.item():.1f}")
# Backward pass
loss.backward()
# Gradients: ∂loss/∂w = 2(y - target) * ∂y/∂w
print(f"∂loss/∂w1 = {w1.grad.item():.1f}") # 2*(3-10)*4 = -56
print(f"∂loss/∂w2 = {w2.grad.item():.1f}") # 2*(3-10)*5 = -70
print(f"∂loss/∂b = {b.grad.item():.1f}") # 2*(3-10)*1 = -14
Gradient Control
During inference (prediction time) you don't need gradients — disabling them saves memory and speeds up computation. PyTorch provides torch.no_grad() as a context manager and .detach() for individual tensors:
import torch
# torch.no_grad() — disable gradient tracking (for inference)
w = torch.tensor(5.0, requires_grad=True)
# During training: gradients tracked
y = w * 3
print("Grad fn:", y.grad_fn) # <MulBackward0>
# During inference: no gradients needed (saves memory & computation)
with torch.no_grad():
y_infer = w * 3
print("Grad fn:", y_infer.grad_fn) # None
print("Requires grad:", y_infer.requires_grad) # False
The .detach() method removes a single tensor from the computation graph, and there is a critical gotcha about gradient accumulation — PyTorch adds new gradients to old ones by default instead of replacing them. You must manually zero gradients between training iterations:
import torch
# detach() — remove a tensor from the computation graph
x = torch.tensor(3.0, requires_grad=True)
y = x ** 2
# Detach y from the graph — useful for using model outputs as inputs elsewhere
y_detached = y.detach()
print("y requires_grad:", y.requires_grad) # True
print("y_detached requires_grad:", y_detached.requires_grad) # False
# CRITICAL: Gradients accumulate! You must zero them between iterations
w = torch.tensor(2.0, requires_grad=True)
for i in range(3):
loss = (w * 3) ** 2
loss.backward()
print(f"Iteration {i}: w.grad = {w.grad.item()}")
# Without zeroing: 36, 72, 108 (accumulates!)
# Fix: zero gradients before each backward pass
w = torch.tensor(2.0, requires_grad=True)
for i in range(3):
if w.grad is not None:
w.grad.zero_() # Zero the gradient!
loss = (w * 3) ** 2
loss.backward()
print(f"Iteration {i} (zeroed): w.grad = {w.grad.item()}")
# Correct: 36, 36, 36
optimizer.zero_grad() before loss.backward(). We'll formalize this in Part 3.
Higher-Order Gradients
Sometimes you need the gradient of a gradient — the second derivative (Hessian). Pass create_graph=True to torch.autograd.grad() so the first gradient itself becomes a differentiable computation:
import torch
# Computing second derivatives (Hessian elements)
x = torch.tensor(3.0, requires_grad=True)
# f(x) = x³ → f'(x) = 3x² → f''(x) = 6x
y = x ** 3
# First derivative: create_graph=True keeps the graph for further differentiation
grad1 = torch.autograd.grad(y, x, create_graph=True)[0]
print(f"f(3) = {y.item()}") # 27
print(f"f'(3) = {grad1.item()}") # 27 (3 * 3²)
# Second derivative
grad2 = torch.autograd.grad(grad1, x)[0]
print(f"f''(3) = {grad2.item()}") # 18 (6 * 3)
GPU Acceleration
GPUs can perform thousands of parallel computations simultaneously, making them ideal for the matrix operations at the heart of deep learning. A single modern GPU can speed up training by 10-100× compared to a CPU.
CUDA Basics
CUDA is NVIDIA's parallel computing platform. PyTorch uses it transparently — you just tell tensors which device to live on. Let's start by checking what hardware is available:
import torch
# Check GPU availability
print("CUDA available:", torch.cuda.is_available())
print("Device count:", torch.cuda.device_count())
# Set default device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)
# Create tensor directly on GPU
if torch.cuda.is_available():
gpu_tensor = torch.randn(3, 3, device="cuda")
print("GPU tensor device:", gpu_tensor.device)
else:
cpu_tensor = torch.randn(3, 3)
print("CPU tensor device:", cpu_tensor.device)
To move existing tensors between CPU and GPU, use .to(device). One critical rule: all tensors in an operation must be on the same device. If you try to multiply a CPU tensor with a GPU tensor, PyTorch will raise a RuntimeError:
import torch
# Moving tensors between devices
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Create on CPU, then move to GPU
x = torch.randn(1000, 1000)
print("Before .to():", x.device) # cpu
x_device = x.to(device)
print("After .to():", x_device.device) # cuda:0 (if GPU available)
# Operations between tensors MUST be on the same device
y = torch.randn(1000, 1000, device=device)
# This works — both on same device
result = x_device @ y
print("Result device:", result.device)
# Moving back to CPU (e.g., for NumPy conversion or plotting)
result_cpu = result.cpu()
print("Back on CPU:", result_cpu.device)
Device-Agnostic Code
The best practice is to write code that works on both CPU and GPU without changes:
import torch
# Device-agnostic pattern — use this in all your projects
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using: {device}")
# Create data on the right device
X = torch.randn(64, 10, device=device) # Features
W = torch.randn(10, 5, device=device) # Weights
b = torch.zeros(5, device=device) # Bias
# Forward pass — works on CPU or GPU
output = X @ W + b
print("Output shape:", output.shape) # [64, 5]
print("Output device:", output.device)
# When saving results or converting to NumPy
result_np = output.detach().cpu().numpy()
print("NumPy result shape:", result_np.shape)
GPU Memory Management
GPU memory is limited (typically 8-24 GB on consumer cards). When training large models, you need to monitor and manage it. PyTorch caches memory for reuse, so memory_reserved may be larger than memory_allocated. Use torch.cuda.empty_cache() to release unused cached memory back to the OS:
import torch
if torch.cuda.is_available():
# Check GPU memory
print(f"Total GPU memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
print(f"Allocated: {torch.cuda.memory_allocated(0) / 1e6:.1f} MB")
print(f"Cached: {torch.cuda.memory_reserved(0) / 1e6:.1f} MB")
# Allocate a large tensor
big = torch.randn(10000, 10000, device="cuda")
print(f"\nAfter allocation: {torch.cuda.memory_allocated(0) / 1e6:.1f} MB")
# Free memory
del big
torch.cuda.empty_cache()
print(f"After cleanup: {torch.cuda.memory_allocated(0) / 1e6:.1f} MB")
else:
print("GPU memory management is only relevant when CUDA is available.")
print("All code in this tutorial works perfectly on CPU.")
CPU vs GPU Benchmark
Matrix Multiplication: CPU vs GPU Speed Comparison
This benchmark demonstrates the massive speedup GPUs provide for large matrix operations. On a typical NVIDIA GPU, you'll see 10-50× faster computation for large matrices.
import torch
import time
def benchmark_matmul(size, device, num_runs=10):
"""Benchmark matrix multiplication on a given device."""
a = torch.randn(size, size, device=device)
b = torch.randn(size, size, device=device)
# Warm-up run
_ = a @ b
if device.type == "cuda":
torch.cuda.synchronize()
# Timed runs
start = time.time()
for _ in range(num_runs):
_ = a @ b
if device.type == "cuda":
torch.cuda.synchronize()
elapsed = (time.time() - start) / num_runs
return elapsed
# Run benchmark
sizes = [256, 1024, 4096]
cpu_device = torch.device("cpu")
print(f"{'Size':>6} | {'CPU (ms)':>10} | {'GPU (ms)':>10} | {'Speedup':>8}")
print("-" * 50)
for size in sizes:
cpu_time = benchmark_matmul(size, cpu_device) * 1000
if torch.cuda.is_available():
gpu_device = torch.device("cuda")
gpu_time = benchmark_matmul(size, gpu_device) * 1000
speedup = cpu_time / gpu_time
print(f"{size:>6} | {cpu_time:>10.2f} | {gpu_time:>10.2f} | {speedup:>7.1f}×")
else:
print(f"{size:>6} | {cpu_time:>10.2f} | {'N/A':>10} | {'N/A':>8}")
Exercises
Test your understanding with these challenges. Try to solve them before looking at the solutions:
torch.arange and broadcasting.
import torch
# Solution: Multiplication table using broadcasting
rows = torch.arange(1, 6).unsqueeze(1) # Column vector [5, 1]
cols = torch.arange(1, 6).unsqueeze(0) # Row vector [1, 5]
table = rows * cols # Broadcasting → [5, 5]
print("Multiplication table:\n", table)
import torch
import math
# Solution: Gradient of sin(x²) at x = π
x = torch.tensor(math.pi, requires_grad=True)
y = torch.sin(x ** 2)
y.backward()
print(f"f(π) = sin(π²) = {y.item():.6f}")
print(f"f'(π) = 2π·cos(π²) = {x.grad.item():.6f}")
# Manual verification
manual = 2 * math.pi * math.cos(math.pi ** 2)
print(f"Manual calculation: {manual:.6f}")
print(f"Match: {abs(x.grad.item() - manual) < 1e-5}")
import torch
# Solution: Chaining tensor operations
torch.manual_seed(42) # For reproducibility
A = torch.randn(3, 3)
B = torch.randn(3, 3)
result = (A @ B).T.flatten()
max_idx = result.argmax()
print("A:\n", A)
print("B:\n", B)
print("(A @ B)^T flattened:", result)
print(f"Max value: {result[max_idx]:.4f} at index {max_idx.item()}")
Conclusion & Next Steps
You've built a solid foundation in PyTorch's three pillars:
- Tensors — the multi-dimensional arrays that hold all data and model parameters
- Autograd — the automatic differentiation engine that computes gradients for backpropagation
- GPU acceleration — the device management system for high-performance computation
These three concepts underpin everything in PyTorch. Every neural network, every training loop, every deployment — they all start here.
.clone() when converting from NumPy to avoid shared memory bugs. Prefer @ for matrix multiplication. Use torch.no_grad() during inference.
Next in the Series
In Part 2: Building Neural Networks, we'll use these tensor and autograd skills to construct neural networks with nn.Module, define layers, activation functions, and loss functions — and see how PyTorch's object-oriented design makes building complex architectures intuitive.