Back to PyTorch Mastery Series

Part 1: Tensors, Autograd & GPU Fundamentals

May 3, 2026 Wasil Zafar 35 min read

Build a rock-solid foundation in PyTorch — from creating and manipulating tensors, to harnessing automatic differentiation with autograd, to accelerating everything on the GPU.

Table of Contents

  1. What Is PyTorch?
  2. Installation & Setup
  3. Tensor Fundamentals
  4. Tensor Operations
  5. Reshaping & Indexing
  6. Autograd: Automatic Differentiation
  7. GPU Acceleration
  8. Conclusion & Next Steps

What Is PyTorch?

PyTorch is an open-source deep learning framework developed by Meta AI (formerly Facebook AI Research). Since its release in 2016, it has become the dominant framework in research and is rapidly gaining ground in production environments. If you're learning deep learning today, PyTorch is the best place to start.

At its core, PyTorch provides two fundamental capabilities:

  • Tensor computation — like NumPy, but with powerful GPU acceleration
  • Automatic differentiation — a system called autograd that computes gradients for you, making neural network training almost effortless
Key Insight: PyTorch uses dynamic computation graphs (define-by-run). Unlike static graph frameworks, you build the computation graph on-the-fly as your code executes — just like regular Python. This means you can use standard Python control flow (if, for, while) directly in your models, making debugging intuitive with standard tools like print() and pdb.

PyTorch vs TensorFlow

Both are powerful frameworks, but they have different philosophies:

Feature PyTorch TensorFlow
Graph typeDynamic (define-by-run)Static by default (eager mode optional)
DebuggingStandard Python debuggerRequires tf.debugging tools
Research adoption~75% of papers at NeurIPS/ICMLDeclining in research
Production deploymentTorchServe, ONNX, TorchScriptTF Serving, TFLite, TF.js
API stylePythonic, object-orientedFunctional (Keras) or low-level
Learning curveGentle — feels like NumPySteeper — multiple API layers

The PyTorch Ecosystem

PyTorch isn't just a tensor library — it's an entire ecosystem of specialized tools:

PyTorch Ecosystem Overview
flowchart TD
    A["PyTorch Core
Tensors + Autograd"] --> B["torchvision
Image Models & Datasets"] A --> C["torchaudio
Audio Processing"] A --> D["torchtext
NLP Utilities"] A --> E["torch.distributed
Multi-GPU Training"] A --> F["TorchServe
Model Serving"] A --> G["PyTorch Lightning
High-Level Training"] A --> H["Hugging Face
Transformers & LLMs"] A --> I["ONNX
Model Export"] style A fill:#132440,stroke:#3B9797,color:#ffffff style B fill:#16476A,stroke:#3B9797,color:#ffffff style C fill:#16476A,stroke:#3B9797,color:#ffffff style D fill:#16476A,stroke:#3B9797,color:#ffffff style E fill:#16476A,stroke:#3B9797,color:#ffffff style F fill:#3B9797,stroke:#132440,color:#ffffff style G fill:#3B9797,stroke:#132440,color:#ffffff style H fill:#3B9797,stroke:#132440,color:#ffffff style I fill:#3B9797,stroke:#132440,color:#ffffff

Installation & Setup

PyTorch installation depends on your hardware. The official installer at pytorch.org generates the exact command for your system.

CPU-Only Install

If you don't have an NVIDIA GPU, install the CPU-only version:

# CPU-only install (works on any machine)
pip install torch torchvision torchaudio

GPU Install (CUDA)

For NVIDIA GPU acceleration, specify your CUDA version:

# GPU install with CUDA 12.1 (check your CUDA version with: nvcc --version or nvidia-smi) Refer to https://pytorch.org/get-started/locally
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# GPU install with CUDA 11.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Verifying Your Installation

Once installed, run this quick check to confirm everything works. It will report your PyTorch version, whether a GPU is available, and create a tiny test tensor:

import torch

# Check PyTorch version
print("PyTorch version:", torch.__version__)

# Check if CUDA (GPU) is available
print("CUDA available:", torch.cuda.is_available())

# If CUDA is available, print GPU info
if torch.cuda.is_available():
    print("GPU device:", torch.cuda.get_device_name(0))
    print("CUDA version:", torch.version.cuda)
    print("Number of GPUs:", torch.cuda.device_count())
else:
    print("Running on CPU — all code in this tutorial works on CPU too!")

# Quick tensor test
x = torch.tensor([1.0, 2.0, 3.0])
print("Test tensor:", x)
print("Tensor device:", x.device)
Don't have a GPU? No problem. Every code example in this series works on CPU. GPU acceleration makes training faster, but is never required for learning. Google Colab also provides free GPU access if you want to experiment.

Tensor Fundamentals

Tensors are the fundamental data structure in PyTorch — think of them as multi-dimensional arrays on steroids. If you've used NumPy's ndarray, tensors will feel immediately familiar, but with two superpowers: GPU acceleration and automatic differentiation.

DimensionsMathematical NameExample
0ScalarA single number: 42
1VectorA list of numbers: [1, 2, 3]
2MatrixA table of numbers: rows × columns
33D TensorA cube: e.g., color image (H × W × C)
44D TensorA batch of images (N × C × H × W)

Creating Tensors

PyTorch offers many factory functions for creating tensors. Let's explore them all:

import torch

# From a Python list
a = torch.tensor([1, 2, 3, 4])
print("From list:", a)            # tensor([1, 2, 3, 4])

# From a nested list (2D tensor / matrix)
b = torch.tensor([[1, 2, 3],
                   [4, 5, 6]])
print("Matrix shape:", b.shape)   # torch.Size([2, 3])

# Scalar (0-dimensional tensor)
s = torch.tensor(3.14)
print("Scalar:", s)               # tensor(3.1400)
print("Scalar value:", s.item())  # 3.14 — extracts Python number

Beyond converting Python data, PyTorch provides factory functions for creating tensors filled with specific values. These are the workhorses you will use to initialize weights, create masks, and set up placeholder arrays:

import torch

# Zeros and ones
zeros = torch.zeros(3, 4)          # 3×4 matrix of zeros
ones = torch.ones(2, 3, 5)         # 2×3×5 tensor of ones
print("Zeros shape:", zeros.shape)  # torch.Size([3, 4])
print("Ones shape:", ones.shape)    # torch.Size([2, 3, 5])

# Full — fill with any value
sevens = torch.full((2, 3), 7.0)
print("Full:\n", sevens)

# Identity matrix
eye = torch.eye(4)
print("Identity:\n", eye)

Random tensors are critical for neural network weight initialization. PyTorch supports several random distributions, each suited to different scenarios:

import torch

# Random tensors — essential for initializing neural network weights
uniform = torch.rand(3, 3)        # Uniform [0, 1)
normal = torch.randn(3, 3)        # Standard normal (mean=0, std=1)
randint = torch.randint(0, 10, (2, 4))  # Random integers in [0, 10)

print("Uniform:\n", uniform)
print("Normal:\n", normal)
print("Random ints:\n", randint)

# Random permutation — useful for shuffling data
perm = torch.randperm(8)
print("Permutation:", perm)  # e.g., tensor([5, 2, 0, 7, 3, 1, 6, 4])

Finally, PyTorch includes sequence generators (analogous to Python's range() and NumPy's linspace()) and "like" functions that clone the shape and dtype of an existing tensor:

import torch

# Sequences — like Python's range() and NumPy's linspace()
arange = torch.arange(0, 10, 2)      # Start, stop, step
linspace = torch.linspace(0, 1, 5)   # Start, end, num_points

print("Arange:", arange)       # tensor([0, 2, 4, 6, 8])
print("Linspace:", linspace)   # tensor([0.0000, 0.2500, 0.5000, 0.7500, 1.0000])

# Create a tensor with the same shape as another
template = torch.randn(3, 4)
zeros_like = torch.zeros_like(template)
ones_like = torch.ones_like(template)
rand_like = torch.rand_like(template)
print("zeros_like shape:", zeros_like.shape)  # torch.Size([3, 4])

Data Types

PyTorch tensors have explicit data types. Choosing the right dtype affects memory usage and computation speed:

import torch

# Default dtypes
int_tensor = torch.tensor([1, 2, 3])
float_tensor = torch.tensor([1.0, 2.0, 3.0])
print("Int dtype:", int_tensor.dtype)      # torch.int64
print("Float dtype:", float_tensor.dtype)  # torch.float32

# Explicit dtype specification
x = torch.tensor([1.0, 2.0], dtype=torch.float16)   # Half precision
y = torch.tensor([1.0, 2.0], dtype=torch.float32)   # Single precision (default)
z = torch.tensor([1.0, 2.0], dtype=torch.float64)   # Double precision
b = torch.tensor([True, False], dtype=torch.bool)    # Boolean

print("float16:", x.dtype, "— uses", x.element_size(), "bytes per element")
print("float32:", y.dtype, "— uses", y.element_size(), "bytes per element")
print("float64:", z.dtype, "— uses", z.element_size(), "bytes per element")

# Casting between types
converted = int_tensor.float()   # int64 → float32
print("Converted:", converted.dtype)

# Also works with .to()
as_half = float_tensor.to(torch.float16)
print("Half precision:", as_half)
Important: Neural networks typically use torch.float32 for training. Using float16 (half precision) saves memory and can be faster on modern GPUs, but requires careful handling of numerical stability. We'll cover mixed-precision training in Part 3.

NumPy Interoperability

PyTorch and NumPy can share memory, allowing zero-copy conversions. But this comes with a critical caveat:

import torch
import numpy as np

# NumPy → PyTorch (shares memory by default!)
np_array = np.array([1.0, 2.0, 3.0])
tensor_shared = torch.from_numpy(np_array)

print("Before modification:")
print("  NumPy:", np_array)
print("  Tensor:", tensor_shared)

# Modifying one changes the other! (shared memory)
np_array[0] = 999.0
print("\nAfter modifying NumPy array:")
print("  NumPy:", np_array)           # [999.  2.  3.]
print("  Tensor:", tensor_shared)     # tensor([999.,  2.,  3.]) — also changed!

# Safe conversion — use .clone() to break shared memory
np_array2 = np.array([10.0, 20.0, 30.0])
tensor_safe = torch.from_numpy(np_array2).clone()
np_array2[0] = -1.0
print("\nWith .clone() — independent copy:")
print("  NumPy:", np_array2)          # [-1. 20. 30.]
print("  Tensor:", tensor_safe)       # tensor([10., 20., 30.]) — unchanged!

Converting in the other direction — from a PyTorch tensor back to a NumPy array — uses the .numpy() method. The same shared-memory behavior applies (for CPU tensors), and GPU tensors must be moved to CPU first:

import torch
import numpy as np

# PyTorch → NumPy
tensor = torch.tensor([4.0, 5.0, 6.0])
np_from_tensor = tensor.numpy()  # Shares memory (CPU tensors only)
print("Tensor → NumPy:", np_from_tensor)
print("Type:", type(np_from_tensor))  # numpy.ndarray

# GPU tensors must be moved to CPU first
# tensor_gpu.cpu().numpy()  # This is the pattern for GPU tensors

Tensor Attributes

Every tensor carries metadata that tells you its shape, data type, which device it lives on, and whether gradients are being tracked. Inspecting these properties is how you debug shape mismatches — the single most common category of PyTorch errors:

import torch

t = torch.randn(3, 4, 5)

# Shape and dimensions
print("Shape:", t.shape)            # torch.Size([3, 4, 5])
print("Size:", t.size())            # torch.Size([3, 4, 5]) — same as .shape
print("Dimensions:", t.ndim)        # 3
print("Total elements:", t.numel()) # 60 (3 × 4 × 5)

# Data type and device
print("Dtype:", t.dtype)              # torch.float32
print("Device:", t.device)            # cpu
print("Requires grad:", t.requires_grad)  # False

# Memory layout
print("Is contiguous:", t.is_contiguous())  # True
print("Stride:", t.stride())               # (20, 5, 1)
print("Element size:", t.element_size(), "bytes")  # 4 bytes (float32)

Tensor Operations

PyTorch supports hundreds of operations on tensors. Let's cover the most important ones you'll use every day.

Element-wise Operations

Element-wise operations apply the same function to every element independently. They are the building blocks of neural network computations — activation functions, scaling, and normalization are all element-wise under the hood:

import torch

a = torch.tensor([1.0, 2.0, 3.0, 4.0])
b = torch.tensor([10.0, 20.0, 30.0, 40.0])

# Arithmetic — operates element by element
print("Add:", a + b)           # tensor([11., 22., 33., 44.])
print("Subtract:", a - b)     # tensor([-9., -18., -27., -36.])
print("Multiply:", a * b)     # tensor([10., 40., 90., 160.])  ← Hadamard product
print("Divide:", a / b)       # tensor([0.1000, 0.1000, 0.1000, 0.1000])
print("Power:", a ** 2)        # tensor([ 1.,  4.,  9., 16.])

# Mathematical functions
print("Sqrt:", torch.sqrt(a))     # tensor([1.0000, 1.4142, 1.7321, 2.0000])
print("Exp:", torch.exp(a))       # tensor([ 2.7183,  7.3891, 20.0855, 54.5981])
print("Log:", torch.log(a))       # tensor([0.0000, 0.6931, 1.0986, 1.3863])
print("Sin:", torch.sin(a))       # tensor([0.8415, 0.9093, 0.1411, -0.7568])
print("Abs:", torch.abs(torch.tensor([-3.0, -1.0, 2.0])))  # tensor([3., 1., 2.])
Practical Example

In-Place Operations: The Underscore Convention

PyTorch uses a trailing underscore _ convention for in-place operations. These modify the tensor directly instead of creating a new one, saving memory — but they can break autograd computation graphs, so use them carefully.

Memory Efficiency Autograd Caveat
import torch

x = torch.tensor([1.0, 2.0, 3.0])
print("Original:", x)

# In-place addition (modifies x directly)
x.add_(10)
print("After add_(10):", x)    # tensor([11., 12., 13.])

# In-place multiplication
x.mul_(2)
print("After mul_(2):", x)     # tensor([22., 24., 26.])

# In-place clamp
x.clamp_(min=23, max=25)
print("After clamp_:", x)      # tensor([23., 24., 25.])

# Comparison: out-of-place creates a NEW tensor
y = torch.tensor([1.0, 2.0, 3.0])
z = y.add(10)    # y is unchanged, z is new
print("y unchanged:", y)   # tensor([1., 2., 3.])
print("z is new:", z)      # tensor([11., 12., 13.])

Matrix Multiplication

Matrix multiplication is the backbone of neural networks. PyTorch provides multiple ways to perform it:

import torch

A = torch.tensor([[1.0, 2.0],
                   [3.0, 4.0]])

B = torch.tensor([[5.0, 6.0],
                   [7.0, 8.0]])

# All three are equivalent for 2D matrix multiplication
result1 = A @ B                    # Python operator (preferred)
result2 = torch.matmul(A, B)      # Function form
result3 = torch.mm(A, B)          # Explicit 2D matrix multiply

print("A @ B:\n", result1)
# tensor([[19., 22.],
#         [43., 50.]])

# Verify all methods give the same result
print("All equal:", torch.equal(result1, result2) and torch.equal(result2, result3))

# Matrix-vector multiplication
v = torch.tensor([1.0, 2.0])
print("A @ v:", A @ v)     # tensor([ 5., 11.])

# Dot product (1D vectors only)
x = torch.tensor([1.0, 2.0, 3.0])
y = torch.tensor([4.0, 5.0, 6.0])
print("Dot product:", torch.dot(x, y))  # tensor(32.) — 1*4 + 2*5 + 3*6

In real models, you rarely multiply single matrices. Batch matrix multiplication applies the same operation across many matrices in parallel — this is how attention heads work in Transformers:

import torch

# Batch matrix multiplication — essential for transformer models
# Shape: (batch_size, rows, cols)
batch_A = torch.randn(8, 3, 4)   # 8 matrices, each 3×4
batch_B = torch.randn(8, 4, 5)   # 8 matrices, each 4×5

batch_result = torch.bmm(batch_A, batch_B)
print("Batch matmul shape:", batch_result.shape)  # torch.Size([8, 3, 5])

# torch.matmul handles batches automatically too
batch_result2 = torch.matmul(batch_A, batch_B)
print("Same result:", torch.allclose(batch_result, batch_result2))  # True

Concatenation & Stacking

You will frequently need to combine tensors. torch.cat() joins tensors along an existing dimension, while torch.stack() creates a new dimension. The difference matters when you're assembling batches, merging feature maps, or averaging predictions from an ensemble:

import torch

a = torch.tensor([[1, 2], [3, 4]])
b = torch.tensor([[5, 6], [7, 8]])

# cat — joins along an EXISTING dimension
cat_rows = torch.cat([a, b], dim=0)     # Stack vertically
cat_cols = torch.cat([a, b], dim=1)     # Stack horizontally
print("Cat dim=0 (rows):\n", cat_rows)   # Shape: [4, 2]
print("Cat dim=1 (cols):\n", cat_cols)   # Shape: [2, 4]

# stack — joins along a NEW dimension
stacked = torch.stack([a, b], dim=0)
print("Stack dim=0:\n", stacked)          # Shape: [2, 2, 2]
print("Stack shape:", stacked.shape)

# Practical example: stacking predictions from multiple models
pred1 = torch.tensor([0.9, 0.1])
pred2 = torch.tensor([0.8, 0.2])
pred3 = torch.tensor([0.7, 0.3])
ensemble = torch.stack([pred1, pred2, pred3])
avg_pred = ensemble.mean(dim=0)
print("Ensemble average:", avg_pred)  # Average across 3 models

Broadcasting

Broadcasting automatically expands tensor dimensions for element-wise operations between tensors of different shapes, following NumPy's broadcasting rules:

import torch

# Scalar broadcast
a = torch.tensor([[1.0, 2.0, 3.0],
                   [4.0, 5.0, 6.0]])
print("a + 10:\n", a + 10)   # Scalar 10 broadcasts to every element

# Vector broadcast (add bias to each row)
bias = torch.tensor([100.0, 200.0, 300.0])  # Shape: [3]
print("a + bias:\n", a + bias)
# tensor([[101., 202., 303.],
#         [104., 205., 306.]])

# Column vector broadcast (add per-row scaling)
scale = torch.tensor([[10.0],
                       [20.0]])  # Shape: [2, 1]
print("a * scale:\n", a * scale)
# tensor([[ 10.,  20.,  30.],
#         [ 80., 100., 120.]])

# Broadcasting rules:
# 1. Align dimensions from the right
# 2. Dimensions must be equal, or one must be 1
# 3. Missing dimensions are treated as 1
print("\nBroadcasting shapes:")
print("  [2, 3] + [3]    → [2, 3]")     # bias added to each row
print("  [2, 3] * [2, 1] → [2, 3]")     # scale applied per-row
print("  [4, 1] + [1, 5] → [4, 5]")     # outer product-like

Reshaping & Indexing

View vs Reshape

Both view() and reshape() change the shape of a tensor without changing its data. The key difference is about contiguous memory:

import torch

x = torch.arange(12)
print("Original:", x)           # tensor([ 0,  1,  2, ..., 11])
print("Shape:", x.shape)        # torch.Size([12])

# view — requires contiguous memory, returns a VIEW (shared data)
v = x.view(3, 4)
print("View (3×4):\n", v)

# reshape — works even on non-contiguous tensors (may copy data)
r = x.reshape(4, 3)
print("Reshape (4×3):\n", r)

# Use -1 to let PyTorch infer one dimension
auto = x.view(2, -1)   # -1 becomes 6 (12 / 2)
print("Auto shape:", auto.shape)  # torch.Size([2, 6])

# view shares memory — modifying one changes the other
v[0, 0] = 999
print("After modifying view, original:", x[0])  # 999

# flatten — shorthand for reshaping to 1D
matrix = torch.randn(3, 4)
flat = matrix.flatten()
print("Flattened:", flat.shape)  # torch.Size([12])

Transpose & Permute

.T and .transpose() swap two dimensions, while .permute() lets you reorder all dimensions at once. Permute is especially important for image data: most image libraries store pixels as (Height, Width, Channels), but PyTorch convolutions expect (Channels, Height, Width):

import torch

# Transpose — swap two dimensions
m = torch.tensor([[1, 2, 3],
                   [4, 5, 6]])
print("Original shape:", m.shape)   # [2, 3]
print("Transposed:\n", m.T)        # [3, 2]
print("Transpose:\n", m.transpose(0, 1))  # Same as .T for 2D

# Permute — reorder ALL dimensions (essential for image data)
# Image: (batch, height, width, channels) → (batch, channels, height, width)
img = torch.randn(8, 224, 224, 3)    # 8 images, 224×224, RGB
print("Original (NHWC):", img.shape)  # [8, 224, 224, 3]

img_pytorch = img.permute(0, 3, 1, 2)  # Move channels to dim 1
print("PyTorch (NCHW):", img_pytorch.shape)  # [8, 3, 224, 224]

Squeeze & Unsqueeze

unsqueeze() inserts a size-1 dimension at a specified position, while squeeze() removes size-1 dimensions. These are used constantly to add or remove batch dimensions when switching between single-sample inference and batch processing:

import torch

# unsqueeze — add a dimension of size 1
x = torch.tensor([1, 2, 3])
print("Original:", x.shape)         # [3]
print("Unsqueeze(0):", x.unsqueeze(0).shape)  # [1, 3] — row vector
print("Unsqueeze(1):", x.unsqueeze(1).shape)  # [3, 1] — column vector

# squeeze — remove dimensions of size 1
y = torch.randn(1, 3, 1, 4)
print("Before squeeze:", y.shape)             # [1, 3, 1, 4]
print("Squeeze all:", y.squeeze().shape)      # [3, 4]
print("Squeeze dim 0:", y.squeeze(0).shape)   # [3, 1, 4]

# Common pattern: add batch dimension for single inference
single_image = torch.randn(3, 224, 224)         # C × H × W
batched = single_image.unsqueeze(0)              # 1 × C × H × W
print("Single image:", single_image.shape)       # [3, 224, 224]
print("Batched:", batched.shape)                 # [1, 3, 224, 224]

Advanced Indexing

PyTorch supports the same slicing syntax as NumPy, plus boolean masking and fancy indexing with integer lists. Masking is how you filter data by condition (e.g., keeping only predictions above a threshold), and fancy indexing is how DataLoaders select random samples from a dataset:

import torch

t = torch.tensor([[10, 20, 30],
                   [40, 50, 60],
                   [70, 80, 90]])

# Basic slicing (same as NumPy)
print("Row 0:", t[0])           # tensor([10, 20, 30])
print("Col 1:", t[:, 1])       # tensor([20, 50, 80])
print("Block:", t[0:2, 1:3])   # tensor([[20, 30], [50, 60]])

# Boolean (mask) indexing — filter elements
scores = torch.tensor([85.0, 92.0, 67.0, 78.0, 95.0])
passed = scores > 80
print("Mask:", passed)                      # tensor([True, True, False, False, True])
print("Passing scores:", scores[passed])    # tensor([85., 92., 95.])

# Fancy indexing with lists
indices = [0, 2, 4]
print("Selected:", scores[indices])  # tensor([85., 67., 95.])

# where — conditional selection
result = torch.where(scores > 80, scores, torch.tensor(0.0))
print("Where result:", result)  # tensor([85., 92.,  0.,  0., 95.])

Autograd: Automatic Differentiation

Autograd is PyTorch's automatic differentiation engine — the magic that makes training neural networks possible. Instead of manually computing derivatives (gradients), you define the forward computation and PyTorch automatically computes all gradients for you via reverse-mode differentiation (backpropagation).

Why this matters: A modern neural network like GPT-3 has 175 billion parameters. Computing gradients for all of them by hand is impossible. Autograd does it automatically, efficiently, and correctly — every single time.

Here's how it works conceptually:

Autograd Computational Graph
flowchart LR
    X["x
(input)"] --> MUL["×"] W["w
(weight)
requires_grad=True"] --> MUL MUL --> ADD["+"] B["b
(bias)
requires_grad=True"] --> ADD ADD --> Y["y = wx + b
(output)"] Y --> LOSS["loss = (y - target)²"] LOSS -->|".backward()"| GW["∂loss/∂w"] LOSS -->|".backward()"| GB["∂loss/∂b"] style X fill:#3B9797,stroke:#132440,color:#fff style W fill:#BF092F,stroke:#132440,color:#fff style B fill:#BF092F,stroke:#132440,color:#fff style MUL fill:#132440,stroke:#3B9797,color:#fff style ADD fill:#132440,stroke:#3B9797,color:#fff style Y fill:#16476A,stroke:#3B9797,color:#fff style LOSS fill:#16476A,stroke:#3B9797,color:#fff style GW fill:#3B9797,stroke:#132440,color:#fff style GB fill:#3B9797,stroke:#132440,color:#fff

When you call .backward(), PyTorch walks this graph in reverse, computing gradients using the chain rule at each node.

Backward Pass & Gradients

The .backward() method triggers backpropagation — it walks through the computation graph in reverse and fills in the .grad attribute of every leaf tensor that has requires_grad=True. Let’s see this with a simple polynomial and then a multi-variable linear model:

import torch

# Simple example: y = x² + 3x + 1, find dy/dx at x=2
x = torch.tensor(2.0, requires_grad=True)

# Forward pass — builds the computation graph
y = x**2 + 3*x + 1
print("y =", y.item())  # y = 2² + 3(2) + 1 = 11

# Backward pass — computes gradients
y.backward()

# dy/dx = 2x + 3, at x=2 → dy/dx = 7
print("dy/dx =", x.grad.item())  # 7.0

Autograd shines with multi-variable functions too. In the next example, we compute partial derivatives of a loss with respect to multiple parameters — exactly what happens inside every neural network during training:

import torch

# Multi-variable example: linear regression forward pass
# y = w1*x1 + w2*x2 + b
w1 = torch.tensor(3.0, requires_grad=True)
w2 = torch.tensor(-2.0, requires_grad=True)
b = torch.tensor(1.0, requires_grad=True)

x1, x2 = torch.tensor(4.0), torch.tensor(5.0)
target = torch.tensor(10.0)

# Forward pass
y = w1 * x1 + w2 * x2 + b    # 3*4 + (-2)*5 + 1 = 3
loss = (y - target) ** 2       # (3 - 10)² = 49

print(f"Prediction: {y.item():.1f}")
print(f"Loss: {loss.item():.1f}")

# Backward pass
loss.backward()

# Gradients: ∂loss/∂w = 2(y - target) * ∂y/∂w
print(f"∂loss/∂w1 = {w1.grad.item():.1f}")  # 2*(3-10)*4 = -56
print(f"∂loss/∂w2 = {w2.grad.item():.1f}")  # 2*(3-10)*5 = -70
print(f"∂loss/∂b  = {b.grad.item():.1f}")   # 2*(3-10)*1 = -14

Gradient Control

During inference (prediction time) you don't need gradients — disabling them saves memory and speeds up computation. PyTorch provides torch.no_grad() as a context manager and .detach() for individual tensors:

import torch

# torch.no_grad() — disable gradient tracking (for inference)
w = torch.tensor(5.0, requires_grad=True)

# During training: gradients tracked
y = w * 3
print("Grad fn:", y.grad_fn)  # <MulBackward0>

# During inference: no gradients needed (saves memory & computation)
with torch.no_grad():
    y_infer = w * 3
    print("Grad fn:", y_infer.grad_fn)          # None
    print("Requires grad:", y_infer.requires_grad)  # False

The .detach() method removes a single tensor from the computation graph, and there is a critical gotcha about gradient accumulation — PyTorch adds new gradients to old ones by default instead of replacing them. You must manually zero gradients between training iterations:

import torch

# detach() — remove a tensor from the computation graph
x = torch.tensor(3.0, requires_grad=True)
y = x ** 2

# Detach y from the graph — useful for using model outputs as inputs elsewhere
y_detached = y.detach()
print("y requires_grad:", y.requires_grad)            # True
print("y_detached requires_grad:", y_detached.requires_grad)  # False

# CRITICAL: Gradients accumulate! You must zero them between iterations
w = torch.tensor(2.0, requires_grad=True)

for i in range(3):
    loss = (w * 3) ** 2
    loss.backward()
    print(f"Iteration {i}: w.grad = {w.grad.item()}")
    # Without zeroing: 36, 72, 108 (accumulates!)

# Fix: zero gradients before each backward pass
w = torch.tensor(2.0, requires_grad=True)
for i in range(3):
    if w.grad is not None:
        w.grad.zero_()   # Zero the gradient!
    loss = (w * 3) ** 2
    loss.backward()
    print(f"Iteration {i} (zeroed): w.grad = {w.grad.item()}")
    # Correct: 36, 36, 36
Common Bug Alert: Forgetting to zero gradients is one of the most common PyTorch mistakes. In a training loop, always call optimizer.zero_grad() before loss.backward(). We'll formalize this in Part 3.

Higher-Order Gradients

Sometimes you need the gradient of a gradient — the second derivative (Hessian). Pass create_graph=True to torch.autograd.grad() so the first gradient itself becomes a differentiable computation:

import torch

# Computing second derivatives (Hessian elements)
x = torch.tensor(3.0, requires_grad=True)

# f(x) = x³ → f'(x) = 3x² → f''(x) = 6x
y = x ** 3

# First derivative: create_graph=True keeps the graph for further differentiation
grad1 = torch.autograd.grad(y, x, create_graph=True)[0]
print(f"f(3) = {y.item()}")       # 27
print(f"f'(3) = {grad1.item()}")  # 27 (3 * 3²)

# Second derivative
grad2 = torch.autograd.grad(grad1, x)[0]
print(f"f''(3) = {grad2.item()}")  # 18 (6 * 3)

GPU Acceleration

GPUs can perform thousands of parallel computations simultaneously, making them ideal for the matrix operations at the heart of deep learning. A single modern GPU can speed up training by 10-100× compared to a CPU.

CUDA Basics

CUDA is NVIDIA's parallel computing platform. PyTorch uses it transparently — you just tell tensors which device to live on. Let's start by checking what hardware is available:

import torch

# Check GPU availability
print("CUDA available:", torch.cuda.is_available())
print("Device count:", torch.cuda.device_count())

# Set default device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# Create tensor directly on GPU
if torch.cuda.is_available():
    gpu_tensor = torch.randn(3, 3, device="cuda")
    print("GPU tensor device:", gpu_tensor.device)
else:
    cpu_tensor = torch.randn(3, 3)
    print("CPU tensor device:", cpu_tensor.device)

To move existing tensors between CPU and GPU, use .to(device). One critical rule: all tensors in an operation must be on the same device. If you try to multiply a CPU tensor with a GPU tensor, PyTorch will raise a RuntimeError:

import torch

# Moving tensors between devices
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Create on CPU, then move to GPU
x = torch.randn(1000, 1000)
print("Before .to():", x.device)  # cpu

x_device = x.to(device)
print("After .to():", x_device.device)  # cuda:0 (if GPU available)

# Operations between tensors MUST be on the same device
y = torch.randn(1000, 1000, device=device)

# This works — both on same device
result = x_device @ y
print("Result device:", result.device)

# Moving back to CPU (e.g., for NumPy conversion or plotting)
result_cpu = result.cpu()
print("Back on CPU:", result_cpu.device)

Device-Agnostic Code

The best practice is to write code that works on both CPU and GPU without changes:

import torch

# Device-agnostic pattern — use this in all your projects
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using: {device}")

# Create data on the right device
X = torch.randn(64, 10, device=device)   # Features
W = torch.randn(10, 5, device=device)    # Weights
b = torch.zeros(5, device=device)        # Bias

# Forward pass — works on CPU or GPU
output = X @ W + b
print("Output shape:", output.shape)   # [64, 5]
print("Output device:", output.device)

# When saving results or converting to NumPy
result_np = output.detach().cpu().numpy()
print("NumPy result shape:", result_np.shape)

GPU Memory Management

GPU memory is limited (typically 8-24 GB on consumer cards). When training large models, you need to monitor and manage it. PyTorch caches memory for reuse, so memory_reserved may be larger than memory_allocated. Use torch.cuda.empty_cache() to release unused cached memory back to the OS:

import torch

if torch.cuda.is_available():
    # Check GPU memory
    print(f"Total GPU memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
    print(f"Allocated: {torch.cuda.memory_allocated(0) / 1e6:.1f} MB")
    print(f"Cached: {torch.cuda.memory_reserved(0) / 1e6:.1f} MB")

    # Allocate a large tensor
    big = torch.randn(10000, 10000, device="cuda")
    print(f"\nAfter allocation: {torch.cuda.memory_allocated(0) / 1e6:.1f} MB")

    # Free memory
    del big
    torch.cuda.empty_cache()
    print(f"After cleanup: {torch.cuda.memory_allocated(0) / 1e6:.1f} MB")
else:
    print("GPU memory management is only relevant when CUDA is available.")
    print("All code in this tutorial works perfectly on CPU.")

CPU vs GPU Benchmark

Performance Benchmark

Matrix Multiplication: CPU vs GPU Speed Comparison

This benchmark demonstrates the massive speedup GPUs provide for large matrix operations. On a typical NVIDIA GPU, you'll see 10-50× faster computation for large matrices.

CUDA Performance Parallelism
import torch
import time

def benchmark_matmul(size, device, num_runs=10):
    """Benchmark matrix multiplication on a given device."""
    a = torch.randn(size, size, device=device)
    b = torch.randn(size, size, device=device)

    # Warm-up run
    _ = a @ b
    if device.type == "cuda":
        torch.cuda.synchronize()

    # Timed runs
    start = time.time()
    for _ in range(num_runs):
        _ = a @ b
    if device.type == "cuda":
        torch.cuda.synchronize()
    elapsed = (time.time() - start) / num_runs
    return elapsed

# Run benchmark
sizes = [256, 1024, 4096]
cpu_device = torch.device("cpu")

print(f"{'Size':>6} | {'CPU (ms)':>10} | {'GPU (ms)':>10} | {'Speedup':>8}")
print("-" * 50)

for size in sizes:
    cpu_time = benchmark_matmul(size, cpu_device) * 1000

    if torch.cuda.is_available():
        gpu_device = torch.device("cuda")
        gpu_time = benchmark_matmul(size, gpu_device) * 1000
        speedup = cpu_time / gpu_time
        print(f"{size:>6} | {cpu_time:>10.2f} | {gpu_time:>10.2f} | {speedup:>7.1f}×")
    else:
        print(f"{size:>6} | {cpu_time:>10.2f} | {'N/A':>10} | {'N/A':>8}")

Exercises

Test your understanding with these challenges. Try to solve them before looking at the solutions:

Exercise 1: Create a 5×5 tensor where each element is the product of its row and column indices (like a multiplication table). Hint: use torch.arange and broadcasting.
import torch

# Solution: Multiplication table using broadcasting
rows = torch.arange(1, 6).unsqueeze(1)   # Column vector [5, 1]
cols = torch.arange(1, 6).unsqueeze(0)   # Row vector [1, 5]
table = rows * cols                       # Broadcasting → [5, 5]
print("Multiplication table:\n", table)
Exercise 2: Compute the gradient of $f(x) = \sin(x^2)$ at $x = \pi$ using autograd. Verify the answer manually: $f'(x) = 2x\cos(x^2)$.
import torch
import math

# Solution: Gradient of sin(x²) at x = π
x = torch.tensor(math.pi, requires_grad=True)

y = torch.sin(x ** 2)
y.backward()

print(f"f(π)  = sin(π²) = {y.item():.6f}")
print(f"f'(π) = 2π·cos(π²) = {x.grad.item():.6f}")

# Manual verification
manual = 2 * math.pi * math.cos(math.pi ** 2)
print(f"Manual calculation: {manual:.6f}")
print(f"Match: {abs(x.grad.item() - manual) < 1e-5}")
Exercise 3: Create two random 3×3 matrices. Compute their product, transpose the result, flatten it, and find the index of the maximum element.
import torch

# Solution: Chaining tensor operations
torch.manual_seed(42)  # For reproducibility
A = torch.randn(3, 3)
B = torch.randn(3, 3)

result = (A @ B).T.flatten()
max_idx = result.argmax()

print("A:\n", A)
print("B:\n", B)
print("(A @ B)^T flattened:", result)
print(f"Max value: {result[max_idx]:.4f} at index {max_idx.item()}")

Conclusion & Next Steps

You've built a solid foundation in PyTorch's three pillars:

  • Tensors — the multi-dimensional arrays that hold all data and model parameters
  • Autograd — the automatic differentiation engine that computes gradients for backpropagation
  • GPU acceleration — the device management system for high-performance computation

These three concepts underpin everything in PyTorch. Every neural network, every training loop, every deployment — they all start here.

Key takeaways: Always write device-agnostic code. Remember to zero gradients in training loops. Use .clone() when converting from NumPy to avoid shared memory bugs. Prefer @ for matrix multiplication. Use torch.no_grad() during inference.

Next in the Series

In Part 2: Building Neural Networks, we'll use these tensor and autograd skills to construct neural networks with nn.Module, define layers, activation functions, and loss functions — and see how PyTorch's object-oriented design makes building complex architectures intuitive.