Tensor Calculus & Automatic Differentiation

Extension Track: This page builds on Part 7 linear algebra, Part 8 calculus, and Part 9 loss functions. It is the bridge from hand-derived gradients to production autodiff libraries.

Why Tensors Matter

A tensor is a multidimensional array with a shape. Scalars have shape $()$, vectors have shape $(d)$, matrices have shape $(m,n)$, and model batches usually have higher-rank shapes such as $(B,T,d)$ for batch size, token length, and embedding dimension.

Object	Typical Shape	AI Meaning
Embedding batch	$(B,T,d)$	Token vectors for a mini-batch
Attention logits	$(B,h,T,T)$	Pairwise token scores per head
Image batch	$(B,C,H,W)$	Channels, height, width
Gradient tensor	same as parameter	Direction that lowers loss

Shapes & Broadcasting

Broadcasting lets arrays with compatible shapes interact without manually copying values. A bias vector $b \in \mathbb{R}^d$ can be added to every row of $X \in \mathbb{R}^{B \times d}$ because the bias is broadcast along the batch axis.

import numpy as np

X = np.array([[1.0, 2.0, 3.0],
              [4.0, 5.0, 6.0]])       # shape (2, 3)
b = np.array([0.1, 0.2, 0.3])          # shape (3,)
Y = X + b                              # b broadcasts to shape (2, 3)
print("Y shape:", Y.shape)
print(Y)

Gradient rule: if a tensor was broadcast in the forward pass, its gradient must be reduced along the broadcast axes in the backward pass. That is why bias gradients often use sum(axis=0).

Computational Graphs

A computational graph records how values are produced. If $z = (xw + b)^2$, the graph stores multiplication, addition, and square nodes. Autodiff applies the chain rule backward through this graph.

Reverse-Mode Autodiff Flow

flowchart LR
    X[x] --> M[Multiply]
    W[w] --> M
    M --> A[Add Bias]
    B[b] --> A
    A --> S[Square]
    S --> L[Loss]
    L -. gradients .-> S
    S -. gradients .-> A
    A -. gradients .-> M
    M -. gradients .-> X
    M -. gradients .-> W

Vector-Jacobian Products

For a function $y=f(x)$ with Jacobian $J=\frac{\partial y}{\partial x}$, reverse-mode autodiff does not usually materialize $J$. It propagates a vector-Jacobian product:

$$\bar{x} = \bar{y}J$$

This is efficient when the output is scalar, as with a loss function. A model may have billions of parameters, but the loss is one number, so reverse mode computes all parameter gradients in one backward pass.

Mini Autodiff Engine

This tiny example is intentionally small, but it shows the core idea: every operation stores a backward function.

import math

class Value:
    def __init__(self, data, children=(), backward=lambda: None):
        self.data = float(data)
        self.grad = 0.0
        self.children = list(children)
        self._backward = backward

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other))
        def backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = backward
        return out

    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other))
        def backward():
            self.grad += out.grad
            other.grad += out.grad
        out._backward = backward
        return out

    def tanh(self):
        t = math.tanh(self.data)
        out = Value(t, (self,))
        def backward():
            self.grad += (1 - t * t) * out.grad
        out._backward = backward
        return out

x = Value(2.0)
w = Value(-3.0)
b = Value(0.5)
y = (x * w + b).tanh()
y.grad = 1.0
for node in [y, y.children[0], y.children[0].children[0]]:
    node._backward()
print("y:", y.data)
print("dy/dx:", x.grad, "dy/dw:", w.grad, "dy/db:", b.grad)

Practice Exercises

ExerciseShapes

Trace a Transformer Tensor

Start with $X \in \mathbb{R}^{B \times T \times d}$. If $W_Q \in \mathbb{R}^{d \times d_k}$, what is the shape of $Q=XW_Q$? What is the shape of $QK^\top$ after batching?

Cookie Consent

Table of Contents

Why Tensors Matter

Shapes & Broadcasting

Computational Graphs

Vector-Jacobian Products

Mini Autodiff Engine

Practice Exercises

Trace a Transformer Tensor

Cookie Consent

Tensor Calculus & Automatic Differentiation

Table of Contents

Why Tensors Matter

Shapes & Broadcasting

Computational Graphs

Vector-Jacobian Products

Mini Autodiff Engine

Practice Exercises

Trace a Transformer Tensor

Continue the Modern AI Extension

Transformer & LLM Math

Calculus & Optimization