Back to Math for AI Hub

Tensor Calculus & Automatic Differentiation

April 30, 2026Wasil Zafar22 min read

Tensors are the data structure of deep learning, and automatic differentiation is the engine that trains them. This extension connects matrix calculus from the core series to the computational graphs used by PyTorch, JAX, and TensorFlow.

Table of Contents

  1. Why Tensors Matter
  2. Shapes & Broadcasting
  3. Computational Graphs
  4. Vector-Jacobian Products
  5. Mini Autodiff Engine
  6. Practice Exercises
Extension Track: This page builds on Part 7 linear algebra, Part 8 calculus, and Part 9 loss functions. It is the bridge from hand-derived gradients to production autodiff libraries.

Why Tensors Matter

A tensor is a multidimensional array with a shape. Scalars have shape $()$, vectors have shape $(d)$, matrices have shape $(m,n)$, and model batches usually have higher-rank shapes such as $(B,T,d)$ for batch size, token length, and embedding dimension.

ObjectTypical ShapeAI Meaning
Embedding batch$(B,T,d)$Token vectors for a mini-batch
Attention logits$(B,h,T,T)$Pairwise token scores per head
Image batch$(B,C,H,W)$Channels, height, width
Gradient tensorsame as parameterDirection that lowers loss

Shapes & Broadcasting

Broadcasting lets arrays with compatible shapes interact without manually copying values. A bias vector $b \in \mathbb{R}^d$ can be added to every row of $X \in \mathbb{R}^{B \times d}$ because the bias is broadcast along the batch axis.

import numpy as np

X = np.array([[1.0, 2.0, 3.0],
              [4.0, 5.0, 6.0]])       # shape (2, 3)
b = np.array([0.1, 0.2, 0.3])          # shape (3,)
Y = X + b                              # b broadcasts to shape (2, 3)
print("Y shape:", Y.shape)
print(Y)
Gradient rule: if a tensor was broadcast in the forward pass, its gradient must be reduced along the broadcast axes in the backward pass. That is why bias gradients often use sum(axis=0).

Computational Graphs

A computational graph records how values are produced. If $z = (xw + b)^2$, the graph stores multiplication, addition, and square nodes. Autodiff applies the chain rule backward through this graph.

Reverse-Mode Autodiff Flow
flowchart LR
    X[x] --> M[Multiply]
    W[w] --> M
    M --> A[Add Bias]
    B[b] --> A
    A --> S[Square]
    S --> L[Loss]
    L -. gradients .-> S
    S -. gradients .-> A
    A -. gradients .-> M
    M -. gradients .-> X
    M -. gradients .-> W
        

Vector-Jacobian Products

For a function $y=f(x)$ with Jacobian $J=\frac{\partial y}{\partial x}$, reverse-mode autodiff does not usually materialize $J$. It propagates a vector-Jacobian product:

$$\bar{x} = \bar{y}J$$

This is efficient when the output is scalar, as with a loss function. A model may have billions of parameters, but the loss is one number, so reverse mode computes all parameter gradients in one backward pass.

Mini Autodiff Engine

This tiny example is intentionally small, but it shows the core idea: every operation stores a backward function.

import math

class Value:
    def __init__(self, data, children=(), backward=lambda: None):
        self.data = float(data)
        self.grad = 0.0
        self.children = list(children)
        self._backward = backward

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other))
        def backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = backward
        return out

    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other))
        def backward():
            self.grad += out.grad
            other.grad += out.grad
        out._backward = backward
        return out

    def tanh(self):
        t = math.tanh(self.data)
        out = Value(t, (self,))
        def backward():
            self.grad += (1 - t * t) * out.grad
        out._backward = backward
        return out

x = Value(2.0)
w = Value(-3.0)
b = Value(0.5)
y = (x * w + b).tanh()
y.grad = 1.0
for node in [y, y.children[0], y.children[0].children[0]]:
    node._backward()
print("y:", y.data)
print("dy/dx:", x.grad, "dy/dw:", w.grad, "dy/db:", b.grad)

Practice Exercises

ExerciseShapes
Trace a Transformer Tensor

Start with $X \in \mathbb{R}^{B \times T \times d}$. If $W_Q \in \mathbb{R}^{d \times d_k}$, what is the shape of $Q=XW_Q$? What is the shape of $QK^\top$ after batching?