Why Tensors Matter
A tensor is a multidimensional array with a shape. Scalars have shape $()$, vectors have shape $(d)$, matrices have shape $(m,n)$, and model batches usually have higher-rank shapes such as $(B,T,d)$ for batch size, token length, and embedding dimension.
| Object | Typical Shape | AI Meaning |
|---|---|---|
| Embedding batch | $(B,T,d)$ | Token vectors for a mini-batch |
| Attention logits | $(B,h,T,T)$ | Pairwise token scores per head |
| Image batch | $(B,C,H,W)$ | Channels, height, width |
| Gradient tensor | same as parameter | Direction that lowers loss |
Shapes & Broadcasting
Broadcasting lets arrays with compatible shapes interact without manually copying values. A bias vector $b \in \mathbb{R}^d$ can be added to every row of $X \in \mathbb{R}^{B \times d}$ because the bias is broadcast along the batch axis.
import numpy as np
X = np.array([[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0]]) # shape (2, 3)
b = np.array([0.1, 0.2, 0.3]) # shape (3,)
Y = X + b # b broadcasts to shape (2, 3)
print("Y shape:", Y.shape)
print(Y)
sum(axis=0).Computational Graphs
A computational graph records how values are produced. If $z = (xw + b)^2$, the graph stores multiplication, addition, and square nodes. Autodiff applies the chain rule backward through this graph.
flowchart LR
X[x] --> M[Multiply]
W[w] --> M
M --> A[Add Bias]
B[b] --> A
A --> S[Square]
S --> L[Loss]
L -. gradients .-> S
S -. gradients .-> A
A -. gradients .-> M
M -. gradients .-> X
M -. gradients .-> W
Vector-Jacobian Products
For a function $y=f(x)$ with Jacobian $J=\frac{\partial y}{\partial x}$, reverse-mode autodiff does not usually materialize $J$. It propagates a vector-Jacobian product:
$$\bar{x} = \bar{y}J$$
This is efficient when the output is scalar, as with a loss function. A model may have billions of parameters, but the loss is one number, so reverse mode computes all parameter gradients in one backward pass.
Mini Autodiff Engine
This tiny example is intentionally small, but it shows the core idea: every operation stores a backward function.
import math
class Value:
def __init__(self, data, children=(), backward=lambda: None):
self.data = float(data)
self.grad = 0.0
self.children = list(children)
self._backward = backward
def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data, (self, other))
def backward():
self.grad += other.data * out.grad
other.grad += self.data * out.grad
out._backward = backward
return out
def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, (self, other))
def backward():
self.grad += out.grad
other.grad += out.grad
out._backward = backward
return out
def tanh(self):
t = math.tanh(self.data)
out = Value(t, (self,))
def backward():
self.grad += (1 - t * t) * out.grad
out._backward = backward
return out
x = Value(2.0)
w = Value(-3.0)
b = Value(0.5)
y = (x * w + b).tanh()
y.grad = 1.0
for node in [y, y.children[0], y.children[0].children[0]]:
node._backward()
print("y:", y.data)
print("dy/dx:", x.grad, "dy/dw:", w.grad, "dy/db:", b.grad)
Practice Exercises
Trace a Transformer Tensor
Start with $X \in \mathbb{R}^{B \times T \times d}$. If $W_Q \in \mathbb{R}^{d \times d_k}$, what is the shape of $Q=XW_Q$? What is the shape of $QK^\top$ after batching?