Back to Math for AI Hub

Transformer & LLM Math

April 30, 2026Wasil Zafar26 min read

Transformers are built from matrix multiplication, normalization, masking, and probability distributions. This page explains the math that turns token IDs into contextual language model predictions.

Table of Contents

  1. Tokens & Embeddings
  2. Scaled Dot-Product Attention
  3. Causal Masking
  4. Residuals & Layer Norm
  5. Decoding & Sampling
  6. KV Cache Complexity
Core formula: $$\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}+M\right)V$$ where $M$ is often a causal mask.

Tokens & Embeddings

An LLM begins with token IDs. An embedding matrix $E \in \mathbb{R}^{|V| \times d}$ maps each token ID to a vector in $\mathbb{R}^d$. The model does not see words directly; it sees rows of this matrix.

import numpy as np

vocab_size, d_model = 8, 4
token_ids = np.array([2, 5, 1])
E = np.arange(vocab_size * d_model).reshape(vocab_size, d_model) / 10
X = E[token_ids]
print("Embedding shape:", X.shape)
print(X)

Scaled Dot-Product Attention

Attention compares queries to keys using dot products. If $Q,K,V \in \mathbb{R}^{T \times d_k}$, then $QK^\top \in \mathbb{R}^{T \times T}$ contains every token-to-token compatibility score.

Single Attention Head
flowchart LR
    X["Token embeddings X"] --> Q["Q = X Wq"]
    X --> K["K = X Wk"]
    X --> V["V = X Wv"]
    Q --> S["QK transpose / sqrt(dk)"]
    K --> S
    S --> P["softmax scores"]
    V --> O["P V"]
        
import numpy as np

np.random.seed(7)
T, d_model, d_k = 4, 6, 3
X = np.random.randn(T, d_model)
Wq, Wk, Wv = [np.random.randn(d_model, d_k) for _ in range(3)]
Q, K, V = X @ Wq, X @ Wk, X @ Wv
scores = (Q @ K.T) / np.sqrt(d_k)
weights = np.exp(scores - scores.max(axis=-1, keepdims=True))
weights = weights / weights.sum(axis=-1, keepdims=True)
out = weights @ V
print("scores:", scores.shape, "output:", out.shape)

Causal Masking

Autoregressive models must not attend to future tokens. A causal mask sets future logits to a very negative value before softmax, making their probability nearly zero.

import numpy as np

T = 5
logits = np.arange(T * T, dtype=float).reshape(T, T)
mask = np.triu(np.ones((T, T)), k=1).astype(bool)
masked_logits = logits.copy()
masked_logits[mask] = -1e9
print(masked_logits)

Residuals & Layer Norm

Residual connections preserve a direct path for information and gradients: $x_{l+1}=x_l+F(x_l)$. Layer normalization stabilizes each token representation by normalizing across hidden dimensions.

$$\text{LayerNorm}(x)=\gamma\frac{x-\mu}{\sqrt{\sigma^2+\epsilon}}+\beta$$

Decoding & Sampling

The final hidden state is projected to vocabulary logits. Temperature rescales logits: $p_i=\text{softmax}(z_i/\tau)$. Lower $\tau$ sharpens the distribution; higher $\tau$ makes sampling more exploratory.

KV Cache Complexity

Without caching, every generated token recomputes keys and values for all previous tokens. A KV cache stores previous $K,V$ tensors so generation only computes the new token's query and appends one new key/value pair.