Tokens & Embeddings
An LLM begins with token IDs. An embedding matrix $E \in \mathbb{R}^{|V| \times d}$ maps each token ID to a vector in $\mathbb{R}^d$. The model does not see words directly; it sees rows of this matrix.
import numpy as np
vocab_size, d_model = 8, 4
token_ids = np.array([2, 5, 1])
E = np.arange(vocab_size * d_model).reshape(vocab_size, d_model) / 10
X = E[token_ids]
print("Embedding shape:", X.shape)
print(X)
Scaled Dot-Product Attention
Attention compares queries to keys using dot products. If $Q,K,V \in \mathbb{R}^{T \times d_k}$, then $QK^\top \in \mathbb{R}^{T \times T}$ contains every token-to-token compatibility score.
flowchart LR
X["Token embeddings X"] --> Q["Q = X Wq"]
X --> K["K = X Wk"]
X --> V["V = X Wv"]
Q --> S["QK transpose / sqrt(dk)"]
K --> S
S --> P["softmax scores"]
V --> O["P V"]
import numpy as np
np.random.seed(7)
T, d_model, d_k = 4, 6, 3
X = np.random.randn(T, d_model)
Wq, Wk, Wv = [np.random.randn(d_model, d_k) for _ in range(3)]
Q, K, V = X @ Wq, X @ Wk, X @ Wv
scores = (Q @ K.T) / np.sqrt(d_k)
weights = np.exp(scores - scores.max(axis=-1, keepdims=True))
weights = weights / weights.sum(axis=-1, keepdims=True)
out = weights @ V
print("scores:", scores.shape, "output:", out.shape)
Causal Masking
Autoregressive models must not attend to future tokens. A causal mask sets future logits to a very negative value before softmax, making their probability nearly zero.
import numpy as np
T = 5
logits = np.arange(T * T, dtype=float).reshape(T, T)
mask = np.triu(np.ones((T, T)), k=1).astype(bool)
masked_logits = logits.copy()
masked_logits[mask] = -1e9
print(masked_logits)
Residuals & Layer Norm
Residual connections preserve a direct path for information and gradients: $x_{l+1}=x_l+F(x_l)$. Layer normalization stabilizes each token representation by normalizing across hidden dimensions.
$$\text{LayerNorm}(x)=\gamma\frac{x-\mu}{\sqrt{\sigma^2+\epsilon}}+\beta$$
Decoding & Sampling
The final hidden state is projected to vocabulary logits. Temperature rescales logits: $p_i=\text{softmax}(z_i/\tau)$. Lower $\tau$ sharpens the distribution; higher $\tau$ makes sampling more exploratory.
KV Cache Complexity
Without caching, every generated token recomputes keys and values for all previous tokens. A KV cache stores previous $K,V$ tensors so generation only computes the new token's query and appends one new key/value pair.