Transformer & LLM Math

Core formula: $$\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}+M\right)V$$ where $M$ is often a causal mask.

Tokens & Embeddings

An LLM begins with token IDs. An embedding matrix $E \in \mathbb{R}^{|V| \times d}$ maps each token ID to a vector in $\mathbb{R}^d$. The model does not see words directly; it sees rows of this matrix.

import numpy as np

vocab_size, d_model = 8, 4
token_ids = np.array([2, 5, 1])
E = np.arange(vocab_size * d_model).reshape(vocab_size, d_model) / 10
X = E[token_ids]
print("Embedding shape:", X.shape)
print(X)

Scaled Dot-Product Attention

Attention compares queries to keys using dot products. If $Q,K,V \in \mathbb{R}^{T \times d_k}$, then $QK^\top \in \mathbb{R}^{T \times T}$ contains every token-to-token compatibility score.

Single Attention Head

flowchart LR
    X["Token embeddings X"] --> Q["Q = X Wq"]
    X --> K["K = X Wk"]
    X --> V["V = X Wv"]
    Q --> S["QK transpose / sqrt(dk)"]
    K --> S
    S --> P["softmax scores"]
    V --> O["P V"]

import numpy as np

np.random.seed(7)
T, d_model, d_k = 4, 6, 3
X = np.random.randn(T, d_model)
Wq, Wk, Wv = [np.random.randn(d_model, d_k) for _ in range(3)]
Q, K, V = X @ Wq, X @ Wk, X @ Wv
scores = (Q @ K.T) / np.sqrt(d_k)
weights = np.exp(scores - scores.max(axis=-1, keepdims=True))
weights = weights / weights.sum(axis=-1, keepdims=True)
out = weights @ V
print("scores:", scores.shape, "output:", out.shape)

Causal Masking

Autoregressive models must not attend to future tokens. A causal mask sets future logits to a very negative value before softmax, making their probability nearly zero.

import numpy as np

T = 5
logits = np.arange(T * T, dtype=float).reshape(T, T)
mask = np.triu(np.ones((T, T)), k=1).astype(bool)
masked_logits = logits.copy()
masked_logits[mask] = -1e9
print(masked_logits)

Residuals & Layer Norm

Residual connections preserve a direct path for information and gradients: $x_{l+1}=x_l+F(x_l)$. Layer normalization stabilizes each token representation by normalizing across hidden dimensions.

$$\text{LayerNorm}(x)=\gamma\frac{x-\mu}{\sqrt{\sigma^2+\epsilon}}+\beta$$

Decoding & Sampling

The final hidden state is projected to vocabulary logits. Temperature rescales logits: $p_i=\text{softmax}(z_i/\tau)$. Lower $\tau$ sharpens the distribution; higher $\tau$ makes sampling more exploratory.

KV Cache Complexity

Without caching, every generated token recomputes keys and values for all previous tokens. A KV cache stores previous $K,V$ tensors so generation only computes the new token's query and appends one new key/value pair.

Cookie Consent

Table of Contents

Tokens & Embeddings

Scaled Dot-Product Attention

Causal Masking

Residuals & Layer Norm

Decoding & Sampling

KV Cache Complexity

Cookie Consent

Transformer & LLM Math

Table of Contents

Tokens & Embeddings

Scaled Dot-Product Attention

Causal Masking

Residuals & Layer Norm

Decoding & Sampling

KV Cache Complexity

Continue the Extension

Generative Model Math

Self-Attention from Scratch