Language Models & Autoregressive Generation
A language model is, at its core, a probability distribution over sequences of tokens. Given a sequence of preceding words (or tokens), a language model predicts what comes next. This is the fundamental task behind GPT-2, ChatGPT, and all modern large language models — they learn to predict the next token in a sequence, one token at a time.
The probability of a sequence is decomposed using the chain rule of probability:
$$P(x_1, x_2, \ldots, x_T) = \prod_{t=1}^{T} P(x_t | x_1, \ldots, x_{t-1})$$Each factor $P(x_t | x_{<t})$ is what the model learns to predict. During generation, we sample from this distribution token by token — this is called autoregressive generation. The model generates one token, appends it to the context, and then predicts the next token conditioned on everything so far.
Autoregressive vs Masked Language Models
There are two major paradigms for training language models. Autoregressive models (GPT family) predict tokens left-to-right, seeing only past context. Masked language models (BERT family) mask random tokens and predict them given both left and right context. This fundamental difference determines what each model excels at:
GPT (Decoder-only, autoregressive): Sees only past tokens → Excellent for generation tasks (writing, code completion, chat). Uses a causal mask to prevent attending to future positions.
BERT (Encoder-only, masked): Sees all tokens (bidirectional) → Excellent for understanding tasks (classification, NER, QA). Cannot generate text naturally because it was trained to fill in blanks, not predict sequences.
Why decoder-only wins for generation: Because it's trained to predict the next token given only prior context, the generation procedure (sampling token by token) perfectly matches the training objective. There's no train-test mismatch.
Let's demonstrate the autoregressive generation concept with a simple example. We'll show how probabilities are computed and how tokens are selected step by step:
import torch
import torch.nn.functional as F
# Simulate a tiny language model's output
# Vocabulary: ["the", "cat", "sat", "on", "mat", "dog", "ran"]
vocab = ["the", "cat", "sat", "on", "mat", "dog", "ran"]
vocab_size = len(vocab)
# Simulate logits (raw model output) for next-token prediction
# Given context "the cat", model outputs logits for each vocab token
torch.manual_seed(42)
logits = torch.randn(vocab_size) # Raw scores from model
# Convert to probabilities via softmax
probs = F.softmax(logits, dim=-1)
print("=== Autoregressive Next-Token Prediction ===")
print(f"Context: 'the cat'")
print(f"\nToken probabilities:")
for token, prob in zip(vocab, probs):
bar = "█" * int(prob * 50)
print(f" {token:6s}: {prob:.4f} {bar}")
# Sample from the distribution (stochastic generation)
sampled_idx = torch.multinomial(probs, num_samples=1)
print(f"\nSampled next token: '{vocab[sampled_idx.item()]}'")
# Greedy decoding (deterministic - pick highest probability)
greedy_idx = torch.argmax(probs)
print(f"Greedy next token: '{vocab[greedy_idx.item()]}'")
print(f"\nAutoregressive = repeat this for each new token!")
This snippet illustrates the core idea: the model outputs a probability distribution over the vocabulary, and we select the next token from that distribution. The entire art of text generation lies in how we select from this distribution — greedy, temperature-scaled, top-k, or nucleus sampling (we'll implement all of these later).
GPT-2 Architecture Overview
GPT-2, released by OpenAI in 2019, is a decoder-only Transformer. Unlike the original Transformer paper (which has both encoder and decoder), GPT-2 uses only the decoder stack with causal (masked) self-attention. The key architectural decisions that distinguish GPT-2 from the original Transformer are:
- Pre-norm: LayerNorm is applied before attention and FFN (not after), which stabilizes training for deep models
- Learned positional embeddings: Instead of fixed sinusoidal encodings, positions are learned parameters
- GELU activation: The feed-forward network uses GELU instead of ReLU
- Weight tying: The token embedding matrix is reused as the output projection
- No encoder-decoder cross-attention: Only self-attention within the decoder
flowchart TD
A[Input Tokens] --> B[Token Embedding]
A --> C[Position Embedding]
B --> D["+"]
C --> D
D --> E[Dropout]
E --> F["GPT Block 1"]
F --> G["GPT Block 2"]
G --> H["..."]
H --> I["GPT Block N"]
I --> J[Final LayerNorm]
J --> K["Linear Head (vocab_size)"]
K --> L["Softmax → P(next token)"]
subgraph block["Single GPT Block"]
direction TB
B1["LayerNorm"] --> B2["Multi-Head Causal Attention"]
B2 --> B3["+ Residual"]
B3 --> B4["LayerNorm"]
B4 --> B5["Feed-Forward (GELU)"]
B5 --> B6["+ Residual"]
end
Pre-Norm vs Post-Norm
The original Transformer uses post-norm: the output of each sub-layer is $\text{LayerNorm}(x + \text{Sublayer}(x))$. GPT-2 uses pre-norm: $x + \text{Sublayer}(\text{LayerNorm}(x))$. This seemingly small change makes a huge difference for training stability. With pre-norm, the residual path is completely clean — gradients can flow directly from the loss to any layer without passing through normalizations. This allows GPT-2 to train stably with 48 layers (GPT-2 XL) without careful learning rate warmup.
import torch
import torch.nn as nn
# Comparison: Pre-Norm vs Post-Norm residual connections
class PostNormBlock(nn.Module):
"""Original Transformer style: normalize AFTER residual addition."""
def __init__(self, d_model):
super().__init__()
self.norm = nn.LayerNorm(d_model)
self.linear = nn.Linear(d_model, d_model)
def forward(self, x):
# Post-norm: LayerNorm(x + Sublayer(x))
return self.norm(x + torch.relu(self.linear(x)))
class PreNormBlock(nn.Module):
"""GPT-2 style: normalize BEFORE the sublayer."""
def __init__(self, d_model):
super().__init__()
self.norm = nn.LayerNorm(d_model)
self.linear = nn.Linear(d_model, d_model)
def forward(self, x):
# Pre-norm: x + Sublayer(LayerNorm(x))
# The residual path (x) is completely clean!
return x + torch.relu(self.linear(self.norm(x)))
# Demonstrate gradient flow difference
d_model = 128
x = torch.randn(1, 10, d_model, requires_grad=True)
# Stack 12 pre-norm blocks (like GPT-2)
blocks = nn.Sequential(*[PreNormBlock(d_model) for _ in range(12)])
output = blocks(x)
loss = output.sum()
loss.backward()
print(f"Input gradient norm after 12 Pre-Norm blocks: {x.grad.norm():.4f}")
print(f"Gradient is well-preserved through clean residual paths!")
print(f"\nPre-norm advantage: gradients bypass sublayers via residual")
print(f"Post-norm problem: gradients must pass through LayerNorm at every layer")
Notice how with pre-norm, the input tensor $x$ has a direct additive path through every block. The gradient of the loss with respect to any intermediate representation has a term that's simply 1 (from the identity skip connection), plus additional gradient terms from the sublayers. This prevents gradients from vanishing in deep networks.
Tokenization
Before text enters GPT-2, it must be converted into integers — this is tokenization. GPT-2 uses Byte Pair Encoding (BPE), a subword tokenization algorithm that sits between character-level and word-level approaches. The GPT-2 vocabulary has exactly 50,257 tokens (256 byte tokens + 50,000 BPE merges + 1 special end-of-text token).
BPE works by starting with individual characters, then iteratively merging the most frequent pair of adjacent tokens. For example, "th" and "e" merge into "the" if they appear together frequently enough. This creates a vocabulary that efficiently handles common words as single tokens while splitting rare words into subword pieces.
Character-Level Tokenizer for Our Mini Model
For our mini GPT-2 that we'll train on Shakespeare, we'll use a simple character-level tokenizer. This keeps things simple while demonstrating the same principles. Each unique character in the training text becomes a token:
import torch
# Character-level tokenizer for our mini GPT-2
# This is what we'll use to train on Shakespeare
class CharTokenizer:
"""Simple character-level tokenizer for demonstration."""
def __init__(self, text):
# Get all unique characters in the text
self.chars = sorted(list(set(text)))
self.vocab_size = len(self.chars)
# Create mappings
self.char_to_idx = {ch: i for i, ch in enumerate(self.chars)}
self.idx_to_char = {i: ch for i, ch in enumerate(self.chars)}
def encode(self, text):
"""Convert string to list of integers."""
return [self.char_to_idx[ch] for ch in text]
def decode(self, indices):
"""Convert list of integers back to string."""
return ''.join([self.idx_to_char[i] for i in indices])
# Example with Shakespeare-like text
sample_text = """To be, or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles."""
tokenizer = CharTokenizer(sample_text)
print(f"Vocabulary size: {tokenizer.vocab_size} unique characters")
print(f"Characters: {''.join(tokenizer.chars)}")
# Encode and decode
encoded = tokenizer.encode("To be")
decoded = tokenizer.decode(encoded)
print(f"\n'To be' → {encoded}")
print(f"{encoded} → '{decoded}'")
# Convert to tensor for model input
tokens_tensor = torch.tensor(tokenizer.encode(sample_text[:50]))
print(f"\nFirst 50 chars as tensor: shape {tokens_tensor.shape}")
print(f"Token values: {tokens_tensor[:10].tolist()}...")
For production GPT-2, OpenAI uses BPE tokenization with the tiktoken library. Let's see how the real GPT-2 tokenizer handles text differently from our character-level approach:
import torch
# Demonstrating BPE tokenization concepts
# (tiktoken requires: pip install tiktoken)
# Simulate BPE behavior without the library
# BPE starts with characters and merges frequent pairs
def simple_bpe_demo(text, num_merges=5):
"""Demonstrate BPE merge process step by step."""
# Start: each character is its own token
tokens = list(text)
print(f"Original: {len(tokens)} character tokens")
print(f"Text: '{text}'")
print(f"\nBPE Merge Process:")
for i in range(num_merges):
# Count all adjacent pairs
pairs = {}
for j in range(len(tokens) - 1):
pair = (tokens[j], tokens[j+1])
pairs[pair] = pairs.get(pair, 0) + 1
if not pairs:
break
# Find most frequent pair
best_pair = max(pairs, key=pairs.get)
merged = best_pair[0] + best_pair[1]
# Merge all occurrences
new_tokens = []
j = 0
while j < len(tokens):
if j < len(tokens) - 1 and (tokens[j], tokens[j+1]) == best_pair:
new_tokens.append(merged)
j += 2
else:
new_tokens.append(tokens[j])
j += 1
tokens = new_tokens
print(f" Merge {i+1}: '{best_pair[0]}' + '{best_pair[1]}' → "
f"'{merged}' (freq={pairs[best_pair]}) → {len(tokens)} tokens")
return tokens
# Run BPE demo
text = "the cat sat on the mat the cat"
final_tokens = simple_bpe_demo(text)
print(f"\nFinal tokens: {final_tokens}")
print(f"\nGPT-2 BPE vocab: 50,257 tokens (256 bytes + 50,000 merges + 1 EoT)")
print(f"Average English word ≈ 1.3 tokens in GPT-2's tokenizer")
The BPE algorithm learns which character sequences appear most frequently in the training corpus and creates tokens for them. The result is that common words like "the", "and", "is" become single tokens, while rare words get broken into subword units. For example, "tokenization" might become ["token", "ization"] — both are meaningful subparts.
Causal (Masked) Self-Attention
The core innovation that makes GPT-2 autoregressive is causal self-attention. In standard self-attention, every token can attend to every other token. In causal attention, each token can only attend to itself and tokens that came before it. This is enforced by a causal mask — a lower-triangular matrix that sets future positions to $-\infty$ before the softmax.
The causal attention formula is:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V$$where the mask $M$ is defined as $M_{ij} = 0$ if $j \leq i$ (allowed to attend) and $M_{ij} = -\infty$ if $j > i$ (blocked). Since $e^{-\infty} = 0$, future positions contribute zero weight after softmax.
import torch
import torch.nn as nn
import torch.nn.functional as F
def causal_self_attention(x, d_k):
"""
Implement causal (masked) self-attention from scratch.
Args:
x: Input tensor of shape (batch, seq_len, d_model)
d_k: Dimension of keys (for scaling)
Returns:
Attention output of shape (batch, seq_len, d_model)
"""
batch, seq_len, d_model = x.shape
# Project to Q, K, V (in practice, use nn.Linear)
W_q = torch.randn(d_model, d_k) * 0.02
W_k = torch.randn(d_model, d_k) * 0.02
W_v = torch.randn(d_model, d_k) * 0.02
Q = x @ W_q # (batch, seq_len, d_k)
K = x @ W_k # (batch, seq_len, d_k)
V = x @ W_v # (batch, seq_len, d_k)
# Compute attention scores
scores = Q @ K.transpose(-2, -1) / (d_k ** 0.5) # (batch, seq_len, seq_len)
# Create causal mask: lower triangular = 1, upper = 0
mask = torch.tril(torch.ones(seq_len, seq_len)) # Lower triangular
# Apply mask: set future positions to -infinity
scores = scores.masked_fill(mask == 0, float('-inf'))
# Softmax over last dimension (the key dimension)
attn_weights = F.softmax(scores, dim=-1)
# Weighted sum of values
output = attn_weights @ V # (batch, seq_len, d_k)
return output, attn_weights
# Demonstrate causal attention
torch.manual_seed(42)
batch_size, seq_len, d_model, d_k = 1, 5, 32, 32
x = torch.randn(batch_size, seq_len, d_model)
output, weights = causal_self_attention(x, d_k)
print("=== Causal Self-Attention ===")
print(f"Input shape: {list(x.shape)} (batch, seq_len, d_model)")
print(f"Output shape: {list(output.shape)}")
print(f"\nAttention weights (notice the triangular pattern):")
print(f"Token 0 attends to: {weights[0, 0].tolist()}")
print(f"Token 1 attends to: {[f'{w:.3f}' for w in weights[0, 1].tolist()]}")
print(f"Token 4 attends to: {[f'{w:.3f}' for w in weights[0, 4].tolist()]}")
print(f"\nRow sums (should all be 1.0): {weights[0].sum(dim=-1).tolist()}")
print(f"Upper triangle (should be 0): {weights[0].triu(diagonal=1).sum():.6f}")
Notice the triangular structure of the attention weights: token 0 can only attend to itself (weight 1.0), token 1 attends to tokens 0 and 1, and token 4 attends to all five tokens. The upper-triangular entries are exactly zero after softmax because we filled them with $-\infty$. This is what makes GPT autoregressive — during training, all positions are processed in parallel (unlike RNNs), but each position only has access to past information.
Multi-Head Causal Attention
A single attention head can only focus on one type of relationship at a time. Multi-head attention runs multiple attention heads in parallel, each with its own Q, K, V projections, allowing the model to simultaneously attend to different aspects (syntax, semantics, position, etc.). The outputs are concatenated and projected back to the model dimension:
$$\text{MultiHead}(X) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O$$where each $\text{head}_i = \text{CausalAttention}(XW_i^Q, XW_i^K, XW_i^V)$
import torch
import torch.nn as nn
import torch.nn.functional as F
class MultiHeadCausalAttention(nn.Module):
"""
Multi-head causal self-attention for GPT-2.
Each head independently computes causal attention over a subspace
of the embedding dimension. Results are concatenated and projected.
"""
def __init__(self, d_model, n_heads, dropout=0.1):
super().__init__()
assert d_model % n_heads == 0, "d_model must be divisible by n_heads"
self.d_model = d_model
self.n_heads = n_heads
self.d_k = d_model // n_heads # Dimension per head
# Combined QKV projection (more efficient than separate)
self.qkv_proj = nn.Linear(d_model, 3 * d_model)
# Output projection
self.out_proj = nn.Linear(d_model, d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
batch, seq_len, _ = x.shape
# Project to Q, K, V simultaneously
qkv = self.qkv_proj(x) # (batch, seq_len, 3 * d_model)
q, k, v = qkv.chunk(3, dim=-1) # Each: (batch, seq_len, d_model)
# Reshape for multi-head: (batch, n_heads, seq_len, d_k)
q = q.view(batch, seq_len, self.n_heads, self.d_k).transpose(1, 2)
k = k.view(batch, seq_len, self.n_heads, self.d_k).transpose(1, 2)
v = v.view(batch, seq_len, self.n_heads, self.d_k).transpose(1, 2)
# Scaled dot-product attention with causal mask
scores = q @ k.transpose(-2, -1) / (self.d_k ** 0.5)
# Causal mask
mask = torch.tril(torch.ones(seq_len, seq_len, device=x.device))
scores = scores.masked_fill(mask == 0, float('-inf'))
attn = F.softmax(scores, dim=-1)
attn = self.dropout(attn)
# Apply attention to values
out = attn @ v # (batch, n_heads, seq_len, d_k)
# Concatenate heads: (batch, seq_len, d_model)
out = out.transpose(1, 2).contiguous().view(batch, seq_len, self.d_model)
# Final projection
return self.out_proj(out)
# Demonstrate multi-head attention
torch.manual_seed(42)
d_model, n_heads = 128, 4
mha = MultiHeadCausalAttention(d_model, n_heads)
x = torch.randn(2, 10, d_model) # batch=2, seq_len=10
output = mha(x)
print("=== Multi-Head Causal Attention ===")
print(f"d_model={d_model}, n_heads={n_heads}, d_k={d_model // n_heads}")
print(f"Input: {list(x.shape)} (batch, seq_len, d_model)")
print(f"Output: {list(output.shape)} (same shape - residual-friendly)")
print(f"\nParameter count:")
print(f" QKV projection: {d_model} × {3*d_model} = {d_model * 3 * d_model:,}")
print(f" Output projection: {d_model} × {d_model} = {d_model * d_model:,}")
print(f" Total: {sum(p.numel() for p in mha.parameters()):,} parameters")
The key efficiency trick is the combined QKV projection — instead of three separate linear layers for Q, K, and V, we use one large linear layer that outputs all three concatenated. We then split into chunks and reshape for parallel head computation. After attention, heads are concatenated and a final projection mixes information across heads.
The GPT Block
A single GPT block combines multi-head causal attention with a feed-forward network, both wrapped in pre-norm residual connections. The data flow through one block is:
- LayerNorm the input
- Pass through Multi-Head Causal Attention
- Add residual (skip connection from before LayerNorm)
- LayerNorm the result
- Pass through Feed-Forward Network (MLP with GELU)
- Add residual (skip connection from before second LayerNorm)
flowchart LR
X[Input x] --> LN1[LayerNorm]
LN1 --> MHA[Multi-Head Causal Attention]
MHA --> ADD1["+"]
X --> ADD1
ADD1 --> LN2[LayerNorm]
LN2 --> FFN["FFN (GELU)"]
FFN --> ADD2["+"]
ADD1 --> ADD2
ADD2 --> OUT[Output]
Feed-Forward Network (MLP)
The feed-forward network in GPT-2 is a simple two-layer MLP with a GELU activation and a 4× expansion factor. If the model dimension is $d_{\text{model}}$, the hidden dimension is $4 \times d_{\text{model}}$. The formula is:
$$\text{FFN}(x) = \text{GELU}(xW_1 + b_1)W_2 + b_2$$GELU (Gaussian Error Linear Unit) is smoother than ReLU — it doesn't have the hard cutoff at zero. Instead, it softly gates values based on their magnitude: $\text{GELU}(x) = x \cdot \Phi(x)$ where $\Phi$ is the standard normal CDF. In practice, we use the approximation $\text{GELU}(x) \approx 0.5x(1 + \tanh[\sqrt{2/\pi}(x + 0.044715x^3)])$.
import torch
import torch.nn as nn
import torch.nn.functional as F
class FeedForward(nn.Module):
"""GPT-2 Feed-Forward Network with GELU activation and 4x expansion."""
def __init__(self, d_model, dropout=0.1):
super().__init__()
self.fc1 = nn.Linear(d_model, 4 * d_model) # Expand
self.fc2 = nn.Linear(4 * d_model, d_model) # Project back
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# GELU activation between the two linear layers
x = F.gelu(self.fc1(x))
x = self.dropout(x)
x = self.fc2(x)
return x
class GPTBlock(nn.Module):
"""
A single GPT-2 transformer block with pre-norm residual connections.
Flow: x → LN → MHA → +residual → LN → FFN → +residual
"""
def __init__(self, d_model, n_heads, dropout=0.1):
super().__init__()
self.ln1 = nn.LayerNorm(d_model)
self.attn = MultiHeadCausalAttention(d_model, n_heads, dropout)
self.ln2 = nn.LayerNorm(d_model)
self.ffn = FeedForward(d_model, dropout)
def forward(self, x):
# Pre-norm + attention + residual
x = x + self.attn(self.ln1(x))
# Pre-norm + FFN + residual
x = x + self.ffn(self.ln2(x))
return x
# Need MultiHeadCausalAttention from earlier
class MultiHeadCausalAttention(nn.Module):
def __init__(self, d_model, n_heads, dropout=0.1):
super().__init__()
self.d_model = d_model
self.n_heads = n_heads
self.d_k = d_model // n_heads
self.qkv_proj = nn.Linear(d_model, 3 * d_model)
self.out_proj = nn.Linear(d_model, d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
B, T, C = x.shape
qkv = self.qkv_proj(x)
q, k, v = qkv.chunk(3, dim=-1)
q = q.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
k = k.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
v = v.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
scores = q @ k.transpose(-2, -1) / (self.d_k ** 0.5)
mask = torch.tril(torch.ones(T, T, device=x.device))
scores = scores.masked_fill(mask == 0, float('-inf'))
attn = F.softmax(scores, dim=-1)
attn = self.dropout(attn)
out = attn @ v
out = out.transpose(1, 2).contiguous().view(B, T, self.d_model)
return self.out_proj(out)
# Build and test a single GPT block
torch.manual_seed(42)
block = GPTBlock(d_model=256, n_heads=4, dropout=0.1)
x = torch.randn(2, 20, 256) # batch=2, seq_len=20, d_model=256
output = block(x)
print("=== GPT-2 Block ===")
print(f"Input: {list(x.shape)}")
print(f"Output: {list(output.shape)} (same shape — stackable!)")
print(f"\nBlock parameters: {sum(p.numel() for p in block.parameters()):,}")
print(f" LayerNorm 1: {2 * 256:,}")
print(f" MHA: {sum(p.numel() for p in block.attn.parameters()):,}")
print(f" LayerNorm 2: {2 * 256:,}")
print(f" FFN: {sum(p.numel() for p in block.ffn.parameters()):,}")
The GPT block preserves the input shape exactly — this is essential because we stack multiple blocks sequentially. Each block refines the representations: the attention sub-layer lets tokens gather information from relevant past tokens, while the FFN sub-layer processes each position independently, adding computational depth. The residual connections ensure that information from earlier layers is always available.
Assembling GPT-2 Mini
Now we assemble the full GPT-2 model. The complete architecture stacks token embeddings + position embeddings → N transformer blocks → final LayerNorm → linear output head. Here's the full implementation with configurable hyperparameters:
import torch
import torch.nn as nn
import torch.nn.functional as F
class MultiHeadCausalAttention(nn.Module):
"""Multi-head causal self-attention for GPT-2."""
def __init__(self, d_model, n_heads, dropout=0.1, block_size=512):
super().__init__()
self.d_model = d_model
self.n_heads = n_heads
self.d_k = d_model // n_heads
self.qkv_proj = nn.Linear(d_model, 3 * d_model)
self.out_proj = nn.Linear(d_model, d_model)
self.attn_dropout = nn.Dropout(dropout)
self.resid_dropout = nn.Dropout(dropout)
# Register causal mask as buffer (not a parameter)
self.register_buffer("mask", torch.tril(
torch.ones(block_size, block_size)).view(1, 1, block_size, block_size))
def forward(self, x):
B, T, C = x.shape
qkv = self.qkv_proj(x)
q, k, v = qkv.chunk(3, dim=-1)
q = q.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
k = k.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
v = v.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
scores = q @ k.transpose(-2, -1) / (self.d_k ** 0.5)
scores = scores.masked_fill(self.mask[:, :, :T, :T] == 0, float('-inf'))
attn = F.softmax(scores, dim=-1)
attn = self.attn_dropout(attn)
out = (attn @ v).transpose(1, 2).contiguous().view(B, T, C)
return self.resid_dropout(self.out_proj(out))
class FeedForward(nn.Module):
"""FFN with GELU and 4x expansion."""
def __init__(self, d_model, dropout=0.1):
super().__init__()
self.fc1 = nn.Linear(d_model, 4 * d_model)
self.fc2 = nn.Linear(4 * d_model, d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
return self.dropout(self.fc2(F.gelu(self.fc1(x))))
class GPTBlock(nn.Module):
"""Pre-norm transformer block."""
def __init__(self, d_model, n_heads, dropout=0.1, block_size=512):
super().__init__()
self.ln1 = nn.LayerNorm(d_model)
self.attn = MultiHeadCausalAttention(d_model, n_heads, dropout, block_size)
self.ln2 = nn.LayerNorm(d_model)
self.ffn = FeedForward(d_model, dropout)
def forward(self, x):
x = x + self.attn(self.ln1(x))
x = x + self.ffn(self.ln2(x))
return x
class GPT2Mini(nn.Module):
"""
GPT-2 Mini: A complete language model.
Architecture: Token Emb + Pos Emb → N Blocks → LayerNorm → Linear Head
Uses weight tying: embedding weights = output projection weights.
"""
def __init__(self, vocab_size, d_model=256, n_heads=4, n_layers=6,
block_size=512, dropout=0.1):
super().__init__()
self.block_size = block_size
# Token and position embeddings
self.token_emb = nn.Embedding(vocab_size, d_model)
self.pos_emb = nn.Embedding(block_size, d_model)
self.dropout = nn.Dropout(dropout)
# Transformer blocks
self.blocks = nn.Sequential(*[
GPTBlock(d_model, n_heads, dropout, block_size)
for _ in range(n_layers)
])
# Final layer norm (pre-norm: applied before the head)
self.ln_f = nn.LayerNorm(d_model)
# Output head (projects back to vocabulary)
self.head = nn.Linear(d_model, vocab_size, bias=False)
# Weight tying: share embedding weights with output
self.head.weight = self.token_emb.weight
# Initialize weights
self.apply(self._init_weights)
def _init_weights(self, module):
if isinstance(module, nn.Linear):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
def forward(self, idx, targets=None):
"""
Args:
idx: Token indices, shape (batch, seq_len)
targets: Target token indices for loss computation
Returns:
logits: Shape (batch, seq_len, vocab_size)
loss: Cross-entropy loss (if targets provided)
"""
B, T = idx.shape
assert T <= self.block_size, f"Sequence length {T} exceeds block_size {self.block_size}"
# Create position indices
pos = torch.arange(0, T, device=idx.device).unsqueeze(0) # (1, T)
# Embed tokens and positions
tok_emb = self.token_emb(idx) # (B, T, d_model)
pos_emb = self.pos_emb(pos) # (1, T, d_model)
x = self.dropout(tok_emb + pos_emb)
# Pass through transformer blocks
x = self.blocks(x)
# Final layer norm + output projection
x = self.ln_f(x)
logits = self.head(x) # (B, T, vocab_size)
# Compute loss if targets provided
loss = None
if targets is not None:
loss = F.cross_entropy(
logits.view(-1, logits.size(-1)),
targets.view(-1)
)
return logits, loss
# Create our mini GPT-2
model = GPT2Mini(
vocab_size=65, # ~65 chars in Shakespeare
d_model=256, # Embedding dimension
n_heads=4, # Attention heads
n_layers=6, # Transformer blocks
block_size=128, # Max sequence length
dropout=0.1
)
# Count parameters
total_params = sum(p.numel() for p in model.parameters())
print("=== GPT-2 Mini Architecture ===")
print(f"Total parameters: {total_params:,}")
print(f"\nArchitecture:")
print(f" vocab_size: 65 (characters)")
print(f" d_model: 256")
print(f" n_heads: 4 (d_k = 64 per head)")
print(f" n_layers: 6")
print(f" block_size: 128")
print(f" FFN hidden: {4 * 256} (4× expansion)")
# Test forward pass
dummy_input = torch.randint(0, 65, (4, 32)) # batch=4, seq_len=32
logits, _ = model(dummy_input)
print(f"\nForward pass:")
print(f" Input: {list(dummy_input.shape)} (batch, seq_len)")
print(f" Output: {list(logits.shape)} (batch, seq_len, vocab_size)")
# Test with targets (training mode)
targets = torch.randint(0, 65, (4, 32))
_, loss = model(dummy_input, targets)
print(f" Loss: {loss.item():.4f} (random init ≈ ln(65) = {torch.log(torch.tensor(65.0)):.4f})")
Weight Tying
Weight tying is an elegant trick where the token embedding matrix (shape [vocab_size, d_model]) is reused as the output projection (shape [d_model, vocab_size]). The intuition: if two words have similar embeddings, they should also have similar output probabilities in similar contexts. This reduces parameters significantly (especially with large vocabularies) and acts as a form of regularization. The next-token probability becomes:
where $h_t$ is the hidden state at position $t$ and $W_e$ is the shared embedding matrix. The output logit for token $i$ is simply the dot product between the hidden state and that token's embedding vector — tokens whose embeddings are "close" to the hidden state get high probability.
self.head.weight = self.token_emb.weight — they're literally the same tensor in memory. When backpropagation updates one, it updates both. This saves vocab_size × d_model parameters. For GPT-2 (vocab=50,257, d_model=768), that's ~38.6M parameters saved!
Training on Shakespeare
Now let's train our mini GPT-2 on Shakespeare's complete works. We'll create a character-level dataset, implement the training loop with cross-entropy loss, and watch the model progress from random gibberish to recognizable English. The loss we minimize is the average negative log-likelihood of the correct next token:
$$\mathcal{L} = -\frac{1}{T}\sum_{t=1}^{T} \log P(x_t | x_{<t})$$First, let's set up the dataset. We'll download Shakespeare's works and create training sequences by sampling random windows of text:
import torch
from torch.utils.data import Dataset, DataLoader
class TextDataset(Dataset):
"""
Character-level text dataset for language modeling.
Each sample is a (input, target) pair where target is input shifted by 1.
Example: "Hello" → input="Hell", target="ello"
"""
def __init__(self, text, block_size, tokenizer_encode):
self.block_size = block_size
# Encode entire text to integers
self.data = torch.tensor(tokenizer_encode(text), dtype=torch.long)
print(f"Dataset: {len(self.data):,} tokens, "
f"{len(self.data) // block_size:,} possible sequences")
def __len__(self):
return len(self.data) - self.block_size
def __getitem__(self, idx):
# Get a window of block_size + 1 tokens
chunk = self.data[idx : idx + self.block_size + 1]
x = chunk[:-1] # Input: tokens 0..T-1
y = chunk[1:] # Target: tokens 1..T (shifted by 1)
return x, y
# Simulate Shakespeare-like text for demonstration
# (In practice: download from https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt)
shakespeare_sample = """ROMEO: O, she doth teach the torches to burn bright!
It seems she hangs upon the cheek of night
Like a rich jewel in an Ethiope's ear;
Beauty too rich for use, for earth too dear!
So shows a snowy dove trooping with crows,
As yonder lady o'er her fellows shows.
The measure done, I'll watch her place of stand,
And, touching hers, make blessed my rude hand.
Did my heart love till now? forswear it, sight!
For I ne'er saw true beauty till this night.
JULIET: O Romeo, Romeo! wherefore art thou Romeo?
Deny thy father and refuse thy name;
Or, if thou wilt not, be but sworn my love,
And I'll no longer be a Capulet.
ROMEO: Shall I hear more, or shall I speak at this?
""" * 50 # Repeat to have enough data
# Build tokenizer
chars = sorted(list(set(shakespeare_sample)))
vocab_size = len(chars)
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}
encode = lambda s: [char_to_idx[c] for c in s]
decode = lambda l: ''.join([idx_to_char[i] for i in l])
print(f"Vocabulary: {vocab_size} characters")
print(f"Text length: {len(shakespeare_sample):,} characters")
# Create dataset
block_size = 128
dataset = TextDataset(shakespeare_sample, block_size, encode)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)
# Verify shapes
x_batch, y_batch = next(iter(dataloader))
print(f"\nBatch shapes:")
print(f" Input: {list(x_batch.shape)} (batch, block_size)")
print(f" Target: {list(y_batch.shape)} (batch, block_size)")
print(f"\nExample (first 40 chars):")
print(f" Input: '{decode(x_batch[0][:40].tolist())}'")
print(f" Target: '{decode(y_batch[0][:40].tolist())}'")
print(f" (Target is input shifted by 1 position)")
The target is simply the input shifted by one position. When the model sees "ROMEO: O, she doth", it should predict ", she doth " (each character predicting the next). This is the self-supervised objective that requires no human labeling — the text itself provides supervision.
Training Loop with Learning Rate Schedule
Modern LLM training uses learning rate warmup followed by cosine decay. During warmup, we linearly increase the LR from 0 to the peak value — this prevents early instabilities when gradients are noisy. Then cosine decay smoothly reduces the LR to a minimum value, allowing the model to converge to a sharper minimum:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
# Training configuration
config = {
'vocab_size': 50, # Simplified for demo
'd_model': 128,
'n_heads': 4,
'n_layers': 4,
'block_size': 64,
'dropout': 0.1,
'learning_rate': 3e-4,
'warmup_steps': 100,
'max_steps': 1000,
'min_lr': 3e-5,
}
def get_lr(step, warmup_steps, max_steps, max_lr, min_lr):
"""Cosine learning rate schedule with linear warmup."""
if step < warmup_steps:
# Linear warmup
return max_lr * (step / warmup_steps)
elif step > max_steps:
return min_lr
else:
# Cosine decay
decay_ratio = (step - warmup_steps) / (max_steps - warmup_steps)
coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio))
return min_lr + coeff * (max_lr - min_lr)
# Visualize the schedule
steps = list(range(1200))
lrs = [get_lr(s, config['warmup_steps'], config['max_steps'],
config['learning_rate'], config['min_lr']) for s in steps]
print("=== Learning Rate Schedule ===")
print(f"Warmup: 0 → {config['learning_rate']} over {config['warmup_steps']} steps")
print(f"Decay: {config['learning_rate']} → {config['min_lr']} (cosine) over "
f"{config['max_steps'] - config['warmup_steps']} steps")
print(f"\nSample LR values:")
for s in [0, 50, 100, 250, 500, 750, 1000]:
lr = get_lr(s, config['warmup_steps'], config['max_steps'],
config['learning_rate'], config['min_lr'])
print(f" Step {s:4d}: LR = {lr:.6f}")
Now let's put together the complete training loop. This shows how to train our mini GPT-2 model and periodically generate text to monitor progress:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
# --- Minimal model definition (self-contained) ---
class MiniGPT(nn.Module):
"""Minimal GPT for training demonstration."""
def __init__(self, vocab_size, d_model=128, n_heads=4, n_layers=4, block_size=64):
super().__init__()
self.block_size = block_size
self.token_emb = nn.Embedding(vocab_size, d_model)
self.pos_emb = nn.Embedding(block_size, d_model)
# Simplified transformer blocks
self.blocks = nn.ModuleList()
for _ in range(n_layers):
self.blocks.append(nn.ModuleDict({
'ln1': nn.LayerNorm(d_model),
'ln2': nn.LayerNorm(d_model),
'attn_proj': nn.Linear(d_model, 3 * d_model),
'attn_out': nn.Linear(d_model, d_model),
'ff1': nn.Linear(d_model, 4 * d_model),
'ff2': nn.Linear(4 * d_model, d_model),
}))
self.ln_f = nn.LayerNorm(d_model)
self.head = nn.Linear(d_model, vocab_size, bias=False)
self.head.weight = self.token_emb.weight # Weight tying
self.n_heads = n_heads
self.d_k = d_model // n_heads
self.register_buffer("mask", torch.tril(torch.ones(block_size, block_size)))
def forward(self, idx, targets=None):
B, T = idx.shape
x = self.token_emb(idx) + self.pos_emb(torch.arange(T, device=idx.device))
for block in self.blocks:
# Attention with causal mask
h = block['ln1'](x)
qkv = block['attn_proj'](h)
q, k, v = qkv.chunk(3, dim=-1)
q = q.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
k = k.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
v = v.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
scores = (q @ k.transpose(-2, -1)) / (self.d_k ** 0.5)
scores = scores.masked_fill(self.mask[:T, :T] == 0, float('-inf'))
attn = F.softmax(scores, dim=-1)
out = (attn @ v).transpose(1, 2).contiguous().view(B, T, -1)
x = x + block['attn_out'](out)
# FFN
h = block['ln2'](x)
x = x + block['ff2'](F.gelu(block['ff1'](h)))
logits = self.head(self.ln_f(x))
loss = None
if targets is not None:
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
return logits, loss
# --- Training setup ---
text = "To be or not to be that is the question " * 200
chars = sorted(set(text))
vocab_size = len(chars)
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])
data = torch.tensor(encode(text), dtype=torch.long)
block_size = 64
model = MiniGPT(vocab_size, d_model=128, n_heads=4, n_layers=4, block_size=block_size)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Training on {len(data):,} tokens\n")
# Training loop
model.train()
for step in range(301):
# Random batch of sequences
ix = torch.randint(0, len(data) - block_size - 1, (16,))
xb = torch.stack([data[i:i+block_size] for i in ix])
yb = torch.stack([data[i+1:i+block_size+1] for i in ix])
# Forward + backward
logits, loss = model(xb, yb)
optimizer.zero_grad()
loss.backward()
# Gradient clipping (prevents exploding gradients)
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
if step % 100 == 0:
print(f"Step {step:4d} | Loss: {loss.item():.4f}")
# Generate sample text
model.eval()
context = torch.tensor([encode("To be")], dtype=torch.long)
with torch.no_grad():
for _ in range(40):
logits, _ = model(context[:, -block_size:])
probs = F.softmax(logits[:, -1, :], dim=-1)
next_token = torch.multinomial(probs, 1)
context = torch.cat([context, next_token], dim=1)
generated = decode(context[0].tolist())
print(f" Generated: '{generated[:60]}...'")
model.train()
print("\nTraining complete! Loss should decrease from ~ln(vocab) to < 1.0")
As training progresses, you'll see the loss decrease from approximately $\ln(\text{vocab\_size})$ (random guessing) toward values below 1.0. The generated text will evolve from random characters → recognizable words → grammatically coherent phrases. With enough data and training, character-level models can produce surprisingly fluent text.
Text Generation Strategies
Once trained, how we sample from the model dramatically affects output quality. The raw model outputs logits $z_i$ for each vocabulary token. We convert these to probabilities and then select a token. The simplest approach is greedy decoding (always pick the highest probability), but this produces repetitive, boring text. Better strategies introduce controlled randomness:
Temperature sampling scales the logits before softmax. Temperature $\tau$ controls randomness:
$$P_i = \frac{\exp(z_i / \tau)}{\sum_j \exp(z_j / \tau)}$$- $\tau = 1.0$: Standard sampling (model's natural distribution)
- $\tau < 1.0$: Sharper distribution → more deterministic (confident picks)
- $\tau > 1.0$: Flatter distribution → more random (creative but risky)
- $\tau \to 0$: Equivalent to greedy decoding
import torch
import torch.nn.functional as F
def generate_with_temperature(logits, temperature=1.0):
"""
Apply temperature scaling to logits before sampling.
Args:
logits: Raw model output, shape (vocab_size,)
temperature: Controls randomness (0=greedy, 1=standard, >1=creative)
Returns:
Sampled token index
"""
if temperature == 0:
# Greedy: deterministic
return torch.argmax(logits)
# Scale logits by temperature
scaled_logits = logits / temperature
probs = F.softmax(scaled_logits, dim=-1)
# Sample from the distribution
return torch.multinomial(probs, num_samples=1).item()
# Demonstrate temperature effects
torch.manual_seed(42)
vocab = ["the", "a", "cat", "dog", "sat", "ran", "on", "mat"]
# Simulate model output: "the" has highest logit
logits = torch.tensor([3.0, 1.0, 2.0, 1.5, 0.5, 0.3, 0.8, 0.2])
print("=== Temperature Sampling ===")
print(f"Raw logits: {logits.tolist()}")
print(f"Vocab: {vocab}\n")
for temp in [0.0, 0.3, 0.7, 1.0, 1.5, 2.0]:
if temp == 0:
probs = torch.zeros_like(logits)
probs[torch.argmax(logits)] = 1.0
else:
probs = F.softmax(logits / temp, dim=-1)
prob_str = ' '.join([f'{p:.3f}' for p in probs.tolist()])
top_token = vocab[torch.argmax(probs)]
print(f"τ={temp:.1f}: [{prob_str}] → top='{top_token}' (p={probs.max():.3f})")
print(f"\nLow τ → peaked distribution (always 'the')")
print(f"High τ → flat distribution (random word)")
Temperature alone doesn't prevent the model from occasionally selecting very unlikely tokens. Top-k sampling restricts selection to only the k most probable tokens, while top-p (nucleus) sampling dynamically selects the smallest set of tokens whose cumulative probability exceeds a threshold p:
Top-k & Nucleus (Top-p) Sampling
Let's implement both strategies. Top-k zeroes out all logits outside the top k candidates before sampling, while top-p sorts tokens by probability and includes only enough to exceed the cumulative threshold. In practice, top-p adapts automatically — it selects fewer tokens when the model is confident and more when uncertain:
import torch
import torch.nn.functional as F
def top_k_sampling(logits, k=10, temperature=1.0):
"""
Top-k sampling: only consider the k highest-probability tokens.
This prevents the model from ever selecting very unlikely tokens,
which can cause incoherent text.
"""
# Apply temperature
scaled = logits / temperature
# Find the k-th largest value
top_k_values, top_k_indices = torch.topk(scaled, k)
# Set everything below top-k to -infinity
filtered = torch.full_like(scaled, float('-inf'))
filtered.scatter_(0, top_k_indices, top_k_values)
# Sample from filtered distribution
probs = F.softmax(filtered, dim=-1)
return torch.multinomial(probs, 1).item()
def top_p_sampling(logits, p=0.9, temperature=1.0):
"""
Nucleus (top-p) sampling: keep smallest set of tokens with cumulative
probability >= p.
Advantage over top-k: adapts to the shape of the distribution.
- Sharp distribution (confident): fewer tokens selected
- Flat distribution (uncertain): more tokens selected
"""
# Apply temperature
scaled = logits / temperature
probs = F.softmax(scaled, dim=-1)
# Sort by probability (descending)
sorted_probs, sorted_indices = torch.sort(probs, descending=True)
# Compute cumulative probability
cumulative_probs = torch.cumsum(sorted_probs, dim=-1)
# Remove tokens with cumulative probability above threshold
# Keep at least one token (shift right by 1 so first token is never removed)
sorted_mask = cumulative_probs - sorted_probs > p
sorted_probs[sorted_mask] = 0.0
# Normalize remaining probabilities
sorted_probs = sorted_probs / sorted_probs.sum()
# Sample from filtered distribution
sampled_idx = torch.multinomial(sorted_probs, 1).item()
return sorted_indices[sampled_idx].item()
# Compare all generation strategies
torch.manual_seed(42)
vocab = ["once", "upon", "a", "time", "there", "was", "king", "queen",
"dragon", "castle", "magic", "forest", "dark", "bright", "ancient"]
logits = torch.tensor([2.5, 1.8, 1.2, 2.0, 1.5, 1.0, 0.8, 0.6,
0.3, 0.4, 0.2, 0.5, 0.1, -0.2, -0.5])
print("=== Generation Strategy Comparison ===")
print(f"Vocabulary: {len(vocab)} tokens\n")
# Greedy
greedy_idx = torch.argmax(logits)
print(f"1. Greedy: '{vocab[greedy_idx]}' (always picks highest prob)")
# Temperature sampling (run multiple times)
torch.manual_seed(0)
temp_samples = [vocab[top_k_sampling(logits, k=15, temperature=0.7)] for _ in range(5)]
print(f"2. Temperature: {temp_samples} (τ=0.7)")
# Top-k
torch.manual_seed(0)
topk_samples = [vocab[top_k_sampling(logits, k=5, temperature=1.0)] for _ in range(5)]
print(f"3. Top-k (k=5): {topk_samples}")
# Top-p (nucleus)
torch.manual_seed(0)
topp_samples = [vocab[top_p_sampling(logits, p=0.85, temperature=1.0)] for _ in range(5)]
print(f"4. Top-p (p=.85): {topp_samples}")
print(f"\nKey differences:")
print(f" Greedy: deterministic, repetitive")
print(f" Top-k: fixed candidate pool (may be too many or few)")
print(f" Top-p: adaptive pool (responds to model confidence)")
In practice, modern language models use a combination: temperature scaling + top-p filtering works well for most applications. ChatGPT, for example, uses temperature around 0.7 with top-p around 0.95 for balanced creative output. Lower temperature (0.1-0.3) is used for factual/code tasks where you want deterministic answers.
Let's implement a complete generation function that supports all strategies in a single interface:
import torch
import torch.nn.functional as F
@torch.no_grad()
def generate(model, context, max_new_tokens, temperature=1.0, top_k=None, top_p=None):
"""
Complete text generation function with all strategies.
Args:
model: Trained GPT model
context: Starting token indices, shape (1, seq_len)
max_new_tokens: Number of tokens to generate
temperature: Sampling temperature (0=greedy)
top_k: If set, only sample from top-k tokens
top_p: If set, use nucleus sampling with this threshold
Returns:
Generated token indices including the original context
"""
model.eval()
for _ in range(max_new_tokens):
# Crop context to block_size if needed
ctx = context[:, -model.block_size:]
# Forward pass
logits, _ = model(ctx)
logits = logits[:, -1, :] # Only the last position matters
if temperature == 0:
# Greedy decoding
next_token = torch.argmax(logits, dim=-1, keepdim=True)
else:
# Apply temperature
logits = logits / temperature
# Apply top-k filtering
if top_k is not None:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[:, [-1]]] = float('-inf')
# Apply top-p (nucleus) filtering
if top_p is not None:
sorted_logits, sorted_indices = torch.sort(logits, descending=True)
cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
# Remove tokens above threshold (keep at least 1)
sorted_mask = cumulative_probs - F.softmax(sorted_logits, dim=-1) > top_p
sorted_logits[sorted_mask] = float('-inf')
# Scatter back to original order
logits = sorted_logits.scatter(1, sorted_indices.argsort(1), sorted_logits)
# Sample from distribution
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
# Append to context
context = torch.cat([context, next_token], dim=1)
return context
# Example usage (with a dummy model for illustration)
print("=== Complete Generation Function ===")
print("Usage examples:")
print(" generate(model, ctx, 100, temperature=0) # Greedy")
print(" generate(model, ctx, 100, temperature=0.8) # Temperature")
print(" generate(model, ctx, 100, temperature=0.8, top_k=40) # Top-k + temp")
print(" generate(model, ctx, 100, temperature=0.9, top_p=0.95) # Nucleus")
print("")
print("Recommended settings:")
print(" Creative writing: temperature=0.9, top_p=0.95")
print(" Code generation: temperature=0.2, top_p=0.95")
print(" Factual Q&A: temperature=0.0 (greedy)")
print(" Brainstorming: temperature=1.2, top_k=50")
The generate function processes one token at a time in a loop. At each step, it runs the full model forward pass on the current context, extracts logits for only the last position, applies the sampling strategy, and appends the result. The context grows by one token each iteration until we've generated the desired number of tokens.
Scaling Laws & GPT-2 Sizes
GPT-2 was released in four sizes, each demonstrating that bigger models learn better representations. The jump in quality from Small to XL is dramatic — each size roughly doubles/triples parameters and shows clear improvement on all language benchmarks:
| Variant | Parameters | Layers | d_model | Heads |
|---|---|---|---|---|
| GPT-2 Small | 124M | 12 | 768 | 12 |
| GPT-2 Medium | 355M | 24 | 1024 | 16 |
| GPT-2 Large | 774M | 36 | 1280 | 20 |
| GPT-2 XL | 1.5B | 48 | 1600 | 25 |
Chinchilla Scaling Laws
DeepMind's Chinchilla paper (2022) revealed that most LLMs were undertrained. The compute-optimal strategy is to scale both model size and training data equally. The key finding: for a given compute budget $C$, the optimal model size $N$ and dataset size $D$ satisfy $N \propto C^{0.5}$ and $D \propto C^{0.5}$. This means a 70B parameter model should be trained on ~1.4 trillion tokens (not 300B tokens as was common practice).
import torch
import math
def compute_gpt2_params(n_layers, d_model, vocab_size=50257, block_size=1024):
"""
Calculate exact parameter count for a GPT-2 variant.
Parameters come from:
- Token embeddings: vocab_size × d_model
- Position embeddings: block_size × d_model
- Per block:
- LayerNorm (×2): 2 × 2 × d_model
- QKV projection: d_model × 3 × d_model + 3 × d_model (bias)
- Output projection: d_model × d_model + d_model
- FFN up: d_model × 4 × d_model + 4 × d_model
- FFN down: 4 × d_model × d_model + d_model
- Final LayerNorm: 2 × d_model
- Output head: weight-tied (0 extra params)
"""
# Embeddings
token_emb = vocab_size * d_model
pos_emb = block_size * d_model
# Per block
ln_params = 2 * (2 * d_model) # 2 LayerNorms × (weight + bias)
attn_qkv = d_model * 3 * d_model + 3 * d_model # QKV + bias
attn_out = d_model * d_model + d_model # Output proj + bias
ffn_up = d_model * 4 * d_model + 4 * d_model
ffn_down = 4 * d_model * d_model + d_model
block_params = ln_params + attn_qkv + attn_out + ffn_up + ffn_down
# Final LN
final_ln = 2 * d_model
# Total (head is weight-tied)
total = token_emb + pos_emb + n_layers * block_params + final_ln
return total
# Compute for all GPT-2 variants
variants = [
("GPT-2 Small", 12, 768),
("GPT-2 Medium", 24, 1024),
("GPT-2 Large", 36, 1280),
("GPT-2 XL", 48, 1600),
]
print("=== GPT-2 Parameter Counts (Exact) ===\n")
for name, layers, d_model in variants:
params = compute_gpt2_params(layers, d_model)
print(f"{name:15s}: {params:>12,} params ({params/1e6:.1f}M)")
# Chinchilla optimal training tokens
print(f"\n=== Chinchilla Scaling Law ===")
print(f"Rule: Optimal tokens ≈ 20 × parameter count")
print()
for name, layers, d_model in variants:
params = compute_gpt2_params(layers, d_model)
optimal_tokens = 20 * params
actual_tokens = 10_000_000_000 # GPT-2 was trained on ~10B tokens
print(f"{name:15s}: Optimal={optimal_tokens/1e9:.1f}B tokens, "
f"Actual=10B tokens, "
f"{'Under-trained!' if actual_tokens < optimal_tokens else 'OK'}")
This calculation reveals that by Chinchilla's standards, all GPT-2 models were significantly undertrained — they should have seen 2-30× more tokens. This insight drove the development of models like LLaMA (Meta) which are smaller but trained on much more data, achieving similar performance to larger undertrained models.
Loading Pretrained GPT-2
While building from scratch teaches us everything about the architecture, in practice we use pretrained models via Hugging Face's transformers library. Let's load GPT-2 and generate text:
import torch
# Using Hugging Face transformers to load pretrained GPT-2
# pip install transformers
# --- OPTION 1: Using the pipeline (simplest) ---
print("=== Loading Pretrained GPT-2 ===")
print("(Requires: pip install transformers)\n")
# Demonstrate the API (conceptual - requires transformers installed)
code_example = """
from transformers import GPT2LMHeadModel, GPT2Tokenizer
# Load pretrained model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()
# Encode input text
prompt = "The future of artificial intelligence is"
input_ids = tokenizer.encode(prompt, return_tensors='pt')
# Generate text
with torch.no_grad():
output = model.generate(
input_ids,
max_new_tokens=50,
temperature=0.8,
top_p=0.95,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
# Decode and print
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)
"""
print(code_example)
# Show model architecture details
print("GPT-2 Small Architecture:")
print(f" Parameters: 124,439,808")
print(f" Layers: 12")
print(f" Hidden size: 768")
print(f" Attention heads: 12")
print(f" Context window: 1024 tokens")
print(f" Vocabulary: 50,257 (BPE)")
print(f" Training data: WebText (~40GB of internet text)")
The Hugging Face model is architecturally identical to what we built from scratch — the same pre-norm blocks, causal attention, GELU FFN, and weight tying. The difference is that it's been trained on 40GB of internet text for significant compute, giving it strong language understanding capabilities.
Fine-Tuning on Custom Text
Fine-tuning adapts a pretrained model to a specific domain or style. We take the pretrained GPT-2 weights and continue training on our custom dataset with a small learning rate. This is much faster than training from scratch because the model already understands language — it just needs to adapt its style:
import torch
import torch.nn.functional as F
# Conceptual fine-tuning code for GPT-2
# (Requires: pip install transformers datasets)
finetune_code = """
from transformers import GPT2LMHeadModel, GPT2Tokenizer, TrainingArguments
from transformers import Trainer, TextDataset, DataCollatorForLanguageModeling
# 1. Load pretrained model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
# Add padding token (GPT-2 doesn't have one by default)
tokenizer.pad_token = tokenizer.eos_token
# 2. Prepare your custom dataset
def tokenize_function(examples):
return tokenizer(examples['text'], truncation=True, max_length=512)
# 3. Fine-tuning configuration
training_args = TrainingArguments(
output_dir='./gpt2-finetuned',
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # Effective batch = 16
learning_rate=5e-5, # Small LR for fine-tuning!
warmup_steps=100,
weight_decay=0.01,
logging_steps=50,
save_steps=500,
fp16=True, # Mixed precision for speed
)
# 4. Train
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
trainer.train()
"""
print("=== Fine-Tuning Pretrained GPT-2 ===\n")
print(finetune_code)
# Key fine-tuning tips
print("\n=== Fine-Tuning Best Practices ===")
tips = [
("Learning rate", "5e-5 to 2e-5 (10-100x smaller than pretraining)"),
("Epochs", "2-5 (avoid overfitting to small datasets)"),
("Batch size", "16-64 (use gradient accumulation if GPU-limited)"),
("Warmup", "5-10% of total steps"),
("Early stopping", "Monitor validation loss, stop when it increases"),
("Data size", "Minimum ~1000 examples for noticeable style shift"),
("Freezing", "Optionally freeze early layers, only train last N"),
]
for key, value in tips:
print(f" {key:15s}: {value}")
Fine-tuning is remarkably sample-efficient — even a few thousand examples of your target domain can shift GPT-2's writing style noticeably. Common use cases include: generating code documentation, writing in a specific author's style, creating domain-specific content (medical, legal, technical), or adapting to a particular format (emails, tweets, poetry).
Complete Training Example
Let's wrap everything into a single, production-quality training script that demonstrates the full GPT-2 mini pipeline from data preparation to generation. This is a self-contained reference implementation:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from dataclasses import dataclass
@dataclass
class GPTConfig:
"""GPT-2 model configuration."""
vocab_size: int = 256 # Byte-level for simplicity
block_size: int = 128 # Maximum context length
n_layers: int = 6 # Number of transformer blocks
n_heads: int = 4 # Number of attention heads
d_model: int = 256 # Embedding dimension
dropout: float = 0.1 # Dropout rate
bias: bool = True # Use bias in Linear layers
# Print configuration
config = GPTConfig()
print("=== GPT-2 Mini Configuration ===")
for field in config.__dataclass_fields__:
print(f" {field:12s}: {getattr(config, field)}")
# Estimate FLOPS for one forward pass
# Approximate: 6 * N * T (where N = params, T = sequence length)
params_est = (config.n_layers * (12 * config.d_model**2 +
config.vocab_size * config.d_model))
flops_per_token = 6 * params_est
print(f"\n ~Parameters: {params_est:,}")
print(f" ~FLOPS/token: {flops_per_token:,}")
print(f" ~FLOPS/sequence: {flops_per_token * config.block_size:,}")
The dataclass configuration pattern makes it easy to experiment with different model sizes. You can scale from our mini model (256 hidden, 6 layers) to GPT-2 Small (768 hidden, 12 layers) by simply changing the config. The FLOPS estimation follows the rule of thumb: approximately $6N$ FLOPS per token for a model with $N$ parameters (accounting for both forward and backward passes).
import torch
import torch.nn as nn
import torch.nn.functional as F
# Complete generation pipeline: encode → model → decode → sample → repeat
def demonstrate_generation_pipeline():
"""Show the full autoregressive generation loop step by step."""
# Mini vocabulary for clarity
vocab = list("abcdefghijklmnopqrstuvwxyz .,!?\n")
vocab_size = len(vocab)
stoi = {ch: i for i, ch in enumerate(vocab)}
itos = {i: ch for i, ch in enumerate(vocab)}
print("=== Autoregressive Generation Pipeline ===\n")
# Simulate a trained model's behavior
torch.manual_seed(42)
prompt = "the cat "
context = [stoi[c] for c in prompt if c in stoi]
print(f"Prompt: '{prompt}'")
print(f"Encoded: {context}")
print(f"\nGeneration steps:")
generated = list(prompt)
for step in range(10):
# In reality: run model forward pass here
# model_output = model(torch.tensor([context]))
# logits = model_output[:, -1, :]
# Simulate model output (biased toward common characters)
logits = torch.randn(vocab_size) * 0.5
# Bias toward spaces and common letters
for ch in "satone ":
if ch in stoi:
logits[stoi[ch]] += 1.5
# Apply temperature + top-k
temperature = 0.8
top_k = 10
scaled = logits / temperature
topk_vals, topk_idx = torch.topk(scaled, top_k)
filtered = torch.full_like(scaled, float('-inf'))
filtered.scatter_(0, topk_idx, topk_vals)
probs = F.softmax(filtered, dim=-1)
# Sample
next_idx = torch.multinomial(probs, 1).item()
next_char = itos[next_idx]
# Append and continue
context.append(next_idx)
generated.append(next_char)
print(f" Step {step+1}: context_len={len(context)}, "
f"sampled='{next_char}' (idx={next_idx}, p={probs[next_idx]:.3f})")
print(f"\nFull generated text: '{''.join(generated)}'")
print(f"\nKey: Each step uses ALL previous tokens as context!")
demonstrate_generation_pipeline()
This demonstrates the fundamental loop of autoregressive generation: at each step, the entire context (including all previously generated tokens) feeds into the model, and only the logits at the final position determine the next token. This is why generation is inherently sequential — you can't parallelize it because each token depends on all previous ones.