Deep Dive: GPT-2 Mini — Building a Language Model | PyTorch Mastery

Language Models & Autoregressive Generation

A language model is, at its core, a probability distribution over sequences of tokens. Given a sequence of preceding words (or tokens), a language model predicts what comes next. This is the fundamental task behind GPT-2, ChatGPT, and all modern large language models — they learn to predict the next token in a sequence, one token at a time.

The probability of a sequence is decomposed using the chain rule of probability:

$$P(x_1, x_2, \ldots, x_T) = \prod_{t=1}^{T} P(x_t | x_1, \ldots, x_{t-1})$$

Each factor $P(x_t | x_{<t})$ is what the model learns to predict. During generation, we sample from this distribution token by token — this is called autoregressive generation. The model generates one token, appends it to the context, and then predicts the next token conditioned on everything so far.

                            
                            Key Insight: GPT-2 doesn't "understand" language in a human sense — it learns statistical patterns of which tokens follow other tokens. Yet this simple objective produces remarkably coherent text because natural language has deep, compressible structure.
                        

Autoregressive vs Masked Language Models

There are two major paradigms for training language models. Autoregressive models (GPT family) predict tokens left-to-right, seeing only past context. Masked language models (BERT family) mask random tokens and predict them given both left and right context. This fundamental difference determines what each model excels at:

Architecture Comparison

GPT (Decoder-only, autoregressive): Sees only past tokens → Excellent for generation tasks (writing, code completion, chat). Uses a causal mask to prevent attending to future positions.

BERT (Encoder-only, masked): Sees all tokens (bidirectional) → Excellent for understanding tasks (classification, NER, QA). Cannot generate text naturally because it was trained to fill in blanks, not predict sequences.

Why decoder-only wins for generation: Because it's trained to predict the next token given only prior context, the generation procedure (sampling token by token) perfectly matches the training objective. There's no train-test mismatch.

Decoder-Only Autoregressive Causal Mask

Let's demonstrate the autoregressive generation concept with a simple example. We'll show how probabilities are computed and how tokens are selected step by step:

import torch
import torch.nn.functional as F

# Simulate a tiny language model's output
# Vocabulary: ["the", "cat", "sat", "on", "mat", "dog", "ran"]
vocab = ["the", "cat", "sat", "on", "mat", "dog", "ran"]
vocab_size = len(vocab)

# Simulate logits (raw model output) for next-token prediction
# Given context "the cat", model outputs logits for each vocab token
torch.manual_seed(42)
logits = torch.randn(vocab_size)  # Raw scores from model

# Convert to probabilities via softmax
probs = F.softmax(logits, dim=-1)

print("=== Autoregressive Next-Token Prediction ===")
print(f"Context: 'the cat'")
print(f"\nToken probabilities:")
for token, prob in zip(vocab, probs):
    bar = "█" * int(prob * 50)
    print(f"  {token:6s}: {prob:.4f} {bar}")

# Sample from the distribution (stochastic generation)
sampled_idx = torch.multinomial(probs, num_samples=1)
print(f"\nSampled next token: '{vocab[sampled_idx.item()]}'")

# Greedy decoding (deterministic - pick highest probability)
greedy_idx = torch.argmax(probs)
print(f"Greedy next token:  '{vocab[greedy_idx.item()]}'")
print(f"\nAutoregressive = repeat this for each new token!")

This snippet illustrates the core idea: the model outputs a probability distribution over the vocabulary, and we select the next token from that distribution. The entire art of text generation lies in how we select from this distribution — greedy, temperature-scaled, top-k, or nucleus sampling (we'll implement all of these later).

GPT-2 Architecture Overview

GPT-2, released by OpenAI in 2019, is a decoder-only Transformer. Unlike the original Transformer paper (which has both encoder and decoder), GPT-2 uses only the decoder stack with causal (masked) self-attention. The key architectural decisions that distinguish GPT-2 from the original Transformer are:

Pre-norm: LayerNorm is applied before attention and FFN (not after), which stabilizes training for deep models
Learned positional embeddings: Instead of fixed sinusoidal encodings, positions are learned parameters
GELU activation: The feed-forward network uses GELU instead of ReLU
Weight tying: The token embedding matrix is reused as the output projection
No encoder-decoder cross-attention: Only self-attention within the decoder

GPT-2 Architecture (Full Block Stack)

flowchart TD
    A[Input Tokens] --> B[Token Embedding]
    A --> C[Position Embedding]
    B --> D["+"]
    C --> D
    D --> E[Dropout]
    E --> F["GPT Block 1"]
    F --> G["GPT Block 2"]
    G --> H["..."]
    H --> I["GPT Block N"]
    I --> J[Final LayerNorm]
    J --> K["Linear Head (vocab_size)"]
    K --> L["Softmax → P(next token)"]

    subgraph block["Single GPT Block"]
        direction TB
        B1["LayerNorm"] --> B2["Multi-Head Causal Attention"]
        B2 --> B3["+ Residual"]
        B3 --> B4["LayerNorm"]
        B4 --> B5["Feed-Forward (GELU)"]
        B5 --> B6["+ Residual"]
    end

Pre-Norm vs Post-Norm

The original Transformer uses post-norm: the output of each sub-layer is $\text{LayerNorm}(x + \text{Sublayer}(x))$. GPT-2 uses pre-norm: $x + \text{Sublayer}(\text{LayerNorm}(x))$. This seemingly small change makes a huge difference for training stability. With pre-norm, the residual path is completely clean — gradients can flow directly from the loss to any layer without passing through normalizations. This allows GPT-2 to train stably with 48 layers (GPT-2 XL) without careful learning rate warmup.

import torch
import torch.nn as nn

# Comparison: Pre-Norm vs Post-Norm residual connections

class PostNormBlock(nn.Module):
    """Original Transformer style: normalize AFTER residual addition."""
    def __init__(self, d_model):
        super().__init__()
        self.norm = nn.LayerNorm(d_model)
        self.linear = nn.Linear(d_model, d_model)
    
    def forward(self, x):
        # Post-norm: LayerNorm(x + Sublayer(x))
        return self.norm(x + torch.relu(self.linear(x)))


class PreNormBlock(nn.Module):
    """GPT-2 style: normalize BEFORE the sublayer."""
    def __init__(self, d_model):
        super().__init__()
        self.norm = nn.LayerNorm(d_model)
        self.linear = nn.Linear(d_model, d_model)
    
    def forward(self, x):
        # Pre-norm: x + Sublayer(LayerNorm(x))
        # The residual path (x) is completely clean!
        return x + torch.relu(self.linear(self.norm(x)))


# Demonstrate gradient flow difference
d_model = 128
x = torch.randn(1, 10, d_model, requires_grad=True)

# Stack 12 pre-norm blocks (like GPT-2)
blocks = nn.Sequential(*[PreNormBlock(d_model) for _ in range(12)])
output = blocks(x)
loss = output.sum()
loss.backward()

print(f"Input gradient norm after 12 Pre-Norm blocks: {x.grad.norm():.4f}")
print(f"Gradient is well-preserved through clean residual paths!")
print(f"\nPre-norm advantage: gradients bypass sublayers via residual")
print(f"Post-norm problem: gradients must pass through LayerNorm at every layer")

Notice how with pre-norm, the input tensor $x$ has a direct additive path through every block. The gradient of the loss with respect to any intermediate representation has a term that's simply 1 (from the identity skip connection), plus additional gradient terms from the sublayers. This prevents gradients from vanishing in deep networks.

Tokenization

Before text enters GPT-2, it must be converted into integers — this is tokenization. GPT-2 uses Byte Pair Encoding (BPE), a subword tokenization algorithm that sits between character-level and word-level approaches. The GPT-2 vocabulary has exactly 50,257 tokens (256 byte tokens + 50,000 BPE merges + 1 special end-of-text token).

BPE works by starting with individual characters, then iteratively merging the most frequent pair of adjacent tokens. For example, "th" and "e" merge into "the" if they appear together frequently enough. This creates a vocabulary that efficiently handles common words as single tokens while splitting rare words into subword pieces.

                            
                            Why subword tokenization? Word-level tokenization creates enormous vocabularies and can't handle new words. Character-level tokenization creates very long sequences (expensive for attention which is $O(n^2)$). BPE is the sweet spot: common words get single tokens, rare words get split into meaningful subparts.
                        

Character-Level Tokenizer for Our Mini Model

For our mini GPT-2 that we'll train on Shakespeare, we'll use a simple character-level tokenizer. This keeps things simple while demonstrating the same principles. Each unique character in the training text becomes a token:

import torch

# Character-level tokenizer for our mini GPT-2
# This is what we'll use to train on Shakespeare

class CharTokenizer:
    """Simple character-level tokenizer for demonstration."""
    
    def __init__(self, text):
        # Get all unique characters in the text
        self.chars = sorted(list(set(text)))
        self.vocab_size = len(self.chars)
        
        # Create mappings
        self.char_to_idx = {ch: i for i, ch in enumerate(self.chars)}
        self.idx_to_char = {i: ch for i, ch in enumerate(self.chars)}
    
    def encode(self, text):
        """Convert string to list of integers."""
        return [self.char_to_idx[ch] for ch in text]
    
    def decode(self, indices):
        """Convert list of integers back to string."""
        return ''.join([self.idx_to_char[i] for i in indices])


# Example with Shakespeare-like text
sample_text = """To be, or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles."""

tokenizer = CharTokenizer(sample_text)
print(f"Vocabulary size: {tokenizer.vocab_size} unique characters")
print(f"Characters: {''.join(tokenizer.chars)}")

# Encode and decode
encoded = tokenizer.encode("To be")
decoded = tokenizer.decode(encoded)
print(f"\n'To be' → {encoded}")
print(f"{encoded} → '{decoded}'")

# Convert to tensor for model input
tokens_tensor = torch.tensor(tokenizer.encode(sample_text[:50]))
print(f"\nFirst 50 chars as tensor: shape {tokens_tensor.shape}")
print(f"Token values: {tokens_tensor[:10].tolist()}...")

For production GPT-2, OpenAI uses BPE tokenization with the tiktoken library. Let's see how the real GPT-2 tokenizer handles text differently from our character-level approach:

import torch

# Demonstrating BPE tokenization concepts
# (tiktoken requires: pip install tiktoken)

# Simulate BPE behavior without the library
# BPE starts with characters and merges frequent pairs

def simple_bpe_demo(text, num_merges=5):
    """Demonstrate BPE merge process step by step."""
    # Start: each character is its own token
    tokens = list(text)
    
    print(f"Original: {len(tokens)} character tokens")
    print(f"Text: '{text}'")
    print(f"\nBPE Merge Process:")
    
    for i in range(num_merges):
        # Count all adjacent pairs
        pairs = {}
        for j in range(len(tokens) - 1):
            pair = (tokens[j], tokens[j+1])
            pairs[pair] = pairs.get(pair, 0) + 1
        
        if not pairs:
            break
        
        # Find most frequent pair
        best_pair = max(pairs, key=pairs.get)
        merged = best_pair[0] + best_pair[1]
        
        # Merge all occurrences
        new_tokens = []
        j = 0
        while j < len(tokens):
            if j < len(tokens) - 1 and (tokens[j], tokens[j+1]) == best_pair:
                new_tokens.append(merged)
                j += 2
            else:
                new_tokens.append(tokens[j])
                j += 1
        
        tokens = new_tokens
        print(f"  Merge {i+1}: '{best_pair[0]}' + '{best_pair[1]}' → "
              f"'{merged}' (freq={pairs[best_pair]}) → {len(tokens)} tokens")
    
    return tokens

# Run BPE demo
text = "the cat sat on the mat the cat"
final_tokens = simple_bpe_demo(text)
print(f"\nFinal tokens: {final_tokens}")
print(f"\nGPT-2 BPE vocab: 50,257 tokens (256 bytes + 50,000 merges + 1 EoT)")
print(f"Average English word ≈ 1.3 tokens in GPT-2's tokenizer")

The BPE algorithm learns which character sequences appear most frequently in the training corpus and creates tokens for them. The result is that common words like "the", "and", "is" become single tokens, while rare words get broken into subword units. For example, "tokenization" might become ["token", "ization"] — both are meaningful subparts.

Causal (Masked) Self-Attention

The core innovation that makes GPT-2 autoregressive is causal self-attention. In standard self-attention, every token can attend to every other token. In causal attention, each token can only attend to itself and tokens that came before it. This is enforced by a causal mask — a lower-triangular matrix that sets future positions to $-\infty$ before the softmax.

The causal attention formula is:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V$$

where the mask $M$ is defined as $M_{ij} = 0$ if $j \leq i$ (allowed to attend) and $M_{ij} = -\infty$ if $j > i$ (blocked). Since $e^{-\infty} = 0$, future positions contribute zero weight after softmax.

import torch
import torch.nn as nn
import torch.nn.functional as F

def causal_self_attention(x, d_k):
    """
    Implement causal (masked) self-attention from scratch.
    
    Args:
        x: Input tensor of shape (batch, seq_len, d_model)
        d_k: Dimension of keys (for scaling)
    
    Returns:
        Attention output of shape (batch, seq_len, d_model)
    """
    batch, seq_len, d_model = x.shape
    
    # Project to Q, K, V (in practice, use nn.Linear)
    W_q = torch.randn(d_model, d_k) * 0.02
    W_k = torch.randn(d_model, d_k) * 0.02
    W_v = torch.randn(d_model, d_k) * 0.02
    
    Q = x @ W_q  # (batch, seq_len, d_k)
    K = x @ W_k  # (batch, seq_len, d_k)
    V = x @ W_v  # (batch, seq_len, d_k)
    
    # Compute attention scores
    scores = Q @ K.transpose(-2, -1) / (d_k ** 0.5)  # (batch, seq_len, seq_len)
    
    # Create causal mask: lower triangular = 1, upper = 0
    mask = torch.tril(torch.ones(seq_len, seq_len))  # Lower triangular
    
    # Apply mask: set future positions to -infinity
    scores = scores.masked_fill(mask == 0, float('-inf'))
    
    # Softmax over last dimension (the key dimension)
    attn_weights = F.softmax(scores, dim=-1)
    
    # Weighted sum of values
    output = attn_weights @ V  # (batch, seq_len, d_k)
    
    return output, attn_weights


# Demonstrate causal attention
torch.manual_seed(42)
batch_size, seq_len, d_model, d_k = 1, 5, 32, 32

x = torch.randn(batch_size, seq_len, d_model)
output, weights = causal_self_attention(x, d_k)

print("=== Causal Self-Attention ===")
print(f"Input shape:  {list(x.shape)}  (batch, seq_len, d_model)")
print(f"Output shape: {list(output.shape)}")
print(f"\nAttention weights (notice the triangular pattern):")
print(f"Token 0 attends to: {weights[0, 0].tolist()}")
print(f"Token 1 attends to: {[f'{w:.3f}' for w in weights[0, 1].tolist()]}")
print(f"Token 4 attends to: {[f'{w:.3f}' for w in weights[0, 4].tolist()]}")
print(f"\nRow sums (should all be 1.0): {weights[0].sum(dim=-1).tolist()}")
print(f"Upper triangle (should be 0): {weights[0].triu(diagonal=1).sum():.6f}")

Notice the triangular structure of the attention weights: token 0 can only attend to itself (weight 1.0), token 1 attends to tokens 0 and 1, and token 4 attends to all five tokens. The upper-triangular entries are exactly zero after softmax because we filled them with $-\infty$. This is what makes GPT autoregressive — during training, all positions are processed in parallel (unlike RNNs), but each position only has access to past information.

Multi-Head Causal Attention

A single attention head can only focus on one type of relationship at a time. Multi-head attention runs multiple attention heads in parallel, each with its own Q, K, V projections, allowing the model to simultaneously attend to different aspects (syntax, semantics, position, etc.). The outputs are concatenated and projected back to the model dimension:

$$\text{MultiHead}(X) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O$$

where each $\text{head}_i = \text{CausalAttention}(XW_i^Q, XW_i^K, XW_i^V)$

import torch
import torch.nn as nn
import torch.nn.functional as F

class MultiHeadCausalAttention(nn.Module):
    """
    Multi-head causal self-attention for GPT-2.
    
    Each head independently computes causal attention over a subspace
    of the embedding dimension. Results are concatenated and projected.
    """
    
    def __init__(self, d_model, n_heads, dropout=0.1):
        super().__init__()
        assert d_model % n_heads == 0, "d_model must be divisible by n_heads"
        
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads  # Dimension per head
        
        # Combined QKV projection (more efficient than separate)
        self.qkv_proj = nn.Linear(d_model, 3 * d_model)
        # Output projection
        self.out_proj = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        batch, seq_len, _ = x.shape
        
        # Project to Q, K, V simultaneously
        qkv = self.qkv_proj(x)  # (batch, seq_len, 3 * d_model)
        q, k, v = qkv.chunk(3, dim=-1)  # Each: (batch, seq_len, d_model)
        
        # Reshape for multi-head: (batch, n_heads, seq_len, d_k)
        q = q.view(batch, seq_len, self.n_heads, self.d_k).transpose(1, 2)
        k = k.view(batch, seq_len, self.n_heads, self.d_k).transpose(1, 2)
        v = v.view(batch, seq_len, self.n_heads, self.d_k).transpose(1, 2)
        
        # Scaled dot-product attention with causal mask
        scores = q @ k.transpose(-2, -1) / (self.d_k ** 0.5)
        
        # Causal mask
        mask = torch.tril(torch.ones(seq_len, seq_len, device=x.device))
        scores = scores.masked_fill(mask == 0, float('-inf'))
        
        attn = F.softmax(scores, dim=-1)
        attn = self.dropout(attn)
        
        # Apply attention to values
        out = attn @ v  # (batch, n_heads, seq_len, d_k)
        
        # Concatenate heads: (batch, seq_len, d_model)
        out = out.transpose(1, 2).contiguous().view(batch, seq_len, self.d_model)
        
        # Final projection
        return self.out_proj(out)


# Demonstrate multi-head attention
torch.manual_seed(42)
d_model, n_heads = 128, 4
mha = MultiHeadCausalAttention(d_model, n_heads)

x = torch.randn(2, 10, d_model)  # batch=2, seq_len=10
output = mha(x)

print("=== Multi-Head Causal Attention ===")
print(f"d_model={d_model}, n_heads={n_heads}, d_k={d_model // n_heads}")
print(f"Input:  {list(x.shape)}  (batch, seq_len, d_model)")
print(f"Output: {list(output.shape)}  (same shape - residual-friendly)")
print(f"\nParameter count:")
print(f"  QKV projection: {d_model} × {3*d_model} = {d_model * 3 * d_model:,}")
print(f"  Output projection: {d_model} × {d_model} = {d_model * d_model:,}")
print(f"  Total: {sum(p.numel() for p in mha.parameters()):,} parameters")

The key efficiency trick is the combined QKV projection — instead of three separate linear layers for Q, K, and V, we use one large linear layer that outputs all three concatenated. We then split into chunks and reshape for parallel head computation. After attention, heads are concatenated and a final projection mixes information across heads.

The GPT Block

A single GPT block combines multi-head causal attention with a feed-forward network, both wrapped in pre-norm residual connections. The data flow through one block is:

LayerNorm the input
Pass through Multi-Head Causal Attention
Add residual (skip connection from before LayerNorm)
LayerNorm the result
Pass through Feed-Forward Network (MLP with GELU)
Add residual (skip connection from before second LayerNorm)

Single GPT-2 Block (Pre-Norm)

flowchart LR
    X[Input x] --> LN1[LayerNorm]
    LN1 --> MHA[Multi-Head Causal Attention]
    MHA --> ADD1["+"]
    X --> ADD1
    ADD1 --> LN2[LayerNorm]
    LN2 --> FFN["FFN (GELU)"]
    FFN --> ADD2["+"]
    ADD1 --> ADD2
    ADD2 --> OUT[Output]

Feed-Forward Network (MLP)

The feed-forward network in GPT-2 is a simple two-layer MLP with a GELU activation and a 4× expansion factor. If the model dimension is $d_{\text{model}}$, the hidden dimension is $4 \times d_{\text{model}}$. The formula is:

$$\text{FFN}(x) = \text{GELU}(xW_1 + b_1)W_2 + b_2$$

GELU (Gaussian Error Linear Unit) is smoother than ReLU — it doesn't have the hard cutoff at zero. Instead, it softly gates values based on their magnitude: $\text{GELU}(x) = x \cdot \Phi(x)$ where $\Phi$ is the standard normal CDF. In practice, we use the approximation $\text{GELU}(x) \approx 0.5x(1 + \tanh[\sqrt{2/\pi}(x + 0.044715x^3)])$.

import torch
import torch.nn as nn
import torch.nn.functional as F

class FeedForward(nn.Module):
    """GPT-2 Feed-Forward Network with GELU activation and 4x expansion."""
    
    def __init__(self, d_model, dropout=0.1):
        super().__init__()
        self.fc1 = nn.Linear(d_model, 4 * d_model)  # Expand
        self.fc2 = nn.Linear(4 * d_model, d_model)   # Project back
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        # GELU activation between the two linear layers
        x = F.gelu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        return x


class GPTBlock(nn.Module):
    """
    A single GPT-2 transformer block with pre-norm residual connections.
    
    Flow: x → LN → MHA → +residual → LN → FFN → +residual
    """
    
    def __init__(self, d_model, n_heads, dropout=0.1):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.attn = MultiHeadCausalAttention(d_model, n_heads, dropout)
        self.ln2 = nn.LayerNorm(d_model)
        self.ffn = FeedForward(d_model, dropout)
    
    def forward(self, x):
        # Pre-norm + attention + residual
        x = x + self.attn(self.ln1(x))
        # Pre-norm + FFN + residual
        x = x + self.ffn(self.ln2(x))
        return x


# Need MultiHeadCausalAttention from earlier
class MultiHeadCausalAttention(nn.Module):
    def __init__(self, d_model, n_heads, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        self.qkv_proj = nn.Linear(d_model, 3 * d_model)
        self.out_proj = nn.Linear(d_model, d_model)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        B, T, C = x.shape
        qkv = self.qkv_proj(x)
        q, k, v = qkv.chunk(3, dim=-1)
        q = q.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        k = k.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        v = v.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        scores = q @ k.transpose(-2, -1) / (self.d_k ** 0.5)
        mask = torch.tril(torch.ones(T, T, device=x.device))
        scores = scores.masked_fill(mask == 0, float('-inf'))
        attn = F.softmax(scores, dim=-1)
        attn = self.dropout(attn)
        out = attn @ v
        out = out.transpose(1, 2).contiguous().view(B, T, self.d_model)
        return self.out_proj(out)


# Build and test a single GPT block
torch.manual_seed(42)
block = GPTBlock(d_model=256, n_heads=4, dropout=0.1)

x = torch.randn(2, 20, 256)  # batch=2, seq_len=20, d_model=256
output = block(x)

print("=== GPT-2 Block ===")
print(f"Input:  {list(x.shape)}")
print(f"Output: {list(output.shape)} (same shape — stackable!)")
print(f"\nBlock parameters: {sum(p.numel() for p in block.parameters()):,}")
print(f"  LayerNorm 1: {2 * 256:,}")
print(f"  MHA: {sum(p.numel() for p in block.attn.parameters()):,}")
print(f"  LayerNorm 2: {2 * 256:,}")
print(f"  FFN: {sum(p.numel() for p in block.ffn.parameters()):,}")

The GPT block preserves the input shape exactly — this is essential because we stack multiple blocks sequentially. Each block refines the representations: the attention sub-layer lets tokens gather information from relevant past tokens, while the FFN sub-layer processes each position independently, adding computational depth. The residual connections ensure that information from earlier layers is always available.

Assembling GPT-2 Mini

Now we assemble the full GPT-2 model. The complete architecture stacks token embeddings + position embeddings → N transformer blocks → final LayerNorm → linear output head. Here's the full implementation with configurable hyperparameters:

import torch
import torch.nn as nn
import torch.nn.functional as F


class MultiHeadCausalAttention(nn.Module):
    """Multi-head causal self-attention for GPT-2."""
    def __init__(self, d_model, n_heads, dropout=0.1, block_size=512):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        self.qkv_proj = nn.Linear(d_model, 3 * d_model)
        self.out_proj = nn.Linear(d_model, d_model)
        self.attn_dropout = nn.Dropout(dropout)
        self.resid_dropout = nn.Dropout(dropout)
        # Register causal mask as buffer (not a parameter)
        self.register_buffer("mask", torch.tril(
            torch.ones(block_size, block_size)).view(1, 1, block_size, block_size))
    
    def forward(self, x):
        B, T, C = x.shape
        qkv = self.qkv_proj(x)
        q, k, v = qkv.chunk(3, dim=-1)
        q = q.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        k = k.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        v = v.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        
        scores = q @ k.transpose(-2, -1) / (self.d_k ** 0.5)
        scores = scores.masked_fill(self.mask[:, :, :T, :T] == 0, float('-inf'))
        attn = F.softmax(scores, dim=-1)
        attn = self.attn_dropout(attn)
        
        out = (attn @ v).transpose(1, 2).contiguous().view(B, T, C)
        return self.resid_dropout(self.out_proj(out))


class FeedForward(nn.Module):
    """FFN with GELU and 4x expansion."""
    def __init__(self, d_model, dropout=0.1):
        super().__init__()
        self.fc1 = nn.Linear(d_model, 4 * d_model)
        self.fc2 = nn.Linear(4 * d_model, d_model)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        return self.dropout(self.fc2(F.gelu(self.fc1(x))))


class GPTBlock(nn.Module):
    """Pre-norm transformer block."""
    def __init__(self, d_model, n_heads, dropout=0.1, block_size=512):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.attn = MultiHeadCausalAttention(d_model, n_heads, dropout, block_size)
        self.ln2 = nn.LayerNorm(d_model)
        self.ffn = FeedForward(d_model, dropout)
    
    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.ffn(self.ln2(x))
        return x


class GPT2Mini(nn.Module):
    """
    GPT-2 Mini: A complete language model.
    
    Architecture: Token Emb + Pos Emb → N Blocks → LayerNorm → Linear Head
    Uses weight tying: embedding weights = output projection weights.
    """
    
    def __init__(self, vocab_size, d_model=256, n_heads=4, n_layers=6,
                 block_size=512, dropout=0.1):
        super().__init__()
        self.block_size = block_size
        
        # Token and position embeddings
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(block_size, d_model)
        self.dropout = nn.Dropout(dropout)
        
        # Transformer blocks
        self.blocks = nn.Sequential(*[
            GPTBlock(d_model, n_heads, dropout, block_size)
            for _ in range(n_layers)
        ])
        
        # Final layer norm (pre-norm: applied before the head)
        self.ln_f = nn.LayerNorm(d_model)
        
        # Output head (projects back to vocabulary)
        self.head = nn.Linear(d_model, vocab_size, bias=False)
        
        # Weight tying: share embedding weights with output
        self.head.weight = self.token_emb.weight
        
        # Initialize weights
        self.apply(self._init_weights)
    
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
    
    def forward(self, idx, targets=None):
        """
        Args:
            idx: Token indices, shape (batch, seq_len)
            targets: Target token indices for loss computation
        
        Returns:
            logits: Shape (batch, seq_len, vocab_size)
            loss: Cross-entropy loss (if targets provided)
        """
        B, T = idx.shape
        assert T <= self.block_size, f"Sequence length {T} exceeds block_size {self.block_size}"
        
        # Create position indices
        pos = torch.arange(0, T, device=idx.device).unsqueeze(0)  # (1, T)
        
        # Embed tokens and positions
        tok_emb = self.token_emb(idx)    # (B, T, d_model)
        pos_emb = self.pos_emb(pos)      # (1, T, d_model)
        x = self.dropout(tok_emb + pos_emb)
        
        # Pass through transformer blocks
        x = self.blocks(x)
        
        # Final layer norm + output projection
        x = self.ln_f(x)
        logits = self.head(x)  # (B, T, vocab_size)
        
        # Compute loss if targets provided
        loss = None
        if targets is not None:
            loss = F.cross_entropy(
                logits.view(-1, logits.size(-1)),
                targets.view(-1)
            )
        
        return logits, loss


# Create our mini GPT-2
model = GPT2Mini(
    vocab_size=65,      # ~65 chars in Shakespeare
    d_model=256,        # Embedding dimension
    n_heads=4,          # Attention heads
    n_layers=6,         # Transformer blocks
    block_size=128,     # Max sequence length
    dropout=0.1
)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
print("=== GPT-2 Mini Architecture ===")
print(f"Total parameters: {total_params:,}")
print(f"\nArchitecture:")
print(f"  vocab_size:  65 (characters)")
print(f"  d_model:     256")
print(f"  n_heads:     4 (d_k = 64 per head)")
print(f"  n_layers:    6")
print(f"  block_size:  128")
print(f"  FFN hidden:  {4 * 256} (4× expansion)")

# Test forward pass
dummy_input = torch.randint(0, 65, (4, 32))  # batch=4, seq_len=32
logits, _ = model(dummy_input)
print(f"\nForward pass:")
print(f"  Input:  {list(dummy_input.shape)} (batch, seq_len)")
print(f"  Output: {list(logits.shape)} (batch, seq_len, vocab_size)")

# Test with targets (training mode)
targets = torch.randint(0, 65, (4, 32))
_, loss = model(dummy_input, targets)
print(f"  Loss:   {loss.item():.4f} (random init ≈ ln(65) = {torch.log(torch.tensor(65.0)):.4f})")

Weight Tying

Weight tying is an elegant trick where the token embedding matrix (shape [vocab_size, d_model]) is reused as the output projection (shape [d_model, vocab_size]). The intuition: if two words have similar embeddings, they should also have similar output probabilities in similar contexts. This reduces parameters significantly (especially with large vocabularies) and acts as a form of regularization. The next-token probability becomes:

$$P(x_t | x_{<t}) = \text{softmax}(h_t W_e^T)$$

where $h_t$ is the hidden state at position $t$ and $W_e$ is the shared embedding matrix. The output logit for token $i$ is simply the dot product between the hidden state and that token's embedding vector — tokens whose embeddings are "close" to the hidden state get high probability.

                            
                            Important: Weight tying means self.head.weight = self.token_emb.weight — they're literally the same tensor in memory. When backpropagation updates one, it updates both. This saves vocab_size × d_model parameters. For GPT-2 (vocab=50,257, d_model=768), that's ~38.6M parameters saved!
                        

Training on Shakespeare

Now let's train our mini GPT-2 on Shakespeare's complete works. We'll create a character-level dataset, implement the training loop with cross-entropy loss, and watch the model progress from random gibberish to recognizable English. The loss we minimize is the average negative log-likelihood of the correct next token:

$$\mathcal{L} = -\frac{1}{T}\sum_{t=1}^{T} \log P(x_t | x_{<t})$$

First, let's set up the dataset. We'll download Shakespeare's works and create training sequences by sampling random windows of text:

import torch
from torch.utils.data import Dataset, DataLoader

class TextDataset(Dataset):
    """
    Character-level text dataset for language modeling.
    
    Each sample is a (input, target) pair where target is input shifted by 1.
    Example: "Hello" → input="Hell", target="ello"
    """
    
    def __init__(self, text, block_size, tokenizer_encode):
        self.block_size = block_size
        # Encode entire text to integers
        self.data = torch.tensor(tokenizer_encode(text), dtype=torch.long)
        print(f"Dataset: {len(self.data):,} tokens, "
              f"{len(self.data) // block_size:,} possible sequences")
    
    def __len__(self):
        return len(self.data) - self.block_size
    
    def __getitem__(self, idx):
        # Get a window of block_size + 1 tokens
        chunk = self.data[idx : idx + self.block_size + 1]
        x = chunk[:-1]   # Input:  tokens 0..T-1
        y = chunk[1:]    # Target: tokens 1..T (shifted by 1)
        return x, y


# Simulate Shakespeare-like text for demonstration
# (In practice: download from https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt)
shakespeare_sample = """ROMEO: O, she doth teach the torches to burn bright!
It seems she hangs upon the cheek of night
Like a rich jewel in an Ethiope's ear;
Beauty too rich for use, for earth too dear!
So shows a snowy dove trooping with crows,
As yonder lady o'er her fellows shows.
The measure done, I'll watch her place of stand,
And, touching hers, make blessed my rude hand.
Did my heart love till now? forswear it, sight!
For I ne'er saw true beauty till this night.

JULIET: O Romeo, Romeo! wherefore art thou Romeo?
Deny thy father and refuse thy name;
Or, if thou wilt not, be but sworn my love,
And I'll no longer be a Capulet.

ROMEO: Shall I hear more, or shall I speak at this?
""" * 50  # Repeat to have enough data

# Build tokenizer
chars = sorted(list(set(shakespeare_sample)))
vocab_size = len(chars)
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}
encode = lambda s: [char_to_idx[c] for c in s]
decode = lambda l: ''.join([idx_to_char[i] for i in l])

print(f"Vocabulary: {vocab_size} characters")
print(f"Text length: {len(shakespeare_sample):,} characters")

# Create dataset
block_size = 128
dataset = TextDataset(shakespeare_sample, block_size, encode)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

# Verify shapes
x_batch, y_batch = next(iter(dataloader))
print(f"\nBatch shapes:")
print(f"  Input:  {list(x_batch.shape)} (batch, block_size)")
print(f"  Target: {list(y_batch.shape)} (batch, block_size)")
print(f"\nExample (first 40 chars):")
print(f"  Input:  '{decode(x_batch[0][:40].tolist())}'")
print(f"  Target: '{decode(y_batch[0][:40].tolist())}'")
print(f"  (Target is input shifted by 1 position)")

The target is simply the input shifted by one position. When the model sees "ROMEO: O, she doth", it should predict ", she doth " (each character predicting the next). This is the self-supervised objective that requires no human labeling — the text itself provides supervision.

Training Loop with Learning Rate Schedule

Modern LLM training uses learning rate warmup followed by cosine decay. During warmup, we linearly increase the LR from 0 to the peak value — this prevents early instabilities when gradients are noisy. Then cosine decay smoothly reduces the LR to a minimum value, allowing the model to converge to a sharper minimum:

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

# Training configuration
config = {
    'vocab_size': 50,       # Simplified for demo
    'd_model': 128,
    'n_heads': 4,
    'n_layers': 4,
    'block_size': 64,
    'dropout': 0.1,
    'learning_rate': 3e-4,
    'warmup_steps': 100,
    'max_steps': 1000,
    'min_lr': 3e-5,
}

def get_lr(step, warmup_steps, max_steps, max_lr, min_lr):
    """Cosine learning rate schedule with linear warmup."""
    if step < warmup_steps:
        # Linear warmup
        return max_lr * (step / warmup_steps)
    elif step > max_steps:
        return min_lr
    else:
        # Cosine decay
        decay_ratio = (step - warmup_steps) / (max_steps - warmup_steps)
        coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio))
        return min_lr + coeff * (max_lr - min_lr)

# Visualize the schedule
steps = list(range(1200))
lrs = [get_lr(s, config['warmup_steps'], config['max_steps'],
              config['learning_rate'], config['min_lr']) for s in steps]

print("=== Learning Rate Schedule ===")
print(f"Warmup: 0 → {config['learning_rate']} over {config['warmup_steps']} steps")
print(f"Decay:  {config['learning_rate']} → {config['min_lr']} (cosine) over "
      f"{config['max_steps'] - config['warmup_steps']} steps")
print(f"\nSample LR values:")
for s in [0, 50, 100, 250, 500, 750, 1000]:
    lr = get_lr(s, config['warmup_steps'], config['max_steps'],
                config['learning_rate'], config['min_lr'])
    print(f"  Step {s:4d}: LR = {lr:.6f}")

Now let's put together the complete training loop. This shows how to train our mini GPT-2 model and periodically generate text to monitor progress:

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

# --- Minimal model definition (self-contained) ---
class MiniGPT(nn.Module):
    """Minimal GPT for training demonstration."""
    def __init__(self, vocab_size, d_model=128, n_heads=4, n_layers=4, block_size=64):
        super().__init__()
        self.block_size = block_size
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(block_size, d_model)
        
        # Simplified transformer blocks
        self.blocks = nn.ModuleList()
        for _ in range(n_layers):
            self.blocks.append(nn.ModuleDict({
                'ln1': nn.LayerNorm(d_model),
                'ln2': nn.LayerNorm(d_model),
                'attn_proj': nn.Linear(d_model, 3 * d_model),
                'attn_out': nn.Linear(d_model, d_model),
                'ff1': nn.Linear(d_model, 4 * d_model),
                'ff2': nn.Linear(4 * d_model, d_model),
            }))
        
        self.ln_f = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, vocab_size, bias=False)
        self.head.weight = self.token_emb.weight  # Weight tying
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        self.register_buffer("mask", torch.tril(torch.ones(block_size, block_size)))
    
    def forward(self, idx, targets=None):
        B, T = idx.shape
        x = self.token_emb(idx) + self.pos_emb(torch.arange(T, device=idx.device))
        
        for block in self.blocks:
            # Attention with causal mask
            h = block['ln1'](x)
            qkv = block['attn_proj'](h)
            q, k, v = qkv.chunk(3, dim=-1)
            q = q.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
            k = k.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
            v = v.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
            scores = (q @ k.transpose(-2, -1)) / (self.d_k ** 0.5)
            scores = scores.masked_fill(self.mask[:T, :T] == 0, float('-inf'))
            attn = F.softmax(scores, dim=-1)
            out = (attn @ v).transpose(1, 2).contiguous().view(B, T, -1)
            x = x + block['attn_out'](out)
            # FFN
            h = block['ln2'](x)
            x = x + block['ff2'](F.gelu(block['ff1'](h)))
        
        logits = self.head(self.ln_f(x))
        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
        return logits, loss

# --- Training setup ---
text = "To be or not to be that is the question " * 200
chars = sorted(set(text))
vocab_size = len(chars)
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

data = torch.tensor(encode(text), dtype=torch.long)
block_size = 64

model = MiniGPT(vocab_size, d_model=128, n_heads=4, n_layers=4, block_size=block_size)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Training on {len(data):,} tokens\n")

# Training loop
model.train()
for step in range(301):
    # Random batch of sequences
    ix = torch.randint(0, len(data) - block_size - 1, (16,))
    xb = torch.stack([data[i:i+block_size] for i in ix])
    yb = torch.stack([data[i+1:i+block_size+1] for i in ix])
    
    # Forward + backward
    logits, loss = model(xb, yb)
    optimizer.zero_grad()
    loss.backward()
    # Gradient clipping (prevents exploding gradients)
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()
    
    if step % 100 == 0:
        print(f"Step {step:4d} | Loss: {loss.item():.4f}")
        
        # Generate sample text
        model.eval()
        context = torch.tensor([encode("To be")], dtype=torch.long)
        with torch.no_grad():
            for _ in range(40):
                logits, _ = model(context[:, -block_size:])
                probs = F.softmax(logits[:, -1, :], dim=-1)
                next_token = torch.multinomial(probs, 1)
                context = torch.cat([context, next_token], dim=1)
        generated = decode(context[0].tolist())
        print(f"  Generated: '{generated[:60]}...'")
        model.train()

print("\nTraining complete! Loss should decrease from ~ln(vocab) to < 1.0")

As training progresses, you'll see the loss decrease from approximately $\ln(\text{vocab\_size})$ (random guessing) toward values below 1.0. The generated text will evolve from random characters → recognizable words → grammatically coherent phrases. With enough data and training, character-level models can produce surprisingly fluent text.

                            
                            Training progression: Step 0: random gibberish → Step 100: recognizable character patterns → Step 500: word-like structures → Step 2000+: coherent phrases. The cross-entropy loss for a well-trained character-level Shakespeare model typically converges around 1.0-1.5 bits per character.
                        

Text Generation Strategies

Once trained, how we sample from the model dramatically affects output quality. The raw model outputs logits $z_i$ for each vocabulary token. We convert these to probabilities and then select a token. The simplest approach is greedy decoding (always pick the highest probability), but this produces repetitive, boring text. Better strategies introduce controlled randomness:

Temperature sampling scales the logits before softmax. Temperature $\tau$ controls randomness:

$$P_i = \frac{\exp(z_i / \tau)}{\sum_j \exp(z_j / \tau)}$$

$\tau = 1.0$: Standard sampling (model's natural distribution)
$\tau < 1.0$: Sharper distribution → more deterministic (confident picks)
$\tau > 1.0$: Flatter distribution → more random (creative but risky)
$\tau \to 0$: Equivalent to greedy decoding

import torch
import torch.nn.functional as F

def generate_with_temperature(logits, temperature=1.0):
    """
    Apply temperature scaling to logits before sampling.
    
    Args:
        logits: Raw model output, shape (vocab_size,)
        temperature: Controls randomness (0=greedy, 1=standard, >1=creative)
    
    Returns:
        Sampled token index
    """
    if temperature == 0:
        # Greedy: deterministic
        return torch.argmax(logits)
    
    # Scale logits by temperature
    scaled_logits = logits / temperature
    probs = F.softmax(scaled_logits, dim=-1)
    
    # Sample from the distribution
    return torch.multinomial(probs, num_samples=1).item()


# Demonstrate temperature effects
torch.manual_seed(42)
vocab = ["the", "a", "cat", "dog", "sat", "ran", "on", "mat"]
# Simulate model output: "the" has highest logit
logits = torch.tensor([3.0, 1.0, 2.0, 1.5, 0.5, 0.3, 0.8, 0.2])

print("=== Temperature Sampling ===")
print(f"Raw logits: {logits.tolist()}")
print(f"Vocab: {vocab}\n")

for temp in [0.0, 0.3, 0.7, 1.0, 1.5, 2.0]:
    if temp == 0:
        probs = torch.zeros_like(logits)
        probs[torch.argmax(logits)] = 1.0
    else:
        probs = F.softmax(logits / temp, dim=-1)
    
    prob_str = ' '.join([f'{p:.3f}' for p in probs.tolist()])
    top_token = vocab[torch.argmax(probs)]
    print(f"τ={temp:.1f}: [{prob_str}]  → top='{top_token}' (p={probs.max():.3f})")

print(f"\nLow τ → peaked distribution (always 'the')")
print(f"High τ → flat distribution (random word)")

Temperature alone doesn't prevent the model from occasionally selecting very unlikely tokens. Top-k sampling restricts selection to only the k most probable tokens, while top-p (nucleus) sampling dynamically selects the smallest set of tokens whose cumulative probability exceeds a threshold p:

Top-k & Nucleus (Top-p) Sampling

Let's implement both strategies. Top-k zeroes out all logits outside the top k candidates before sampling, while top-p sorts tokens by probability and includes only enough to exceed the cumulative threshold. In practice, top-p adapts automatically — it selects fewer tokens when the model is confident and more when uncertain:

import torch
import torch.nn.functional as F

def top_k_sampling(logits, k=10, temperature=1.0):
    """
    Top-k sampling: only consider the k highest-probability tokens.
    
    This prevents the model from ever selecting very unlikely tokens,
    which can cause incoherent text.
    """
    # Apply temperature
    scaled = logits / temperature
    
    # Find the k-th largest value
    top_k_values, top_k_indices = torch.topk(scaled, k)
    
    # Set everything below top-k to -infinity
    filtered = torch.full_like(scaled, float('-inf'))
    filtered.scatter_(0, top_k_indices, top_k_values)
    
    # Sample from filtered distribution
    probs = F.softmax(filtered, dim=-1)
    return torch.multinomial(probs, 1).item()


def top_p_sampling(logits, p=0.9, temperature=1.0):
    """
    Nucleus (top-p) sampling: keep smallest set of tokens with cumulative 
    probability >= p.
    
    Advantage over top-k: adapts to the shape of the distribution.
    - Sharp distribution (confident): fewer tokens selected
    - Flat distribution (uncertain): more tokens selected
    """
    # Apply temperature
    scaled = logits / temperature
    probs = F.softmax(scaled, dim=-1)
    
    # Sort by probability (descending)
    sorted_probs, sorted_indices = torch.sort(probs, descending=True)
    
    # Compute cumulative probability
    cumulative_probs = torch.cumsum(sorted_probs, dim=-1)
    
    # Remove tokens with cumulative probability above threshold
    # Keep at least one token (shift right by 1 so first token is never removed)
    sorted_mask = cumulative_probs - sorted_probs > p
    sorted_probs[sorted_mask] = 0.0
    
    # Normalize remaining probabilities
    sorted_probs = sorted_probs / sorted_probs.sum()
    
    # Sample from filtered distribution
    sampled_idx = torch.multinomial(sorted_probs, 1).item()
    return sorted_indices[sampled_idx].item()


# Compare all generation strategies
torch.manual_seed(42)
vocab = ["once", "upon", "a", "time", "there", "was", "king", "queen", 
         "dragon", "castle", "magic", "forest", "dark", "bright", "ancient"]
logits = torch.tensor([2.5, 1.8, 1.2, 2.0, 1.5, 1.0, 0.8, 0.6, 
                       0.3, 0.4, 0.2, 0.5, 0.1, -0.2, -0.5])

print("=== Generation Strategy Comparison ===")
print(f"Vocabulary: {len(vocab)} tokens\n")

# Greedy
greedy_idx = torch.argmax(logits)
print(f"1. Greedy:      '{vocab[greedy_idx]}' (always picks highest prob)")

# Temperature sampling (run multiple times)
torch.manual_seed(0)
temp_samples = [vocab[top_k_sampling(logits, k=15, temperature=0.7)] for _ in range(5)]
print(f"2. Temperature: {temp_samples} (τ=0.7)")

# Top-k
torch.manual_seed(0)
topk_samples = [vocab[top_k_sampling(logits, k=5, temperature=1.0)] for _ in range(5)]
print(f"3. Top-k (k=5): {topk_samples}")

# Top-p (nucleus)
torch.manual_seed(0)
topp_samples = [vocab[top_p_sampling(logits, p=0.85, temperature=1.0)] for _ in range(5)]
print(f"4. Top-p (p=.85): {topp_samples}")

print(f"\nKey differences:")
print(f"  Greedy: deterministic, repetitive")
print(f"  Top-k:  fixed candidate pool (may be too many or few)")
print(f"  Top-p:  adaptive pool (responds to model confidence)")

In practice, modern language models use a combination: temperature scaling + top-p filtering works well for most applications. ChatGPT, for example, uses temperature around 0.7 with top-p around 0.95 for balanced creative output. Lower temperature (0.1-0.3) is used for factual/code tasks where you want deterministic answers.

Let's implement a complete generation function that supports all strategies in a single interface:

import torch
import torch.nn.functional as F

@torch.no_grad()
def generate(model, context, max_new_tokens, temperature=1.0, top_k=None, top_p=None):
    """
    Complete text generation function with all strategies.
    
    Args:
        model: Trained GPT model
        context: Starting token indices, shape (1, seq_len)
        max_new_tokens: Number of tokens to generate
        temperature: Sampling temperature (0=greedy)
        top_k: If set, only sample from top-k tokens
        top_p: If set, use nucleus sampling with this threshold
    
    Returns:
        Generated token indices including the original context
    """
    model.eval()
    
    for _ in range(max_new_tokens):
        # Crop context to block_size if needed
        ctx = context[:, -model.block_size:]
        
        # Forward pass
        logits, _ = model(ctx)
        logits = logits[:, -1, :]  # Only the last position matters
        
        if temperature == 0:
            # Greedy decoding
            next_token = torch.argmax(logits, dim=-1, keepdim=True)
        else:
            # Apply temperature
            logits = logits / temperature
            
            # Apply top-k filtering
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = float('-inf')
            
            # Apply top-p (nucleus) filtering
            if top_p is not None:
                sorted_logits, sorted_indices = torch.sort(logits, descending=True)
                cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
                # Remove tokens above threshold (keep at least 1)
                sorted_mask = cumulative_probs - F.softmax(sorted_logits, dim=-1) > top_p
                sorted_logits[sorted_mask] = float('-inf')
                # Scatter back to original order
                logits = sorted_logits.scatter(1, sorted_indices.argsort(1), sorted_logits)
            
            # Sample from distribution
            probs = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
        
        # Append to context
        context = torch.cat([context, next_token], dim=1)
    
    return context

# Example usage (with a dummy model for illustration)
print("=== Complete Generation Function ===")
print("Usage examples:")
print("  generate(model, ctx, 100, temperature=0)        # Greedy")
print("  generate(model, ctx, 100, temperature=0.8)      # Temperature")
print("  generate(model, ctx, 100, temperature=0.8, top_k=40)  # Top-k + temp")
print("  generate(model, ctx, 100, temperature=0.9, top_p=0.95) # Nucleus")
print("")
print("Recommended settings:")
print("  Creative writing: temperature=0.9, top_p=0.95")
print("  Code generation:  temperature=0.2, top_p=0.95")
print("  Factual Q&A:      temperature=0.0 (greedy)")
print("  Brainstorming:    temperature=1.2, top_k=50")

The generate function processes one token at a time in a loop. At each step, it runs the full model forward pass on the current context, extracts logits for only the last position, applies the sampling strategy, and appends the result. The context grows by one token each iteration until we've generated the desired number of tokens.

Scaling Laws & GPT-2 Sizes

GPT-2 was released in four sizes, each demonstrating that bigger models learn better representations. The jump in quality from Small to XL is dramatic — each size roughly doubles/triples parameters and shows clear improvement on all language benchmarks:

GPT-2 Model Variants

Variant	Parameters	Layers	d_model	Heads
GPT-2 Small	124M	12	768	12
GPT-2 Medium	355M	24	1024	16
GPT-2 Large	774M	36	1280	20
GPT-2 XL	1.5B	48	1600	25

Scaling 124M → 1.5B 2019

Chinchilla Scaling Laws

DeepMind's Chinchilla paper (2022) revealed that most LLMs were undertrained. The compute-optimal strategy is to scale both model size and training data equally. The key finding: for a given compute budget $C$, the optimal model size $N$ and dataset size $D$ satisfy $N \propto C^{0.5}$ and $D \propto C^{0.5}$. This means a 70B parameter model should be trained on ~1.4 trillion tokens (not 300B tokens as was common practice).

import torch
import math

def compute_gpt2_params(n_layers, d_model, vocab_size=50257, block_size=1024):
    """
    Calculate exact parameter count for a GPT-2 variant.
    
    Parameters come from:
    - Token embeddings: vocab_size × d_model
    - Position embeddings: block_size × d_model
    - Per block: 
        - LayerNorm (×2): 2 × 2 × d_model
        - QKV projection: d_model × 3 × d_model + 3 × d_model (bias)
        - Output projection: d_model × d_model + d_model
        - FFN up: d_model × 4 × d_model + 4 × d_model
        - FFN down: 4 × d_model × d_model + d_model
    - Final LayerNorm: 2 × d_model
    - Output head: weight-tied (0 extra params)
    """
    # Embeddings
    token_emb = vocab_size * d_model
    pos_emb = block_size * d_model
    
    # Per block
    ln_params = 2 * (2 * d_model)  # 2 LayerNorms × (weight + bias)
    attn_qkv = d_model * 3 * d_model + 3 * d_model  # QKV + bias
    attn_out = d_model * d_model + d_model  # Output proj + bias
    ffn_up = d_model * 4 * d_model + 4 * d_model
    ffn_down = 4 * d_model * d_model + d_model
    block_params = ln_params + attn_qkv + attn_out + ffn_up + ffn_down
    
    # Final LN
    final_ln = 2 * d_model
    
    # Total (head is weight-tied)
    total = token_emb + pos_emb + n_layers * block_params + final_ln
    return total

# Compute for all GPT-2 variants
variants = [
    ("GPT-2 Small",  12, 768),
    ("GPT-2 Medium", 24, 1024),
    ("GPT-2 Large",  36, 1280),
    ("GPT-2 XL",     48, 1600),
]

print("=== GPT-2 Parameter Counts (Exact) ===\n")
for name, layers, d_model in variants:
    params = compute_gpt2_params(layers, d_model)
    print(f"{name:15s}: {params:>12,} params ({params/1e6:.1f}M)")

# Chinchilla optimal training tokens
print(f"\n=== Chinchilla Scaling Law ===")
print(f"Rule: Optimal tokens ≈ 20 × parameter count")
print()
for name, layers, d_model in variants:
    params = compute_gpt2_params(layers, d_model)
    optimal_tokens = 20 * params
    actual_tokens = 10_000_000_000  # GPT-2 was trained on ~10B tokens
    print(f"{name:15s}: Optimal={optimal_tokens/1e9:.1f}B tokens, "
          f"Actual=10B tokens, "
          f"{'Under-trained!' if actual_tokens < optimal_tokens else 'OK'}")

This calculation reveals that by Chinchilla's standards, all GPT-2 models were significantly undertrained — they should have seen 2-30× more tokens. This insight drove the development of models like LLaMA (Meta) which are smaller but trained on much more data, achieving similar performance to larger undertrained models.

Loading Pretrained GPT-2

While building from scratch teaches us everything about the architecture, in practice we use pretrained models via Hugging Face's transformers library. Let's load GPT-2 and generate text:

import torch

# Using Hugging Face transformers to load pretrained GPT-2
# pip install transformers

# --- OPTION 1: Using the pipeline (simplest) ---
print("=== Loading Pretrained GPT-2 ===")
print("(Requires: pip install transformers)\n")

# Demonstrate the API (conceptual - requires transformers installed)
code_example = """
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pretrained model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()

# Encode input text
prompt = "The future of artificial intelligence is"
input_ids = tokenizer.encode(prompt, return_tensors='pt')

# Generate text
with torch.no_grad():
    output = model.generate(
        input_ids,
        max_new_tokens=50,
        temperature=0.8,
        top_p=0.95,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

# Decode and print
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)
"""
print(code_example)

# Show model architecture details
print("GPT-2 Small Architecture:")
print(f"  Parameters: 124,439,808")
print(f"  Layers: 12")
print(f"  Hidden size: 768")
print(f"  Attention heads: 12")
print(f"  Context window: 1024 tokens")
print(f"  Vocabulary: 50,257 (BPE)")
print(f"  Training data: WebText (~40GB of internet text)")

The Hugging Face model is architecturally identical to what we built from scratch — the same pre-norm blocks, causal attention, GELU FFN, and weight tying. The difference is that it's been trained on 40GB of internet text for significant compute, giving it strong language understanding capabilities.

Fine-Tuning on Custom Text

Fine-tuning adapts a pretrained model to a specific domain or style. We take the pretrained GPT-2 weights and continue training on our custom dataset with a small learning rate. This is much faster than training from scratch because the model already understands language — it just needs to adapt its style:

import torch
import torch.nn.functional as F

# Conceptual fine-tuning code for GPT-2
# (Requires: pip install transformers datasets)

finetune_code = """
from transformers import GPT2LMHeadModel, GPT2Tokenizer, TrainingArguments
from transformers import Trainer, TextDataset, DataCollatorForLanguageModeling

# 1. Load pretrained model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Add padding token (GPT-2 doesn't have one by default)
tokenizer.pad_token = tokenizer.eos_token

# 2. Prepare your custom dataset
def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, max_length=512)

# 3. Fine-tuning configuration
training_args = TrainingArguments(
    output_dir='./gpt2-finetuned',
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,  # Effective batch = 16
    learning_rate=5e-5,             # Small LR for fine-tuning!
    warmup_steps=100,
    weight_decay=0.01,
    logging_steps=50,
    save_steps=500,
    fp16=True,                      # Mixed precision for speed
)

# 4. Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
trainer.train()
"""
print("=== Fine-Tuning Pretrained GPT-2 ===\n")
print(finetune_code)

# Key fine-tuning tips
print("\n=== Fine-Tuning Best Practices ===")
tips = [
    ("Learning rate", "5e-5 to 2e-5 (10-100x smaller than pretraining)"),
    ("Epochs", "2-5 (avoid overfitting to small datasets)"),
    ("Batch size", "16-64 (use gradient accumulation if GPU-limited)"),
    ("Warmup", "5-10% of total steps"),
    ("Early stopping", "Monitor validation loss, stop when it increases"),
    ("Data size", "Minimum ~1000 examples for noticeable style shift"),
    ("Freezing", "Optionally freeze early layers, only train last N"),
]
for key, value in tips:
    print(f"  {key:15s}: {value}")

Fine-tuning is remarkably sample-efficient — even a few thousand examples of your target domain can shift GPT-2's writing style noticeably. Common use cases include: generating code documentation, writing in a specific author's style, creating domain-specific content (medical, legal, technical), or adapting to a particular format (emails, tweets, poetry).

                            
                            Key Takeaway: You now understand every component of GPT-2 — from tokenization to causal attention to generation strategies. The architecture is conceptually simple: stack pre-norm transformer blocks, train to predict the next token, and sample autoregressively. The magic comes from scale: more parameters, more data, more compute → emergent capabilities that look like understanding.
                        

Complete Training Example

Let's wrap everything into a single, production-quality training script that demonstrates the full GPT-2 mini pipeline from data preparation to generation. This is a self-contained reference implementation:

import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from dataclasses import dataclass

@dataclass
class GPTConfig:
    """GPT-2 model configuration."""
    vocab_size: int = 256     # Byte-level for simplicity
    block_size: int = 128     # Maximum context length
    n_layers: int = 6         # Number of transformer blocks
    n_heads: int = 4          # Number of attention heads
    d_model: int = 256        # Embedding dimension
    dropout: float = 0.1      # Dropout rate
    bias: bool = True         # Use bias in Linear layers

# Print configuration
config = GPTConfig()
print("=== GPT-2 Mini Configuration ===")
for field in config.__dataclass_fields__:
    print(f"  {field:12s}: {getattr(config, field)}")

# Estimate FLOPS for one forward pass
# Approximate: 6 * N * T (where N = params, T = sequence length)
params_est = (config.n_layers * (12 * config.d_model**2 + 
              config.vocab_size * config.d_model))
flops_per_token = 6 * params_est
print(f"\n  ~Parameters: {params_est:,}")
print(f"  ~FLOPS/token: {flops_per_token:,}")
print(f"  ~FLOPS/sequence: {flops_per_token * config.block_size:,}")

The dataclass configuration pattern makes it easy to experiment with different model sizes. You can scale from our mini model (256 hidden, 6 layers) to GPT-2 Small (768 hidden, 12 layers) by simply changing the config. The FLOPS estimation follows the rule of thumb: approximately $6N$ FLOPS per token for a model with $N$ parameters (accounting for both forward and backward passes).

import torch
import torch.nn as nn
import torch.nn.functional as F

# Complete generation pipeline: encode → model → decode → sample → repeat

def demonstrate_generation_pipeline():
    """Show the full autoregressive generation loop step by step."""
    
    # Mini vocabulary for clarity
    vocab = list("abcdefghijklmnopqrstuvwxyz .,!?\n")
    vocab_size = len(vocab)
    stoi = {ch: i for i, ch in enumerate(vocab)}
    itos = {i: ch for i, ch in enumerate(vocab)}
    
    print("=== Autoregressive Generation Pipeline ===\n")
    
    # Simulate a trained model's behavior
    torch.manual_seed(42)
    prompt = "the cat "
    context = [stoi[c] for c in prompt if c in stoi]
    
    print(f"Prompt: '{prompt}'")
    print(f"Encoded: {context}")
    print(f"\nGeneration steps:")
    
    generated = list(prompt)
    for step in range(10):
        # In reality: run model forward pass here
        # model_output = model(torch.tensor([context]))
        # logits = model_output[:, -1, :]
        
        # Simulate model output (biased toward common characters)
        logits = torch.randn(vocab_size) * 0.5
        # Bias toward spaces and common letters
        for ch in "satone ":
            if ch in stoi:
                logits[stoi[ch]] += 1.5
        
        # Apply temperature + top-k
        temperature = 0.8
        top_k = 10
        
        scaled = logits / temperature
        topk_vals, topk_idx = torch.topk(scaled, top_k)
        filtered = torch.full_like(scaled, float('-inf'))
        filtered.scatter_(0, topk_idx, topk_vals)
        probs = F.softmax(filtered, dim=-1)
        
        # Sample
        next_idx = torch.multinomial(probs, 1).item()
        next_char = itos[next_idx]
        
        # Append and continue
        context.append(next_idx)
        generated.append(next_char)
        
        print(f"  Step {step+1}: context_len={len(context)}, "
              f"sampled='{next_char}' (idx={next_idx}, p={probs[next_idx]:.3f})")
    
    print(f"\nFull generated text: '{''.join(generated)}'")
    print(f"\nKey: Each step uses ALL previous tokens as context!")

demonstrate_generation_pipeline()

This demonstrates the fundamental loop of autoregressive generation: at each step, the entire context (including all previously generated tokens) feeds into the model, and only the logits at the final position determine the next token. This is why generation is inherently sequential — you can't parallelize it because each token depends on all previous ones.

Cookie Consent

Deep Dive: GPT-2 Mini — Building a Language Model

Table of Contents

Language Models & Autoregressive Generation

Autoregressive vs Masked Language Models

GPT-2 Architecture Overview

Pre-Norm vs Post-Norm

Tokenization

Character-Level Tokenizer for Our Mini Model

Causal (Masked) Self-Attention

Multi-Head Causal Attention

The GPT Block

Feed-Forward Network (MLP)

Assembling GPT-2 Mini

Weight Tying

Training on Shakespeare

Training Loop with Learning Rate Schedule

Text Generation Strategies

Top-k & Nucleus (Top-p) Sampling

Scaling Laws & GPT-2 Sizes

Chinchilla Scaling Laws

Loading Pretrained GPT-2

Fine-Tuning on Custom Text

Complete Training Example

Cookie Consent

Deep Dive: GPT-2 Mini — Building a Language Model

Table of Contents

Language Models & Autoregressive Generation

Autoregressive vs Masked Language Models

GPT-2 Architecture Overview

Pre-Norm vs Post-Norm

Tokenization

Character-Level Tokenizer for Our Mini Model

Causal (Masked) Self-Attention

Multi-Head Causal Attention

The GPT Block

Feed-Forward Network (MLP)

Assembling GPT-2 Mini

Weight Tying

Training on Shakespeare

Training Loop with Learning Rate Schedule

Text Generation Strategies

Top-k & Nucleus (Top-p) Sampling

Scaling Laws & GPT-2 Sizes

Chinchilla Scaling Laws

Loading Pretrained GPT-2

Fine-Tuning on Custom Text

Complete Training Example

Related Articles in This Series

Part 7: Transformers & Attention

Deep Dive: ResNet — Residual Networks from Scratch

Part 9: Deployment & Production