Large Language Models

This article builds a complete understanding of large language models from the ground up — covering the transformer architecture that powers them, the scaling laws that govern their training, the emergent capabilities that surprise their creators, the competitive landscape of open and closed models, and the engineering required to deploy them reliably in production. Code examples, model comparison tables, and benchmark references are included throughout.

Advanced LLMs Architecture Deployment Engineering

Transformer Architecture

The decoder-only transformer is the dominant architecture for all modern large language models. Its core components are: a tokeniser (converting raw text into integer token IDs), an embedding layer (mapping token IDs to dense vectors), a stack of transformer blocks (each containing multi-head self-attention followed by a position-wise feed-forward network, with residual connections and layer normalisation around each sub-layer), and a language modelling head (a linear projection from the final hidden state to vocabulary logits). The training objective is causal language modelling: at each position in the sequence, predict the next token given all preceding tokens, maximising the log-probability of the target token. This objective is simple enough to scale to arbitrary data and compute — no labels, no task-specific objectives, no reward signal — yet produces systems capable of in-context learning, instruction following, and coherent multi-thousand-word generation.

The lineage traces from GPT-2 (1.5B parameters, trained on 40GB of web text, 2019) through GPT-3 (175B parameters, trained on 300B tokens, 2020) to the current generation of frontier models operating at hundreds of billions to over a trillion effective parameters. Each generation retained the same fundamental architecture while incorporating engineering improvements: GPT-3 established that scale alone dramatically expanded capability without architectural changes; the subsequent generation introduced Grouped-Query Attention (GQA), which reduces KV cache memory by sharing key and value heads across multiple query heads; SwiGLU activation functions in feed-forward layers, which empirically improve training efficiency; RMS layer normalisation in place of the original layer norm; and Rotary Positional Embeddings (RoPE), which encode position more efficiently and generalise better to long contexts. These innovations are now standard across all leading open-weight architectures including LLaMA 3, Mistral, Gemma, and Qwen.

                        
                        Key Insight: Parameter count alone is a poor proxy for model capability. A well-trained 7B model on high-quality, carefully curated data — like Mistral 7B or LLaMA 3 8B — routinely outperforms poorly trained models several times its size. Data quality and the training recipe matter as much as raw scale, which is why Chinchilla-optimal training has displaced the race to maximise parameter counts.
                    

Attention Mechanisms & Multi-Head Self-Attention

Self-attention is the mechanism that allows each token in the sequence to aggregate information from all previous tokens. For each token, three vectors are computed: a query (what information am I looking for?), a key (what information do I contain?), and a value (what information should I pass forward?). The attention weights between token pairs are computed as the scaled dot product of queries and keys, normalised by a softmax, and then used to compute a weighted sum of value vectors. Multi-head attention runs this operation in parallel across H independent heads with different learned projections, allowing the model to simultaneously attend to multiple aspects of the context — syntactic structure, semantic relationships, coreference — before concatenating and projecting the results.

The key architectural innovations for efficient scaling are Flash Attention (Dao et al., 2022), which computes attention in fused GPU kernels using tiling to keep intermediate values in fast SRAM rather than slower HBM, reducing memory bandwidth by 5–20x for long sequences; Grouped-Query Attention (GQA), which groups multiple query heads to share a single key-value head, reducing the KV cache size (and thus inference memory) proportionally to the grouping factor; and Multi-Query Attention (MQA), an extreme version of GQA where all query heads share one key-value head. These optimisations are specifically targeted at inference efficiency: the KV cache — which stores key and value vectors for all previous tokens to avoid recomputation during autoregressive decoding — is the primary memory bottleneck for serving long-context requests at scale.

Positional Encoding & Context Windows

Self-attention is permutation-invariant by construction — the attention computation treats tokens identically regardless of their position in the sequence. Positional encodings inject order information by modifying the token representations before or during attention. Early models used learned absolute positional embeddings (a lookup table of position-specific vectors), which worked well within training-length sequences but generalised poorly to longer inputs. Rotary Positional Embeddings (RoPE) encode position by rotating query and key vectors in the attention computation, with the rotation angle proportional to the relative distance between token pairs. This relative encoding scheme generalises better to longer sequences and has become the standard across all modern open-weight architectures.

Context window length is governed by the O(n²) attention complexity: doubling the context length quadruples attention memory and compute. Flash Attention reduces the constant factor substantially but does not change the asymptotic complexity. Techniques that break the O(n²) barrier include sliding window attention (Mistral 7B), where each token attends only to a local window of preceding tokens rather than the full history, reducing complexity to O(n·w) where w is the window size; and sparse attention patterns. Context extension beyond training length uses techniques like YaRN (position interpolation with adjusted RoPE scaling), which adjusts the effective frequency basis of RoPE to allow extrapolation to longer sequences at inference time without retraining. The practical ceiling for reliable long-context retrieval is lower than the nominal context window suggests, due to the "lost in the middle" phenomenon: retrieval accuracy for information positioned in the middle of a long context degrades significantly relative to information at the start or end of the context.

Scaling Laws & Training

Scaling laws are empirical relationships that govern how language model performance improves as model size, dataset size, and compute budget increase. The foundational work by Kaplan et al. (2020) from OpenAI demonstrated that test loss decreases as a smooth power law with each of these three factors, enabling practitioners to predict model performance at untested scales from smaller training runs. This predictability is what justifies the enormous compute investments required for frontier model training: the capability gains are not speculative but empirically forecasted. The Kaplan guidance suggested scaling model size faster than data size for a given compute budget, which led to the trend of very large models trained on relatively modest datasets — most famously GPT-3 at 175B parameters trained on 300B tokens.

Chinchilla & Compute-Optimal Training

Kaplan's scaling laws made the case for investment but raised an immediate question for deployers: which axis of scale provides the most return for a given inference budget? A frontier model trained with ten times the compute may score meaningfully better on benchmarks while costing ten times as much per token to serve. For most production applications, the relevant axis is not maximum capability but capability-per-dollar at the query volume and latency requirements of the specific deployment. This reframing redirects model selection from "which model scores highest on MMLU?" to "which model achieves acceptable quality on my task at acceptable cost and latency?" — a fundamentally different question that requires task-specific benchmarking rather than public benchmark consulting.

The Chinchilla paper (Hoffmann et al., 2022) revised Kaplan's guidance substantially. By training over 400 models at diverse size-data combinations while holding total compute fixed, DeepMind found that Kaplan had underestimated the returns to data relative to parameters. The Chinchilla-optimal recipe allocates approximately 20 training tokens per parameter: a 7B model should train on ~140B tokens, a 70B model on ~1.4T tokens. By this analysis, GPT-3's 175B parameters trained on 300B tokens was significantly undertrained — the same compute could have produced a smaller but substantially better-performing model trained on more data.

The post-Chinchilla generation of models — LLaMA 1 and 2, Mistral, Gemma, Gemma 2, Qwen — all reflect this shift, training smaller models on dramatically larger datasets. But practitioners quickly discovered the distinction between training-optimal and inference-optimal compute allocation: a Chinchilla-optimal 65B model costs far more per inference call than a "overtrained" 7B model with equivalent performance on many tasks. LLaMA 2's 7B model was deliberately trained on 2T tokens — far beyond the Chinchilla optimum — because the inference cost savings at scale dwarf the training compute overhead. This trade-off is now a central design variable in any serious model training decision.

Training Data & Curation

Modern LLM training corpora are assembled from multiple source types: web crawl data (Common Crawl being the dominant source, often 50–70% of total tokens after filtering), digitised books and academic papers (providing higher-quality, long-form language and domain-specific knowledge), code repositories (GitHub being the primary source, producing code capabilities and structured reasoning), curated datasets (Wikipedia, Stack Exchange, specialised corpora for science, law, and medicine), and increasingly synthetic data generated by stronger models. The mixture ratios — how much of each source to include — have outsized effects on downstream capability: higher proportions of code data improve mathematical reasoning even on non-coding benchmarks; academic paper inclusion improves scientific knowledge and citation accuracy.

Data quality processing is as important as source selection. Near-deduplication (removing near-identical documents using MinHash locality-sensitive hashing) is essential — training on duplicated data produces models that recite memorised text rather than generalising, and wastes compute on redundant updates. Quality filtering removes low-quality web pages using heuristics (length, punctuation density, repetitive n-gram ratios) or classifier-based scoring trained on high-quality seed data. Toxicity filtering removes harmful content to reduce safety risks in the base model. Benchmark contamination — the inadvertent inclusion of evaluation benchmark questions in training data — is a significant concern that inflates benchmark scores and makes model comparisons unreliable; reputable labs maintain contamination detection pipelines but contamination cannot be entirely eliminated from web-crawl data.

Case Study

Grounding GPT-4 in Legal Documents: LexisNexis's RAG Pipeline for Case Research

LexisNexis, with over 140 years of case law, statutory text, and legal commentary spanning hundreds of jurisdictions, represents one of the most demanding knowledge-grounding challenges for LLMs. Their development of Lexis+ AI — a RAG-based legal research assistant — required solving retrieval problems that off-the-shelf vector search could not handle. Legal queries involve precise statutory citations, Latin legal maxims, and domain-specific terminology that general embedding models encode poorly. The team built a hybrid retrieval pipeline: dense bi-encoder retrieval for semantic similarity combined with BM25 for exact keyword and citation matching, re-ranked by a cross-encoder fine-tuned on legal document pairs. This reduced the proportion of answers lacking supporting citations from approximately 18% with pure dense retrieval to under 3%.

The generation prompt required extensive iteration. Early prompts that asked the model to "answer based on the following cases" produced responses mixing retrieved holdings with GPT-4's parametric legal knowledge — often subtly, in ways attorneys only caught on careful review. The team adopted a strict grounding regime: the generation prompt instructs the model to cite a specific retrieved document for every factual claim, to flag any claim it cannot support from retrieved context, and to acknowledge jurisdictional scope limits. Automated faithfulness scoring — comparing each sentence of the generated answer against retrieved documents using a fine-tuned NLI classifier — was integrated into the serving pipeline. The case demonstrates that for high-stakes professional domains, RAG quality is determined by evaluation and guardrail infrastructure as much as by retrieval or generation quality.

Scaling Training Data Compute Budget

Emergent Capabilities

Emergence in LLMs refers to qualitatively new capabilities appearing at certain scale thresholds, essentially absent in smaller models. The Wei et al. (2022) survey documented emergent abilities across 137 BIG-Bench tasks: capabilities including multi-digit arithmetic, word unscrambling, and chain-of-thought reasoning appeared sharply above approximately 50–100B parameters. This non-linearity has significant practical implications: if you evaluate a model class at 7B parameters and observe near-zero performance on a task, you cannot conclude the task is impossible at larger scales. However, Schaeffer et al. (2023) challenged this framing, showing that many apparent emergent phenomena vanish when tasks are scored with smoother, more granular metrics rather than binary exact-match accuracy — suggesting some "emergence" is an artefact of evaluation methodology rather than a genuine phase transition in model capability. The practical takeaway is balanced: empirical evaluation at multiple scales remains the most reliable planning tool, and practitioners should be cautious about both assuming smooth scaling and assuming hard capability thresholds.

Several emergent capabilities have direct practical implications that practitioners should understand. Tool use and function calling — the ability to decide which external tool to invoke, generate correctly formatted tool calls, and incorporate tool results into subsequent reasoning — is an emergent capability that first appears reliably in models at approximately 8B+ parameters with strong instruction tuning. Below this threshold, models frequently generate malformed function calls, invoke tools for the wrong reason, or fail to incorporate tool results correctly. Calibration — the alignment between a model's confidence and its actual accuracy — also improves with scale: larger models are better calibrated in the sense that their token probabilities more accurately reflect their actual uncertainty, making confidence-based filtering (routing low-confidence outputs for human review) more reliable. Multilingual transfer — the ability to reason in a language under-represented in the training data by leveraging reasoning chains in higher-resource languages — is another emergent phenomenon that makes frontier models practically usable in low-resource language markets even without dedicated language-specific training data.

In-Context Learning & Few-Shot

In-context learning (ICL) is the emergent ability to learn a new task from examples provided in the prompt at inference time, without any weight updates. Zero-shot ICL provides only a task description; few-shot ICL includes k labelled input-output examples that demonstrate the desired behaviour. The dominant hypothesis for why ICL works is that autoregressive pre-training on diverse data implicitly encodes a meta-learning algorithm — the model has seen so many task patterns that it has learned to identify and execute the pattern implied by prompt examples. ICL performance is sensitive to example selection (diverse, representative, correctly formatted examples outperform arbitrary ones), example ordering (later examples receive more attention weight), and label format (the model is surprisingly robust to label noise but sensitive to format consistency). Dynamic few-shot selection — choosing the most semantically similar examples from a curated pool at inference time using a retriever — consistently outperforms static selection and is the recommended approach for production pipelines.

Chain-of-Thought & Reasoning

Chain-of-thought (CoT) prompting is the discovery that instructing a model to generate intermediate reasoning steps before its final answer substantially improves performance on multi-step reasoning tasks. Manual CoT (Wei et al., 2022) provides few-shot examples with worked-out step-by-step reasoning; zero-shot CoT (Kojima et al., 2022) simply appends "Let's think step by step." to the prompt with equivalent effect on many benchmarks. The mechanistic rationale is that intermediate tokens provide additional context that the attention mechanism can use when computing the final answer, reducing the probability of shortcut reasoning. CoT is most effective for arithmetic, logical deduction, and multi-hop reasoning; it provides minimal benefit for simple factual retrieval or classification. Its key limitation is that models can produce confident-looking reasoning chains that contain embedded errors, particularly for precise arithmetic. Programme-of-thought prompting — instructing the model to write executable Python code that solves the problem, then running the code externally — is a more reliable alternative for quantitative tasks, delegating computation to an interpreter rather than requiring the LLM to perform it in prose.

The LLM Landscape

The LLM landscape has stratified into two broad categories: closed frontier models available only via API, and open-weight models whose parameters are publicly downloadable. The choice between them is a genuine engineering decision with significant implications for cost, latency, data privacy, and customisability — not simply a choice between "better" and "worse". The capability gap between the two categories has narrowed dramatically since 2023: leading open-weight models now match or exceed GPT-3.5-class performance on most benchmarks, and for specific well-defined tasks, fine-tuned open-weight models can outperform general-purpose frontier APIs.

Open vs. Closed Models

Closed frontier models — GPT-4o (OpenAI), Claude 3.5 Sonnet (Anthropic), Gemini 1.5 Pro (Google) — offer the highest general-purpose capability, strong safety alignment from extensive RLHF, and the convenience of fully managed infrastructure. The tradeoffs are real: API pricing compounds at scale; data passes through third-party infrastructure, which may be prohibited under GDPR, HIPAA, or enterprise data residency requirements; base weights are inaccessible, precluding fine-tuning; and the provider controls the model version schedule, meaning silent capability changes can cause production regressions. Open-weight models — Meta's LLaMA 3 (8B and 70B), Mistral 7B, Mixtral 8x7B (a sparse mixture-of-experts model with the inference cost of a 12B model but the capacity of 56B), Google's Gemma and Gemma 2, Alibaba's Qwen 2.5, and DeepSeek — offer downloadable weights, self-hostable inference, and full fine-tuning capability. The cost advantages at scale are substantial: self-hosted inference on 100M tokens per day at the 7B scale costs a fraction of equivalent API usage, at the cost of infrastructure engineering and reliability management.

Many production systems use a hybrid architecture: closed frontier APIs for low-volume, high-complexity queries (complex multi-step reasoning, ambiguous edge cases, high-stakes decisions) and self-hosted open-weight models for high-volume, well-defined tasks (classification, extraction, summarisation at scale). This approach optimises both cost and capability. Key model selection criteria beyond cost are: first-token latency requirements (sub-100ms is achievable with small self-hosted models but rarely with external APIs); data privacy constraints (regulated industries typically require self-hosted or region-specific deployment); fine-tuning needs (domain adaptation or style alignment requiring persistent weight modification); and compliance requirements (some jurisdictions require auditable model versioning, which API providers cannot always guarantee).

Benchmarks & Evaluation

The canonical LLM benchmark suite covers distinct capability dimensions: MMLU (Massive Multitask Language Understanding) tests factual knowledge across 57 academic subjects; HumanEval tests code generation via functional correctness on Python programming problems; GSM8K evaluates grade-school mathematics reasoning; HellaSwag tests commonsense NLI via sentence completion; TruthfulQA probes factual accuracy by testing whether models propagate popular misconceptions; MATH covers competition-level mathematics; and BIG-Bench Hard focuses on tasks where even frontier models remain below human performance. No single benchmark captures all relevant capability dimensions, and evaluating across the full suite is the minimum standard for a meaningful model comparison.

The Chatbot Arena (LMSYS) approach produces the most trustworthy general-purpose capability rankings: real users submit any query to two anonymised models, evaluate which response they prefer, and the results aggregate into Elo ratings over millions of comparisons. Because queries come from real users with genuine tasks, contamination is not a concern and the evaluation distribution reflects actual use. The canonical limitation of all public benchmarks is the Goodhart's Law problem: once a benchmark becomes a widely cited capability signal, model developers optimise for it — through benchmark-specific data inclusion or fine-tuning — inflating scores without improving general capability. Practitioners should always supplement public benchmarks with task-specific evaluation on datasets reflecting their actual production query distribution.

The economics of the LLM market are shifting rapidly in practitioners' favour. Competition between providers has driven frontier model API prices down by approximately 80% over two years. Open-weight models that matched GPT-3.5-class performance required 70B parameters in 2023; by 2025, 7B and 8B models achieve the same benchmark results due to improved training data, better instruction tuning, and advances in the scaling efficiency of smaller, longer-trained models. Mixture-of-experts (MoE) architectures — Mixtral 8x7B, DeepSeek-V2, and Qwen2.5-MoE — provide the parameter capacity of large dense models at the inference cost of smaller ones by activating only a subset of experts per token. For practitioners, this landscape means: the right model for most production tasks is almost always smaller and cheaper than intuition suggests; benchmark the actual deployment task, not general capability; and factor in total cost of ownership (API cost + infra cost + engineering cost) when making model selection decisions, not capability scores alone.

                        
                        Production Warning: Deploying an LLM without red-teaming it for your specific use case is the most common source of costly post-launch incidents. A model that tops MMLU may have systematic failure modes on your domain-specific query types — generating legally problematic content, leaking system prompt information, or producing confident errors on edge cases your eval set did not cover. Red-team before launch, not after.
                    

LLM Model Comparison (Mid-2025)

The following comparison covers the major LLMs available in mid-2025. Model capabilities evolve rapidly — always verify against current benchmarks and provider documentation before making architectural decisions. The "API Cost" column reflects approximate pricing at time of writing; costs typically decrease over time as competition intensifies.

Model	Provider	Context Window	Open-Weight	Strengths	API Cost / 1M tokens (approx.)
GPT-4o	OpenAI	128K tokens	No	Best general reasoning, coding, multimodal (vision + audio)	$5 input / $15 output
Claude 3.5 Sonnet	Anthropic	200K tokens	No	Long-context, instruction following, coding, reduced hallucination	$3 input / $15 output
Gemini 1.5 Pro	Google	1M tokens	No	Massive context window, multimodal, audio/video understanding	$3.50 input / $10.50 output
LLaMA 3.1 70B	Meta	128K tokens	Yes	Near GPT-4 class open-weight; fine-tunable; data privacy	Self-hosted ~$0.50–$1.00 (infra cost)
Mistral Large	Mistral AI	32K tokens	Partial	Strong reasoning, European data residency, efficient inference	$2 input / $6 output

                        
                        Model Selection Heuristic: For most enterprise applications, start with GPT-4o-mini or Claude 3 Haiku (the smaller, cheaper variants) for high-volume tasks and GPT-4o or Claude 3.5 Sonnet for complex reasoning tasks. Only migrate to self-hosted open-weight models when your monthly API bill exceeds ~$5K/month or when data residency requirements are non-negotiable.
                    

LLM Evaluation Benchmarks

Understanding what each benchmark actually measures — and what it does not — is essential for interpreting model comparison charts. Every benchmark has a specific failure mode or limitation; no single benchmark should be treated as a comprehensive capability signal.

Benchmark	What It Tests	Weakness / Limitation	Typical Frontier Score
MMLU	Factual knowledge across 57 academic subjects (science, law, medicine, etc.)	Multiple choice; susceptible to contamination; doesn't test reasoning depth	85–92% (GPT-4o class)
HumanEval	Python code generation — functional correctness on 164 programming problems	Only Python; simple algorithmic tasks; easier than real codebases	85–90% (GPT-4o, Claude 3.5)
GSM8K	Grade-school math word problems — multi-step arithmetic reasoning	Simple enough that near-saturation achieved; doesn't test hard math	95%+ (frontier models)
MATH	Competition-level mathematics (AMC/AIME difficulty)	Requires symbolic manipulation; still hard for all models	60–78% (best models)
HellaSwag	Commonsense reasoning — completing physical situation descriptions plausibly	Near-saturated by frontier models; marginal differentiation	95%+ (frontier models)
TruthfulQA	Factual accuracy — whether models avoid propagating popular misconceptions	Fixed question set; models can be fine-tuned to score well without being truthful	60–80% (varies widely)
Chatbot Arena	Head-to-head human preference ratings across real user queries (Elo ranking)	Biased toward verbosity and formatting; doesn't isolate specific capability dimensions	Elo ~1250–1320 (top models, Jan 2025)

LLM Inference: From Token Generation to Production Serving

Training an LLM is a one-time (or periodic) investment; inference is where the model earns its keep — and where the bills accumulate. Every time a user sends a prompt to ChatGPT, every time a copilot auto-completes a line of code, every time a customer support chatbot answers a question, an inference pipeline is executing. Understanding how LLM inference works — from the mathematical mechanics of next-token prediction to the systems engineering of serving millions of requests per second — is essential for anyone building, deploying, or paying for LLM-powered applications.

The economics tell the story: OpenAI processes over 100 billion tokens per day as of early 2026. At GPT-4o pricing of $2.50 per million input tokens and $10 per million output tokens, a single enterprise application generating 10 million tokens daily faces $25,000–$100,000 in monthly inference costs. Understanding inference mechanics is not an academic exercise — it is directly connected to product viability and unit economics.

The Mechanics of Next-Token Prediction

At its core, LLM inference is autoregressive generation: the model predicts one token at a time, appends it to the sequence, and predicts the next. This simple loop — predict, append, repeat — is the heartbeat of every LLM interaction. But the simplicity of the description masks enormous computational complexity.

Consider what happens when you send a prompt to an LLM. The input text is first tokenised — split into subword units using an algorithm like Byte-Pair Encoding (BPE). GPT-4 uses a vocabulary of roughly 100,000 tokens. The word "understanding" might be a single token, while "tokenisation" might split into "token" + "isation". Each token maps to an embedding vector — a dense numerical representation in a high-dimensional space (typically 4,096 to 12,288 dimensions for frontier models).

The inference process has two distinct phases, and understanding them is the key to understanding every optimisation technique that follows:

Phase 1 — Prefill (Prompt Processing): The entire input prompt is processed in parallel through all transformer layers. This is computationally intensive but highly parallelisable — GPUs excel at it. The output of this phase is a set of Key-Value (KV) pairs for every token at every layer. Think of these as the model's "working memory" of the conversation so far. For a 7-billion-parameter model with 32 layers and a 4,096-dimensional hidden state, processing a 2,048-token prompt generates approximately 2 GB of KV cache data.

Phase 2 — Decode (Token Generation): The model generates output tokens one at a time. Each new token requires a forward pass through all transformer layers, but — critically — it only needs to attend to the KV cache from all previous tokens, not recompute them. This is the KV cache optimisation that makes autoregressive generation tractable. Without it, generating a 500-token response from a 2,000-token prompt would require reprocessing 2,000 + 2,001 + 2,002 + ... + 2,499 tokens — roughly 1.1 million forward operations. With KV caching, it requires just 500 incremental forward passes, each attending to a growing but pre-computed cache.

                        
                        Analogy — The Restaurant Kitchen: Think of prefill as a kitchen reading the entire order ticket at once and setting up all the ingredients and cooking stations (KV cache). The decode phase is like plating one dish at a time — each dish is quick because all the prep is already done. Without the prep (no KV cache), the kitchen would have to re-read the entire order ticket from scratch before plating each dish.
                    

The decode phase is memory-bandwidth-bound, not compute-bound. Each token generation step reads the model weights and the growing KV cache from GPU memory (HBM) but performs relatively little arithmetic. This is why GPU memory bandwidth (measured in TB/s) matters more than raw FLOPS for inference workloads. An NVIDIA A100 has 2 TB/s of HBM bandwidth; an H100 has 3.35 TB/s — that 67% improvement translates almost directly to faster token generation for single-request inference.

A Brief History of LLM Inference

The evolution of LLM inference mirrors the explosive growth of language models themselves. Understanding this history provides context for why today's techniques exist and where the field is heading.

2017–2019: The Transformer Era Begins. When the original "Attention Is All You Need" paper introduced the Transformer architecture in 2017, inference was straightforward — models were small enough (BERT at 340M parameters, GPT-1 at 117M) to run on a single GPU with standard PyTorch or TensorFlow inference. The primary optimisation was batching: grouping multiple requests together to amortise the fixed cost of loading model weights. No one needed KV caching for models this small — the entire forward pass was fast enough.

2020–2021: The Scale Barrier. GPT-3 at 175 billion parameters changed everything. The model didn't fit on a single GPU (requiring at least 350 GB of memory in FP16). Inference required model parallelism — splitting the model across multiple GPUs and coordinating their computation. Latency became a real concern for interactive applications. Microsoft and OpenAI developed ONNX Runtime and DeepSpeed for efficient inference, introducing kernel fusion (combining multiple GPU operations into single kernels to reduce memory transfers) and quantisation (reducing weight precision from FP16 to INT8 to halve memory requirements with acceptable quality loss).

2022–2023: The Serving Revolution. As LLM applications exploded post-ChatGPT (November 2022), the problem shifted from "can we run inference?" to "can we serve millions of concurrent users cost-effectively?" This period produced the foundational serving innovations: continuous batching (dynamically adding and removing requests from a running batch rather than waiting for the entire batch to complete), PagedAttention (managing KV cache memory like virtual memory with paging to eliminate fragmentation), and speculative decoding (using a small draft model to propose multiple tokens that the large model verifies in parallel). The vLLM project, which introduced PagedAttention, demonstrated 2–4× throughput improvements over naive serving.

2024–2026: The Efficiency Frontier. Current innovations focus on pushing the cost-per-token down by orders of magnitude. Grouped Query Attention (GQA) reduces KV cache size by sharing key-value heads across query heads — Llama 2 70B uses 8 KV heads instead of 64, reducing cache by 8×. Flash Attention v2 and v3 fuse the entire attention computation into a single, memory-efficient GPU kernel. Quantisation has advanced from INT8 to INT4 and even INT2 for specific workloads, with techniques like GPTQ, AWQ, and QuIP# maintaining quality at extreme compression. Mixture-of-Experts (MoE) architectures like Mixtral activate only a fraction of total parameters per token, achieving the quality of a dense model at a fraction of the inference cost.

Historical Milestone

PagedAttention: The Invention That Transformed LLM Serving

In June 2023, researchers at UC Berkeley released the vLLM paper introducing PagedAttention — a technique that manages KV cache memory the way operating systems manage virtual memory. Before PagedAttention, LLM serving frameworks pre-allocated contiguous memory blocks for the maximum possible sequence length for every request, wasting 60–80% of GPU memory on reserved-but-unused space. PagedAttention divides KV cache into fixed-size blocks (pages) that are allocated on demand and can be stored non-contiguously in GPU memory, just like pages in a virtual memory system. The result: 2–4× higher throughput at the same hardware cost. Within 12 months, PagedAttention was adopted by every major serving framework (TGI, TensorRT-LLM, SGLang), and vLLM became the de facto standard for open-model serving.

PagedAttention vLLM Memory Management

KV Cache: The Critical Bottleneck

The KV cache is the single most important concept in LLM inference optimisation. Every technique — from quantisation to PagedAttention to speculative decoding — ultimately aims to either reduce KV cache size, manage it more efficiently, or work around its constraints.

For each token in the sequence, at each transformer layer, the model stores a Key vector and a Value vector. For a model with L layers, h attention heads, and head dimension d, the KV cache size per token is: 2 × L × h × d × bytes_per_element. For Llama 2 70B (80 layers, 64 heads, 128 head dimension, FP16): 2 × 80 × 64 × 128 × 2 bytes = 2.62 MB per token. A 4,096-token context consumes 10.7 GB of KV cache — that's 13% of an 80GB A100's memory for a single request.

This is why long-context models (128K, 1M+ tokens) present such a serving challenge. A single 128K-context request at the Llama 70B scale would require 336 GB of KV cache — more memory than four A100-80GB GPUs. Techniques like Grouped Query Attention (GQA), Multi-Query Attention (MQA), KV cache quantisation (storing cache values in INT8 or INT4), and sliding window attention (discarding KV entries beyond a fixed window) are all strategies to tame this memory explosion.

import torch
import math

def calculate_kv_cache_size(
    num_layers: int,
    num_kv_heads: int,
    head_dim: int,
    seq_length: int,
    batch_size: int = 1,
    dtype_bytes: int = 2  # FP16 = 2 bytes, INT8 = 1 byte
) -> dict:
    """Calculate KV cache memory requirements for an LLM.
    
    This snippet is self-contained and demonstrates the memory math
    that governs LLM inference capacity planning.
    """
    # Each token stores K and V vectors at every layer
    per_token_bytes = 2 * num_layers * num_kv_heads * head_dim * dtype_bytes
    
    # Total for sequence × batch
    total_bytes = per_token_bytes * seq_length * batch_size
    total_gb = total_bytes / (1024 ** 3)
    
    return {
        "per_token_mb": per_token_bytes / (1024 ** 2),
        "total_gb": round(total_gb, 2),
        "per_token_bytes": per_token_bytes
    }

# Example: Llama 2 70B (GQA: 8 KV heads instead of 64)
llama70b_gqa = calculate_kv_cache_size(
    num_layers=80, num_kv_heads=8, head_dim=128, seq_length=4096
)
print(f"Llama 70B (GQA) — 4K context: {llama70b_gqa['total_gb']} GB")

# Without GQA (original MHA: 64 KV heads)
llama70b_mha = calculate_kv_cache_size(
    num_layers=80, num_kv_heads=64, head_dim=128, seq_length=4096
)
print(f"Llama 70B (MHA) — 4K context: {llama70b_mha['total_gb']} GB")

# Long context scenario: 128K tokens
llama70b_long = calculate_kv_cache_size(
    num_layers=80, num_kv_heads=8, head_dim=128, seq_length=131072
)
print(f"Llama 70B (GQA) — 128K context: {llama70b_long['total_gb']} GB")

# Batch of 32 concurrent requests
llama70b_batch = calculate_kv_cache_size(
    num_layers=80, num_kv_heads=8, head_dim=128, seq_length=4096, batch_size=32
)
print(f"Llama 70B (GQA) — 4K × 32 batch: {llama70b_batch['total_gb']} GB")

Quantisation for Inference

Quantisation is the most accessible inference optimisation — it reduces model size and memory bandwidth requirements by representing weights (and sometimes activations) in lower-precision number formats. The journey from FP32 to FP16 was the first step, halving memory requirements with negligible quality loss. The frontier has since pushed to INT8, INT4, and even lower precisions.

Post-Training Quantisation (PTQ) converts a pre-trained model to lower precision without retraining. GPTQ (2022) performs layer-wise quantisation using a calibration dataset to minimise quantisation error. AWQ (Activation-Aware Weight Quantisation, 2023) identifies the 1% of weights that are critical (those corresponding to high-magnitude activations) and keeps them at higher precision while aggressively quantising the rest. The intuition is that not all weights matter equally — a small fraction disproportionately influence the output, and protecting those weights preserves model quality even at extreme compression.

Quantisation trade-offs follow a predictable pattern. FP16 → INT8 quantisation typically loses less than 0.5% on standard benchmarks with a 2× memory reduction. INT8 → INT4 quantisation introduces 1–3% quality degradation but achieves a 4× compression from the FP16 baseline. Beyond INT4, quality degrades rapidly unless specialised techniques (QuIP#, AQLM) or model-specific calibration are applied. The practical rule: INT4 quantisation (GPTQ or AWQ) is the standard choice for cost-sensitive deployments where a 1–2% quality trade-off is acceptable. INT8 is the conservative choice for applications where quality is paramount.

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load a quantised model using HuggingFace Transformers + bitsandbytes
# This loads Llama 3.1 8B in 4-bit quantisation — runs on a single consumer GPU

model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 4-bit quantisation using bitsandbytes (NF4 format)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    device_map="auto",             # Automatically distribute across available GPUs
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_quant_type="nf4",     # NormalFloat4 — optimal for normally-distributed weights
    bnb_4bit_use_double_quant=True  # Quantise the quantisation constants themselves
)

# Memory comparison
param_count = sum(p.numel() for p in model.parameters())
memory_fp16_gb = (param_count * 2) / (1024 ** 3)
memory_4bit_gb = (param_count * 0.5) / (1024 ** 3)  # ~0.5 bytes per param in NF4

print(f"Parameters: {param_count / 1e9:.1f}B")
print(f"FP16 memory: {memory_fp16_gb:.1f} GB")
print(f"4-bit memory: {memory_4bit_gb:.1f} GB")
print(f"Compression ratio: {memory_fp16_gb / memory_4bit_gb:.1f}×")

# Generate text with the quantised model
inputs = tokenizer("Explain KV cache in LLM inference:", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Speculative Decoding & Parallel Generation

Speculative decoding is one of the most elegant ideas in modern inference optimisation. The core insight: autoregressive generation (one token at a time with the large model) is slow because each step is bottlenecked by memory bandwidth, not compute. What if a small, fast draft model could propose multiple candidate tokens, and the large verifier model could check all of them in a single forward pass?

This is exactly what speculative decoding does. A draft model (e.g., a 1B-parameter distilled version of a 70B model) generates K candidate tokens autoregressively — this is fast because the draft model is small. The large model then processes the original context + all K draft tokens in a single parallel forward pass and checks which draft tokens it would have generated itself. Accepted tokens are kept; the first rejected token is replaced with the large model's prediction, and the process restarts. Because the large model's verification pass is parallel (like prefill), it takes roughly the same time regardless of whether it's checking 1 or 10 tokens.

The speedup depends on the acceptance rate — the fraction of draft tokens the large model accepts. For aligned draft models (distilled from the target or trained on similar data), acceptance rates of 70–85% are typical, yielding 2–3× latency improvements. The technique is provably lossless: the output distribution matches what the large model would have generated on its own, because any rejected token triggers a resample from the large model's distribution. Google's implementation of speculative decoding in PaLM 2 achieved 2× speedup on code generation tasks; Anthropic uses a variant in Claude's serving infrastructure.

                        
                        Analogy — The Editor & the Ghostwriter: Speculative decoding is like hiring a fast ghostwriter (draft model) who quickly writes a paragraph, and then having the senior editor (large model) review the entire paragraph at once, accepting most sentences and rewriting only the ones that don't meet their standards. The editor can read and approve 10 sentences in barely more time than it takes to write one from scratch — so the overall process is much faster than the editor writing every sentence themselves.
                    

Continuous Batching & Serving Systems

In early LLM serving systems, requests were grouped into static batches: the server collected N requests, processed them all together, and returned results only when the longest sequence in the batch finished generating. Short requests that completed early sat idle, wasting GPU compute. This is analogous to a restaurant that won't serve anyone at a table until everyone's food is ready — the person who ordered a salad waits 45 minutes because someone else ordered a soufflé.

Continuous batching (also called iteration-level batching or inflight batching) solves this by operating at the granularity of individual decode steps rather than complete requests. At every decode iteration, finished requests are evicted from the batch and new requests are admitted — the batch composition changes dynamically. A request that needs only 50 output tokens exits the batch in 50 iterations, freeing its KV cache memory for a new request, even if another request in the same batch is generating 2,000 tokens. This maximises GPU utilisation and minimises queueing latency.

The serving stack for production LLM inference has converged around a few key systems. vLLM (PagedAttention, continuous batching, OpenAI-compatible API) is the most widely deployed open-source serving framework. TensorRT-LLM (NVIDIA) provides the highest raw performance on NVIDIA hardware through aggressive kernel optimisation but is vendor-locked. SGLang introduces RadixAttention for efficient KV cache sharing across requests with common prefixes (e.g., shared system prompts), achieving 3–5× speedup for applications where many requests share the same context prefix. llama.cpp targets CPU and consumer GPU inference with extreme quantisation, enabling 7B–13B models to run on laptops.

"""
Benchmark inference throughput with different batch sizes.

This self-contained script simulates the throughput characteristics
of static vs. continuous batching to build intuition about why
continuous batching dominates production serving.
"""
import time
import random

def simulate_static_batching(requests: list, batch_size: int) -> dict:
    """Static batching: all requests in a batch wait for the longest one."""
    total_tokens = 0
    total_time = 0.0
    
    for i in range(0, len(requests), batch_size):
        batch = requests[i:i + batch_size]
        max_tokens = max(batch)    # Entire batch waits for longest request
        batch_time = max_tokens * 0.01  # Simulate 10ms per token
        total_tokens += sum(batch)
        total_time += batch_time
    
    return {"tokens": total_tokens, "time": round(total_time, 2),
            "throughput": round(total_tokens / total_time, 1)}

def simulate_continuous_batching(requests: list, max_concurrent: int) -> dict:
    """Continuous batching: requests exit as soon as they finish."""
    total_tokens = sum(requests)
    # In continuous batching, total time ≈ max sequence length × overhead
    # Each iteration processes all active sequences in parallel
    sorted_reqs = sorted(requests, reverse=True)
    # Simplified: time dominated by longest requests, short ones "free-ride"
    effective_iterations = sorted_reqs[0] if sorted_reqs else 0
    total_time = effective_iterations * 0.01 * (1 + len(requests) / max_concurrent * 0.1)
    
    return {"tokens": total_tokens, "time": round(total_time, 2),
            "throughput": round(total_tokens / total_time, 1)}

# Simulate 100 requests with varying output lengths (realistic distribution)
random.seed(42)
requests = [random.randint(20, 500) for _ in range(100)]
print(f"Requests: {len(requests)}, Avg tokens: {sum(requests)//len(requests)}")

static = simulate_static_batching(requests, batch_size=8)
continuous = simulate_continuous_batching(requests, max_concurrent=32)

print(f"\nStatic Batching  — Throughput: {static['throughput']} tok/s, Time: {static['time']}s")
print(f"Continuous Batch — Throughput: {continuous['throughput']} tok/s, Time: {continuous['time']}s")
print(f"Speedup: {continuous['throughput'] / static['throughput']:.1f}×")

The Economics of LLM Inference

Understanding inference economics is essential for anyone building products on LLMs. The cost equation has four primary levers: model size (determines GPU memory and compute per token), quantisation level (trades quality for cost), batch size (amortises fixed costs across requests), and hardware choice (GPU type, cloud vs. on-premise).

Case Study

Inference Cost at Scale: A Real-World Comparison

Consider a customer support chatbot handling 1 million conversations per day, averaging 800 input tokens and 200 output tokens per conversation. Monthly token volume: 30 billion tokens.

Option A — GPT-4o API: Input: 24B tokens × $2.50/M = $60,000. Output: 6B tokens × $10/M = $60,000. Monthly cost: $120,000.

Option B — Self-hosted Llama 3.1 70B (8×A100 cluster): Hardware: ~$25,000/month (cloud). Achieves ~3,000 tokens/sec with vLLM. Monthly capacity: 7.8B tokens. Need 4 clusters for 30B tokens = $100,000/month. Plus engineering overhead (~$15,000/month in ops time). Total: ~$115,000.

Option C — Self-hosted Llama 3.1 8B INT4 (single A100): Hardware: ~$3,000/month. Achieves ~8,000 tokens/sec. Monthly capacity: 20.7B tokens. Need 2 GPUs = $6,000. Quality trade-off: 8B is less capable than 70B — acceptable for simple FAQs, insufficient for complex reasoning. Total: ~$8,000 (with quality compromise).

The optimal choice depends on quality requirements: if 90% of queries are simple (handled well by 8B) and 10% are complex (requiring 70B or GPT-4o), a routing architecture sending easy queries to the cheap model and hard queries to the expensive model achieves near-frontier quality at 60–70% cost reduction.

Cost Optimisation Model Routing Self-Hosting

Case Study

How Spotify Uses Speculative Decoding for Podcast Summaries

Spotify's podcast summarisation system processes millions of episodes, generating 3–5 sentence summaries for search indexing and user discovery. Their initial deployment used a 70B model via API, costing over $200,000/month and taking 8 seconds per summary. By switching to self-hosted inference with speculative decoding — using a 1.5B draft model distilled from their fine-tuned 70B summariser — they reduced latency to 3.2 seconds (2.5× speedup) with zero quality degradation (speculative decoding is provably lossless). The draft model achieved an 82% acceptance rate on podcast transcripts because the domain-specific fine-tuning made the small model highly predictive of the large model's outputs. Combined with INT8 quantisation and continuous batching, the total cost dropped to $45,000/month — a 78% reduction.

Speculative Decoding Distillation Cost Reduction

Inference Optimisation Techniques Reference

The following table summarises the major inference optimisation techniques, their mechanisms, typical speedups, and trade-offs. Use this as a decision framework when designing your serving architecture.

Technique	Mechanism	Speedup	Quality Impact	When to Use
KV Cache	Store computed K,V vectors; reuse during decode	Essential (baseline)	None (lossless)	Always — every serving framework uses this
Continuous Batching	Dynamic request scheduling at iteration level	2–4× throughput	None	Any multi-user serving scenario
PagedAttention	Virtual memory for KV cache (non-contiguous blocks)	2–4× throughput	None	Variable-length workloads (default in vLLM)
Flash Attention	Fused, IO-aware attention kernel	2–3× attention speed	None	Long sequences, memory-constrained GPUs
INT8 Quantisation	Reduce weight precision to 8-bit integers	2× memory, 1.5× speed	<0.5% degradation	Quality-sensitive production deployments
INT4 Quantisation (GPTQ/AWQ)	4-bit weights with calibration-based error minimisation	4× memory, 2× speed	1–3% degradation	Cost-sensitive, consumer GPU deployments
Speculative Decoding	Draft model proposes tokens; target model verifies in parallel	2–3× latency	None (provably lossless)	Latency-critical single-request serving
GQA / MQA	Share KV heads across query heads	4–8× KV cache reduction	<1% degradation	Long context, high-batch serving (built into model)
Prefix Caching	Share KV cache for common prompt prefixes	3–5× for shared-prefix workloads	None	Applications with shared system prompts
Model Distillation	Train small model to mimic large model's outputs	5–20× (model-dependent)	5–15% degradation	High-volume, latency-sensitive applications

"""
Compare inference speed across quantisation levels using llama-cpp-python.

This script demonstrates the practical performance differences between
FP16, INT8, and INT4 quantised models on the same hardware.
Requires: pip install llama-cpp-python
"""
import time

def benchmark_inference(model_path: str, prompt: str, n_tokens: int = 100) -> dict:
    """Benchmark token generation speed for a given model."""
    from llama_cpp import Llama

    llm = Llama(model_path=model_path, n_ctx=2048, verbose=False)
    
    # Warm-up run
    llm(prompt, max_tokens=10)
    
    # Timed generation
    start = time.perf_counter()
    output = llm(prompt, max_tokens=n_tokens, temperature=0.7)
    elapsed = time.perf_counter() - start
    
    tokens_generated = output["usage"]["completion_tokens"]
    return {
        "tokens": tokens_generated,
        "time_s": round(elapsed, 2),
        "tokens_per_sec": round(tokens_generated / elapsed, 1)
    }

# Example: Compare quantisation levels of the same model
# Download GGUF files from HuggingFace (e.g., TheBloke/Llama-2-7B-GGUF)
prompt = "Explain how KV cache works in transformer inference:"

models = {
    "FP16": "llama-2-7b.f16.gguf",
    "INT8 (Q8_0)": "llama-2-7b.Q8_0.gguf",
    "INT4 (Q4_K_M)": "llama-2-7b.Q4_K_M.gguf",
}

print(f"{'Model':<20} {'Tokens/sec':<12} {'Time (s)':<10} {'Tokens'}")
print("-" * 55)
for name, path in models.items():
    try:
        result = benchmark_inference(path, prompt, n_tokens=100)
        print(f"{name:<20} {result['tokens_per_sec']:<12} {result['time_s']:<10} {result['tokens']}")
    except Exception as e:
        print(f"{name:<20} Skipped ({e})")

Code: Local LLM Inference with HuggingFace

The following example demonstrates production-ready local inference with Llama 3.1 8B Instruct using HuggingFace Transformers, including the correct instruction template format, bfloat16 precision for efficiency, and automatic device placement across available GPUs. Understanding the generation hyperparameters — temperature, top_p, max_new_tokens — is essential for controlling output quality and cost.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Local LLM inference — Llama 3.1 8B Instruct
model_id = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16,
    device_map="auto"  # auto-distributes across available GPUs
)

# Instruction-formatted prompt
messages = [
    {"role": "system", "content": "You are a data analysis expert."},
    {"role": "user",   "content": "Explain the difference between RMSE and MAE in plain English."}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.7,
        do_sample=True,
        top_p=0.9
    )

response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

                        
                        Memory Estimate: LLaMA 3.1 8B in bfloat16 requires ~16GB of GPU VRAM. For 4-bit quantised inference (GPTQ or AWQ), the same model fits in ~5GB — enabling deployment on consumer GPUs. Use BitsAndBytesConfig(load_in_4bit=True) in the from_pretrained call to enable 4-bit loading with minimal quality loss (<2% degradation on most benchmarks).
                    

Code: High-Throughput Serving with vLLM

For production serving at scale, vLLM's PagedAttention algorithm provides 2–4x higher throughput than naive HuggingFace inference by managing KV cache memory in discrete pages — similar to virtual memory paging in an OS — rather than pre-allocating the full maximum context length per request. The following example demonstrates batched inference with tensor parallelism across multiple GPUs and the key configuration parameters that govern throughput.

from vllm import LLM, SamplingParams

# vLLM: PagedAttention for efficient KV cache management
# 2-4x higher throughput than naive HuggingFace inference
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    tensor_parallel_size=2,    # distribute across 2 GPUs
    gpu_memory_utilization=0.85,
    max_model_len=8192
)

sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=512)

# Batch inference — PagedAttention handles variable-length sequences efficiently
prompts = [
    "Summarize this earnings report in 3 bullet points: [REPORT TEXT]",
    "Extract all dates from: 'The contract runs from Jan 2024 to Dec 2025'",
    "Classify sentiment: 'The product quality is good but delivery was terrible'"
]

outputs = llm.generate(prompts, sampling_params)  # batched, continuous batching
for output in outputs:
    print(f"Prompt: {output.prompt[:50]}...")
    print(f"Response: {output.outputs[0].text[:100]}\n")
# Throughput: ~2,000 tokens/sec on 2x A100 80GB vs ~500 tokens/sec naive

                        
                        Throughput Benchmarking: To measure your actual serving throughput, run vLLM's built-in benchmark: python -m vllm.entrypoints.openai.api_server --model [model_id] to start the server, then use benchmark_serving.py to fire concurrent requests at different batch sizes. Throughput typically scales linearly with batch size up to ~16 concurrent requests, then plateaus as GPU memory becomes the bottleneck.
                    

Code: Token Counting & Context Window Management

Token budgeting is a critical production concern. Exceeding a model's context window raises an error; approaching it degrades generation quality. The following code demonstrates reliable token counting with tiktoken and a pragmatic strategy for gracefully truncating long documents while preserving both the system instructions (at the start) and the most recent context (at the end) — the two regions that receive the most model attention.

import tiktoken  # OpenAI's tokenizer
from transformers import AutoTokenizer

def count_tokens_openai(text: str, model: str = "gpt-4o") -> int:
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

def truncate_to_context_window(text: str, max_tokens: int = 100_000,
                                model: str = "gpt-4o") -> str:
    enc = tiktoken.encoding_for_model(model)
    tokens = enc.encode(text)
    if len(tokens) <= max_tokens:
        return text
    # Truncate from middle to preserve start (system instructions) and end (recent context)
    keep_start = max_tokens * 2 // 3
    keep_end = max_tokens // 3
    preserved = tokens[:keep_start] + tokens[-keep_end:]
    return enc.decode(preserved)

# Real-world example: summarizing a 500-page document
doc_tokens = count_tokens_openai(long_document)
print(f"Document: {doc_tokens:,} tokens")  # e.g., 150,000 tokens
print(f"Fits in GPT-4o context: {doc_tokens <= 128_000}")   # True for GPT-4o
print(f"Fits in Claude 3.5 context: {doc_tokens <= 200_000}")  # True (200K context)
# For documents > 200K tokens: chunking + summarize-then-synthesize strategy

Code: Context Budget Management

The following implementation provides a reusable context budget allocator that dynamically partitions a model's context window across system prompt, conversation history, retrieved context, and user input. It prevents hard context limit errors, enables predictable cost control, and applies principled truncation strategies to each component when limits are approached.

import tiktoken
from dataclasses import dataclass, field
from typing import Optional

enc = tiktoken.encoding_for_model("gpt-4o")

def count_tokens(text: str) -> int:
    return len(enc.encode(text))

def count_messages_tokens(messages: list[dict]) -> int:
    """Count tokens in OpenAI chat messages format."""
    total = 0
    for msg in messages:
        total += 4  # overhead per message
        total += count_tokens(msg.get("content", ""))
        total += count_tokens(msg.get("role", ""))
    return total + 3  # reply priming overhead

@dataclass
class ContextBudget:
    """Allocate token budget across prompt components."""
    model_context_limit: int = 128_000
    safety_margin: float = 0.05   # 5% reserve
    response_buffer: int = 2_000  # reserved for model output

    # Proportional budget allocations
    system_frac: float = 0.10     # 10% for system prompt
    history_frac: float = 0.25    # 25% for conversation history
    retrieval_frac: float = 0.50  # 50% for retrieved context
    # Remainder goes to user input

    @property
    def usable_tokens(self) -> int:
        return int(self.model_context_limit * (1 - self.safety_margin)) - self.response_buffer

    def budgets(self) -> dict[str, int]:
        u = self.usable_tokens
        return {
            'system':    int(u * self.system_frac),
            'history':   int(u * self.history_frac),
            'retrieval': int(u * self.retrieval_frac),
            'input':     u - int(u * (self.system_frac + self.history_frac + self.retrieval_frac))
        }

def truncate_history(messages: list[dict], budget: int) -> list[dict]:
    """Keep most recent messages within budget (preserve recency)."""
    result, used = [], 0
    for msg in reversed(messages):
        tokens = count_tokens(msg.get("content", "")) + 4
        if used + tokens > budget: break
        result.insert(0, msg)
        used += tokens
    return result

def truncate_retrieval(chunks: list[str], budget: int) -> list[str]:
    """Trim retrieved chunks from end (lowest relevance last)."""
    result, used = [], 0
    for chunk in chunks:
        tokens = count_tokens(chunk)
        if used + tokens > budget: break
        result.append(chunk)
        used += tokens
    return result

def build_context(system_prompt: str, history: list[dict],
                  retrieved_chunks: list[str], user_input: str,
                  budget: Optional[ContextBudget] = None) -> list[dict]:
    """Build a context-safe message list for OpenAI Chat API."""
    budget = budget or ContextBudget()
    b = budget.budgets()

    # Truncate each component to its budget
    sys = system_prompt[:count_tokens(system_prompt)]  # trim if needed
    hist = truncate_history(history, b['history'])
    ctx = truncate_retrieval(retrieved_chunks, b['retrieval'])
    ctx_text = "\n\n".join(ctx) if ctx else ""

    sys_with_context = f"{sys}\n\nContext:\n{ctx_text}" if ctx_text else sys
    messages = [{"role": "system", "content": sys_with_context}]
    messages.extend(hist)
    messages.append({"role": "user", "content": user_input})

    total = count_messages_tokens(messages)
    print(f"[Context] total={total:,} | limit={budget.usable_tokens:,} | "
          f"history={len(hist)} turns | chunks={len(ctx)}")
    return messages

# Usage
history = [{"role": "user", "content": "What's your return policy?"},
           {"role": "assistant", "content": "30 days for all items."}]
chunks = ["Our return policy covers 30 days for all items with receipt.",
          "Electronics returns require original packaging.",
          "Shipping costs for returns are the customer's responsibility."]

msgs = build_context(
    system_prompt="You are a helpful customer support assistant.",
    history=history,
    retrieved_chunks=chunks,
    user_input="Can I return a laptop without the original box?"
)
# [Context] total=287 | limit=119,000 | history=2 turns | chunks=3

                        
                        Production Pattern: Log the token counts from every production API call — system, history, retrieval, input, and output tokens separately. Tracking these over time reveals whether context usage is growing (e.g., accumulating history in long sessions), which component is driving cost, and when you're approaching limits before hard errors occur. Set up an alert when any call exceeds 80% of the model's context limit.
                    

Fine-Tuning & Model Adaptation

Fine-tuning adapts a pre-trained model's weights using a smaller, domain-specific dataset, updating the model to specialise in a target task, domain, or response style beyond what prompting alone can achieve. The decision between prompt engineering and fine-tuning is one of the most consequential choices in any LLM application. Fine-tuning incurs significant upfront cost — data curation, training compute, evaluation infrastructure, model versioning, and ongoing retraining cadence — but can produce consistent performance improvements, smaller effective model sizes (a fine-tuned 7B can match a general-purpose 70B on a well-defined task), and fundamentally lower inference cost per task. Prompt engineering is faster to iterate, requires no compute investment, and is appropriate for a wide range of tasks — but has ceiling effects, is vulnerable to prompt sensitivity, and cannot encode knowledge that the base model genuinely lacks.

Fine-Tuning Approach Quick Reference

Approach	GPU Memory (7B model)	Trainable Params	Quality vs. Full FT	Best For
Full Fine-Tuning	~140GB (fp16)	100%	Baseline	Maximum quality when compute budget allows; large datasets (>100K examples)
LoRA (rank 16)	~16GB (fp16)	<0.3%	97–99% of full FT	Most production fine-tuning scenarios; good balance of quality and efficiency
QLoRA (4-bit + LoRA)	~5GB	<0.3% (fp16 adapters)	96–98% of full FT	Consumer GPU fine-tuning; teams without data centre access; rapid experimentation
IA3 / Prompt Tuning	~14GB (fp16)	<0.01%	90–95% of full FT	Extreme parameter efficiency; soft prompt learning; few-shot style adaptation
Continued Pre-Training	~140GB (full) or ~5GB (QLoRA)	All (or LoRA adapters)	Varies by domain gap	Domain injection (medical, legal, scientific); significant vocabulary gap from base model

When to Fine-Tune vs. Prompt-Engineer

The practical decision tree for fine-tuning starts with two questions. First: does the task require knowledge or capabilities absent from the base model, or does it merely require directing existing capabilities? If a task requires the model to answer questions about your company's internal processes, product specifications, or domain-specific terminology that post-dates the training cutoff, RAG or fine-tuning on that content is necessary; prompting alone cannot inject information the model does not have. If the task requires a specific output format, response style, or persona that the model consistently deviates from despite careful prompting, fine-tuning the format into the weights is more reliable than repeating format instructions in every request. Second: is the performance gain worth the operational overhead? Fine-tuned models must be versioned, served as distinct endpoints, re-evaluated after base model updates, and retrained when task requirements change — all costs with no equivalent in prompt-based solutions that use managed API endpoints.

The most productive fine-tuning scenarios are: (a) consistent output format — teaching the model to always produce responses in a specific JSON schema, XML format, or structured report template that few-shot prompting fails to enforce reliably; (b) domain-specific vocabulary — adapting to dense technical terminology (medical, legal, financial) that the base model encodes imprecisely; (c) response style alignment — calibrating the model to a brand voice, level of formality, or response length that differs substantially from its default; (d) task-specific capability — teaching a new task type for which there is substantial labelled data and where task performance is the primary value driver. Scenarios where fine-tuning is typically not worth the investment include: tasks where top-5% prompting achieves acceptable performance; tasks with rapidly changing requirements (fine-tuning lags prompting for fast iteration); and tasks requiring general world knowledge, where the base model's breadth is an asset.

LoRA & Parameter-Efficient Fine-Tuning

Parameter-efficient fine-tuning (PEFT) methods update only a small fraction of model parameters rather than the full weight matrix, dramatically reducing training memory requirements and enabling fine-tuning on consumer or single-GPU hardware. Low-Rank Adaptation (LoRA; Hu et al., 2022) is the dominant PEFT approach: it freezes the original weight matrices and adds small trainable rank-decomposition matrices to the attention layers. A rank-16 LoRA adapter for a 7B model adds approximately 17M trainable parameters — less than 0.3% of the 7B base weights — but achieves fine-tuning quality within 1–3% of full fine-tuning on most tasks. The adapter matrices can be saved separately (typically a few hundred MB) and merged with the base weights at inference time, adding no serving latency.

Quantised LoRA (QLoRA; Dettmers et al., 2023) takes PEFT further: it quantises the frozen base model to 4-bit precision during training, then fine-tunes the LoRA adapters in 16-bit. This reduces the GPU memory required to fine-tune a 7B model from approximately 14GB (full fine-tuning in fp16) to approximately 5GB (QLoRA), enabling fine-tuning on a single RTX 3090 or 4090. The quality trade-off versus full fine-tuning is minimal (typically <1%) for most tasks. The HuggingFace PEFT and trl (Transformer Reinforcement Learning) libraries make QLoRA fine-tuning accessible with under 30 lines of configuration code. For practitioners at teams without dedicated ML infrastructure, QLoRA on a single GPU is the practical entry point for domain adaptation of open-weight models.

Choosing LoRA hyperparameters is more art than science but follows empirically established heuristics. Rank (r): higher rank captures more task-specific information but risks overfitting on small datasets; 8–32 covers most production scenarios, with 64–128 reserved for large datasets or tasks requiring substantial style shift. Alpha (the LoRA scaling factor): typically set to 2x the rank. Target modules: applying LoRA to both query and value attention projections is standard; applying to all linear layers (including MLP) provides marginal improvement at 2x the adapter size. Learning rate: QLoRA fine-tuning typically requires lower learning rates (1e-4 to 2e-4) than full fine-tuning; the warmup ratio should be 3–5% of total steps. Dataset size: for format alignment or style tasks, 1,000–5,000 high-quality examples are often sufficient; for knowledge injection, 10,000–50,000 examples are typically needed to achieve reliable generalisation.

Instruction Tuning & Chat Alignment

Base pre-trained LLMs are next-token predictors trained to continue text — they are not inherently helpful assistants. Instruction tuning transforms a base model into a chat model by fine-tuning it on a large dataset of (instruction, response) pairs spanning diverse task types, teaching the model to follow instructions, maintain conversation format, and provide helpful, well-formatted answers. The original InstructGPT paper (Ouyang et al., 2022) demonstrated that RLHF — using human preference rankings to train a reward model, then using the reward model to update the LLM via PPO reinforcement learning — substantially improved instruction following and reduced harmful outputs relative to supervised instruction tuning alone. This three-stage recipe (pre-training → supervised instruction tuning → RLHF) is the foundation of every chat-aligned model from GPT-4 to Claude to Gemini.

Direct Preference Optimization (DPO; Rafailov et al., 2023) has become a widely-used alternative to RLHF that eliminates the need for a separate reward model training stage, directly optimising on preference data pairs (chosen vs. rejected responses) in a single fine-tuning pass. DPO produces comparable alignment quality to RLHF on most benchmarks with simpler implementation and less training instability. For practitioners building domain-specific chat models, DPO is currently the recommended approach for preference alignment after initial instruction tuning. Constructing high-quality preference data — expert-rated pairs that capture the preference distinctions relevant to your specific application — is more important than the choice between RLHF and DPO for achieving good alignment results.

Continued pre-training is a distinct technique from instruction tuning — it extends the base model's pre-training on a domain-specific corpus without using labelled instruction pairs, teaching the model new vocabulary, facts, and reasoning patterns from raw domain text. It is appropriate when the target domain is sufficiently different from the base model's training distribution that fine-tuning alone produces poor grounding: medical literature using clinical terminology, legal codes using jurisdiction-specific conventions, financial filings using regulatory language, or scientific papers using domain-specific notation. Continued pre-training typically requires more compute than instruction fine-tuning (longer training runs over large unlabelled corpora), but produces a fundamentally more capable base model for the domain that responds better to both prompting and subsequent instruction tuning than starting directly from a general-purpose base. The recommended approach for most teams is to evaluate whether domain-specific RAG over existing models achieves acceptable performance before committing to continued pre-training, as the engineering investment is significantly higher.

Case Study

Fine-Tuning Llama for Medical Triage at Scale: Lessons from a Hospital Network Deployment

A hospital network deploying an LLM to support nurse triage documentation — summarising patient intake notes and suggesting ICD-10 coding categories — faced a challenge that illustrates the fine-tuning decision calculus clearly. The base Llama 3.1 8B model, prompted with carefully engineered instructions, achieved approximately 71% agreement with expert coders on ICD-10 category selection and consistently produced summaries that mixed clinical and lay terminology inconsistently. Increasing to a 70B model raised coding agreement to 78% but made the per-query cost prohibitive at the required volume of 40,000 notes per month. The team chose QLoRA fine-tuning of the 8B model on a dataset of 15,000 annotated triage notes and coding pairs, producing a model that achieved 84% coding agreement — exceeding the 70B baseline — with inference cost 9x lower than the prompted 70B approach.

The key lessons from the deployment: (a) data quality dominated data quantity — the initial 15,000-example dataset contained labelling inconsistencies from 8 different coders, and cleaning it to 11,000 high-consistency examples improved final performance more than the additional 4,000 noisy examples; (b) domain-specific evaluation was non-negotiable — MMLU scores and general coding benchmarks were uncorrelated with the deployment metric (ICD-10 agreement rate with expert coders), confirming that task-specific held-out evaluation is the only reliable performance signal; (c) the model needed continuous retraining — coding category updates and clinical terminology evolution meant the model required quarterly fine-tuning updates to maintain performance, making the full MLOps pipeline (data versioning, automated evaluation, staged rollout) as important as the fine-tuning methodology itself.

LoRA Domain Adaptation PEFT

Hallucination & Safety Guardrails

Hallucination is the most commercially consequential LLM failure mode: the generation of plausible-sounding but factually incorrect content. Open-domain hallucination (fabricating facts from parametric knowledge) and closed-domain hallucination (contradicting provided context) require different mitigations. For open-domain hallucination, RAG is the primary mitigation: grounding responses in retrieved, verified content and instructing the model to express uncertainty when evidence is insufficient. For closed-domain hallucination, automated faithfulness verification — using an NLI classifier or LLM judge to check whether each claim in the response is supported by the provided context — is essential for any application where factual accuracy is consequential. Temperature reduction and self-consistency (majority voting over multiple generations) also reduce hallucination rates at the cost of increased inference time.

Safety guardrails for production LLM systems typically operate at three layers. Input guardrails: classifiers or rule-based filters that screen incoming requests for harmful content, prompt injection attempts, or policy-violating queries before they reach the model. Output guardrails: classifiers applied to the model's response before delivery, checking for harmful content, policy violations, or anomalous patterns. Behavioural monitoring: tracking output distributions over time to detect model drift, unusual response patterns, or systematic failures that individual request classifiers miss. All three layers are necessary in production; relying on any single layer as the sole safety control is insufficient. Monitoring dashboards that surface hallucination rates, safety filter trigger rates, and topic distribution of production traffic are essential operational tools for any team running LLMs in a regulated or high-stakes environment.

Multi-region and multi-provider deployment strategies address the availability and compliance requirements of global enterprise applications. Data residency regulations — GDPR, the EU AI Act, and equivalents in India, China, Brazil, and Saudi Arabia — may require that certain data never leave a specific geographic region. Multi-region deployment uses provider-specific regional API endpoints (AWS Bedrock regions, Azure OpenAI regional deployments, Google Vertex AI zones) to ensure data locality. For high-availability architectures, a primary provider in the target region with a geographically co-located secondary provider and an on-premise fallback for critical workloads provides three tiers of resilience. Latency routing — directing each user's request to the lowest-latency endpoint at the time of the request using DNS-level geolocation — improves response times by 50–200ms for globally distributed user bases. Implementing multi-region LLM deployment correctly requires provider-specific evaluation of data processing agreements, subprocessor lists, and audit controls — not just infrastructure configuration — to satisfy regulatory compliance obligations.

Context Window Management

The context window is the most fundamental constraint in LLM application design. Every input token costs money, contributes latency, and competes for the model's attention budget. Exceeding the context limit raises a hard error; approaching it degrades retrieval quality and coherence. Effective context management — deciding what information to include, how to represent it compactly, and how to handle inputs that exceed window limits — is one of the most practically important skills in LLM engineering. It determines whether a system can handle enterprise-scale documents, long conversation histories, and complex multi-document queries reliably.

Understanding token economics is the starting point. At typical API pricing, a 128K-token context window filled to capacity costs $0.16–$0.64 per call depending on the model (at $1.25–$5 per million input tokens). For a high-volume application making 100K calls per day, context management decisions directly drive costs of thousands of dollars per day. Token counting before each API call — using provider tokenizers (tiktoken for OpenAI models, the Anthropic tokenization library, etc.) — prevents hard errors and enables dynamic context truncation. A context budget allocation approach — reserving fixed token budgets for the system prompt, conversation history, retrieved context, and output — provides a structured framework for managing context across all query types in a multi-purpose application.

Context Window Comparison (Major Models)

Model	Context Window	Effective Retrieval Range	Practical Implication
GPT-4o / GPT-4o-mini	128K tokens (~96K words)	~100K reliable	Handles most enterprise documents; long reports fit natively
Claude 3.5 Sonnet	200K tokens (~150K words)	~180K reliable	Full book or large codebase in single context; strong long-document performance
Gemini 1.5 Pro	1M tokens (~750K words)	~500K reliable	Multi-document corpora; video transcripts; very large codebases
LLaMA 3.1 8B/70B	128K tokens	~80K reliable (self-hosted)	Competitive with GPT-4o range; KV cache memory at 128K is significant constraint
Mistral 7B (sliding window)	Theoretically unlimited (SWA)	~32K reliable	Sliding window attention limits effective recall despite nominal unbounded window

Chunking & Retrieval Strategies

Chunking strategy is the first critical design decision in any RAG pipeline. How documents are split into retrievable units determines both retrieval quality (can the relevant information be found?) and generation quality (is the context coherent enough for the LLM to reason over?). Fixed-size token chunking — splitting every N tokens regardless of content structure — is the simplest approach and performs reasonably well as a baseline, but frequently splits sentences, paragraphs, or logical units mid-thought. With overlap (e.g., 128-token overlap on 512-token chunks), it ensures that information near chunk boundaries appears in at least one chunk fully, at the cost of some redundancy. For structured documents with natural section boundaries (headers, article breaks, code blocks), section-aware chunking that respects semantic boundaries consistently outperforms fixed-size chunking by 8–15% on retrieval precision.

Semantic chunking — using embedding similarity to detect topic shifts and creating chunk boundaries at natural semantic transitions — produces the most coherent chunks but requires the most compute. Small-to-big chunking stores small chunks (128 tokens) for retrieval precision but returns their parent sections (512–1024 tokens) as context for generation, combining the retrieval granularity of small chunks with the contextual coherence of larger units. This "child-retrieval, parent-context" pattern is one of the most reliable improvements to RAG quality and adds minimal implementation complexity once a parent-child chunk relationship is stored in the index metadata. Sentence window retrieval is a variant: retrieve at the sentence level for precision, but expand to the surrounding 2–3 sentences before injection for coherence.

Hybrid search — combining dense vector retrieval with sparse BM25 keyword matching — consistently outperforms either approach alone, with typical improvements of 5–15% in top-5 recall. The combination matters because dense retrieval captures semantic similarity ("affordable flights" matching "cheap airfare") while sparse retrieval captures lexical precision (exact product codes, technical terms, proper nouns that embeddings compress poorly). Reciprocal Rank Fusion (RRF) is the standard algorithm for combining dense and sparse ranked lists: it assigns each document a score proportional to 1/(k + rank) in each list, then sums scores across lists. Cross-encoder reranking — applying a more expensive but more accurate model to the top-20 retrieved candidates to select the final top-5 — adds 100–300ms latency but improves precision by 10–20%, making it worth the latency cost for high-value applications where context quality directly affects answer accuracy.

Long-Document Processing Patterns

Processing documents that exceed the context window requires one of three architectural patterns, each with distinct trade-offs. The map-reduce pattern processes long documents in chunks: a "map" LLM call processes each chunk independently (extracting key information, summarising, or answering a partial question), and a "reduce" call synthesises the chunk-level results into a final answer. Map-reduce is highly parallelisable — all map calls can run concurrently — but suffers from the boundary problem: information that spans two chunks may not be captured correctly by either map call. With 128-token overlap between chunks, this is mitigated but not eliminated.

The iterative refinement pattern processes chunks sequentially: the first chunk produces an initial answer or summary, and each subsequent chunk updates and refines it. This captures cross-chunk relationships better than map-reduce but is sequential (no parallelism) and accumulates errors that early incorrect inferences introduce into later refinement steps. For summarisation tasks over very long documents, iterative refinement often produces higher-quality output than map-reduce because context from earlier sections informs the interpretation of later sections — a key insight that appears early in a report shapes how subsequent details are understood. The hierarchical summarisation pattern addresses the same problem differently: first summarise each section individually, then summarise the section summaries, creating a multi-level hierarchy that preserves both local detail and global structure. This is particularly effective for structured enterprise documents (financial reports, regulatory filings, technical specifications) where section boundaries carry semantic significance.

Implementation Pattern

The Context Budget Pattern for Multi-Purpose Applications

Applications handling diverse query types — some requiring long system prompts, others large knowledge base contexts, others long conversation histories — benefit from a context budget allocator that partitions the available token window dynamically per query type. A practical implementation defines per-query-type budgets as percentages: system prompt (10–15%), conversation history (20–30%), retrieved context (40–55%), input query (5–10%), response buffer (15–20%). Before each API call, the budget allocator counts current token usage in each slot, truncates or compresses each slot to stay within its budget, and verifies total token count against the model's context limit with a 5% safety margin. For conversation history truncation, a sliding window preserving the most recent N turns outperforms fixed-token truncation because recency matters more than token coverage for coherent conversation. For retrieved context truncation, reranker scores provide the optimal truncation criterion — cut the lowest-scoring chunks first until the context fits the budget.

Context Management Token Budget RAG

LLM Application Architecture Patterns

Production LLM applications are rarely a single model call — they are systems composed of multiple LLM calls, retrieval steps, tool invocations, validation layers, and orchestration logic. Understanding the canonical architectural patterns for composing these components helps practitioners make sound design decisions rather than reinventing solutions to well-understood problems. The patterns covered here — routing, chaining, caching, fallbacks, and function calling — appear in virtually every mature LLM application, from enterprise automation systems to consumer AI products.

The orchestration layer — the code that sequences LLM calls and other operations — is where most of the complexity in LLM applications lives. Frameworks like LangChain, LlamaIndex, and DSPy provide reusable abstractions for common patterns. LangChain's chain and agent abstractions handle sequential and conditional LLM call pipelines with built-in prompt templating and output parsing. LlamaIndex focuses on data connectors and query engines for RAG pipelines over diverse data sources. DSPy takes a different philosophy: rather than manually writing prompts, it compiles high-level program specifications into optimised prompts using a compilation pipeline that maximises a task-specific metric. The choice between frameworks and custom orchestration depends on team familiarity, the complexity of the pipeline, and the degree of control needed over individual components.

Routing & Chaining

Routing dispatches queries to different models, prompts, or processing pipelines based on the query's characteristics. The most common routing strategy is complexity routing: classify each incoming query as simple or complex, direct simple queries to a fast, cheap model (GPT-4o-mini, Claude Haiku) and complex queries to a frontier model (GPT-4o, Claude 3.5 Sonnet). A binary complexity classifier trained on historical query-outcome pairs achieves this efficiently. Task routing sends different task types to task-optimised models: a coding query routes to a code-specialised model, a document summarisation query routes to a context-length-optimised model, a creative writing query routes to a model known for fluent prose. Well-implemented routing achieves 40–70% cost reduction with less than 2% quality degradation measured on aggregate production metrics.

Chaining sequences multiple LLM calls where the output of one step feeds into the next. The canonical examples are: a summarise-then-answer chain (first compress long documents, then answer questions over the summary), a draft-then-critique chain (one call produces a draft, a second call critically evaluates it and suggests improvements), and a decompose-then-solve chain (one call breaks a complex query into sub-questions, subsequent calls solve each sub-question, a final call synthesises the answers). Chains introduce compounding error risk: if an early step produces a subtly incorrect output, downstream steps have no access to the original information and cannot detect or recover from the error. Chain design should minimise the number of sequential steps, validate intermediate outputs against schemas or classifiers before they feed into subsequent steps, and maintain access to the original input at every stage for context retrieval.

Function calling (structured tool use) is the architectural pattern for connecting LLMs to external systems: databases, APIs, search engines, calculators, code interpreters. The model is given a schema of available functions with their parameters, and rather than generating text, it generates a structured function call that the application layer executes, returning the result to the model for incorporation into the final response. This pattern eliminates hallucination about current facts (the model can call a real-time data API), enables precise computation (the model delegates arithmetic to a calculator), and provides a structured interface for system integrations. All major frontier model APIs (OpenAI function calling, Anthropic tool use, Gemini function declarations) support this natively. The key design practice is writing precise, concise function descriptions that clearly distinguish between functions — ambiguous descriptions cause the model to select the wrong function in multi-tool environments.

Caching, Fallbacks & Resilience

LLM API calls are expensive, latency-variable, and occasionally fail. Production architectures must handle all three realities. Semantic caching stores responses keyed by embedding similarity: when an incoming query's embedding is within a configurable cosine distance threshold of a cached query, the cached response is returned without an API call. For customer support applications, this captures 15–25% of traffic (frequently-asked question patterns), reducing cost proportionally. Cache invalidation must be handled carefully: responses grounded in facts that change (product pricing, availability, policies) must be invalidated when the underlying data changes, not just on TTL expiry. Exact-match caching (keying on normalised query text) is simpler to implement and appropriate for templated queries where canonicalisation is reliable.

Fallback chains handle API failures gracefully. A primary fallback pattern routes requests to an alternative model when the primary API returns a 429 (rate limit) or 5xx error: primary=GPT-4o → fallback-1=Claude 3.5 Sonnet → fallback-2=Gemini 1.5 Pro → fallback-3=cached response or graceful degradation. Cross-provider fallback adds resilience to provider-level outages but requires prompt variants for each provider's API format. Retry with exponential backoff handles transient errors: 3 retries with delays of 1s, 2s, 4s catches the vast majority of transient API failures without hammering a struggling service. Circuit breakers — temporary fallback to an alternative path after N consecutive failures — prevent cascading failures when a provider has a prolonged outage. Building a load balancer that distributes traffic across multiple API provider accounts or regions provides both rate limit mitigation and multi-provider resilience in a single layer.

Observability for LLM application architectures requires tracking at the component level, not just the endpoint level. For a chain with three LLM calls, tracking only the overall request latency and success rate provides too little signal to diagnose where failures occur. Component-level instrumentation — measuring input token count, output token count, latency, cost, and error rate per LLM call in the chain — enables root cause analysis. Distributed tracing (OpenTelemetry is the standard; LangSmith and Langfuse provide LLM-specific tracing) captures the full execution tree of a complex pipeline in a single trace, making it possible to see exactly which retrieval call returned low-quality chunks, which LLM call produced an unexpected response format, and which fallback was triggered. LLM-specific tracing should capture the full prompt (system + history + context + input) and completion for every call, with appropriate PII redaction applied before storage, enabling engineers to replay any production request exactly as it occurred for debugging.

                        
                        Architecture Principle: Design LLM applications to degrade gracefully, not fail catastrophically. A user who receives a cached answer from 6 hours ago when the live API is down is better served than one who receives a 503 error. Define explicit degradation levels — live LLM response, semantic cache hit, static FAQ answer, human escalation — and implement logic to step down through them rather than treating API unavailability as a fatal failure.
                    

Practice Exercises

These exercises build progressively from simple API usage through to production-scale throughput benchmarking. Each exercise is designed to generate concrete, measurable results that deepen your intuition about LLM behaviour.

Beginner

Exercise 1: Temperature Effects on Summarisation

Use the OpenAI API to summarise 5 news articles (300–500 words each). Generate 3 summaries per article at temperature=0.0, temperature=0.5, and temperature=1.0 (15 API calls total). For each temperature setting: How consistent are the summaries across the 5 articles? How much does each summary diverge from the others at the same temperature? Which temperature produces the most factually accurate summaries? Which produces the most engaging prose? Document your observations in a structured table comparing the three temperature settings.

Intermediate

Exercise 2: Token Counting and Prompt Efficiency

Using tiktoken, count the tokens in 10 of your own prompts. For each prompt, calculate: tokens per instruction word (efficiency ratio), and the proportion of the total token count consumed by the system prompt vs. the user query. Design a refactored version of each prompt that achieves the same task specification using fewer tokens. What techniques reduce token count most: removing redundant phrases, using shorter synonyms, switching from examples to schema descriptions? Measure the accuracy of the refactored prompts vs. the originals on 5 test inputs each.

Intermediate

Exercise 3: GPT-4o-mini vs GPT-4o Capability Comparison

Design 10 test tasks spanning: simple factual lookup (2 tasks), multi-step reasoning (3 tasks), code generation (2 tasks), creative writing (1 task), and domain-specific knowledge (2 tasks). Run each task on both models at temperature=0.1. Rate each response on a 1–5 scale for accuracy and quality. For which task categories does the mini model match or exceed the full model? For which does it fail significantly? Calculate the cost difference and estimate the breakeven — at what quality gap is it worth paying for the full model?

Advanced

Exercise 4: vLLM Throughput Benchmarking

Set up vLLM locally (requires a GPU with at least 16GB VRAM) with an open-weight model (LLaMA 3.1 8B or Mistral 7B). Use the built-in benchmark_serving.py script to measure throughput (tokens/sec) at concurrent batch sizes of 1, 4, 16, and 64 requests. Plot the throughput curve. At what batch size does throughput plateau? What is the P95 time-to-first-token at each batch size? Compare: what is the GPU utilisation (measured via nvidia-smi) at each batch size? This experiment reveals the relationship between concurrency, memory pressure, and serving cost.

LLM Inference Exercises

Beginner

Exercise: Calculate Your Application's KV Cache Budget

Using the KV cache calculator code above, compute the memory requirements for your target model at your expected context length and batch size. Determine: (a) How many concurrent requests can your GPU handle before running out of memory (subtract model weight size from total GPU memory, divide remainder by per-request KV cache size)? (b) How does switching from FP16 to INT8 KV cache change the answer? (c) At what context length does a single request consume more than 50% of GPU memory? Document your findings in a table comparing 3 model sizes and 3 context lengths.

Intermediate

Exercise: Build a Token-Level Latency Profiler

Using any LLM API that supports streaming (OpenAI, Anthropic, or a local model via vLLM), measure the time-to-first-token (TTFT) and inter-token latency (ITL) for prompts of varying lengths: 100, 500, 2000, and 8000 tokens. Plot TTFT vs. prompt length — you should observe linear growth (prefill is proportional to input length). Plot ITL vs. output position — it should be roughly constant (decode is independent of position). Calculate the total generation time as TTFT + (N × ITL) and verify it matches your measured end-to-end latency. What happens to TTFT and ITL when you send 10 concurrent requests?

Advanced

Exercise: Implement a Simple Speculative Decoder

Using two HuggingFace models (e.g., GPT-2 Small as the draft model and GPT-2 Large as the verifier), implement a basic speculative decoding loop: (1) the draft model generates K=5 candidate tokens, (2) the verifier model processes all candidates in a single forward pass, (3) accept tokens where the verifier's top prediction matches the draft, (4) reject at the first mismatch and resample from the verifier. Measure: acceptance rate across 50 prompts, latency compared to standard autoregressive generation with the verifier alone, and how K (the draft length) affects the speedup. What is the optimal K for your model pair?

LLM Model Card Generator

Model cards are the standard documentation artefact for AI models — capturing intended use, training data, performance metrics, known limitations, and ethics considerations. Use the form below to generate a structured model card for any LLM-based system you are building or deploying.

LLM Model Card Generator

Model Name *

Model Version

Model Type

Intended Use *

Training Data

Performance Metrics

Known Limitations

Ethics Considerations

Maintenance Plan

Model Owner / Contact

Your Name

Conclusion & Next Steps

Large language models represent a qualitative shift in what software systems can do with natural language. The decoder-only transformer architecture, trained with the simple objective of next-token prediction at massive scale, produces systems with emergent capabilities that were not designed in — in-context learning, chain-of-thought reasoning, and coherent long-form generation that generalise across domains. Chinchilla scaling laws provide actionable guidance for compute allocation, and the distinction between training-optimal and inference-optimal models is now a central design variable for any team training or deploying LLMs at scale.

Understanding LLM internals — how attention works, what context window constraints mean, why hallucination is structural rather than incidental, and what benchmark scores do and do not measure — is the prerequisite for making sound engineering decisions about every aspect of LLM application development: which model to use, how to design prompts, when to fine-tune, and how to build evaluation infrastructure that provides reliable signal. The next article in this series moves to the practitioner's primary interface with LLMs: prompt engineering, where the techniques of zero-shot, few-shot, chain-of-thought, and structured output design translate the capabilities described here into reliable production behaviour.

Next in the Series

In Part 9: Prompt Engineering & In-Context Learning, we cover the systematic techniques — zero-shot, few-shot, chain-of-thought, tree-of-thought, and structured outputs — that make LLMs reliable and controllable in production pipelines.

Cookie Consent

Cookie Preferences

Large Language Models

Table of Contents

AI in the Wild: Real-World Applications & Ethics

AI & ML Landscape Overview

ML Foundations for Practitioners

Natural Language Processing

Computer Vision in the Real World

Recommender Systems

Reinforcement Learning Applications

Conversational AI & Chatbots

Large Language Models

Prompt Engineering & In-Context Learning

Fine-tuning, RLHF & Model Alignment

Generative AI Applications

Multimodal AI

AI Agents & Agentic Workflows

AI in Healthcare & Life Sciences

AI in Finance & Fraud Detection

AI in Autonomous Systems & Robotics

AI Security & Adversarial Robustness

Explainable AI & Interpretability

AI Ethics & Bias Mitigation

MLOps & Model Deployment

Edge AI & On-Device Intelligence

AI Infrastructure, Hardware & Scaling

Responsible AI Governance

AI Policy, Regulation & Future Directions

About This Article

Transformer Architecture

Attention Mechanisms & Multi-Head Self-Attention

Positional Encoding & Context Windows

Scaling Laws & Training

Chinchilla & Compute-Optimal Training

Training Data & Curation

Grounding GPT-4 in Legal Documents: LexisNexis's RAG Pipeline for Case Research

Emergent Capabilities

In-Context Learning & Few-Shot

Chain-of-Thought & Reasoning

The LLM Landscape

Open vs. Closed Models

Benchmarks & Evaluation

LLM Model Comparison (Mid-2025)

LLM Evaluation Benchmarks

LLM Inference: From Token Generation to Production Serving

The Mechanics of Next-Token Prediction

A Brief History of LLM Inference

PagedAttention: The Invention That Transformed LLM Serving

KV Cache: The Critical Bottleneck

Quantisation for Inference

Speculative Decoding & Parallel Generation

Continuous Batching & Serving Systems

The Economics of LLM Inference

Inference Cost at Scale: A Real-World Comparison

How Spotify Uses Speculative Decoding for Podcast Summaries

Inference Optimisation Techniques Reference

Code: Local LLM Inference with HuggingFace

Code: High-Throughput Serving with vLLM

Code: Token Counting & Context Window Management

Code: Context Budget Management

Fine-Tuning & Model Adaptation

Fine-Tuning Approach Quick Reference

When to Fine-Tune vs. Prompt-Engineer

LoRA & Parameter-Efficient Fine-Tuning

Instruction Tuning & Chat Alignment

Fine-Tuning Llama for Medical Triage at Scale: Lessons from a Hospital Network Deployment

Hallucination & Safety Guardrails

Context Window Management

Context Window Comparison (Major Models)

Chunking & Retrieval Strategies

Long-Document Processing Patterns

The Context Budget Pattern for Multi-Purpose Applications

LLM Application Architecture Patterns

Routing & Chaining

Caching, Fallbacks & Resilience

Practice Exercises

Exercise 1: Temperature Effects on Summarisation

Exercise 2: Token Counting and Prompt Efficiency

Exercise 3: GPT-4o-mini vs GPT-4o Capability Comparison