AI & ML Landscape Overview
Paradigms, ecosystem map, real-world applications at a glance
ML Foundations for Practitioners
Supervised learning, bias-variance, model evaluation
Natural Language Processing
Tokenization, embeddings, transformers, semantic search
Computer Vision in the Real World
CNNs, ViTs, detection, segmentation, deployment patterns
Recommender Systems
Collaborative filtering, content-based, two-tower models
Reinforcement Learning Applications
Q-learning, policy gradients, RLHF, real-world deployments
Conversational AI & Chatbots
Dialogue systems, intent detection, RAG, production bots
8
Large Language Models
Architecture, scaling laws, capabilities, limitations
You Are Here
9
Prompt Engineering & In-Context Learning
Chain-of-thought, few-shot, structured outputs, prompt patterns
10
Fine-tuning, RLHF & Model Alignment
LoRA, instruction tuning, DPO, alignment techniques
11
Generative AI Applications
Diffusion models, GANs, image/audio/video generation
12
Multimodal AI
Vision-language models, audio-text, cross-modal retrieval
13
AI Agents & Agentic Workflows
Tool use, planning, memory, multi-agent orchestration
14
AI in Healthcare & Life Sciences
Diagnostics, drug discovery, clinical NLP, regulatory landscape
15
AI in Finance & Fraud Detection
Credit scoring, anomaly detection, algorithmic trading
16
AI in Autonomous Systems & Robotics
Perception, planning, control, sim-to-real transfer
17
AI Security & Adversarial Robustness
Adversarial attacks, poisoning, model extraction, defences
18
Explainable AI & Interpretability
SHAP, LIME, attention, mechanistic interpretability
19
AI Ethics & Bias Mitigation
Fairness metrics, dataset auditing, debiasing techniques
20
MLOps & Model Deployment
CI/CD for ML, feature stores, monitoring, drift detection
21
Edge AI & On-Device Intelligence
Quantization, pruning, TFLite, CoreML, embedded inference
22
AI Infrastructure, Hardware & Scaling
GPUs, TPUs, distributed training, memory hierarchy
23
Responsible AI Governance
Risk frameworks, model cards, auditing, organisational practice
24
AI Policy, Regulation & Future Directions
EU AI Act, global frameworks, emerging risks, what's next
AI in the Wild
Part 8 of 24
About This Article
This article builds a complete understanding of large language models from the ground up — covering the transformer architecture that powers them, the scaling laws that govern their training, the emergent capabilities that surprise their creators, the competitive landscape of open and closed models, and the engineering required to deploy them reliably in production. Code examples, model comparison tables, and benchmark references are included throughout.
Advanced
LLMs
Architecture
Deployment Engineering
Transformer Architecture
The decoder-only transformer is the dominant architecture for all modern large language models. Its core components are: a tokeniser (converting raw text into integer token IDs), an embedding layer (mapping token IDs to dense vectors), a stack of transformer blocks (each containing multi-head self-attention followed by a position-wise feed-forward network, with residual connections and layer normalisation around each sub-layer), and a language modelling head (a linear projection from the final hidden state to vocabulary logits). The training objective is causal language modelling: at each position in the sequence, predict the next token given all preceding tokens, maximising the log-probability of the target token. This objective is simple enough to scale to arbitrary data and compute — no labels, no task-specific objectives, no reward signal — yet produces systems capable of in-context learning, instruction following, and coherent multi-thousand-word generation.
The lineage traces from GPT-2 (1.5B parameters, trained on 40GB of web text, 2019) through GPT-3 (175B parameters, trained on 300B tokens, 2020) to the current generation of frontier models operating at hundreds of billions to over a trillion effective parameters. Each generation retained the same fundamental architecture while incorporating engineering improvements: GPT-3 established that scale alone dramatically expanded capability without architectural changes; the subsequent generation introduced Grouped-Query Attention (GQA), which reduces KV cache memory by sharing key and value heads across multiple query heads; SwiGLU activation functions in feed-forward layers, which empirically improve training efficiency; RMS layer normalisation in place of the original layer norm; and Rotary Positional Embeddings (RoPE), which encode position more efficiently and generalise better to long contexts. These innovations are now standard across all leading open-weight architectures including LLaMA 3, Mistral, Gemma, and Qwen.
Key Insight: Parameter count alone is a poor proxy for model capability. A well-trained 7B model on high-quality, carefully curated data — like Mistral 7B or LLaMA 3 8B — routinely outperforms poorly trained models several times its size. Data quality and the training recipe matter as much as raw scale, which is why Chinchilla-optimal training has displaced the race to maximise parameter counts.
Attention Mechanisms & Multi-Head Self-Attention
Self-attention is the mechanism that allows each token in the sequence to aggregate information from all previous tokens. For each token, three vectors are computed: a query (what information am I looking for?), a key (what information do I contain?), and a value (what information should I pass forward?). The attention weights between token pairs are computed as the scaled dot product of queries and keys, normalised by a softmax, and then used to compute a weighted sum of value vectors. Multi-head attention runs this operation in parallel across H independent heads with different learned projections, allowing the model to simultaneously attend to multiple aspects of the context — syntactic structure, semantic relationships, coreference — before concatenating and projecting the results.
The key architectural innovations for efficient scaling are Flash Attention (Dao et al., 2022), which computes attention in fused GPU kernels using tiling to keep intermediate values in fast SRAM rather than slower HBM, reducing memory bandwidth by 5–20x for long sequences; Grouped-Query Attention (GQA), which groups multiple query heads to share a single key-value head, reducing the KV cache size (and thus inference memory) proportionally to the grouping factor; and Multi-Query Attention (MQA), an extreme version of GQA where all query heads share one key-value head. These optimisations are specifically targeted at inference efficiency: the KV cache — which stores key and value vectors for all previous tokens to avoid recomputation during autoregressive decoding — is the primary memory bottleneck for serving long-context requests at scale.
Positional Encoding & Context Windows
Self-attention is permutation-invariant by construction — the attention computation treats tokens identically regardless of their position in the sequence. Positional encodings inject order information by modifying the token representations before or during attention. Early models used learned absolute positional embeddings (a lookup table of position-specific vectors), which worked well within training-length sequences but generalised poorly to longer inputs. Rotary Positional Embeddings (RoPE) encode position by rotating query and key vectors in the attention computation, with the rotation angle proportional to the relative distance between token pairs. This relative encoding scheme generalises better to longer sequences and has become the standard across all modern open-weight architectures.
Context window length is governed by the O(n²) attention complexity: doubling the context length quadruples attention memory and compute. Flash Attention reduces the constant factor substantially but does not change the asymptotic complexity. Techniques that break the O(n²) barrier include sliding window attention (Mistral 7B), where each token attends only to a local window of preceding tokens rather than the full history, reducing complexity to O(n·w) where w is the window size; and sparse attention patterns. Context extension beyond training length uses techniques like YaRN (position interpolation with adjusted RoPE scaling), which adjusts the effective frequency basis of RoPE to allow extrapolation to longer sequences at inference time without retraining. The practical ceiling for reliable long-context retrieval is lower than the nominal context window suggests, due to the "lost in the middle" phenomenon: retrieval accuracy for information positioned in the middle of a long context degrades significantly relative to information at the start or end of the context.
Scaling Laws & Training
Scaling laws are empirical relationships that govern how language model performance improves as model size, dataset size, and compute budget increase. The foundational work by Kaplan et al. (2020) from OpenAI demonstrated that test loss decreases as a smooth power law with each of these three factors, enabling practitioners to predict model performance at untested scales from smaller training runs. This predictability is what justifies the enormous compute investments required for frontier model training: the capability gains are not speculative but empirically forecasted. The Kaplan guidance suggested scaling model size faster than data size for a given compute budget, which led to the trend of very large models trained on relatively modest datasets — most famously GPT-3 at 175B parameters trained on 300B tokens.
Chinchilla & Compute-Optimal Training
Kaplan's scaling laws made the case for investment but raised an immediate question for deployers: which axis of scale provides the most return for a given inference budget? A frontier model trained with ten times the compute may score meaningfully better on benchmarks while costing ten times as much per token to serve. For most production applications, the relevant axis is not maximum capability but capability-per-dollar at the query volume and latency requirements of the specific deployment. This reframing redirects model selection from "which model scores highest on MMLU?" to "which model achieves acceptable quality on my task at acceptable cost and latency?" — a fundamentally different question that requires task-specific benchmarking rather than public benchmark consulting.
The Chinchilla paper (Hoffmann et al., 2022) revised Kaplan's guidance substantially. By training over 400 models at diverse size-data combinations while holding total compute fixed, DeepMind found that Kaplan had underestimated the returns to data relative to parameters. The Chinchilla-optimal recipe allocates approximately 20 training tokens per parameter: a 7B model should train on ~140B tokens, a 70B model on ~1.4T tokens. By this analysis, GPT-3's 175B parameters trained on 300B tokens was significantly undertrained — the same compute could have produced a smaller but substantially better-performing model trained on more data.
The post-Chinchilla generation of models — LLaMA 1 and 2, Mistral, Gemma, Gemma 2, Qwen — all reflect this shift, training smaller models on dramatically larger datasets. But practitioners quickly discovered the distinction between training-optimal and inference-optimal compute allocation: a Chinchilla-optimal 65B model costs far more per inference call than a "overtrained" 7B model with equivalent performance on many tasks. LLaMA 2's 7B model was deliberately trained on 2T tokens — far beyond the Chinchilla optimum — because the inference cost savings at scale dwarf the training compute overhead. This trade-off is now a central design variable in any serious model training decision.
Training Data & Curation
Modern LLM training corpora are assembled from multiple source types: web crawl data (Common Crawl being the dominant source, often 50–70% of total tokens after filtering), digitised books and academic papers (providing higher-quality, long-form language and domain-specific knowledge), code repositories (GitHub being the primary source, producing code capabilities and structured reasoning), curated datasets (Wikipedia, Stack Exchange, specialised corpora for science, law, and medicine), and increasingly synthetic data generated by stronger models. The mixture ratios — how much of each source to include — have outsized effects on downstream capability: higher proportions of code data improve mathematical reasoning even on non-coding benchmarks; academic paper inclusion improves scientific knowledge and citation accuracy.
Data quality processing is as important as source selection. Near-deduplication (removing near-identical documents using MinHash locality-sensitive hashing) is essential — training on duplicated data produces models that recite memorised text rather than generalising, and wastes compute on redundant updates. Quality filtering removes low-quality web pages using heuristics (length, punctuation density, repetitive n-gram ratios) or classifier-based scoring trained on high-quality seed data. Toxicity filtering removes harmful content to reduce safety risks in the base model. Benchmark contamination — the inadvertent inclusion of evaluation benchmark questions in training data — is a significant concern that inflates benchmark scores and makes model comparisons unreliable; reputable labs maintain contamination detection pipelines but contamination cannot be entirely eliminated from web-crawl data.
Case Study
Grounding GPT-4 in Legal Documents: LexisNexis's RAG Pipeline for Case Research
LexisNexis, with over 140 years of case law, statutory text, and legal commentary spanning hundreds of jurisdictions, represents one of the most demanding knowledge-grounding challenges for LLMs. Their development of Lexis+ AI — a RAG-based legal research assistant — required solving retrieval problems that off-the-shelf vector search could not handle. Legal queries involve precise statutory citations, Latin legal maxims, and domain-specific terminology that general embedding models encode poorly. The team built a hybrid retrieval pipeline: dense bi-encoder retrieval for semantic similarity combined with BM25 for exact keyword and citation matching, re-ranked by a cross-encoder fine-tuned on legal document pairs. This reduced the proportion of answers lacking supporting citations from approximately 18% with pure dense retrieval to under 3%.
The generation prompt required extensive iteration. Early prompts that asked the model to "answer based on the following cases" produced responses mixing retrieved holdings with GPT-4's parametric legal knowledge — often subtly, in ways attorneys only caught on careful review. The team adopted a strict grounding regime: the generation prompt instructs the model to cite a specific retrieved document for every factual claim, to flag any claim it cannot support from retrieved context, and to acknowledge jurisdictional scope limits. Automated faithfulness scoring — comparing each sentence of the generated answer against retrieved documents using a fine-tuned NLI classifier — was integrated into the serving pipeline. The case demonstrates that for high-stakes professional domains, RAG quality is determined by evaluation and guardrail infrastructure as much as by retrieval or generation quality.
Scaling
Training Data
Compute Budget
Emergent Capabilities
Emergence in LLMs refers to qualitatively new capabilities appearing at certain scale thresholds, essentially absent in smaller models. The Wei et al. (2022) survey documented emergent abilities across 137 BIG-Bench tasks: capabilities including multi-digit arithmetic, word unscrambling, and chain-of-thought reasoning appeared sharply above approximately 50–100B parameters. This non-linearity has significant practical implications: if you evaluate a model class at 7B parameters and observe near-zero performance on a task, you cannot conclude the task is impossible at larger scales. However, Schaeffer et al. (2023) challenged this framing, showing that many apparent emergent phenomena vanish when tasks are scored with smoother, more granular metrics rather than binary exact-match accuracy — suggesting some "emergence" is an artefact of evaluation methodology rather than a genuine phase transition in model capability. The practical takeaway is balanced: empirical evaluation at multiple scales remains the most reliable planning tool, and practitioners should be cautious about both assuming smooth scaling and assuming hard capability thresholds.
Several emergent capabilities have direct practical implications that practitioners should understand. Tool use and function calling — the ability to decide which external tool to invoke, generate correctly formatted tool calls, and incorporate tool results into subsequent reasoning — is an emergent capability that first appears reliably in models at approximately 8B+ parameters with strong instruction tuning. Below this threshold, models frequently generate malformed function calls, invoke tools for the wrong reason, or fail to incorporate tool results correctly. Calibration — the alignment between a model's confidence and its actual accuracy — also improves with scale: larger models are better calibrated in the sense that their token probabilities more accurately reflect their actual uncertainty, making confidence-based filtering (routing low-confidence outputs for human review) more reliable. Multilingual transfer — the ability to reason in a language under-represented in the training data by leveraging reasoning chains in higher-resource languages — is another emergent phenomenon that makes frontier models practically usable in low-resource language markets even without dedicated language-specific training data.
In-Context Learning & Few-Shot
In-context learning (ICL) is the emergent ability to learn a new task from examples provided in the prompt at inference time, without any weight updates. Zero-shot ICL provides only a task description; few-shot ICL includes k labelled input-output examples that demonstrate the desired behaviour. The dominant hypothesis for why ICL works is that autoregressive pre-training on diverse data implicitly encodes a meta-learning algorithm — the model has seen so many task patterns that it has learned to identify and execute the pattern implied by prompt examples. ICL performance is sensitive to example selection (diverse, representative, correctly formatted examples outperform arbitrary ones), example ordering (later examples receive more attention weight), and label format (the model is surprisingly robust to label noise but sensitive to format consistency). Dynamic few-shot selection — choosing the most semantically similar examples from a curated pool at inference time using a retriever — consistently outperforms static selection and is the recommended approach for production pipelines.
Chain-of-Thought & Reasoning
Chain-of-thought (CoT) prompting is the discovery that instructing a model to generate intermediate reasoning steps before its final answer substantially improves performance on multi-step reasoning tasks. Manual CoT (Wei et al., 2022) provides few-shot examples with worked-out step-by-step reasoning; zero-shot CoT (Kojima et al., 2022) simply appends "Let's think step by step." to the prompt with equivalent effect on many benchmarks. The mechanistic rationale is that intermediate tokens provide additional context that the attention mechanism can use when computing the final answer, reducing the probability of shortcut reasoning. CoT is most effective for arithmetic, logical deduction, and multi-hop reasoning; it provides minimal benefit for simple factual retrieval or classification. Its key limitation is that models can produce confident-looking reasoning chains that contain embedded errors, particularly for precise arithmetic. Programme-of-thought prompting — instructing the model to write executable Python code that solves the problem, then running the code externally — is a more reliable alternative for quantitative tasks, delegating computation to an interpreter rather than requiring the LLM to perform it in prose.
The LLM Landscape
The LLM landscape has stratified into two broad categories: closed frontier models available only via API, and open-weight models whose parameters are publicly downloadable. The choice between them is a genuine engineering decision with significant implications for cost, latency, data privacy, and customisability — not simply a choice between "better" and "worse". The capability gap between the two categories has narrowed dramatically since 2023: leading open-weight models now match or exceed GPT-3.5-class performance on most benchmarks, and for specific well-defined tasks, fine-tuned open-weight models can outperform general-purpose frontier APIs.
Open vs. Closed Models
Closed frontier models — GPT-4o (OpenAI), Claude 3.5 Sonnet (Anthropic), Gemini 1.5 Pro (Google) — offer the highest general-purpose capability, strong safety alignment from extensive RLHF, and the convenience of fully managed infrastructure. The tradeoffs are real: API pricing compounds at scale; data passes through third-party infrastructure, which may be prohibited under GDPR, HIPAA, or enterprise data residency requirements; base weights are inaccessible, precluding fine-tuning; and the provider controls the model version schedule, meaning silent capability changes can cause production regressions. Open-weight models — Meta's LLaMA 3 (8B and 70B), Mistral 7B, Mixtral 8x7B (a sparse mixture-of-experts model with the inference cost of a 12B model but the capacity of 56B), Google's Gemma and Gemma 2, Alibaba's Qwen 2.5, and DeepSeek — offer downloadable weights, self-hostable inference, and full fine-tuning capability. The cost advantages at scale are substantial: self-hosted inference on 100M tokens per day at the 7B scale costs a fraction of equivalent API usage, at the cost of infrastructure engineering and reliability management.
Many production systems use a hybrid architecture: closed frontier APIs for low-volume, high-complexity queries (complex multi-step reasoning, ambiguous edge cases, high-stakes decisions) and self-hosted open-weight models for high-volume, well-defined tasks (classification, extraction, summarisation at scale). This approach optimises both cost and capability. Key model selection criteria beyond cost are: first-token latency requirements (sub-100ms is achievable with small self-hosted models but rarely with external APIs); data privacy constraints (regulated industries typically require self-hosted or region-specific deployment); fine-tuning needs (domain adaptation or style alignment requiring persistent weight modification); and compliance requirements (some jurisdictions require auditable model versioning, which API providers cannot always guarantee).
Benchmarks & Evaluation
The canonical LLM benchmark suite covers distinct capability dimensions: MMLU (Massive Multitask Language Understanding) tests factual knowledge across 57 academic subjects; HumanEval tests code generation via functional correctness on Python programming problems; GSM8K evaluates grade-school mathematics reasoning; HellaSwag tests commonsense NLI via sentence completion; TruthfulQA probes factual accuracy by testing whether models propagate popular misconceptions; MATH covers competition-level mathematics; and BIG-Bench Hard focuses on tasks where even frontier models remain below human performance. No single benchmark captures all relevant capability dimensions, and evaluating across the full suite is the minimum standard for a meaningful model comparison.
The Chatbot Arena (LMSYS) approach produces the most trustworthy general-purpose capability rankings: real users submit any query to two anonymised models, evaluate which response they prefer, and the results aggregate into Elo ratings over millions of comparisons. Because queries come from real users with genuine tasks, contamination is not a concern and the evaluation distribution reflects actual use. The canonical limitation of all public benchmarks is the Goodhart's Law problem: once a benchmark becomes a widely cited capability signal, model developers optimise for it — through benchmark-specific data inclusion or fine-tuning — inflating scores without improving general capability. Practitioners should always supplement public benchmarks with task-specific evaluation on datasets reflecting their actual production query distribution.
The economics of the LLM market are shifting rapidly in practitioners' favour. Competition between providers has driven frontier model API prices down by approximately 80% over two years. Open-weight models that matched GPT-3.5-class performance required 70B parameters in 2023; by 2025, 7B and 8B models achieve the same benchmark results due to improved training data, better instruction tuning, and advances in the scaling efficiency of smaller, longer-trained models. Mixture-of-experts (MoE) architectures — Mixtral 8x7B, DeepSeek-V2, and Qwen2.5-MoE — provide the parameter capacity of large dense models at the inference cost of smaller ones by activating only a subset of experts per token. For practitioners, this landscape means: the right model for most production tasks is almost always smaller and cheaper than intuition suggests; benchmark the actual deployment task, not general capability; and factor in total cost of ownership (API cost + infra cost + engineering cost) when making model selection decisions, not capability scores alone.
Production Warning: Deploying an LLM without red-teaming it for your specific use case is the most common source of costly post-launch incidents. A model that tops MMLU may have systematic failure modes on your domain-specific query types — generating legally problematic content, leaking system prompt information, or producing confident errors on edge cases your eval set did not cover. Red-team before launch, not after.
Code: Local LLM Inference with HuggingFace
The following example demonstrates production-ready local inference with Llama 3.1 8B Instruct using HuggingFace Transformers, including the correct instruction template format, bfloat16 precision for efficiency, and automatic device placement across available GPUs. Understanding the generation hyperparameters — temperature, top_p, max_new_tokens — is essential for controlling output quality and cost.
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Local LLM inference — Llama 3.1 8B Instruct
model_id = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype=torch.bfloat16,
device_map="auto" # auto-distributes across available GPUs
)
# Instruction-formatted prompt
messages = [
{"role": "system", "content": "You are a data analysis expert."},
{"role": "user", "content": "Explain the difference between RMSE and MAE in plain English."}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
do_sample=True,
top_p=0.9
)
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)
Memory Estimate: LLaMA 3.1 8B in bfloat16 requires ~16GB of GPU VRAM. For 4-bit quantised inference (GPTQ or AWQ), the same model fits in ~5GB — enabling deployment on consumer GPUs. Use BitsAndBytesConfig(load_in_4bit=True) in the from_pretrained call to enable 4-bit loading with minimal quality loss (<2% degradation on most benchmarks).
Code: High-Throughput Serving with vLLM
For production serving at scale, vLLM's PagedAttention algorithm provides 2–4x higher throughput than naive HuggingFace inference by managing KV cache memory in discrete pages — similar to virtual memory paging in an OS — rather than pre-allocating the full maximum context length per request. The following example demonstrates batched inference with tensor parallelism across multiple GPUs and the key configuration parameters that govern throughput.
from vllm import LLM, SamplingParams
# vLLM: PagedAttention for efficient KV cache management
# 2-4x higher throughput than naive HuggingFace inference
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
tensor_parallel_size=2, # distribute across 2 GPUs
gpu_memory_utilization=0.85,
max_model_len=8192
)
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=512)
# Batch inference — PagedAttention handles variable-length sequences efficiently
prompts = [
"Summarize this earnings report in 3 bullet points: [REPORT TEXT]",
"Extract all dates from: 'The contract runs from Jan 2024 to Dec 2025'",
"Classify sentiment: 'The product quality is good but delivery was terrible'"
]
outputs = llm.generate(prompts, sampling_params) # batched, continuous batching
for output in outputs:
print(f"Prompt: {output.prompt[:50]}...")
print(f"Response: {output.outputs[0].text[:100]}\n")
# Throughput: ~2,000 tokens/sec on 2x A100 80GB vs ~500 tokens/sec naive
Throughput Benchmarking: To measure your actual serving throughput, run vLLM's built-in benchmark: python -m vllm.entrypoints.openai.api_server --model [model_id] to start the server, then use benchmark_serving.py to fire concurrent requests at different batch sizes. Throughput typically scales linearly with batch size up to ~16 concurrent requests, then plateaus as GPU memory becomes the bottleneck.
Code: Token Counting & Context Window Management
Token budgeting is a critical production concern. Exceeding a model's context window raises an error; approaching it degrades generation quality. The following code demonstrates reliable token counting with tiktoken and a pragmatic strategy for gracefully truncating long documents while preserving both the system instructions (at the start) and the most recent context (at the end) — the two regions that receive the most model attention.
import tiktoken # OpenAI's tokenizer
from transformers import AutoTokenizer
def count_tokens_openai(text: str, model: str = "gpt-4o") -> int:
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))
def truncate_to_context_window(text: str, max_tokens: int = 100_000,
model: str = "gpt-4o") -> str:
enc = tiktoken.encoding_for_model(model)
tokens = enc.encode(text)
if len(tokens) <= max_tokens:
return text
# Truncate from middle to preserve start (system instructions) and end (recent context)
keep_start = max_tokens * 2 // 3
keep_end = max_tokens // 3
preserved = tokens[:keep_start] + tokens[-keep_end:]
return enc.decode(preserved)
# Real-world example: summarizing a 500-page document
doc_tokens = count_tokens_openai(long_document)
print(f"Document: {doc_tokens:,} tokens") # e.g., 150,000 tokens
print(f"Fits in GPT-4o context: {doc_tokens <= 128_000}") # True for GPT-4o
print(f"Fits in Claude 3.5 context: {doc_tokens <= 200_000}") # True (200K context)
# For documents > 200K tokens: chunking + summarize-then-synthesize strategy
Code: Context Budget Management
The following implementation provides a reusable context budget allocator that dynamically partitions a model's context window across system prompt, conversation history, retrieved context, and user input. It prevents hard context limit errors, enables predictable cost control, and applies principled truncation strategies to each component when limits are approached.
import tiktoken
from dataclasses import dataclass, field
from typing import Optional
enc = tiktoken.encoding_for_model("gpt-4o")
def count_tokens(text: str) -> int:
return len(enc.encode(text))
def count_messages_tokens(messages: list[dict]) -> int:
"""Count tokens in OpenAI chat messages format."""
total = 0
for msg in messages:
total += 4 # overhead per message
total += count_tokens(msg.get("content", ""))
total += count_tokens(msg.get("role", ""))
return total + 3 # reply priming overhead
@dataclass
class ContextBudget:
"""Allocate token budget across prompt components."""
model_context_limit: int = 128_000
safety_margin: float = 0.05 # 5% reserve
response_buffer: int = 2_000 # reserved for model output
# Proportional budget allocations
system_frac: float = 0.10 # 10% for system prompt
history_frac: float = 0.25 # 25% for conversation history
retrieval_frac: float = 0.50 # 50% for retrieved context
# Remainder goes to user input
@property
def usable_tokens(self) -> int:
return int(self.model_context_limit * (1 - self.safety_margin)) - self.response_buffer
def budgets(self) -> dict[str, int]:
u = self.usable_tokens
return {
'system': int(u * self.system_frac),
'history': int(u * self.history_frac),
'retrieval': int(u * self.retrieval_frac),
'input': u - int(u * (self.system_frac + self.history_frac + self.retrieval_frac))
}
def truncate_history(messages: list[dict], budget: int) -> list[dict]:
"""Keep most recent messages within budget (preserve recency)."""
result, used = [], 0
for msg in reversed(messages):
tokens = count_tokens(msg.get("content", "")) + 4
if used + tokens > budget: break
result.insert(0, msg)
used += tokens
return result
def truncate_retrieval(chunks: list[str], budget: int) -> list[str]:
"""Trim retrieved chunks from end (lowest relevance last)."""
result, used = [], 0
for chunk in chunks:
tokens = count_tokens(chunk)
if used + tokens > budget: break
result.append(chunk)
used += tokens
return result
def build_context(system_prompt: str, history: list[dict],
retrieved_chunks: list[str], user_input: str,
budget: Optional[ContextBudget] = None) -> list[dict]:
"""Build a context-safe message list for OpenAI Chat API."""
budget = budget or ContextBudget()
b = budget.budgets()
# Truncate each component to its budget
sys = system_prompt[:count_tokens(system_prompt)] # trim if needed
hist = truncate_history(history, b['history'])
ctx = truncate_retrieval(retrieved_chunks, b['retrieval'])
ctx_text = "\n\n".join(ctx) if ctx else ""
sys_with_context = f"{sys}\n\nContext:\n{ctx_text}" if ctx_text else sys
messages = [{"role": "system", "content": sys_with_context}]
messages.extend(hist)
messages.append({"role": "user", "content": user_input})
total = count_messages_tokens(messages)
print(f"[Context] total={total:,} | limit={budget.usable_tokens:,} | "
f"history={len(hist)} turns | chunks={len(ctx)}")
return messages
# Usage
history = [{"role": "user", "content": "What's your return policy?"},
{"role": "assistant", "content": "30 days for all items."}]
chunks = ["Our return policy covers 30 days for all items with receipt.",
"Electronics returns require original packaging.",
"Shipping costs for returns are the customer's responsibility."]
msgs = build_context(
system_prompt="You are a helpful customer support assistant.",
history=history,
retrieved_chunks=chunks,
user_input="Can I return a laptop without the original box?"
)
# [Context] total=287 | limit=119,000 | history=2 turns | chunks=3
Production Pattern: Log the token counts from every production API call — system, history, retrieval, input, and output tokens separately. Tracking these over time reveals whether context usage is growing (e.g., accumulating history in long sessions), which component is driving cost, and when you're approaching limits before hard errors occur. Set up an alert when any call exceeds 80% of the model's context limit.
LLM Model Comparison (Mid-2025)
The following comparison covers the major LLMs available in mid-2025. Model capabilities evolve rapidly — always verify against current benchmarks and provider documentation before making architectural decisions. The "API Cost" column reflects approximate pricing at time of writing; costs typically decrease over time as competition intensifies.
| Model |
Provider |
Context Window |
Open-Weight |
Strengths |
API Cost / 1M tokens (approx.) |
| GPT-4o |
OpenAI |
128K tokens |
No |
Best general reasoning, coding, multimodal (vision + audio) |
$5 input / $15 output |
| Claude 3.5 Sonnet |
Anthropic |
200K tokens |
No |
Long-context, instruction following, coding, reduced hallucination |
$3 input / $15 output |
| Gemini 1.5 Pro |
Google |
1M tokens |
No |
Massive context window, multimodal, audio/video understanding |
$3.50 input / $10.50 output |
| LLaMA 3.1 70B |
Meta |
128K tokens |
Yes |
Near GPT-4 class open-weight; fine-tunable; data privacy |
Self-hosted ~$0.50–$1.00 (infra cost) |
| Mistral Large |
Mistral AI |
32K tokens |
Partial |
Strong reasoning, European data residency, efficient inference |
$2 input / $6 output |
Model Selection Heuristic: For most enterprise applications, start with GPT-4o-mini or Claude 3 Haiku (the smaller, cheaper variants) for high-volume tasks and GPT-4o or Claude 3.5 Sonnet for complex reasoning tasks. Only migrate to self-hosted open-weight models when your monthly API bill exceeds ~$5K/month or when data residency requirements are non-negotiable.
LLM Evaluation Benchmarks
Understanding what each benchmark actually measures — and what it does not — is essential for interpreting model comparison charts. Every benchmark has a specific failure mode or limitation; no single benchmark should be treated as a comprehensive capability signal.
| Benchmark |
What It Tests |
Weakness / Limitation |
Typical Frontier Score |
| MMLU |
Factual knowledge across 57 academic subjects (science, law, medicine, etc.) |
Multiple choice; susceptible to contamination; doesn't test reasoning depth |
85–92% (GPT-4o class) |
| HumanEval |
Python code generation — functional correctness on 164 programming problems |
Only Python; simple algorithmic tasks; easier than real codebases |
85–90% (GPT-4o, Claude 3.5) |
| GSM8K |
Grade-school math word problems — multi-step arithmetic reasoning |
Simple enough that near-saturation achieved; doesn't test hard math |
95%+ (frontier models) |
| MATH |
Competition-level mathematics (AMC/AIME difficulty) |
Requires symbolic manipulation; still hard for all models |
60–78% (best models) |
| HellaSwag |
Commonsense reasoning — completing physical situation descriptions plausibly |
Near-saturated by frontier models; marginal differentiation |
95%+ (frontier models) |
| TruthfulQA |
Factual accuracy — whether models avoid propagating popular misconceptions |
Fixed question set; models can be fine-tuned to score well without being truthful |
60–80% (varies widely) |
| Chatbot Arena |
Head-to-head human preference ratings across real user queries (Elo ranking) |
Biased toward verbosity and formatting; doesn't isolate specific capability dimensions |
Elo ~1250–1320 (top models, Jan 2025) |
Deployment Engineering
The gap between a capable model and a reliable production system is primarily an engineering problem. A model that scores well on benchmarks can still be unusable in production if it has unacceptable latency, unpredictable throughput under load, no observability into its failures, or no guardrails against the adversarial and out-of-distribution inputs that production traffic inevitably contains. Deployment engineering encompasses the serving infrastructure, optimisation techniques, monitoring systems, and safety layers that transform a model checkpoint into a production service.
The key deployment concerns are: time-to-first-token (TTFT), which governs user-perceived responsiveness and must typically be below 1 second for interactive applications; throughput (tokens generated per second per GPU), which determines cost at scale; reliability (uptime and graceful degradation under load); and observability (logging of inputs, outputs, latency, token counts, and error rates for debugging and improvement). For self-hosted deployments, the serving stack choices are: vLLM (optimised for high-throughput continuous batching with PagedAttention for efficient KV cache management), Hugging Face Text Generation Inference (TGI, production-grade with built-in quantisation support), and Ollama (lightweight local development and testing). For cloud-hosted, managed endpoints from OpenAI, Anthropic, Google, Together AI, and Fireworks AI provide various throughput and SLA guarantees.
Inference Optimization & Serving
Several key techniques reduce inference cost and latency for self-hosted LLMs. Continuous batching (PagedAttention in vLLM) processes multiple requests simultaneously, dynamically allocating GPU memory in pages rather than reserving the full maximum context length per request — improving GPU utilisation by 2–4x over naive batching. Speculative decoding uses a smaller draft model to generate candidate token sequences that a larger target model then verifies in parallel, reducing end-to-end latency for the target model by 2–3x on typical text generation tasks. Quantisation reduces model precision from 32-bit or 16-bit floating point to 4-bit or 8-bit integers (GPTQ, AWQ, and GGUF formats), reducing memory requirements by 2–4x with typically less than 2% quality degradation on most tasks, enabling much larger models to run on a given GPU configuration.
The memory hierarchy is the primary physical constraint in LLM serving. GPU High Bandwidth Memory (HBM) — typically 40–80GB on data centre GPUs — must hold the model weights, KV cache for active requests, and activations for in-flight generation. A 70B parameter model in 16-bit precision requires ~140GB of GPU memory, requiring multi-GPU tensor parallelism to serve. Quantised to 4-bit (AWQ), the same model requires ~35GB, fitting on a single high-end GPU. KV cache size scales linearly with sequence length and batch size: at 128K context length with a large batch, KV cache can exceed model weight memory. Tokens per second per dollar (or tokens per second per GPU-hour) is the key cost metric for self-hosted deployments and should be the primary optimisation target after quality requirements are met.
Hallucination & Safety Guardrails
Hallucination is the most commercially consequential LLM failure mode: the generation of plausible-sounding but factually incorrect content. Open-domain hallucination (fabricating facts from parametric knowledge) and closed-domain hallucination (contradicting provided context) require different mitigations. For open-domain hallucination, RAG is the primary mitigation: grounding responses in retrieved, verified content and instructing the model to express uncertainty when evidence is insufficient. For closed-domain hallucination, automated faithfulness verification — using an NLI classifier or LLM judge to check whether each claim in the response is supported by the provided context — is essential for any application where factual accuracy is consequential. Temperature reduction and self-consistency (majority voting over multiple generations) also reduce hallucination rates at the cost of increased inference time.
Safety guardrails for production LLM systems typically operate at three layers. Input guardrails: classifiers or rule-based filters that screen incoming requests for harmful content, prompt injection attempts, or policy-violating queries before they reach the model. Output guardrails: classifiers applied to the model's response before delivery, checking for harmful content, policy violations, or anomalous patterns. Behavioural monitoring: tracking output distributions over time to detect model drift, unusual response patterns, or systematic failures that individual request classifiers miss. All three layers are necessary in production; relying on any single layer as the sole safety control is insufficient. Monitoring dashboards that surface hallucination rates, safety filter trigger rates, and topic distribution of production traffic are essential operational tools for any team running LLMs in a regulated or high-stakes environment.
Multi-region and multi-provider deployment strategies address the availability and compliance requirements of global enterprise applications. Data residency regulations — GDPR, the EU AI Act, and equivalents in India, China, Brazil, and Saudi Arabia — may require that certain data never leave a specific geographic region. Multi-region deployment uses provider-specific regional API endpoints (AWS Bedrock regions, Azure OpenAI regional deployments, Google Vertex AI zones) to ensure data locality. For high-availability architectures, a primary provider in the target region with a geographically co-located secondary provider and an on-premise fallback for critical workloads provides three tiers of resilience. Latency routing — directing each user's request to the lowest-latency endpoint at the time of the request using DNS-level geolocation — improves response times by 50–200ms for globally distributed user bases. Implementing multi-region LLM deployment correctly requires provider-specific evaluation of data processing agreements, subprocessor lists, and audit controls — not just infrastructure configuration — to satisfy regulatory compliance obligations.
Fine-Tuning & Model Adaptation
Fine-tuning adapts a pre-trained model's weights using a smaller, domain-specific dataset, updating the model to specialise in a target task, domain, or response style beyond what prompting alone can achieve. The decision between prompt engineering and fine-tuning is one of the most consequential choices in any LLM application. Fine-tuning incurs significant upfront cost — data curation, training compute, evaluation infrastructure, model versioning, and ongoing retraining cadence — but can produce consistent performance improvements, smaller effective model sizes (a fine-tuned 7B can match a general-purpose 70B on a well-defined task), and fundamentally lower inference cost per task. Prompt engineering is faster to iterate, requires no compute investment, and is appropriate for a wide range of tasks — but has ceiling effects, is vulnerable to prompt sensitivity, and cannot encode knowledge that the base model genuinely lacks.
Fine-Tuning Approach Quick Reference
| Approach |
GPU Memory (7B model) |
Trainable Params |
Quality vs. Full FT |
Best For |
| Full Fine-Tuning |
~140GB (fp16) |
100% |
Baseline |
Maximum quality when compute budget allows; large datasets (>100K examples) |
| LoRA (rank 16) |
~16GB (fp16) |
<0.3% |
97–99% of full FT |
Most production fine-tuning scenarios; good balance of quality and efficiency |
| QLoRA (4-bit + LoRA) |
~5GB |
<0.3% (fp16 adapters) |
96–98% of full FT |
Consumer GPU fine-tuning; teams without data centre access; rapid experimentation |
| IA3 / Prompt Tuning |
~14GB (fp16) |
<0.01% |
90–95% of full FT |
Extreme parameter efficiency; soft prompt learning; few-shot style adaptation |
| Continued Pre-Training |
~140GB (full) or ~5GB (QLoRA) |
All (or LoRA adapters) |
Varies by domain gap |
Domain injection (medical, legal, scientific); significant vocabulary gap from base model |
When to Fine-Tune vs. Prompt-Engineer
The practical decision tree for fine-tuning starts with two questions. First: does the task require knowledge or capabilities absent from the base model, or does it merely require directing existing capabilities? If a task requires the model to answer questions about your company's internal processes, product specifications, or domain-specific terminology that post-dates the training cutoff, RAG or fine-tuning on that content is necessary; prompting alone cannot inject information the model does not have. If the task requires a specific output format, response style, or persona that the model consistently deviates from despite careful prompting, fine-tuning the format into the weights is more reliable than repeating format instructions in every request. Second: is the performance gain worth the operational overhead? Fine-tuned models must be versioned, served as distinct endpoints, re-evaluated after base model updates, and retrained when task requirements change — all costs with no equivalent in prompt-based solutions that use managed API endpoints.
The most productive fine-tuning scenarios are: (a) consistent output format — teaching the model to always produce responses in a specific JSON schema, XML format, or structured report template that few-shot prompting fails to enforce reliably; (b) domain-specific vocabulary — adapting to dense technical terminology (medical, legal, financial) that the base model encodes imprecisely; (c) response style alignment — calibrating the model to a brand voice, level of formality, or response length that differs substantially from its default; (d) task-specific capability — teaching a new task type for which there is substantial labelled data and where task performance is the primary value driver. Scenarios where fine-tuning is typically not worth the investment include: tasks where top-5% prompting achieves acceptable performance; tasks with rapidly changing requirements (fine-tuning lags prompting for fast iteration); and tasks requiring general world knowledge, where the base model's breadth is an asset.
LoRA & Parameter-Efficient Fine-Tuning
Parameter-efficient fine-tuning (PEFT) methods update only a small fraction of model parameters rather than the full weight matrix, dramatically reducing training memory requirements and enabling fine-tuning on consumer or single-GPU hardware. Low-Rank Adaptation (LoRA; Hu et al., 2022) is the dominant PEFT approach: it freezes the original weight matrices and adds small trainable rank-decomposition matrices to the attention layers. A rank-16 LoRA adapter for a 7B model adds approximately 17M trainable parameters — less than 0.3% of the 7B base weights — but achieves fine-tuning quality within 1–3% of full fine-tuning on most tasks. The adapter matrices can be saved separately (typically a few hundred MB) and merged with the base weights at inference time, adding no serving latency.
Quantised LoRA (QLoRA; Dettmers et al., 2023) takes PEFT further: it quantises the frozen base model to 4-bit precision during training, then fine-tunes the LoRA adapters in 16-bit. This reduces the GPU memory required to fine-tune a 7B model from approximately 14GB (full fine-tuning in fp16) to approximately 5GB (QLoRA), enabling fine-tuning on a single RTX 3090 or 4090. The quality trade-off versus full fine-tuning is minimal (typically <1%) for most tasks. The HuggingFace PEFT and trl (Transformer Reinforcement Learning) libraries make QLoRA fine-tuning accessible with under 30 lines of configuration code. For practitioners at teams without dedicated ML infrastructure, QLoRA on a single GPU is the practical entry point for domain adaptation of open-weight models.
Choosing LoRA hyperparameters is more art than science but follows empirically established heuristics. Rank (r): higher rank captures more task-specific information but risks overfitting on small datasets; 8–32 covers most production scenarios, with 64–128 reserved for large datasets or tasks requiring substantial style shift. Alpha (the LoRA scaling factor): typically set to 2x the rank. Target modules: applying LoRA to both query and value attention projections is standard; applying to all linear layers (including MLP) provides marginal improvement at 2x the adapter size. Learning rate: QLoRA fine-tuning typically requires lower learning rates (1e-4 to 2e-4) than full fine-tuning; the warmup ratio should be 3–5% of total steps. Dataset size: for format alignment or style tasks, 1,000–5,000 high-quality examples are often sufficient; for knowledge injection, 10,000–50,000 examples are typically needed to achieve reliable generalisation.
Instruction Tuning & Chat Alignment
Base pre-trained LLMs are next-token predictors trained to continue text — they are not inherently helpful assistants. Instruction tuning transforms a base model into a chat model by fine-tuning it on a large dataset of (instruction, response) pairs spanning diverse task types, teaching the model to follow instructions, maintain conversation format, and provide helpful, well-formatted answers. The original InstructGPT paper (Ouyang et al., 2022) demonstrated that RLHF — using human preference rankings to train a reward model, then using the reward model to update the LLM via PPO reinforcement learning — substantially improved instruction following and reduced harmful outputs relative to supervised instruction tuning alone. This three-stage recipe (pre-training → supervised instruction tuning → RLHF) is the foundation of every chat-aligned model from GPT-4 to Claude to Gemini.
Direct Preference Optimization (DPO; Rafailov et al., 2023) has become a widely-used alternative to RLHF that eliminates the need for a separate reward model training stage, directly optimising on preference data pairs (chosen vs. rejected responses) in a single fine-tuning pass. DPO produces comparable alignment quality to RLHF on most benchmarks with simpler implementation and less training instability. For practitioners building domain-specific chat models, DPO is currently the recommended approach for preference alignment after initial instruction tuning. Constructing high-quality preference data — expert-rated pairs that capture the preference distinctions relevant to your specific application — is more important than the choice between RLHF and DPO for achieving good alignment results.
Continued pre-training is a distinct technique from instruction tuning — it extends the base model's pre-training on a domain-specific corpus without using labelled instruction pairs, teaching the model new vocabulary, facts, and reasoning patterns from raw domain text. It is appropriate when the target domain is sufficiently different from the base model's training distribution that fine-tuning alone produces poor grounding: medical literature using clinical terminology, legal codes using jurisdiction-specific conventions, financial filings using regulatory language, or scientific papers using domain-specific notation. Continued pre-training typically requires more compute than instruction fine-tuning (longer training runs over large unlabelled corpora), but produces a fundamentally more capable base model for the domain that responds better to both prompting and subsequent instruction tuning than starting directly from a general-purpose base. The recommended approach for most teams is to evaluate whether domain-specific RAG over existing models achieves acceptable performance before committing to continued pre-training, as the engineering investment is significantly higher.
Case Study
Fine-Tuning Llama for Medical Triage at Scale: Lessons from a Hospital Network Deployment
A hospital network deploying an LLM to support nurse triage documentation — summarising patient intake notes and suggesting ICD-10 coding categories — faced a challenge that illustrates the fine-tuning decision calculus clearly. The base Llama 3.1 8B model, prompted with carefully engineered instructions, achieved approximately 71% agreement with expert coders on ICD-10 category selection and consistently produced summaries that mixed clinical and lay terminology inconsistently. Increasing to a 70B model raised coding agreement to 78% but made the per-query cost prohibitive at the required volume of 40,000 notes per month. The team chose QLoRA fine-tuning of the 8B model on a dataset of 15,000 annotated triage notes and coding pairs, producing a model that achieved 84% coding agreement — exceeding the 70B baseline — with inference cost 9x lower than the prompted 70B approach.
The key lessons from the deployment: (a) data quality dominated data quantity — the initial 15,000-example dataset contained labelling inconsistencies from 8 different coders, and cleaning it to 11,000 high-consistency examples improved final performance more than the additional 4,000 noisy examples; (b) domain-specific evaluation was non-negotiable — MMLU scores and general coding benchmarks were uncorrelated with the deployment metric (ICD-10 agreement rate with expert coders), confirming that task-specific held-out evaluation is the only reliable performance signal; (c) the model needed continuous retraining — coding category updates and clinical terminology evolution meant the model required quarterly fine-tuning updates to maintain performance, making the full MLOps pipeline (data versioning, automated evaluation, staged rollout) as important as the fine-tuning methodology itself.
LoRA
Domain Adaptation
PEFT
Context Window Management
The context window is the most fundamental constraint in LLM application design. Every input token costs money, contributes latency, and competes for the model's attention budget. Exceeding the context limit raises a hard error; approaching it degrades retrieval quality and coherence. Effective context management — deciding what information to include, how to represent it compactly, and how to handle inputs that exceed window limits — is one of the most practically important skills in LLM engineering. It determines whether a system can handle enterprise-scale documents, long conversation histories, and complex multi-document queries reliably.
Understanding token economics is the starting point. At typical API pricing, a 128K-token context window filled to capacity costs $0.16–$0.64 per call depending on the model (at $1.25–$5 per million input tokens). For a high-volume application making 100K calls per day, context management decisions directly drive costs of thousands of dollars per day. Token counting before each API call — using provider tokenizers (tiktoken for OpenAI models, the Anthropic tokenization library, etc.) — prevents hard errors and enables dynamic context truncation. A context budget allocation approach — reserving fixed token budgets for the system prompt, conversation history, retrieved context, and output — provides a structured framework for managing context across all query types in a multi-purpose application.
Context Window Comparison (Major Models)
| Model |
Context Window |
Effective Retrieval Range |
Practical Implication |
| GPT-4o / GPT-4o-mini |
128K tokens (~96K words) |
~100K reliable |
Handles most enterprise documents; long reports fit natively |
| Claude 3.5 Sonnet |
200K tokens (~150K words) |
~180K reliable |
Full book or large codebase in single context; strong long-document performance |
| Gemini 1.5 Pro |
1M tokens (~750K words) |
~500K reliable |
Multi-document corpora; video transcripts; very large codebases |
| LLaMA 3.1 8B/70B |
128K tokens |
~80K reliable (self-hosted) |
Competitive with GPT-4o range; KV cache memory at 128K is significant constraint |
| Mistral 7B (sliding window) |
Theoretically unlimited (SWA) |
~32K reliable |
Sliding window attention limits effective recall despite nominal unbounded window |
Chunking & Retrieval Strategies
Chunking strategy is the first critical design decision in any RAG pipeline. How documents are split into retrievable units determines both retrieval quality (can the relevant information be found?) and generation quality (is the context coherent enough for the LLM to reason over?). Fixed-size token chunking — splitting every N tokens regardless of content structure — is the simplest approach and performs reasonably well as a baseline, but frequently splits sentences, paragraphs, or logical units mid-thought. With overlap (e.g., 128-token overlap on 512-token chunks), it ensures that information near chunk boundaries appears in at least one chunk fully, at the cost of some redundancy. For structured documents with natural section boundaries (headers, article breaks, code blocks), section-aware chunking that respects semantic boundaries consistently outperforms fixed-size chunking by 8–15% on retrieval precision.
Semantic chunking — using embedding similarity to detect topic shifts and creating chunk boundaries at natural semantic transitions — produces the most coherent chunks but requires the most compute. Small-to-big chunking stores small chunks (128 tokens) for retrieval precision but returns their parent sections (512–1024 tokens) as context for generation, combining the retrieval granularity of small chunks with the contextual coherence of larger units. This "child-retrieval, parent-context" pattern is one of the most reliable improvements to RAG quality and adds minimal implementation complexity once a parent-child chunk relationship is stored in the index metadata. Sentence window retrieval is a variant: retrieve at the sentence level for precision, but expand to the surrounding 2–3 sentences before injection for coherence.
Hybrid search — combining dense vector retrieval with sparse BM25 keyword matching — consistently outperforms either approach alone, with typical improvements of 5–15% in top-5 recall. The combination matters because dense retrieval captures semantic similarity ("affordable flights" matching "cheap airfare") while sparse retrieval captures lexical precision (exact product codes, technical terms, proper nouns that embeddings compress poorly). Reciprocal Rank Fusion (RRF) is the standard algorithm for combining dense and sparse ranked lists: it assigns each document a score proportional to 1/(k + rank) in each list, then sums scores across lists. Cross-encoder reranking — applying a more expensive but more accurate model to the top-20 retrieved candidates to select the final top-5 — adds 100–300ms latency but improves precision by 10–20%, making it worth the latency cost for high-value applications where context quality directly affects answer accuracy.
Long-Document Processing Patterns
Processing documents that exceed the context window requires one of three architectural patterns, each with distinct trade-offs. The map-reduce pattern processes long documents in chunks: a "map" LLM call processes each chunk independently (extracting key information, summarising, or answering a partial question), and a "reduce" call synthesises the chunk-level results into a final answer. Map-reduce is highly parallelisable — all map calls can run concurrently — but suffers from the boundary problem: information that spans two chunks may not be captured correctly by either map call. With 128-token overlap between chunks, this is mitigated but not eliminated.
The iterative refinement pattern processes chunks sequentially: the first chunk produces an initial answer or summary, and each subsequent chunk updates and refines it. This captures cross-chunk relationships better than map-reduce but is sequential (no parallelism) and accumulates errors that early incorrect inferences introduce into later refinement steps. For summarisation tasks over very long documents, iterative refinement often produces higher-quality output than map-reduce because context from earlier sections informs the interpretation of later sections — a key insight that appears early in a report shapes how subsequent details are understood. The hierarchical summarisation pattern addresses the same problem differently: first summarise each section individually, then summarise the section summaries, creating a multi-level hierarchy that preserves both local detail and global structure. This is particularly effective for structured enterprise documents (financial reports, regulatory filings, technical specifications) where section boundaries carry semantic significance.
Implementation Pattern
The Context Budget Pattern for Multi-Purpose Applications
Applications handling diverse query types — some requiring long system prompts, others large knowledge base contexts, others long conversation histories — benefit from a context budget allocator that partitions the available token window dynamically per query type. A practical implementation defines per-query-type budgets as percentages: system prompt (10–15%), conversation history (20–30%), retrieved context (40–55%), input query (5–10%), response buffer (15–20%). Before each API call, the budget allocator counts current token usage in each slot, truncates or compresses each slot to stay within its budget, and verifies total token count against the model's context limit with a 5% safety margin. For conversation history truncation, a sliding window preserving the most recent N turns outperforms fixed-token truncation because recency matters more than token coverage for coherent conversation. For retrieved context truncation, reranker scores provide the optimal truncation criterion — cut the lowest-scoring chunks first until the context fits the budget.
Context Management
Token Budget
RAG
LLM Application Architecture Patterns
Production LLM applications are rarely a single model call — they are systems composed of multiple LLM calls, retrieval steps, tool invocations, validation layers, and orchestration logic. Understanding the canonical architectural patterns for composing these components helps practitioners make sound design decisions rather than reinventing solutions to well-understood problems. The patterns covered here — routing, chaining, caching, fallbacks, and function calling — appear in virtually every mature LLM application, from enterprise automation systems to consumer AI products.
The orchestration layer — the code that sequences LLM calls and other operations — is where most of the complexity in LLM applications lives. Frameworks like LangChain, LlamaIndex, and DSPy provide reusable abstractions for common patterns. LangChain's chain and agent abstractions handle sequential and conditional LLM call pipelines with built-in prompt templating and output parsing. LlamaIndex focuses on data connectors and query engines for RAG pipelines over diverse data sources. DSPy takes a different philosophy: rather than manually writing prompts, it compiles high-level program specifications into optimised prompts using a compilation pipeline that maximises a task-specific metric. The choice between frameworks and custom orchestration depends on team familiarity, the complexity of the pipeline, and the degree of control needed over individual components.
Routing & Chaining
Routing dispatches queries to different models, prompts, or processing pipelines based on the query's characteristics. The most common routing strategy is complexity routing: classify each incoming query as simple or complex, direct simple queries to a fast, cheap model (GPT-4o-mini, Claude Haiku) and complex queries to a frontier model (GPT-4o, Claude 3.5 Sonnet). A binary complexity classifier trained on historical query-outcome pairs achieves this efficiently. Task routing sends different task types to task-optimised models: a coding query routes to a code-specialised model, a document summarisation query routes to a context-length-optimised model, a creative writing query routes to a model known for fluent prose. Well-implemented routing achieves 40–70% cost reduction with less than 2% quality degradation measured on aggregate production metrics.
Chaining sequences multiple LLM calls where the output of one step feeds into the next. The canonical examples are: a summarise-then-answer chain (first compress long documents, then answer questions over the summary), a draft-then-critique chain (one call produces a draft, a second call critically evaluates it and suggests improvements), and a decompose-then-solve chain (one call breaks a complex query into sub-questions, subsequent calls solve each sub-question, a final call synthesises the answers). Chains introduce compounding error risk: if an early step produces a subtly incorrect output, downstream steps have no access to the original information and cannot detect or recover from the error. Chain design should minimise the number of sequential steps, validate intermediate outputs against schemas or classifiers before they feed into subsequent steps, and maintain access to the original input at every stage for context retrieval.
Function calling (structured tool use) is the architectural pattern for connecting LLMs to external systems: databases, APIs, search engines, calculators, code interpreters. The model is given a schema of available functions with their parameters, and rather than generating text, it generates a structured function call that the application layer executes, returning the result to the model for incorporation into the final response. This pattern eliminates hallucination about current facts (the model can call a real-time data API), enables precise computation (the model delegates arithmetic to a calculator), and provides a structured interface for system integrations. All major frontier model APIs (OpenAI function calling, Anthropic tool use, Gemini function declarations) support this natively. The key design practice is writing precise, concise function descriptions that clearly distinguish between functions — ambiguous descriptions cause the model to select the wrong function in multi-tool environments.
Caching, Fallbacks & Resilience
LLM API calls are expensive, latency-variable, and occasionally fail. Production architectures must handle all three realities. Semantic caching stores responses keyed by embedding similarity: when an incoming query's embedding is within a configurable cosine distance threshold of a cached query, the cached response is returned without an API call. For customer support applications, this captures 15–25% of traffic (frequently-asked question patterns), reducing cost proportionally. Cache invalidation must be handled carefully: responses grounded in facts that change (product pricing, availability, policies) must be invalidated when the underlying data changes, not just on TTL expiry. Exact-match caching (keying on normalised query text) is simpler to implement and appropriate for templated queries where canonicalisation is reliable.
Fallback chains handle API failures gracefully. A primary fallback pattern routes requests to an alternative model when the primary API returns a 429 (rate limit) or 5xx error: primary=GPT-4o → fallback-1=Claude 3.5 Sonnet → fallback-2=Gemini 1.5 Pro → fallback-3=cached response or graceful degradation. Cross-provider fallback adds resilience to provider-level outages but requires prompt variants for each provider's API format. Retry with exponential backoff handles transient errors: 3 retries with delays of 1s, 2s, 4s catches the vast majority of transient API failures without hammering a struggling service. Circuit breakers — temporary fallback to an alternative path after N consecutive failures — prevent cascading failures when a provider has a prolonged outage. Building a load balancer that distributes traffic across multiple API provider accounts or regions provides both rate limit mitigation and multi-provider resilience in a single layer.
Observability for LLM application architectures requires tracking at the component level, not just the endpoint level. For a chain with three LLM calls, tracking only the overall request latency and success rate provides too little signal to diagnose where failures occur. Component-level instrumentation — measuring input token count, output token count, latency, cost, and error rate per LLM call in the chain — enables root cause analysis. Distributed tracing (OpenTelemetry is the standard; LangSmith and Langfuse provide LLM-specific tracing) captures the full execution tree of a complex pipeline in a single trace, making it possible to see exactly which retrieval call returned low-quality chunks, which LLM call produced an unexpected response format, and which fallback was triggered. LLM-specific tracing should capture the full prompt (system + history + context + input) and completion for every call, with appropriate PII redaction applied before storage, enabling engineers to replay any production request exactly as it occurred for debugging.
Architecture Principle: Design LLM applications to degrade gracefully, not fail catastrophically. A user who receives a cached answer from 6 hours ago when the live API is down is better served than one who receives a 503 error. Define explicit degradation levels — live LLM response, semantic cache hit, static FAQ answer, human escalation — and implement logic to step down through them rather than treating API unavailability as a fatal failure.
Practice Exercises
These exercises build progressively from simple API usage through to production-scale throughput benchmarking. Each exercise is designed to generate concrete, measurable results that deepen your intuition about LLM behaviour.
Beginner
Exercise 1: Temperature Effects on Summarisation
Use the OpenAI API to summarise 5 news articles (300–500 words each). Generate 3 summaries per article at temperature=0.0, temperature=0.5, and temperature=1.0 (15 API calls total). For each temperature setting: How consistent are the summaries across the 5 articles? How much does each summary diverge from the others at the same temperature? Which temperature produces the most factually accurate summaries? Which produces the most engaging prose? Document your observations in a structured table comparing the three temperature settings.
Intermediate
Exercise 2: Token Counting and Prompt Efficiency
Using tiktoken, count the tokens in 10 of your own prompts. For each prompt, calculate: tokens per instruction word (efficiency ratio), and the proportion of the total token count consumed by the system prompt vs. the user query. Design a refactored version of each prompt that achieves the same task specification using fewer tokens. What techniques reduce token count most: removing redundant phrases, using shorter synonyms, switching from examples to schema descriptions? Measure the accuracy of the refactored prompts vs. the originals on 5 test inputs each.
Intermediate
Exercise 3: GPT-4o-mini vs GPT-4o Capability Comparison
Design 10 test tasks spanning: simple factual lookup (2 tasks), multi-step reasoning (3 tasks), code generation (2 tasks), creative writing (1 task), and domain-specific knowledge (2 tasks). Run each task on both models at temperature=0.1. Rate each response on a 1–5 scale for accuracy and quality. For which task categories does the mini model match or exceed the full model? For which does it fail significantly? Calculate the cost difference and estimate the breakeven — at what quality gap is it worth paying for the full model?
Advanced
Exercise 4: vLLM Throughput Benchmarking
Set up vLLM locally (requires a GPU with at least 16GB VRAM) with an open-weight model (LLaMA 3.1 8B or Mistral 7B). Use the built-in benchmark_serving.py script to measure throughput (tokens/sec) at concurrent batch sizes of 1, 4, 16, and 64 requests. Plot the throughput curve. At what batch size does throughput plateau? What is the P95 time-to-first-token at each batch size? Compare: what is the GPU utilisation (measured via nvidia-smi) at each batch size? This experiment reveals the relationship between concurrency, memory pressure, and serving cost.
LLM Model Card Generator
Model cards are the standard documentation artefact for AI models — capturing intended use, training data, performance metrics, known limitations, and ethics considerations. Use the form below to generate a structured model card for any LLM-based system you are building or deploying.
Conclusion & Next Steps
Large language models represent a qualitative shift in what software systems can do with natural language. The decoder-only transformer architecture, trained with the simple objective of next-token prediction at massive scale, produces systems with emergent capabilities that were not designed in — in-context learning, chain-of-thought reasoning, and coherent long-form generation that generalise across domains. Chinchilla scaling laws provide actionable guidance for compute allocation, and the distinction between training-optimal and inference-optimal models is now a central design variable for any team training or deploying LLMs at scale.
Understanding LLM internals — how attention works, what context window constraints mean, why hallucination is structural rather than incidental, and what benchmark scores do and do not measure — is the prerequisite for making sound engineering decisions about every aspect of LLM application development: which model to use, how to design prompts, when to fine-tune, and how to build evaluation infrastructure that provides reliable signal. The next article in this series moves to the practitioner's primary interface with LLMs: prompt engineering, where the techniques of zero-shot, few-shot, chain-of-thought, and structured output design translate the capabilities described here into reliable production behaviour.
Next in the Series
In Part 9: Prompt Engineering & In-Context Learning, we cover the systematic techniques — zero-shot, few-shot, chain-of-thought, tree-of-thought, and structured outputs — that make LLMs reliable and controllable in production pipelines.
Continue This Series
Part 3: Natural Language Processing
Tokenisation, embeddings, the transformer architecture, and semantic search — the NLP foundations that LLMs are built upon.
Read Article
Part 9: Prompt Engineering & In-Context Learning
Chain-of-thought prompting, few-shot learning, structured outputs, and the prompt patterns that extract maximum performance without any weight updates.
Read Article
Part 10: Fine-tuning, RLHF & Model Alignment
LoRA, instruction tuning, DPO, and the alignment techniques that turn raw pre-trained LLMs into safe, useful assistants.
Read Article