Back to Technology

AI Infrastructure, Hardware & Scaling

March 30, 2026 Wasil Zafar 32 min read

Training and serving large AI models requires a deep understanding of hardware capabilities, memory constraints, and distributed systems — this article covers the engineering that makes large-scale AI possible.

Table of Contents

  1. AI Hardware Landscape
  2. Memory Hierarchy & Bottlenecks
  3. Distributed Training
  4. Mixed Precision & GPU Profiling
  5. Inference Optimisation & Serving
  6. Practical Exercises
  7. Infrastructure Plan Generator
  8. Conclusion & Next Steps

AI in the Wild: Real-World Applications & Ethics

Your 24-part learning path • Currently on Step 22
AI & ML Landscape Overview
Paradigms, ecosystem map, real-world applications at a glance
ML Foundations for Practitioners
Supervised learning, bias-variance, model evaluation
Natural Language Processing
Tokenization, embeddings, transformers, semantic search
Computer Vision in the Real World
CNNs, ViTs, detection, segmentation, deployment patterns
Recommender Systems
Collaborative filtering, content-based, two-tower models
Reinforcement Learning Applications
Q-learning, policy gradients, RLHF, real-world deployments
Conversational AI & Chatbots
Dialogue systems, intent detection, RAG, production bots
Large Language Models
Architecture, scaling laws, capabilities, limitations
Prompt Engineering & In-Context Learning
Chain-of-thought, few-shot, structured outputs, prompt patterns
Fine-tuning, RLHF & Model Alignment
LoRA, instruction tuning, DPO, alignment techniques
Generative AI Applications
Diffusion models, GANs, image/audio/video generation
Multimodal AI
Vision-language models, audio-text, cross-modal retrieval
AI Agents & Agentic Workflows
Tool use, planning, memory, multi-agent orchestration
AI in Healthcare & Life Sciences
Diagnostics, drug discovery, clinical NLP, regulatory landscape
AI in Finance & Fraud Detection
Credit scoring, anomaly detection, algorithmic trading
AI in Autonomous Systems & Robotics
Perception, planning, control, sim-to-real transfer
AI Security & Adversarial Robustness
Adversarial attacks, poisoning, model extraction, defences
Explainable AI & Interpretability
SHAP, LIME, attention, mechanistic interpretability
AI Ethics & Bias Mitigation
Fairness metrics, dataset auditing, debiasing techniques
MLOps & Model Deployment
CI/CD for ML, feature stores, monitoring, drift detection
Edge AI & On-Device Intelligence
Quantization, pruning, TFLite, CoreML, embedded inference
22
AI Infrastructure, Hardware & Scaling
GPUs, TPUs, distributed training, memory hierarchy
You Are Here
23
Responsible AI Governance
Risk frameworks, model cards, auditing, organisational practice
24
AI Policy, Regulation & Future Directions
EU AI Act, global frameworks, emerging risks, what's next
AI in the Wild Part 22 of 24

About This Article

This article covers the hardware and systems engineering that underpins large-scale AI — from GPU and TPU architectures to memory hierarchies, distributed training strategies, mixed-precision computation, and production inference serving. You will come away with working knowledge of when to apply each parallelism strategy, how to profile and optimise GPU memory, and how to deploy ML workloads on Kubernetes at scale.

GPU/TPU Distributed Training FSDP / ZeRO Mixed Precision FlashAttention Kubernetes ML

AI Hardware Landscape

Modern deep learning is, at its core, a story about matrix multiplication done at extraordinary scale. A transformer with 70 billion parameters performs hundreds of billions of multiply-accumulate operations in a single forward pass, and training requires doing this millions of times while tracking gradients through every layer. General-purpose CPUs are designed for low-latency sequential execution — they excel at running operating systems, web servers, and branchy business logic, but their small number of powerful cores and modest memory bandwidth make them hopelessly inefficient for the dense, regular, parallel arithmetic that deep learning demands. The GPU's SIMD (single instruction, multiple data) architecture, with thousands of smaller cores capable of executing the same operation simultaneously across thousands of data elements, is a fundamentally better fit.

The hardware landscape today organises into three tiers. At the top are NVIDIA GPUs — the H100 and A100 remain the dominant training accelerators, with the Blackwell B200 entering hyperscaler deployments. Google's TPUs (v4 and v5) power nearly all of Google's own model training and are available to external customers through Google Cloud TPU. The third tier is fast-moving custom silicon: AWS Trainium for training and Inferentia for inference, AMD's MI300X as the most credible NVIDIA alternative, and specialised processors from Graphcore (IPU), Cerebras (WSE-3), and SambaNova. Inference workloads add a fourth category — Apple Silicon's unified memory architecture makes M3 and M4 chips surprisingly competitive for running quantised models locally, and Qualcomm's Snapdragon X Elite brings similar capabilities to the PC ecosystem.

The trade-offs between these platforms are not primarily about peak FLOP ratings. Training large models requires high memory capacity (to hold parameters, gradients, and optimiser states), high memory bandwidth (to feed those thousands of cores quickly enough to keep them busy), and a fast interconnect between chips (because no single chip can hold a frontier model). Inference prioritises latency and cost per token over peak throughput. Choosing hardware requires mapping your workload's arithmetic intensity, memory footprint, and batch size to the platform's roofline — not comparing spec sheets in isolation.

Key Insight: The bottleneck in large-model training is almost never raw compute — it is memory bandwidth and inter-accelerator communication. An H100 has 3.9 PFLOPS of BF16 compute but only 3.35 TB/s of HBM bandwidth. Most transformer operations are memory-bandwidth bound at realistic batch sizes, meaning the chip spends more time waiting for data than computing. Hardware selection must be made with the full system stack in mind, not just peak FLOP ratings.

GPUs for ML

NVIDIA's GPU architecture is built around Streaming Multiprocessors (SMs) — each SM contains CUDA cores for general arithmetic, Tensor Cores for matrix multiply-accumulate, and shared SRAM for fast on-chip data exchange. The H100 SXM5 has 132 SMs, each with 4th-generation Tensor Cores that execute a 16x16 matrix multiply in a single clock cycle in FP16 or BF16. The architectural evolution from Volta (V100, 2017) through Ampere (A100, 2020) and Hopper (H100, 2022) has repeatedly doubled useful throughput for ML workloads. Hopper added the Transformer Engine, which automatically selects FP8 precision where safe during the forward and backward pass, unlocking a further doubling of throughput over BF16. The Blackwell B200 (2024) goes further with FP4 support, 192 GB HBM3e, and a 10 TB/s interconnect for multi-chip module (MCM) configurations.

For practitioners, the specs that matter most are: HBM capacity (80 GB on the A100 and H100 SXM5, 192 GB on the H200 and B200), HBM bandwidth (2 TB/s on A100, 3.35 TB/s on H100, 4.8 TB/s on H200), and NVLink bandwidth for multi-GPU communication (600 GB/s bidirectional on NVLink 4.0 in a DGX H100 system). A DGX H100 node packs 8 H100s connected via NVSwitch, delivering 900 GB/s all-to-all bandwidth within the node — fast enough that intra-node all-reduce is rarely a bottleneck. Inter-node communication over InfiniBand (400 Gb/s HDR or 800 Gb/s NDR) is typically 3–4 orders of magnitude slower, which shapes distributed training strategies profoundly.

The CUDA ecosystem is NVIDIA's deepest moat. Over a decade of library development (cuBLAS, cuDNN, NCCL, cuSPARSE), tooling (Nsight Systems, Nsight Compute), and framework integration (PyTorch, JAX, TensorFlow all target CUDA natively) means that switching costs are enormous. AMD's ROCm platform offers a HIP programming model that is syntactically close to CUDA and supports most PyTorch operations, but the ecosystem gaps — missing or slower libraries, less mature debugging tooling, limited operator coverage for custom kernels — remain genuine friction for teams considering migration. The MI300X, with 192 GB of HBM3 and 5.3 TB/s aggregate bandwidth, is technically compelling for inference workloads, but most production training clusters remain on NVIDIA hardware pending maturity of the software stack.

TPUs & Custom Accelerators

Google's Tensor Processing Units were designed from the ground up for the matrix arithmetic of neural networks. A TPU core is a systolic array — a grid of multiply-accumulate units arranged so that data flows through the array in a regular, pipelined pattern without any of the branching and scheduling overhead of a general-purpose processor. This makes TPUs highly efficient for exactly the workloads that appear in training: large matrix multiplications with regular shapes and no data-dependent branching. The v4 TPU pod provides 1.1 EFLOPS of BF16 compute across 4,096 chips connected with a high-speed 3D torus interconnect; the v5p pod extends this to near-exascale capacity. TPUs are almost exclusively programmed through JAX (via XLA compilation) or TensorFlow — PyTorch/XLA provides a compatibility layer, but the user experience for custom kernels and dynamic shapes is rougher than on CUDA.

AWS Trainium (Trn1/Trn2) is designed for large-scale LLM training and is priced at roughly 20–30% below comparable NVIDIA instances on AWS. The NeuronSDK compiles models from PyTorch and TensorFlow to run on Trainium hardware; the compiler performs operator fusion and layout transformations automatically. AWS Inferentia (Inf1/Inf2) targets inference specifically, delivering some of the lowest cost-per-token numbers available in the cloud for deployed LLMs. The trade-off is a narrower programming model — Trainium works best with standard transformer architectures and standard operators; custom CUDA kernels and non-standard operations require manual porting or cannot be expressed at all. For teams building proprietary model architectures or custom training loops with unusual operators, NVIDIA remains the only choice that guarantees full flexibility.

GPU Comparison Table

Choosing hardware for ML training or inference requires understanding the landscape of available accelerators. The table below compares the most commonly used GPUs across dimensions that matter in practice — memory capacity for fitting large models, compute throughput for training speed, interconnect for multi-GPU scaling, and rough cost positioning.

GPU HBM Memory TF32 TFLOPS FP16 TFLOPS NVLink Price Range Best For
NVIDIA H100 SXM5 80 GB HBM3 989 1,979 NVLink 4.0 (900 GB/s) $25k–$35k LLM training, frontier model research
NVIDIA A100 SXM4 80 GB HBM2e 312 624 (w/ sparsity) NVLink 3.0 (600 GB/s) $10k–$15k Production training, mid-size LLMs
NVIDIA V100 SXM2 32 GB HBM2 130 125 NVLink 2.0 (300 GB/s) $3k–$8k (used) Legacy training, budget research
NVIDIA RTX A6000 48 GB GDDR6 154 309 NVLink bridge (112 GB/s) $4k–$6k Workstation training, fine-tuning
NVIDIA L40S 48 GB GDDR6 362 733 NVLink bridge $8k–$12k Inference at scale, multi-modal
NVIDIA RTX 4090 (Consumer) 24 GB GDDR6X 82 165 None $1,500–$2,000 Consumer fine-tuning, local inference

Memory Hierarchy & Bottlenecks

Understanding GPU memory hierarchy is essential for diagnosing performance issues. At the top of the hierarchy sits HBM (High Bandwidth Memory) — a stacked DRAM technology integrated directly onto the GPU package via silicon interposer. HBM delivers up to 4.8 TB/s of bandwidth on the H200, compared to around 900 GB/s for GDDR6X DRAM used in consumer GPUs. Below HBM, the on-chip SRAM (L2 cache and SM-level shared memory) operates at roughly 10–20 TB/s but is orders of magnitude smaller: the H100 has only 50 MB of L2 and 228 KB of shared memory per SM, totalling around 33 MB of shared memory across all SMs. Main system memory (CPU RAM) and storage (NVMe, object storage) form the lower tiers and are accessed via PCIe — at PCIe 5.0 x16, this provides only about 63 GB/s, two orders of magnitude below HBM bandwidth.

This hierarchy creates a clear priority for optimisation: the goal is to keep data in SRAM as long as possible, minimise HBM round trips, and never access PCIe or system memory during critical compute paths. The roofline model is the canonical framework for analysing whether a specific operation is compute-bound or memory-bandwidth-bound: plot achievable FLOPS against arithmetic intensity (FLOPS per byte of memory traffic). Operations with high arithmetic intensity (large matrix multiplications, convolutions with large kernels) sit on the compute-bound "roof"; operations with low arithmetic intensity (elementwise ops, layer norm, many attention patterns) sit on the memory-bandwidth-bound "wall". Most practical transformer training operations at real batch sizes fall on or below the memory bandwidth wall, meaning the GPU is not compute-saturated — it is waiting for data.

Memory Bandwidth & HBM

The practical consequence of HBM bandwidth as the primary bottleneck is that many optimisations that appear to add computation actually improve total training throughput by reducing HBM reads and writes. This is counterintuitive to developers accustomed to CPU optimisation, where avoiding computation is almost always the goal. On GPU, a sequence of four elementwise operations (each reading from and writing to HBM independently) is far slower than one fused kernel that performs all four in registers without HBM traffic. This is why operator fusion, enabled by torch.compile and Triton custom kernels, is a primary tool for GPU performance optimisation.

Memory capacity determines model size. For BF16 training, a model parameter consumes 2 bytes. An optimiser state (Adam) requires an additional 8 bytes per parameter (two 32-bit moments). Gradients consume another 2–4 bytes per parameter. Activations are a function of batch size, sequence length, and model width — for a typical transformer, activation memory at training time can exceed parameter memory for long sequences. The practical consequence is that a 7B parameter model in BF16 requires roughly 7B x 16 bytes (parameters + gradients + Adam states) = approximately 112 GB just for the parameter-related state, before any activations — more than a single A100's 80 GB. This is why even "small" LLMs require multi-GPU setups for full fine-tuning.

Activation Checkpointing & Gradient Accumulation

Activation checkpointing (also called gradient checkpointing in PyTorch) is a technique that trades compute for memory by recomputing intermediate activations during the backward pass instead of storing them. In standard backpropagation, every intermediate activation tensor from the forward pass must be retained until its gradient is computed — this is where most of the "activation memory" is consumed. Checkpointing divides the network into segments, discards activations at segment boundaries after the forward pass, and recomputes them on-the-fly during the backward pass. The memory savings are proportional to the number of checkpointed segments: checkpointing every layer reduces activation memory from O(L) layers to roughly O(sqrt(L)), at the cost of a single extra forward pass computation. The result is typically a 4–8x reduction in activation memory with a 20–30% compute overhead — almost always worth it when memory is the binding constraint.

Gradient accumulation solves a different problem: effective batch size constraints. When training with very small per-GPU batch sizes (due to memory limits), a single gradient update has high noise, which slows convergence. Gradient accumulation runs multiple forward and backward passes before calling optimizer.step(), accumulating gradients across micro-batches without applying them. This allows an effective batch size of N times micro_batch_size without requiring the full batch to fit in memory simultaneously. PyTorch's DDP provides a no_sync() context manager specifically for gradient accumulation, which suppresses the all-reduce communication during accumulation steps and only synchronises at the update step, preserving communication efficiency.

Practitioner Tip

Memory Estimation Before Training

Before launching a multi-day training run, estimate your memory requirements: Total GPU memory needed = (model parameters x bytes_per_param) x (1 + grad_factor + optimizer_factor) + activation_memory. For Adam + BF16: each parameter needs ~16 bytes (2 BF16 + 8 Adam states + 4 FP32 master weights + 2 BF16 gradients). A 13B model needs roughly 208 GB — 3x H100 minimum, 4x with headroom for activations. Use torch.cuda.memory_summary() after a single forward+backward step to validate your estimate before scaling to a full cluster.

Distributed Training

No single GPU can hold or train a frontier model. GPT-4 is estimated at over 1 trillion parameters; Llama 3 405B has 405 billion. Even at INT8 quantisation, these models require hundreds of gigabytes of memory just for the weights — far beyond any single chip. Training them requires parallelism across dozens, hundreds, or thousands of GPUs. Three fundamental paradigms exist, each splitting a different dimension of the training problem: data parallelism splits the data across GPUs (each GPU holds a full model copy); model parallelism splits the model itself across GPUs (each GPU holds only part of the model); and pipeline parallelism stages the model's layers across GPUs like an assembly line. Production LLM training combines all three in 3D parallelism, as implemented by Megatron-LM and DeepSpeed.

Data Parallelism (DDP) — Code

Data Distributed Parallel (DDP) is the most straightforward parallelism strategy. Each GPU receives a different mini-batch of data and computes gradients independently. After each backward pass, gradients are synchronised across all GPUs using an all-reduce collective operation (typically ring-allreduce), ensuring every GPU's copy of the model receives the same gradient update. The all-reduce is communication-efficient: ring-allreduce transfers exactly 2(N-1)/N of the total gradient tensor across N GPUs, and NCCL's implementation overlaps it with computation for maximum efficiency. DDP achieves near-linear scaling up to 8 GPUs within a node (where NVLink bandwidth makes all-reduce fast), with decreasing efficiency at larger node counts due to slower inter-node InfiniBand bandwidth.

import torch
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler

def setup(rank, world_size):
    """Initialize the distributed process group."""
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

def cleanup():
    dist.destroy_process_group()

def train(rank, world_size, model_class, dataset):
    setup(rank, world_size)

    # Each process handles a partition of the data
    sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    loader = torch.utils.data.DataLoader(dataset, batch_size=64, sampler=sampler)

    # Wrap model in DDP -- synchronizes gradients across all GPUs after backward pass
    model = model_class().cuda(rank)
    ddp_model = DDP(model, device_ids=[rank])

    optimizer = torch.optim.Adam(ddp_model.parameters(), lr=1e-3 * world_size)  # scale LR

    for epoch in range(10):
        sampler.set_epoch(epoch)  # ensures different shuffling per epoch
        for batch_x, batch_y in loader:
            batch_x, batch_y = batch_x.cuda(rank), batch_y.cuda(rank)
            loss = criterion(ddp_model(batch_x), batch_y)
            optimizer.zero_grad()
            loss.backward()      # auto-syncs gradients via all_reduce
            optimizer.step()

        if rank == 0:  # only log from process 0
            print(f"Epoch {epoch+1}: loss={loss.item():.4f}")

    cleanup()

# Launch on 4 GPUs (or 4 nodes x 1 GPU)
if __name__ == "__main__":
    world_size = 4
    mp.spawn(train, args=(world_size, MyModel, train_dataset), nprocs=world_size)
    # Linear scaling: 4x GPUs -> ~3.8x throughput (near-linear efficiency with DDP)
Linear Scaling Rule: When multiplying the number of GPUs by N in DDP, scale the learning rate by N (linear scaling rule, Goyal et al. 2017). This preserves the effective per-sample gradient magnitude across larger effective batch sizes. Apply a warmup period of 5–10 epochs at the start of training to avoid instability from the large initial learning rate.

Model & Pipeline Parallelism (FSDP, Megatron)

When a model is too large to fit in a single GPU's memory — even with gradient checkpointing — the model itself must be sharded across GPUs. Two main strategies exist. Tensor parallelism (as implemented by Megatron-LM) splits individual weight matrices and their corresponding computations across GPUs. For a transformer's attention layer, the Q, K, V projection matrices can each be column-sharded across T GPUs, with each GPU computing its shard of the attention output. The shards are then all-gathered for the subsequent operations. Tensor parallelism has very fine communication granularity (all-reduces at every layer boundary), so it requires the fast intra-node NVLink bandwidth to be effective and is rarely applied across nodes.

PyTorch's Fully Sharded Data Parallel (FSDP) implements ZeRO-3 (Zero Redundancy Optimizer Stage 3) — each GPU stores only a shard of the model parameters, gradients, and optimiser states. Before each forward or backward operation, the required parameters are all-gathered from all GPUs, the computation is performed, and then the parameters are immediately discarded (re-sharded). This gives the memory footprint of model parallelism with the programming model of data parallelism — you wrap any nn.Module with FSDP and it handles the sharding transparently. FSDP can train models of arbitrary size as long as a single layer's parameters fit in memory. In practice, it enables training of 70B+ parameter models on 8 A100s, where DDP would require hundreds of GPUs just to hold the model state.

Pipeline parallelism divides the model's layers into stages and assigns each stage to a different set of GPUs. One GPU runs the first N layers; the next GPU receives intermediate activations and runs the following N layers; and so on, like an assembly line. This allows models of arbitrary depth to be trained across GPUs that cannot hold the full model individually. The main challenge is the "pipeline bubble": while GPU 1 is processing the backward pass of batch k, GPUs 2–4 are idle waiting for the gradient flow. Efficient pipeline schedules (GPipe, PipeDream, 1F1B) reduce the bubble to around 1/number_of_stages fraction of total compute time.

Distributed Training Strategies Comparison

Strategy How It Splits Work GPU Memory Reduction Communication Overhead When to Use Tool / Library
DDP (Data Parallel) Each GPU sees different data batch; full model replica on each GPU None (full model copy per GPU) Low — one all-reduce per backward pass Model fits on one GPU; scale throughput PyTorch DDP, Horovod
Pipeline Parallelism Model layers split across GPUs in stages (GPU1 = layers 1–8, GPU2 = layers 9–16...) Linear reduction by number of stages Medium — micro-batch bubble overhead (10–20%) Very deep models; sequential layer chains Megatron-LM, GPipe, PyTorch PipelineParallel
Tensor Parallelism Individual weight matrices column/row-sharded across GPUs Linear by tensor-parallel degree High — all-reduce at every layer; requires NVLink Largest models; intra-node only Megatron-LM, tensor-parallel in NeMo
ZeRO (DeepSpeed) Shards optimizer states (Stage 1), gradients (Stage 2), or params+grads+optimizer (Stage 3) across GPUs Up to 8x (ZeRO-3 vs DDP) Low–Medium — extra all-gathers vs DDP Large models; when FSDP API is limiting DeepSpeed, PyTorch FSDP (ZeRO-3 equivalent)

Mixed Precision & GPU Profiling

Floating-point numbers encode three components: sign (1 bit), exponent (dynamic range), and mantissa (precision). FP32 uses 8 exponent bits and 23 mantissa bits. FP16 uses 5 exponent bits and 10 mantissa bits — it has higher precision than FP32 for numbers close to 1, but a much smaller dynamic range (maximum value ~65,504 vs ~3.4x10^38). BF16 (bfloat16) retains FP32's 8 exponent bits but reduces the mantissa to 7 bits, preserving the full dynamic range while halving memory and doubling Tensor Core throughput. BF16 has become the default for LLM training precisely because gradient magnitudes span many orders of magnitude during training — FP16's limited dynamic range causes underflow or overflow in gradients without dynamic loss scaling, while BF16 rarely does.

FP8 (supported natively by the H100 Transformer Engine) goes further, with two variants: E4M3 (4 exponent bits, 3 mantissa, high precision) for the forward pass and E5M2 (5 exponent bits, 2 mantissa, high range) for gradients. FP8 roughly doubles throughput over BF16 again, but requires careful per-tensor scaling to avoid precision loss. Published results from Llama 3.1 and similar models show FP8 training matching BF16 loss curves with no significant quality degradation when implemented carefully.

FP16, BF16 & AMP

PyTorch's Automatic Mixed Precision (AMP) makes mixed-precision training nearly transparent. The torch.amp.autocast context manager wraps the forward pass and automatically casts supported operations to BF16 or FP16 — matmuls, convolutions, and attention ops are cast down for speed, while reduction operations (softmax, layer norm) remain in FP32 for stability. For FP16 training, the GradScaler multiplies the loss by a large scaling factor before the backward pass and divides gradients back afterward, preventing FP16 underflow in small gradient values. With BF16, GradScaler is typically unnecessary. The master weights pattern keeps FP32 parameter copies for the optimiser update step while using BF16 for the forward and backward passes — this ensures that small gradient updates applied by Adam are not lost to BF16's reduced mantissa precision.

GPU Profiling & Memory Optimisation — Code

Before optimising, profile. PyTorch's built-in profiler captures CPU and CUDA execution timelines, memory allocations, and kernel launch statistics. The example below demonstrates profiling a training loop and applying mixed-precision and gradient checkpointing to address the bottlenecks found. The profiler output integrates directly with TensorBoard's trace viewer for visual inspection.

import torch
from torch.profiler import profile, ProfilerActivity, tensorboard_trace_handler

# Profile GPU memory and compute utilization
model = LargeTransformerModel().cuda()
optimizer = torch.optim.AdamW(model.parameters())

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
    with_stack=True,
    on_trace_ready=tensorboard_trace_handler("./profiler_logs")
) as prof:
    for step, (x, y) in enumerate(train_loader):
        x, y = x.cuda(), y.cuda()

        # Mixed precision training: 2x speedup, 50% memory reduction
        with torch.cuda.amp.autocast():
            output = model(x)
            loss = criterion(output, y)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

        if step == 5: break  # profile first 5 steps

print(prof.key_averages().table(sort_by="cuda_memory_usage", row_limit=10))
# Output: which operations consume the most GPU memory
# Common bottleneck: attention mechanism O(n^2) memory for long sequences

# Memory optimization techniques:
torch.cuda.empty_cache()                              # free cached but unused memory
model = model.half()                                  # float16 -> 2x less VRAM
# gradient checkpointing: recompute activations during backward instead of storing
from torch.utils.checkpoint import checkpoint_sequential
model.encoder = checkpoint_sequential(model.encoder, chunks=4)
# Reduces activation memory by 4x at cost of ~30% compute overhead

Flash Attention & Kernel Fusion

Standard scaled dot-product attention computes Q*K^T, divides by sqrt(d_k), applies softmax, and multiplies by V. The problem is that the full N x N attention matrix must be written to HBM and read back for the softmax, generating O(N^2) memory traffic in addition to O(N^2 * d) arithmetic. For N=4,096 and d=128 on an H100, this means writing and reading a ~128 MB matrix repeatedly — a substantial bandwidth cost that dwarfs the actual compute. FlashAttention (Dao et al., 2022) rewrites attention to be IO-aware: it tiles Q, K, V matrices into blocks that fit within SRAM, computes the softmax incrementally using the online softmax algorithm, and produces the output without ever writing the full attention matrix to HBM. On an A100, FlashAttention achieves 2–4x speedup over standard attention and a roughly 5–20x reduction in HBM memory use, depending on sequence length.

FlashAttention-2 improved work partitioning across warps within each SM, achieving better GPU utilisation, and introduced support for variable-length sequences and GQA/MQA. FlashAttention-3 (targeting Hopper and later) exploits the H100's asynchronous copy engine (TMA) and WGMMA instructions to overlap data loading with compute, and adds FP8 support with per-tile quantisation. FlashAttention is now the default attention implementation in PyTorch's F.scaled_dot_product_attention (SDPA), which dispatches to it automatically when inputs are on CUDA. More broadly, kernel fusion — combining multiple elementwise operations (ReLU, LayerNorm, residual adds, bias terms) into a single GPU kernel — eliminates intermediate HBM reads and writes. The OpenAI Triton language makes writing custom fused kernels practical without raw CUDA expertise, and torch.compile performs automatic kernel fusion for PyTorch graphs.

Inference Optimisation & Serving

Inference has fundamentally different constraints from training. Training optimises for total compute efficiency — keeping GPUs utilised as long as possible. Inference must optimise for time-to-first-token (TTFT), inter-token latency, throughput (tokens per second per GPU), and cost per token — often simultaneously, with conflicting trade-offs. The central resource constraint is HBM: at inference time, weights must be loaded once (a large fixed cost), and then the KV cache for every active request must fit alongside them. For a Llama 3 70B model at BF16, the weights alone consume about 140 GB — more than one H100 can hold. Serving requires tensor parallelism across at least two H100s just for the weights, leaving limited HBM headroom for KV cache.

Continuous batching (as implemented in vLLM, TGI, and SGLang) is the most impactful single serving optimisation. Naive static batching waits for a fixed batch of requests to arrive before starting a forward pass, and holds every slot until the longest sequence finishes — wasting GPU cycles on empty slots. Continuous batching instead inserts new requests into the batch as soon as slots become available after other sequences finish, keeping GPU utilisation high. vLLM's PagedAttention manages KV cache using virtual memory paging, allocating non-contiguous HBM pages to each sequence rather than requiring contiguous buffers — typically achieving 2–4x higher throughput than naive implementations on the same hardware.

Speculative decoding uses a small, fast draft model to generate candidate next tokens, which the large target model then verifies in a single parallel forward pass. Because the verification step processes multiple tokens simultaneously (rather than one at a time), and because most draft proposals are accepted, effective throughput improves by 2–3x with no change to the output distribution. Weight quantisation for inference — GPTQ (post-training quantisation to INT4 or INT8 per weight), AWQ (activation-aware weight quantisation), and GGUF (the format used by llama.cpp for CPU and consumer GPU inference) — trades a small amount of accuracy for roughly 4x memory reduction and 2–3x throughput improvement on memory-bandwidth-bound generation.

Kubernetes ML Inference Deployment — Code

Production ML serving in enterprise environments almost always runs on Kubernetes. The following example shows a complete GPU-accelerated inference deployment with health probes, resource requests, horizontal pod autoscaling, and proper secrets management for model registry URIs. The NVIDIA device plugin must be installed as a DaemonSet on the cluster for GPU scheduling to work.

# GPU-accelerated inference deployment on Kubernetes
# Assumes: NVIDIA device plugin installed, kubeconfig configured

# 1. Create the deployment manifest
cat > churn-api-deployment.yaml << 'EOF'
apiVersion: apps/v1
kind: Deployment
metadata:
  name: churn-prediction-api
  labels:
    app: churn-api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: churn-api
  template:
    metadata:
      labels:
        app: churn-api
    spec:
      containers:
      - name: churn-api
        image: acr.azurecr.io/churn-api:v2.1.0
        resources:
          requests:
            memory: "2Gi"
            cpu: "1"
            nvidia.com/gpu: "1"    # request 1 GPU per pod
          limits:
            memory: "4Gi"
            cpu: "2"
            nvidia.com/gpu: "1"
        ports:
        - containerPort: 8080
        env:
        - name: MODEL_URI
          valueFrom:
            secretKeyRef:
              name: mlflow-secrets
              key: model-uri
        livenessProbe:
          httpGet: {path: /health, port: 8080}
          initialDelaySeconds: 30
        readinessProbe:
          httpGet: {path: /health, port: 8080}
          initialDelaySeconds: 15
EOF

kubectl apply -f churn-api-deployment.yaml

# 2. Horizontal Pod Autoscaler: scale based on GPU utilization
kubectl autoscale deployment churn-prediction-api \
  --min=2 --max=10 \
  --cpu-percent=70

# 3. Monitor deployment
kubectl rollout status deployment/churn-prediction-api
kubectl top pods -l app=churn-api
Key Insight: GPU memory — specifically HBM capacity and bandwidth — is the single most important resource constraint in AI systems. Everything else is downstream of it. The number of parameters you can train, the batch size you can run, the context length you can serve, and the number of concurrent requests you can handle all reduce, ultimately, to how much HBM you have and how fast you can read from it. When evaluating hardware, optimisations, or architectures, always ask the memory question first.
Production Warning: Distributed training bugs are among the hardest to reproduce and diagnose in all of software engineering. A gradient synchronisation error or a ZeRO sharding bug may not surface until thousands of GPU-hours into a training run — by which time you have wasted significant compute and cannot easily reconstruct the faulty state. Enable deterministic mode (torch.use_deterministic_algorithms(True)) in early debugging, instrument gradient norms and loss scaling coefficients as training metrics from the first step, and establish a "canary run" pattern — a short 100-step training run on a small cluster with known-good outputs used to validate any infrastructure change before committing to a full run.

Practical Exercises

These exercises ground the infrastructure concepts in hands-on experimentation. Work through them from beginner to advanced — each builds your intuition about the hardware constraints that govern large-scale AI. You do not need expensive hardware for the first two; a free Google Colab GPU (T4) or Kaggle P100 is sufficient.

Exercise 1 Beginner

Profile a ResNet-50 Forward Pass

Profile a forward pass of a ResNet-50 model using torch.profiler. Identify the top 3 most time-consuming operations. For each, record: (a) the kernel name, (b) the CUDA time vs CPU time, (c) whether it is memory-bandwidth-bound or compute-bound. Compare profiling results on CPU vs GPU. What is the most surprising finding? Inspect the memory timeline — at which layer does peak activation memory occur? Use prof.key_averages().table(sort_by="cuda_time_total") to generate your report. As an extension, repeat the profiling with and without torch.compile(model) and document the difference in kernel fusion and total CUDA time.

Exercise 2 Intermediate

Mixed-Precision Training Comparison

Implement mixed-precision training (FP16) for a CNN. Using a fixed architecture and dataset (e.g., ResNet-18 on CIFAR-100), measure three things: (a) training speed improvement in samples/second for FP32 vs FP16, (b) peak memory reduction using torch.cuda.max_memory_allocated(), and (c) final model accuracy vs full FP32 precision after the same number of epochs. Repeat for BF16 if your hardware supports it. Document: does the AMP overhead during the first epoch differ from subsequent epochs? At what batch size does AMP provide the largest relative speedup, and why? Note any numerical instability symptoms (loss spikes, NaN gradients) and how the GradScaler prevents them.

Exercise 3 Advanced

2-GPU DDP Scaling Efficiency

Set up 2-GPU DDP training on a multi-GPU machine (or simulate with 2 processes on CPU using the Gloo backend for testing the code path without expensive hardware). Measure the scaling efficiency vs single-GPU training: (a) compute throughput in samples/second per GPU, (b) gradient synchronisation overhead using NCCL activity tracking via the profiler, and (c) overall wall-clock time to reach a target validation accuracy. Calculate the actual scaling efficiency as: (single-GPU throughput x 2) / (2-GPU throughput). Benchmark across three model sizes (small/medium/large) to see how scaling efficiency varies with the compute-to-communication ratio. Document the crossover point where communication overhead becomes significant, and explain why it scales with model size in the direction it does.

AI Infrastructure Plan Generator

Use this tool to document your AI infrastructure architecture decisions and generate a professional planning document. Define your hardware choices, training strategy, and operational approach — then export as Word, Excel, PDF, or PowerPoint for sharing with engineering leadership and finance stakeholders.

AI Infrastructure Plan Generator

Define your ML infrastructure architecture and generate a professional planning document. Download as Word, Excel, PDF, or PowerPoint.

Conclusion & Next Steps

The AI infrastructure stack is not a detail left to platform engineers — it is the physical substrate that determines what models can be built, at what cost, and with what reliability. Working upward from silicon: GPU and TPU architectures provide the raw compute through Tensor Cores and systolic arrays optimised for matrix multiply. HBM provides the memory bandwidth that feeds those compute units, and its capacity constrains everything from trainable model size to serving batch size. Activation checkpointing, gradient accumulation, and mixed precision (BF16, FP8) extend what can be done within fixed hardware budgets. Distributed training — data parallelism via FSDP and ZeRO, tensor parallelism via Megatron-LM, pipeline parallelism for depth — scales training to models far too large for any single device. FlashAttention and kernel fusion squeeze efficient use out of every byte of HBM bandwidth. And at the serving layer, continuous batching, PagedAttention, speculative decoding, and quantisation make inference economically viable at scale.

Infrastructure knowledge compounds with model knowledge. A practitioner who understands only model architecture cannot diagnose why training loss diverged after a checkpoint restart, or why serving latency doubles under moderate concurrency, or why gradient norms explode at a specific layer configuration. Conversely, an infrastructure engineer who does not understand the model has no principled basis for choosing parallelism strategies or precision formats. The practitioners who build the most capable and reliable AI systems are those who hold both levels of understanding simultaneously — and who know which knob to turn when the system misbehaves.

As you move into the final two articles of this series, carry the infrastructure lens with you: the governance and policy questions in Parts 23 and 24 are ultimately questions about what gets built with this machinery, and understanding the machinery makes the governance stakes concrete. A responsible AI practitioner who understands that a 70B parameter model requires a minimum of four H100s to serve at production scale has a fundamentally different intuition about the economics, access, and policy dimensions of AI than one who treats models as abstract services delivered by an invisible cloud.

Next in the Series

In Part 23: Responsible AI Governance, we examine how organisations govern the AI systems they build — covering risk frameworks (NIST AI RMF, EU AI Act), model cards, datasheets, AI auditing, and the organisational practices that separate responsible deployment from theatre.

Technology