AI in the Wild
Part 6 of 24
About This Article
This article covers reinforcement learning from first principles through to production deployments — Markov Decision Processes, Q-learning and DQN, policy gradient and actor-critic algorithms, and the RLHF pipeline that underlies modern LLM alignment. Real-world applications across robotics, data centre control, and LLM fine-tuning are surveyed throughout.
Q-Learning
Policy Gradients
RLHF
AI & ML Landscape Overview
Paradigms, ecosystem map, real-world applications at a glance
ML Foundations for Practitioners
Supervised learning, bias-variance, model evaluation
Natural Language Processing
Tokenization, embeddings, transformers, semantic search
Computer Vision in the Real World
CNNs, ViTs, detection, segmentation, deployment patterns
Recommender Systems
Collaborative filtering, content-based, two-tower models
6
Reinforcement Learning Applications
Q-learning, policy gradients, RLHF, real-world deployments
You Are Here
7
Conversational AI & Chatbots
Dialogue systems, intent detection, RAG, production bots
8
Large Language Models
Architecture, scaling laws, capabilities, limitations
9
Prompt Engineering & In-Context Learning
Chain-of-thought, few-shot, structured outputs, prompt patterns
10
Fine-tuning, RLHF & Model Alignment
LoRA, instruction tuning, DPO, alignment techniques
11
Generative AI Applications
Diffusion models, GANs, image/audio/video generation
12
Multimodal AI
Vision-language models, audio-text, cross-modal retrieval
13
AI Agents & Agentic Workflows
Tool use, planning, memory, multi-agent orchestration
14
AI in Healthcare & Life Sciences
Diagnostics, drug discovery, clinical NLP, regulatory landscape
15
AI in Finance & Fraud Detection
Credit scoring, anomaly detection, algorithmic trading
16
AI in Autonomous Systems & Robotics
Perception, planning, control, sim-to-real transfer
17
AI Security & Adversarial Robustness
Adversarial attacks, poisoning, model extraction, defences
18
Explainable AI & Interpretability
SHAP, LIME, attention, mechanistic interpretability
19
AI Ethics & Bias Mitigation
Fairness metrics, dataset auditing, debiasing techniques
20
MLOps & Model Deployment
CI/CD for ML, feature stores, monitoring, drift detection
21
Edge AI & On-Device Intelligence
Quantization, pruning, TFLite, CoreML, embedded inference
22
AI Infrastructure, Hardware & Scaling
GPUs, TPUs, distributed training, memory hierarchy
23
Responsible AI Governance
Risk frameworks, model cards, auditing, organisational practice
24
AI Policy, Regulation & Future Directions
EU AI Act, global frameworks, emerging risks, what's next
RL Fundamentals
Reinforcement learning is the third learning paradigm of machine learning,
distinct from supervised learning (learn from labelled examples)
and unsupervised learning (learn structure from unlabelled data).
In RL, an agent learns by interacting with an environment:
at each timestep t, the agent observes the current state s_t,
selects an action a_t according to its policy π(a|s),
receives a scalar reward r_t from the environment,
and transitions to a new state s_{t+1}.
The agent's objective is to learn a policy that maximises the cumulative discounted return
G_t = r_t + γr_{t+1} + γ²r_{t+2} + ...,
where the discount factor γ ∈ [0,1) controls the relative weight of immediate vs. future rewards.
When γ is close to zero, the agent optimises for immediate reward;
when γ is close to one, it takes a long-term view.
The critical distinction from supervised learning is that there are no labelled correct actions
— the agent must discover good behaviour through trial and error.
Rewards may be sparse (arriving only at episode termination),
delayed (the consequences of an action may not manifest for hundreds of steps),
and potentially deceptive (an action that produces immediate reward may lead to poor long-term outcomes).
Furthermore, the actions an agent takes affect the distribution of future states it will observe
— there is no fixed i.i.d. dataset, and the data collection process is coupled to the learning process.
These properties make RL significantly more challenging to apply reliably than supervised learning,
but also applicable to domains where labelled training data simply does not exist.
Key Insight: The core difficulty of reinforcement learning is the credit assignment problem: when a reward arrives hundreds of steps after the action that caused it — a chess game that lasted 80 moves, a robot manipulation task that required 200 actions — how does the agent identify which of its actions deserves credit for the outcome? Solving credit assignment efficiently and accurately is what separates practical RL algorithms from theoretical curiosities, and it is why modern RL methods invest heavily in value function estimation.
Markov Decision Processes
The RL problem is formally specified as a Markov Decision Process (MDP):
a tuple (S, A, P, R, γ) where S is the set of possible states,
A is the set of possible actions,
P(s' | s, a) is the transition probability distribution (the environment's dynamics),
R(s, a) is the expected immediate reward for taking action a in state s,
and γ is the discount factor.
The Markov property is the key assumption:
the probability of transitioning to s' depends only on the current state s and action a,
not on the full history of prior states and actions.
This assumption is often violated in practice
(think of a patient's medical history or a user's long browsing history),
but it is sufficiently good an approximation that MDP-based algorithms work well across an enormous range of problems.
The value function V^π(s) represents the expected return starting from state s and following policy π:
V^π(s) = E_π[G_t | s_t = s].
The action-value function Q^π(s, a) extends this to include the choice of action:
Q^π(s, a) = E_π[G_t | s_t = s, a_t = a].
Both satisfy Bellman equations — recursive consistency conditions that express
the value of a state in terms of the values of successor states.
The optimal value function V*(s) = max_π V^π(s) satisfies the Bellman optimality equation,
and the optimal policy is the greedy policy with respect to Q*: π*(s) = argmax_a Q*(s, a).
Many RL algorithms can be understood as different approaches to approximating
or directly optimising these quantities.
Exploration vs. Exploitation
Every RL agent faces a fundamental dilemma at each decision step: exploit the action it currently believes to be best (maximise immediate reward based on existing knowledge), or explore less-certain actions that might be better (sacrifice immediate reward for information that improves future decisions). The multi-armed bandit — a simplified RL problem with no state transitions, analogous to choosing between K slot machines with unknown payout probabilities — isolates this tradeoff cleanly. In the bandit setting, epsilon-greedy selects a random action with probability ε and the current best action otherwise. UCB (Upper Confidence Bound) augments the estimated action value with a confidence bonus proportional to uncertainty, implementing the principle of optimism under uncertainty. Thompson sampling maintains a Bayesian posterior over action values and samples an action value estimate from this posterior, naturally balancing exploration with the depth of current beliefs.
In full RL settings with state transitions, exploration is harder:
actions affect which states are visited,
so poor exploration can trap the agent in a local region of state space.
Intrinsic motivation methods address this by adding a curiosity bonus reward
— typically based on the agent's surprise at the outcome,
measured as the prediction error of a learned environment model
— to the extrinsic task reward.
Random Network Distillation (RND) maintains a frozen random target network
and a predictor network; high prediction error signals novel states
and generates intrinsic reward proportional to that novelty.
Count-based exploration bonuses
— giving bonus reward proportional to N^{-0.5} where N is the visit count for a state
— provide theoretically grounded exploration with provable regret bounds in tabular settings.
In production RL systems — data centre control, robotics, advertising bid optimisation
— the exploration constraint is almost always safety, not algorithmic sophistication.
Agents must not take catastrophic actions while exploring,
which typically requires constrained exploration within certified-safe action boundaries,
human oversight of novel actions,
and conservative initialisation from a supervised policy trained on historical demonstrations.
Temporal Difference Learning & Bootstrapping
Temporal Difference (TD) learning is the algorithmic cornerstone of modern RL. Unlike Monte Carlo methods — which wait until the end of an episode to compute the return G_t and update value estimates — TD methods update value estimates at every step using a bootstrapped target. The one-step TD update for a value function: V(s_t) ← V(s_t) + α[r_{t+1} + γV(s_{t+1}) − V(s_t)]. The bracketed term is the TD error: the difference between the estimated value of the current state and a target that combines the immediate reward with the discounted estimated value of the next state. This bootstrapping — using the current value estimate in the target rather than the true return — introduces bias but dramatically reduces variance, enabling learning from individual transitions rather than complete episodes.
n-step TD generalises between one-step TD and Monte Carlo: the n-step return accumulates rewards for n timesteps before bootstrapping from the value estimate: G_t^{(n)} = r_{t+1} + γr_{t+2} + ... + γ^{n-1}r_{t+n} + γ^n V(s_{t+n}). n=1 is one-step TD; n=∞ is Monte Carlo. TD(λ) further generalises by computing a weighted average of n-step returns for all n, with weights decaying exponentially by λ. The eligibility trace — an auxiliary vector that accumulates the gradient of recently visited states — provides an efficient online implementation of TD(λ) without explicitly computing all n-step returns. Generalised Advantage Estimation (GAE), used ubiquitously in PPO implementations, applies the TD(λ) idea to advantage estimation, providing a flexible bias-variance tradeoff controlled by the λ parameter.
Value-Based Methods
Value-based RL methods learn an approximation of the optimal Q-function Q*(s, a) — the maximum expected return achievable from state s by taking action a and then following the optimal policy. The greedy policy with respect to Q* is the optimal policy: π*(s) = argmax_a Q*(s, a). Q-functions are preferred over value functions for control because they encode action-level information directly: given Q*(s, ·), the agent can select the best action without needing a model of the environment's transition dynamics.
Q-Learning & DQN
The Q-learning update rule is derived from the Bellman optimality equation: Q(s, a) ← Q(s, a) + α[r + γ max_{a'} Q(s', a') − Q(s, a)]. The term in brackets is the temporal difference (TD) error: the difference between the current Q-value estimate and a bootstrapped target that combines the immediate reward with the discounted value of the next state. Q-learning is off-policy: it learns the optimal Q-function regardless of which policy generated the transition data, allowing the agent to learn from any collected experience including that of a less-skilled behaviour policy (such as epsilon-greedy). This makes Q-learning naturally compatible with experience replay.
Deep Q-Network (DQN, Mnih et al., DeepMind 2015) replaces the Q-table with a convolutional neural network that takes raw game screen pixels as input and outputs Q-values for each of the discrete actions. Training naively with neural function approximators is unstable: the Q-network is used both to compute targets and to update, creating a moving-target optimisation problem, and consecutive frames from a game are highly correlated. DQN introduced two stabilisation techniques: experience replay (store all transitions in a replay buffer and sample random mini-batches for each gradient update, breaking temporal correlation) and target network (maintain a separate, periodically updated copy of the Q-network to compute the TD targets, keeping the optimisation target stable). The resulting system learned to play 49 Atari 2600 games at or above human-level performance from raw pixels, with the same architecture and hyperparameters across all games.
Q-Learning: CartPole Code
The following example implements tabular Q-learning on CartPole-v1 with epsilon-greedy exploration and discretised state space. This is the foundational RL pattern from which DQN extends by replacing the Q-table with a neural network:
import gymnasium as gym
import numpy as np
import random
from collections import deque
class DQNAgent:
"""Deep Q-Network agent for CartPole-v1."""
def __init__(self, state_dim, action_dim):
self.state_dim = state_dim
self.action_dim = action_dim
self.memory = deque(maxlen=10000) # experience replay buffer
self.gamma = 0.99 # discount factor
self.epsilon = 1.0 # exploration rate
self.epsilon_min = 0.01
self.epsilon_decay = 0.995
self.q_table = {} # simplified tabular version
def get_state_key(self, state):
# Discretize continuous state space for tabular Q-learning
return tuple(np.round(state, 1))
def choose_action(self, state):
# epsilon-greedy: explore with probability epsilon, exploit otherwise
if random.random() < self.epsilon:
return random.randint(0, self.action_dim - 1) # explore
key = self.get_state_key(state)
if key not in self.q_table:
return random.randint(0, self.action_dim - 1)
return np.argmax(self.q_table[key]) # exploit
def learn(self, state, action, reward, next_state, done):
key = self.get_state_key(state)
next_key = self.get_state_key(next_state)
if key not in self.q_table:
self.q_table[key] = np.zeros(self.action_dim)
if next_key not in self.q_table:
self.q_table[next_key] = np.zeros(self.action_dim)
# Bellman equation update
target = reward + (0 if done else self.gamma * np.max(self.q_table[next_key]))
self.q_table[key][action] += 0.1 * (target - self.q_table[key][action])
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
# Training loop
env = gym.make('CartPole-v1')
agent = DQNAgent(state_dim=4, action_dim=2)
for episode in range(500):
state, _ = env.reset()
total_reward = 0
for step in range(500):
action = agent.choose_action(state)
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
agent.learn(state, action, reward, next_state, done)
state = next_state
total_reward += reward
if done: break
if episode % 100 == 0:
print(f"Episode {episode}: reward={total_reward:.0f}, epsilon={agent.epsilon:.3f}")
Double DQN & Improvements
Vanilla DQN systematically overestimates Q-values because it uses the same noisy network both to select the action (argmax_a' Q(s', a')) and to evaluate that action. Maximising over a noisy estimate introduces positive bias. Double DQN (van Hasselt et al., 2015) decouples these two steps: use the online network to select the action, and the target network to evaluate it. This simple change reduces overestimation bias substantially and improves performance across the Atari suite.
Dueling DQN introduces an architectural decomposition: instead of outputting Q(s, a) directly, the network has two streams — a value stream V(s) estimating the state value regardless of action, and an advantage stream A(s, a) estimating how much better each action is relative to the average. These are combined as Q(s, a) = V(s) + A(s, a) − mean_{a'} A(s, a'). Prioritised Experience Replay (PER) improves data efficiency by sampling transitions with probability proportional to their absolute TD error — high-error transitions are more informative and should be trained on more frequently. Rainbow DQN (Hessel et al., 2017) combines six independent improvements into a single agent and demonstrates that they are largely complementary, with the combined agent substantially outperforming any individual component.
Case Study
PPO for Real-Time Bidding in Programmatic Advertising: Lessons from Production
A large digital advertising platform faced a sequential decision-making problem that classical auction theory could not fully address: for each ad impression, a bidding agent must decide how much to bid in a real-time auction, balancing immediate win rate against budget pacing constraints across a campaign flight window of days or weeks. The reward signal was sparse — only generating revenue when a user clicked and converted, which occurred in fewer than 0.5% of won auctions — and the state space included hundreds of features: time remaining in campaign, current spend rate, audience segment signals, historical win rates by publisher, and real-time market price signals.
The team trained a PPO agent using simulated auction environments built from six months of historical bid logs, with domain randomisation over market price distributions to improve generalisation to unseen market conditions. The SFT initialisation used a supervised policy trained on the manual bidding rules of experienced campaign managers. After three months of shadow deployment, the team observed 23% improvement in cost per acquisition and 18% improvement in budget utilisation efficiency in A/B testing. The primary failure mode identified was reward hacking: the agent learned to bid aggressively on impression types with historically high click rates even when those placements had poor conversion rates for the specific product being advertised. Switching from a click-based to a conversion-weighted reward signal resolved this at the cost of sparser reward and slower learning.
PPO
Programmatic Advertising
Reward Design
Policy-Based & Actor-Critic Methods
Value-based methods have structural limitations that motivate a fundamentally different approach. First, they require selecting the greedy action by computing argmax_a Q(s, a) — feasible for discrete action spaces with few actions, but intractable for continuous action spaces (like robot joint torques) or large discrete spaces. Second, Q-function approximation with neural networks is prone to instability: small changes in the Q-function estimate cause large changes in the greedy policy. Third, value-based methods represent only deterministic policies — but some problems have inherently stochastic optimal policies. Policy gradient methods address all three limitations by directly parameterising and optimising the policy.
Policy Gradient Methods
Policy gradient methods parameterise the policy as a differentiable function π_θ(a | s) and optimise the expected return J(θ) = E_π[G_t] by gradient ascent. The policy gradient theorem provides an analytically tractable expression for ∇_θ J(θ): it equals E_π[∇_θ log π_θ(a | s) · Q^π(s, a)], which can be estimated from sampled trajectories. The log probability gradient ∇_θ log π_θ(a | s) indicates the direction in parameter space that makes action a more probable in state s; multiplying by Q^π(s, a) reinforces actions proportionally to their value. REINFORCE is the simplest implementation: collect a full episode, compute Monte Carlo returns G_t for each timestep, and update θ in the direction of ∇_θ log π_θ(a_t | s_t) · G_t.
The critical practical problem with REINFORCE is high variance: returns G_t involve the sum of all future rewards from timestep t onward, and this sum varies enormously across different trajectories even when starting from the same state. The baseline trick reduces variance without introducing bias: subtract any function b(s) that depends only on state from the return, computing ∇_θ log π_θ(a_t | s_t) · (G_t − b(s_t)). The natural baseline choice is the value function V^π(s_t), which makes the effective gradient signal the advantage A(s_t, a_t) = G_t − V^π(s_t) — how much better than average the taken action was. REINFORCE with baseline is still on-policy: every trajectory must be collected under the current policy, making it extremely sample-inefficient for problems where environment interaction is expensive.
A3C, PPO & SAC
Actor-critic methods eliminate the sample efficiency problem of REINFORCE by replacing the Monte Carlo return with a bootstrapped TD estimate from a learned value function (the critic). The actor (policy network) is updated to increase the probability of actions with positive advantage; the critic (value network) is updated to minimise the TD error. This bootstrapping trades some bias for dramatically reduced variance, enabling learning from individual transitions rather than complete episodes. A3C (Asynchronous Advantage Actor-Critic, Mnih et al., 2016) runs multiple independent worker agents in parallel on separate environment instances, each computing gradient updates that are asynchronously applied to a shared global network.
PPO (Proximal Policy Optimisation, Schulman et al., 2017) is the dominant algorithm for RL applications requiring stability and broad applicability. The core innovation is a clipped surrogate objective: define the probability ratio r_t(θ) = π_θ(a_t | s_t) / π_{θ_old}(a_t | s_t), and clip this ratio to [1−ε, 1+ε] before multiplying by the advantage. This prevents the policy from changing too drastically in a single update. PPO requires no matrix inversions or constrained optimisation, runs efficiently with standard gradient descent, and is remarkably robust across diverse environments. It is the algorithm underlying RLHF for LLM alignment, OpenAI Five's Dota 2 agent, and numerous robotics policies. SAC (Soft Actor-Critic, Haarnoja et al., 2018) takes a different approach for continuous control: maximum entropy RL, where the objective is augmented with an entropy bonus — the agent is rewarded not just for high return but for maintaining a diverse, exploratory policy. SAC is off-policy (uses a replay buffer), has automatic entropy temperature tuning, and is the reference standard for continuous robotic control tasks, outperforming PPO on most robotics benchmarks while being significantly more sample-efficient.
RL Algorithms Comparison
The choice of RL algorithm is highly context-dependent. The following table compares the major algorithm families across the dimensions that matter most for practical deployment:
| Algorithm |
Type |
On/Off-Policy |
Continuous Action |
Sample Efficiency |
Key Innovation |
| Q-Learning |
Value-based |
Off-policy |
No (discrete only) |
Low-Medium |
TD learning, off-policy convergence guarantees |
| SARSA |
Value-based |
On-policy |
No (discrete only) |
Low |
On-policy TD, more conservative than Q-learning |
| DQN |
Value-based (deep) |
Off-policy |
No (discrete only) |
Medium |
Experience replay + target network for stability |
| PPO |
Policy gradient (actor-critic) |
On-policy |
Yes |
Low-Medium |
Clipped surrogate objective for stable on-policy updates |
| SAC |
Actor-critic (max-entropy) |
Off-policy |
Yes |
High |
Entropy regularisation + automatic temperature tuning |
| TD3 |
Actor-critic |
Off-policy |
Yes |
High |
Twin Q-networks + delayed policy updates to reduce overestimation |
RL Engineering Patterns for Production
Moving an RL algorithm from a research paper to a production system requires addressing a set of engineering challenges that receive little attention in academic treatments. Reward engineering — crafting a reward function that accurately reflects the intended behaviour without being gameable — is often the most time-consuming phase of an RL project. Reward shaping (adding auxiliary rewards that guide the agent towards the goal without altering the optimal solution) and potential-based reward shaping (adding rewards that are guaranteed not to change the optimal policy) are standard tools for improving sample efficiency in sparse-reward environments. Reward normalisation — scaling rewards to a standard range and normalising by a running estimate of the returns' standard deviation — stabilises training across environments with very different reward scales.
Curriculum learning — progressively increasing the difficulty of the training environment as the agent improves — is essential for learning complex behaviours that are nearly impossible to discover through random exploration from scratch. OpenAI's training of the Rubik's cube solving dexterous hand used an automated domain randomisation curriculum: starting with a cube that varied little from the target configuration and gradually increasing the randomisation range until the agent could handle arbitrary starting configurations. For language model post-training, PPO-based RLHF uses the SFT model as the initialisation — the agent already knows how to produce coherent language, so RL fine-tuning needs only to adjust the distribution towards preferred outputs rather than learning to generate text from scratch.
Key Insight: The most common failure mode in applied RL is not choosing the wrong algorithm — it is a poorly specified reward function. Invest at least as much time in reward design, stress-testing, and red-teaming as you invest in algorithm selection and hyperparameter tuning. Before training, manually simulate what behaviour an adversarially clever agent would adopt to maximise your reward function, and verify that this matches the behaviour you actually want.
Environment Design & Simulation Infrastructure
The simulation environment is as important as the RL algorithm. A high-fidelity simulator enables rapid iteration, safe exploration of dangerous states, and the generation of diverse training scenarios that may be rare or impossible to collect in the real world. The requirements for a production RL simulation environment are: step time under 1ms (to enable millions of environment interactions per hour on available compute), accurate physics and dynamics, support for domain randomisation over physical and visual parameters, multi-process parallelisation for vectorised environment stepping, and a deterministic replay capability for debugging. OpenAI Gymnasium (formerly Gym) provides a standardised interface that most RL libraries support; Isaac Gym and MuJoCo are the standard choices for physics-based robotics simulation; SUMO and CARLA are used for traffic and autonomous vehicle simulation; custom-built simulators are necessary for finance, healthcare, and industrial control applications.
The sim-to-real gap — the inevitable discrepancy between simulated and real-world physics — requires specific mitigation. Domain randomisation trains the agent over a distribution of simulated parameters (friction, mass, sensor noise, visual appearance) rather than a single setting, forcing it to develop robust strategies that work across the range. Adaptive domain randomisation (ADR) automatically adjusts the difficulty of the randomisation as the agent's capability improves. Real-to-sim adaptation uses real-world data to calibrate simulator parameters, reducing the gap at the cost of ongoing calibration effort. System identification — estimating the real system's parameters from observation — enables the simulator to match the specific physical system being controlled, rather than a generic model.
RLHF & Alignment
Reinforcement Learning from Human Feedback (RLHF) is the technique that transformed large language models from impressive text predictors into genuinely useful assistants — and it is arguably the most consequential real-world application of RL to date. The core problem is that we want LLMs to produce outputs that are helpful, harmless, and honest, but these properties cannot be fully specified as a mathematical objective function. We can recognise a good response when we see one, but encoding that judgement as a loss function is beyond current capability. RLHF's insight is to learn the objective from human preferences rather than specifying it analytically, then use RL to optimise the language model against the learned objective.
The RLHF pipeline has three stages. First, Supervised Fine-Tuning (SFT): fine-tune a pre-trained LLM on a curated dataset of demonstration responses to representative prompts. Second, reward model training: collect human preference data comparing pairs of model outputs, and train a reward model to predict which response a human would prefer. Third, RL fine-tuning: use PPO to optimise the SFT model's response policy to maximise the reward model's score, subject to a KL penalty that prevents the policy from drifting too far from the SFT reference model. The InstructGPT paper (Ouyang et al., OpenAI, 2022) demonstrated that a 1.3B parameter RLHF model was preferred by human evaluators over a 175B parameter GPT-3 model on instruction-following tasks — the quality of alignment substantially outweighed raw scale.
Reward Modelling
The reward model is trained on human preference comparisons. Given a prompt x, two model-generated responses (y_w, y_l) are presented to a human annotator who indicates which they prefer. The reward model r_φ(x, y) is trained using the Bradley-Terry model of pairwise preference: P(y_w preferred over y_l) = sigmoid(r_φ(x, y_w) − r_φ(x, y_l)). Annotation quality is the binding constraint: annotators must be given clear, consistent guidelines covering what "helpful" means, what "harmless" means, and how to handle ambiguous cases. The distributional challenge is severe: the reward model is trained on outputs from the SFT model, but during RL fine-tuning the policy drifts away from the SFT distribution. Constitutional AI (Anthropic) and RLAIF (RL from AI Feedback) address the scaling bottleneck by using a capable LLM — rather than human annotators — to generate preference labels at scale, guided by a constitution of principles.
RLHF Reward Model: Code Sketch
The following sketch shows how the RLHF reward model is structured and trained on preference pairs. In production, the encoder would typically be a fine-tuned version of the same base LLM being aligned, not a separate smaller model:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
import torch.nn as nn
# Reward model: learns to score LLM outputs based on human preferences
class RewardModel(nn.Module):
def __init__(self, model_name="distilbert-base-uncased"):
super().__init__()
self.encoder = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=1 # scalar reward score
)
def forward(self, input_ids, attention_mask):
outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
return outputs.logits.squeeze(-1) # scalar reward
# Training on preference pairs (chosen > rejected)
# Human annotators rate pairs: which LLM response is better?
# Reward model learns: r(chosen) > r(rejected)
# Bradley-Terry loss for pairwise preferences:
def preference_loss(reward_chosen, reward_rejected):
return -torch.log(torch.sigmoid(reward_chosen - reward_rejected)).mean()
# After training, reward model guides PPO fine-tuning of the base LLM
# PPO policy gradient: maximize E[r(response)] while staying close to reference policy (KL divergence penalty)
Reward Hacking Warning: Reward hacking is the central failure mode of RLHF: given sufficient optimisation pressure, the language model learns to produce outputs that score highly on the reward model without being genuinely helpful. Because the reward model is itself a learned approximation of human preferences — trained on a finite dataset of comparisons and susceptible to distributional shift — it can be gamed. Common manifestations include responses that are excessively verbose (longer responses are often rated higher by annotators regardless of content), responses that hedge excessively to avoid appearing incorrect, and responses that flatter the user. Monitoring the KL divergence from the SFT reference model and setting a hard constraint on maximum divergence is the standard mitigation.
PPO for LLM Alignment
In the RLHF framework, the language model is the policy: given a prompt x, it generates a response y token by token, with each token selection being an "action" in a high-dimensional discrete action space (the vocabulary, typically 50,000–100,000 tokens). The reward is computed by the reward model on the complete (x, y) pair — it is sparse by design, arriving only at the end of the response. A KL penalty is added to the reward signal: total_reward = r_φ(x, y) − β · KL[π_θ(y | x) || π_ref(y | x)], where π_ref is the frozen SFT model. The KL term penalises the policy for diverging from the SFT reference, preventing reward hacking by limiting how far the model can move in response space. The coefficient β is a hyperparameter controlling the strength of this constraint. Direct Preference Optimisation (DPO, Rafailov et al., 2023) offers a simpler alternative: it analytically marginalises out the reward model, deriving a loss function that directly optimises the policy on preference pairs without a separate reward model or RL training loop. DPO requires no reward model inference at training time and avoids the instabilities of RL, making it significantly cheaper to implement. Empirically, DPO performs comparably to PPO on many benchmarks, and it has become the method of choice for instruction-tuning at organisations without the infrastructure for full RLHF pipelines.
Multi-Armed Bandits: Thompson Sampling
Multi-armed bandits are the simplest non-trivial sequential decision-making problem, and they are deployed at massive scale in recommendation systems, A/B testing, content ranking, and ad serving. Thompson sampling is the Bayesian approach to the bandit problem: it maintains a probability distribution over the success rate of each arm, samples one estimate from each distribution, and selects the arm with the highest sampled value. Over time, the distributions concentrate on the true success rates, and the arm selection naturally converges to the optimal arm. Thompson sampling consistently outperforms epsilon-greedy across virtually all empirical evaluations, with similar computational cost and substantially better regret properties.
import numpy as np
from scipy import stats
class ThompsonSampling:
"""Beta-Bernoulli Thompson Sampling for content ranking / A/B testing."""
def __init__(self, n_arms):
self.alpha = np.ones(n_arms) # successes + 1 (Beta prior)
self.beta = np.ones(n_arms) # failures + 1
def select_arm(self):
# Sample conversion probability from each arm's Beta posterior
theta = [stats.beta.rvs(self.alpha[i], self.beta[i]) for i in range(len(self.alpha))]
return np.argmax(theta) # select arm with highest sampled rate
def update(self, arm, reward):
# reward = 1 (click/conversion) or 0
self.alpha[arm] += reward
self.beta[arm] += (1 - reward)
# Simulate 3 content variants: true CTRs = [0.05, 0.08, 0.12]
true_ctrs = [0.05, 0.08, 0.12]
bandit = ThompsonSampling(n_arms=3)
regret = []
for t in range(10000):
arm = bandit.select_arm()
reward = np.random.binomial(1, true_ctrs[arm])
bandit.update(arm, reward)
regret.append(max(true_ctrs) - true_ctrs[arm])
print(f"Cumulative regret after 10K steps: {sum(regret):.1f}")
# approximately 350 -- converges to optimal arm (arm 2) after ~2K steps
# vs. pure exploration (A/B): regret approximately 2500 until significance
Key Insight: Thompson sampling accumulates roughly 7× less regret than a traditional A/B test that holds all variants equally until reaching statistical significance. This translates directly to business value: for a recommendation system deciding which content to surface, Thompson sampling converges to the best variant while losing far less revenue to inferior options during the exploration phase. Netflix, LinkedIn, and Spotify all use bandit algorithms — typically Thompson sampling or UCB variants — for content ranking and A/B test management at scale.
Offline RL, Safe Exploration & Practical Deployment
Standard on-policy and off-policy RL algorithms require the agent to interact with the environment during training. In many high-value real-world applications — healthcare treatment policies, industrial control, autonomous vehicles — this exploration is prohibitively expensive or dangerous. Offline RL (also called batch RL) addresses this by learning entirely from a fixed dataset of previously collected transitions, without any new environment interaction. The challenge is distribution shift: the learned policy may take actions that are rare or absent in the training data, leading to Q-value overestimation for out-of-distribution (OOD) actions because the value function has no corrective signal from actually executing those actions.
Conservative Q-Learning (CQL) addresses OOD overestimation by adding a regularisation term to the standard Q-learning objective that minimises Q-values on unseen actions and maximises Q-values on actions in the dataset. The result is a Q-function that is pessimistic about OOD actions — the policy cannot exploit actions the dataset does not support. TD3+BC (Behaviour Cloning) adds a behavioural cloning regulariser that penalises the policy for deviating from the dataset's action distribution. IQL (Implicit Q-Learning) avoids querying Q-values on OOD actions entirely by reframing the Bellman backup to use only actions from the dataset. These methods have demonstrated impressive results on D4RL benchmark datasets — a standardised collection of offline RL datasets for locomotion, manipulation, and other domains — and are seeing increasing adoption in healthcare (learning treatment policies from electronic health records without exposing patients to experimental treatments) and finance (learning trading policies from historical market data without live order execution risk).
Key Insight: Offline RL enables learning from the vast quantities of historical decision data that organisations already possess — clinical records, trading logs, recommendation interaction logs, industrial sensor data — without requiring an interactive training phase. The fundamental limitation is that an offline policy cannot outperform the data it was trained on in terms of novel action discovery. Offline RL learns to be as good as the best historically demonstrated behaviour, not to discover radically novel strategies.
Safe RL & Constrained Optimisation
Safety in RL refers to the guarantee that the agent will not take actions that violate hard or soft constraints during deployment, even while exploring or improving its policy. The Constrained MDP (CMDP) framework formalises this: the objective is to maximise expected return subject to a constraint that the expected total cost C(s, a) of the policy over an episode does not exceed a threshold d. Constrained Policy Optimisation (CPO) extends trust-region policy optimisation to CMDPs, providing theoretical guarantees that each policy update is constraint-satisfying. Safe sets and Lyapunov-based safety critics define regions of state space from which the agent can always safely return to a known good state, providing runtime safety guarantees that are independent of policy quality.
In practice, production RL systems rarely deploy pure learned policies without safety wrappers. The dominant pattern is layered safety: a rule-based safety layer (implemented with deterministic conditional logic and validated by domain engineers) has absolute veto power over any action proposed by the RL policy. For data centre cooling, this means the RL agent cannot push any actuator outside predefined thermal and mechanical safety bands. For autonomous vehicle control, a rule-based driver model overrides the RL policy in any safety-critical scenario detected by classical perception. The RL policy operates within a certified-safe envelope defined by the domain experts, learning to optimise within those constraints rather than discovering unsafe strategies through unrestricted exploration. Shadow deployment — running the RL policy in parallel with the production rule-based system without executing its actions — is the standard validation step before any live action execution begins.
Case Study
Offline RL for Sepsis Treatment Policies: From ICU Data to Clinical Decision Support
Sepsis is a life-threatening medical emergency where treatment decisions — particularly around vasopressor doses and intravenous fluid volumes — must be made rapidly, often under significant uncertainty. Komorowski et al. (2018) applied offline RL to learn vasopressor and IV fluid dosing policies from a dataset of 17,083 ICU patients, using a retrospective analysis of the MIMIC-III clinical database. The RL agent, trained with a modified fitted Q-iteration algorithm, was evaluated against the documented clinical decisions for a held-out test cohort. The analysis found that patients whose treatments matched the RL policy's recommendations had substantially lower 90-day mortality than those who received doses significantly different from the policy. Critically, the policy was never prospectively tested in a live clinical trial — all evaluation was retrospective. The team was explicit that their work provided a decision support signal for clinicians, not an autonomous treatment agent. The paper catalysed significant investment in offline clinical RL research and contributed to ongoing regulatory discussions about AI-assisted treatment protocols. It also demonstrates the methodological difficulty: because sicker patients receive more aggressive treatment, causal identification of treatment effects from observational data requires careful confounding adjustment — a challenge that pure RL methods do not inherently address.
Offline RL
Clinical Decision Support
MIMIC-III
Model-Based RL & World Models
Model-free RL methods — Q-learning, PPO, SAC — learn a policy or value function directly from experience without explicitly modelling the environment's dynamics. Model-based RL methods learn a world model P̂(s' | s, a) and use it for planning, data augmentation, or policy gradient computation through the model. The sample efficiency advantage of model-based methods can be dramatic: Dreamer (Hafner et al., Google Brain) and its successors learn world models in latent space using recurrent neural networks, plan entirely within the latent world model (collecting hundreds of imagined trajectories for every real environment step), and achieve performance on the DeepMind Control Suite that required 50–100× fewer real environment interactions than the best model-free algorithms. World models have also entered the autonomous vehicle space: GAIA-1 (Wayve) and similar systems learn video prediction models from real driving data that can be used for counterfactual scenario generation, training, and evaluation of driving policies.
The fundamental limitation of model-based RL is compounding model error: errors in the learned dynamics model accumulate over a long planning horizon, and the policy may overfit to the model's idiosyncrasies rather than the real environment. Dyna-style algorithms address this by interleaving model-generated and real transitions during training rather than planning purely in the model — gaining sample efficiency while retaining robustness to model inaccuracies. The practical guidance for practitioners is: use model-free methods (PPO, SAC) when environment interaction is cheap and you want a robust, well-tested baseline; invest in model-based methods when environment interaction is expensive or dangerous and you have sufficient data and engineering capacity to validate the world model's accuracy.
Real-World RL Applications
Beyond the benchmark environments of Atari and board games, RL has been deployed in production across a surprisingly broad range of high-value domains.
The common thread is that these problems involve sequential decision-making under uncertainty, with objectives that are difficult to specify as differentiable supervised learning loss functions but straightforward to evaluate through interaction.
Finance & Trading: Algorithmic trading systems face a classic RL formulation — given a market state (prices, volumes, order book depth, macro indicators), take actions (buy, sell, hold, adjust position size) to maximise risk-adjusted return over a finite investment horizon.
Deep RL policies for portfolio management have demonstrated statistically significant outperformance of rule-based benchmarks on historical backtesting, particularly in mean-reversion and momentum strategies where the state-action relationship is non-linear.
The critical challenge is non-stationarity: financial markets are adversarial — other participants adapt their strategies in response to observed order flow, eroding any predictable alpha signal.
Models trained on historical data from 2019–2022 may not generalise to 2024 market microstructure following regulatory changes or shifts in participant composition.
Sample efficiency is paramount: live paper trading before real capital deployment, ensemble policies to manage model risk, and strict position limits enforced by the execution infrastructure are standard risk management measures.
Healthcare Treatment Optimisation: Dynamic Treatment Regimes (DTRs) formalise medical treatment as a sequential decision process: at each clinical visit, the clinician observes patient state (lab values, symptoms, biomarkers) and selects a treatment action (dosage, medication, intervention), receiving delayed outcome feedback (survival, remission, hospitalisation).
Offline RL from electronic health records is the primary approach, given the impossibility of random action exploration in clinical settings.
Published applications include HIV treatment regimens, diabetes management (insulin dosing from continuous glucose monitor data), and sepsis management (the Komorowski et al. study cited elsewhere in this article).
The methodological challenges are substantial: confounding (sicker patients receive more aggressive treatment, creating spurious action-outcome correlations), irregular observation timing, and the impossibility of counterfactual verification (what would have happened under a different treatment strategy cannot be observed in historical data).
Causal inference methods — inverse probability weighting, doubly robust estimators, and structural causal models — are integrated with RL to address these challenges in clinical applications.
RL Deployments: Industry Survey
The following table summarises where RL has been successfully deployed, the algorithms used, and the results achieved. These examples demonstrate both the breadth of applicability and the importance of reward design and safety mechanisms in production:
The common thread is that these problems involve sequential decision-making under uncertainty, with objectives that are difficult to specify as differentiable loss functions but easy to evaluate through interaction. Data centre cooling (DeepMind/Google): DeepMind's RL system, first deployed in Google's data centres in 2016, controls hundreds of cooling system actuators to minimise power usage effectiveness (PUE). The agent observes sensor readings (temperatures, pressures, pump speeds, fan speeds) and produces control recommendations. After years of operation with human oversight, Google transitioned to fully autonomous control in 2018. The reported energy savings are 30–40% reduction in cooling energy, translating to hundreds of millions of dollars per year across Google's data centre fleet. The key safety mechanism is a conservative constraint set: the RL agent cannot push any sensor reading outside predefined safe ranges, and a rule-based safety system acts as a hardware-level override.
Robotics (locomotion and manipulation): RL has produced remarkable robotics results in simulation — OpenAI's dexterous hand solving a Rubik's cube with domain randomisation, Boston Dynamics' reinforcement of locomotion controllers, and more recently Agility Robotics and Figure AI training humanoid manipulation policies. The sim-to-real gap — the discrepancy between simulated physics and real-world dynamics — is the central challenge. Domain randomisation (training across a wide range of simulated physical parameters: friction coefficients, mass, damping, sensor noise) and curriculum learning (progressively increasing task difficulty during training) are the standard approaches. Chip floorplanning (Google AlphaChip): the placement of logic blocks on a semiconductor chip is a combinatorial optimisation problem. AlphaChip demonstrated that the trained agent could produce floorplans competitive with or superior to experienced human engineers in a fraction of the time, accelerating chip design cycles. Personalised education: Duolingo uses contextual bandits for lesson sequencing, balancing review of material the user is likely to forget against introduction of new content, with the optimisation target being long-term retention rather than session engagement.
RL Deployments: Industry Survey
The following table summarises where RL has been successfully deployed, the algorithms used, and the results achieved. These examples demonstrate both the breadth of applicability and the importance of reward design and safety mechanisms in production:
| Domain |
Company / System |
RL Algorithm |
Reward Signal |
Achieved Result |
| Game Playing |
DeepMind AlphaGo / AlphaZero |
MCTS + Policy/Value Network |
Win/loss signal |
Superhuman at Go, Chess, Shogi |
| Recommendation |
YouTube / TikTok / LinkedIn |
Bandits + Contextual RL |
Long-term engagement, retention |
Drives 70%+ of watch time; rapid A/B convergence |
| HVAC Optimisation |
Google DeepMind data centres |
Model-based RL + safety constraints |
Power Usage Effectiveness (PUE) |
30–40% cooling energy reduction |
| Drug Discovery |
Insilico Medicine / Recursion |
Goal-conditioned RL + generative models |
Predicted binding affinity, toxicity |
Novel drug candidates at 10× speed vs. traditional |
| Robotics |
OpenAI Dactyl / Boston Dynamics |
PPO + domain randomisation |
Task completion, manipulation accuracy |
Dexterous in-hand manipulation; commercial locomotion |
| LLM Alignment |
OpenAI InstructGPT / Anthropic Claude |
PPO with KL penalty (RLHF) |
Human preference judgements |
1.3B RLHF model preferred over 175B GPT-3 baseline |
Goodhart's Law & Reward Hacking: "When a measure becomes a target, it ceases to be a good measure." This is Goodhart's Law, and it is the central challenge of reward design in RL. An RL agent optimised sufficiently hard against any reward function will find exploits: a robot rewarded for moving fast may flip over and spin its wheels; a game-playing agent rewarded for score may find an unintended exploit that never ends the game; an LLM rewarded by human raters may learn to be confidently wrong in fluent prose. The reward function must capture the true objective, not a proxy — and any proxy, however carefully designed, will eventually be gamed by a sufficiently capable optimiser. Iterative reward design with red-teaming and diverse evaluation is not optional.
Hyperparameter Tuning & Debugging RL
RL is notoriously sensitive to hyperparameters. Small changes to learning rate, discount factor, entropy coefficient, or replay buffer size can cause a training run to succeed or fail completely, with the failure mode often being silent: the agent appears to train (losses decrease, metrics update) but converges to a suboptimal or degenerate policy. Established debugging protocols help: always benchmark against a simple baseline (random agent, rule-based heuristic) to establish a floor; verify the reward signal is being received correctly by inspecting per-episode reward histories; plot the Q-value or value function estimates and verify they are finite, stable, and increasing during learning; visualise policy rollouts qualitatively to check whether the agent's behaviour makes intuitive sense.
The discount factor γ is arguably the most influential hyperparameter. γ=0.99 for long-horizon tasks (hundreds of timesteps), γ=0.95 for medium-horizon tasks, γ=0.9 for short-horizon tasks. Too high a discount in a short-horizon task causes the agent to optimise for a longer horizon than the task structure supports, producing unnecessarily conservative behaviour. The entropy coefficient in maximum-entropy RL (SAC) and in PPO's entropy bonus controls exploration: high entropy encourages broad exploration but can prevent the policy from committing to deterministic optimal behaviour in late training. Learning rate warm-up (gradually increasing the learning rate from zero over the first few thousand steps) dramatically stabilises early training in large actor-critic models. Gradient clipping (clipping gradient norms to a maximum value, typically 0.5) prevents catastrophic parameter updates from single anomalous batches.
Seed sensitivity is a well-documented problem in RL research: the same algorithm with the same hyperparameters but different random seeds can produce dramatically different learning curves, and the best seed from a small sweep may overfit to properties of that particular random initialisation rather than reflecting the algorithm's general performance. Responsible RL evaluation requires reporting results across at least 5–10 independent seeds with confidence intervals. The RL community has adopted standard benchmark suites — Atari-100K (100,000 environment steps), DeepMind Control Suite, D4RL for offline RL — that provide consistent experimental conditions for reproducible comparison.
Multi-Agent RL & Emergent Behaviour
Multi-agent RL (MARL) studies environments with multiple learning agents that may cooperate, compete, or coexist indifferently. Unlike single-agent RL, where the environment is stationary from the agent's perspective, multi-agent environments are non-stationary: as all agents learn simultaneously, each agent's best response changes because the other agents are changing. This causes the "moving target" problem and can produce training instability or cycling rather than convergence. OpenAI's multi-agent competitive hide-and-seek environment demonstrated one of the most striking results in MARL: agents spontaneously invented tool use (using movable boxes as ramps and barricades) without being trained or rewarded for tool use specifically — it emerged from the competitive pressure of the hide-and-seek game through self-play. AlphaStar (DeepMind's StarCraft II agent) and OpenAI Five (Dota 2) trained through massive-scale self-play, each agent improving by playing millions of games against previous versions of itself, producing strategies that surprised professional human players. League training — maintaining a diverse population of agents with different strategies during self-play — prevents the entire training population from converging to a single strategy and being exploited by a single counter-strategy.
In industry, MARL is deployed in multi-robot warehouse coordination, traffic light control across networks of intersections, and multi-player game AI. The principal challenge beyond algorithmic instability is reward assignment in cooperative settings: when multiple agents jointly achieve a goal, how should the reward be distributed among them? Global reward sharing encourages cooperation but makes individual credit assignment difficult. Difference rewards (each agent receives the marginal contribution of its action to the team reward) provide cleaner individual gradients but are more complex to compute.
RL Deployment Challenges & MLOps
Deploying an RL policy to production presents challenges that differ fundamentally from deploying a supervised learning model. A trained neural network classifier is a static function: given the same input, it always returns the same output, and the performance on a fixed test set fully characterises its behaviour. A deployed RL policy is a dynamic controller: it may interact with a live environment that changes over time, and its performance depends on the distribution of states it encounters, which in turn depends on its own past actions. This coupling between the policy and the environment distribution makes deployment significantly harder to validate and monitor.
Shadow Deployment & Staged Rollout
Shadow deployment is the standard pre-live validation technique for RL policies. The RL policy runs in parallel with the incumbent control system, observing the same state inputs and producing action recommendations, but its recommendations are not executed — only the incumbent system's actions take effect. This allows direct comparison of the RL policy's proposed actions against the incumbent's actions on real-world state distributions, without any live risk. Automated metrics compare the distributions of recommended actions, predicted values, and any available counterfactual outcome estimates. Human review examines a sampled subset of cases where the RL policy and the incumbent system disagree significantly, providing qualitative validation that the RL policy's departures from the incumbent represent genuine improvements rather than exploits of measurement gaps.
Staged rollout progressively increases the fraction of live traffic handled by the RL policy: start at 1%, hold for a monitoring window, review metrics, advance to 5%, 25%, 50%, 100%. At each stage, automated monitoring checks for anomalies: performance metric deviations from expected range, unexpected action distributions (e.g., the RL policy consistently choosing an extreme action category that the incumbent never used), safety constraint violations, and downstream outcome metrics. Automated circuit breakers — pre-specified thresholds at which the RL policy is automatically disabled and the incumbent system restored — are essential operational safeguards. For high-stakes deployments (medical devices, industrial control), a parallel manual override capability where human operators can intervene on any individual decision is a regulatory requirement in many jurisdictions.
Continual Learning & Policy Maintenance
Unlike a supervised model trained once on a fixed dataset, a production RL policy may need to adapt continuously as the environment changes. In a data centre, new hardware and cooling systems change the physical dynamics. In a recommendation system, the user base and catalogue evolve. In an industrial process, tool wear, raw material variation, and production line reconfigurations alter the reward landscape. Continual RL methods enable the policy to adapt to distribution shift without catastrophic forgetting — the tendency of gradient-based optimisers to overwrite previously learned behaviour when trained on new data. Elastic Weight Consolidation (EWC) penalises changes to weights that are important for previously learned tasks. Experience replay with a small buffer of historical transitions prevents complete forgetting by regularly revisiting past experiences. In production, the most practical approach is often periodic retraining on a rolling window of recent interaction data, with the previous policy providing a warm start, and A/B testing deployed to validate the new policy before full production rollout.
Exercises
These exercises progress from basic environment interaction through reward function design to multi-armed bandit experiments. OpenAI Gymnasium (formerly Gym) provides all necessary environments and is easily installed with pip install gymnasium.
Exercise 1
Beginner
Random Agent Baseline on CartPole
Run a random agent (selecting actions uniformly at random) on CartPole-v1 for 100 episodes. Record the total reward for each episode and compute the mean and standard deviation. Then implement a simple rule-based policy: if the pole angle is positive, push right; if negative, push left. Compare the rule-based policy's mean reward against the random baseline over 100 episodes.
Success criterion: Produce a bar chart comparing mean rewards for both policies. Rule-based policy should achieve significantly higher mean reward than the random baseline. Report the improvement ratio.
Exercise 2
Intermediate
Tabular Q-Learning on FrozenLake
Implement tabular Q-learning on FrozenLake-v1 (8×8 map, non-slippery version: is_slippery=False). Use epsilon-greedy exploration with epsilon decaying from 1.0 to 0.01 over 5,000 episodes. Plot the learning curve: rolling mean reward (window=100 episodes) vs. episode number. Report the final win rate over the last 100 episodes of training.
Success criterion: Learning curve should show clear improvement from near-zero win rate to >80% win rate by episode 3,000. Plot must show the convergence trend clearly.
Exercise 3
Intermediate
Reward Function Design for Autonomous Vehicles
Design a reward function for an autonomous vehicle navigating a 4-way intersection. The vehicle must: reach its destination, avoid collisions, obey traffic rules (stop at red, yield to pedestrians), and minimise travel time. (1) Write out your reward function mathematically with numerical values for each term. (2) Identify at least 3 potential reward hacking scenarios where your function could be gamed. (3) Propose modifications to address each failure mode. (4) Discuss how you would handle the sparse reward problem when the episode only terminates on collision or arrival.
Success criterion: Reward function document includes: mathematical specification with 5+ terms, 3+ reward hacking scenarios with mitigations, and a discussion of shaping strategies (potential-based reward shaping, intrinsic motivation, curriculum learning).
Exercise 4
Advanced
Multi-Armed Bandit: Thompson Sampling vs Epsilon-Greedy
Set up a multi-armed bandit experiment with 3 web page variants (arms) with simulated click rates of 5%, 8%, and 12%. Implement both epsilon-greedy (ε=0.1) and Thompson Sampling (Beta-Bernoulli). Run each algorithm for 5,000 steps, 50 times each (50 independent simulation runs). Plot: (1) cumulative regret over time for both algorithms (mean ± std across 50 runs), (2) arm selection frequency over time showing how quickly each algorithm identifies the best arm. Report mean cumulative regret at step 5,000 for both algorithms.
Success criterion: Thompson Sampling's cumulative regret at step 5,000 should be substantially lower than epsilon-greedy (roughly 3–5× lower). Plots should show Thompson Sampling converging to arm 3 within ~2,000 steps while epsilon-greedy continues exploring suboptimal arms at 10% frequency indefinitely.
Conclusion & Next Steps
Reinforcement learning has two distinct production legacies.
The classical RL lineage — DQN and its descendants for discrete control, PPO and SAC for continuous control
— has delivered genuine breakthroughs in game playing, robotics, chip design, and data centre optimisation,
but these deployments remain specialised and require substantial engineering investment to make safe.
The RLHF lineage — reward modelling, PPO fine-tuning, and DPO
— has had far broader near-term impact,
because it is the enabling technology for every helpful large language model deployed today.
ChatGPT, Claude, Gemini, and their successors are all, at their core,
language models aligned by some variant of learning from human preference feedback.
The behaviour you experience when interacting with these systems
— their helpfulness, their tendency to decline harmful requests, their conversational register
— is shaped as much by the RLHF training process as by the pre-training data.
The field's open problems are well-known:
sample efficiency (classical RL still requires millions of environment interactions
for tasks that humans learn in hours),
safe exploration (exploration in safety-critical domains remains primarily managed
by constraints and human oversight rather than by principled algorithms),
reward hacking (the tension between optimising a proxy reward
and achieving the true objective is fundamental and unsolved),
and sim-to-real transfer (bridging the gap between simulated training environments
and physical deployment remains an empirical art).
The next part of this series moves to Conversational AI,
where RLHF-aligned LLMs are the core building block,
and examines the engineering of dialogue systems, intent detection,
and retrieval-augmented generation pipelines
that power production chatbots at scale.
Next in the Series
In Part 7: Conversational AI & Chatbots, we examine how RLHF-aligned language models become production dialogue systems — covering intent detection, state management, retrieval-augmented generation, and the engineering tradeoffs that separate a demo chatbot from a reliable, on-brand assistant at scale.
Continue This Series
Part 5: Recommender Systems
Collaborative filtering, two-tower models, and production recommendation pipelines — including the exploration vs. exploitation trade-off that connects directly to RL theory.
Read Article
Part 7: Conversational AI & Chatbots
Dialogue systems, intent detection, RAG, and the production engineering of chatbots that build on RLHF-aligned language models.
Read Article
Part 10: Fine-tuning, RLHF & Model Alignment
LoRA, instruction tuning, DPO, and the full alignment toolkit — a deep dive into RLHF and its alternatives for shaping LLM behaviour.
Read Article