Markov Decision Processes
Reinforcement learning operates in a Markov Decision Process (MDP) — a mathematical framework for sequential decision-making where outcomes are partly random and partly under the control of a decision-maker (agent).
Formal Definition
An MDP is a tuple $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$ where:
| Symbol | Name | Description |
|---|---|---|
| $\mathcal{S}$ | State space | Set of all possible states the environment can be in |
| $\mathcal{A}$ | Action space | Set of all actions available to the agent |
| $P(s'|s,a)$ | Transition function | Probability of reaching state $s'$ given state $s$ and action $a$ |
| $R(s,a,s')$ | Reward function | Immediate reward received after transitioning |
| $\gamma \in [0,1)$ | Discount factor | Weight given to future rewards vs immediate rewards |
The Markov property is the key assumption: the future depends only on the current state, not on the history of how we got there:
$$P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1} | s_t, a_t)$$A policy $\pi(a|s)$ maps states to a probability distribution over actions. A deterministic policy selects one action per state: $a = \pi(s)$. A stochastic policy gives $\pi(a|s) = P(a_t = a | s_t = s)$.
Returns & Discounting
The return $G_t$ is the total discounted reward from time step $t$ onward:
$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$$The discount factor $\gamma$ controls the trade-off:
- $\gamma \to 0$: agent is myopic, only cares about immediate reward
- $\gamma \to 1$: agent is far-sighted, values future rewards almost equally
Why discount? Mathematically, discounting ensures the infinite sum converges when rewards are bounded: $|G_t| \leq \frac{R_{\max}}{1-\gamma}$. Practically, it encodes the intuition that a reward now is worth more than the same reward later.
import numpy as np
# Compute discounted return from a sequence of rewards
def compute_return(rewards, gamma=0.99):
"""Calculate G_t = sum of gamma^k * r_{t+k+1} for each timestep."""
T = len(rewards)
returns = np.zeros(T)
G = 0.0
for t in reversed(range(T)):
G = rewards[t] + gamma * G
returns[t] = G
return returns
# Example: rewards from a 5-step episode
rewards = [1.0, 0.0, -1.0, 2.0, 1.0]
gamma = 0.99
returns = compute_return(rewards, gamma)
print("Rewards:", rewards)
print("Returns:", [f"{g:.4f}" for g in returns])
print(f"G_0 = {returns[0]:.4f}") # Total discounted return from start
Bellman Equations
The Bellman equations are the recursive relationships that define value functions. They are the backbone of nearly every RL algorithm — from dynamic programming to deep Q-networks.
Bellman Expectation Equations
The state-value function $V^\pi(s)$ is the expected return when starting in state $s$ and following policy $\pi$:
$$V^\pi(s) = \mathbb{E}_\pi[G_t | s_t = s] = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \;\middle|\; s_t = s\right]$$By expanding one step of the return $G_t = R_{t+1} + \gamma G_{t+1}$ and taking the expectation:
$$\boxed{V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)\left[R(s,a,s') + \gamma V^\pi(s')\right]}$$This is the Bellman expectation equation for $V^\pi$. It says: the value of a state equals the expected immediate reward plus the discounted value of the next state, averaged over all actions and transitions.
Similarly, the action-value function (Q-function) $Q^\pi(s,a)$ is the expected return when taking action $a$ in state $s$ and then following $\pi$:
$$Q^\pi(s,a) = \sum_{s'} P(s'|s,a)\left[R(s,a,s') + \gamma \sum_{a'} \pi(a'|s') Q^\pi(s',a')\right]$$The relationship between $V$ and $Q$ is straightforward:
$$V^\pi(s) = \sum_a \pi(a|s) \, Q^\pi(s,a) \qquad \text{(average Q over policy)}$$Bellman Optimality Equations
The optimal value function $V^*(s)$ achieves the maximum expected return over all policies:
$$V^*(s) = \max_\pi V^\pi(s) = \max_a \sum_{s'} P(s'|s,a)\left[R(s,a,s') + \gamma V^*(s')\right]$$And the optimal action-value function:
$$\boxed{Q^*(s,a) = \sum_{s'} P(s'|s,a)\left[R(s,a,s') + \gamma \max_{a'} Q^*(s',a')\right]}$$This is the equation that Q-learning and DQN approximate. Once we have $Q^*$, the optimal policy is simply $\pi^*(s) = \arg\max_a Q^*(s,a)$.
flowchart TD
S["State s"] --> A1["Action a₁"]
S --> A2["Action a₂"]
A1 --> S1["s' with P(s'|s,a₁)"]
A1 --> S2["s'' with P(s''|s,a₁)"]
A2 --> S3["s' with P(s'|s,a₂)"]
S1 --> V1["R + γV(s')"]
S2 --> V2["R + γV(s'')"]
S3 --> V3["R + γV(s')"]
Value Iteration Convergence
Value iteration applies the Bellman optimality operator $\mathcal{T}$ repeatedly:
$$V_{k+1}(s) = \mathcal{T}[V_k](s) = \max_a \sum_{s'} P(s'|s,a)\left[R(s,a,s') + \gamma V_k(s')\right]$$Theorem (Contraction): The Bellman optimality operator is a $\gamma$-contraction in the supremum norm:
$$\|\mathcal{T}[V] - \mathcal{T}[V']\|_\infty \leq \gamma \|V - V'\|_\infty$$Proof sketch: For any state $s$:
$$|\mathcal{T}[V](s) - \mathcal{T}[V'](s)| \leq \max_a \sum_{s'} P(s'|s,a) \gamma |V(s') - V'(s')| \leq \gamma \|V - V'\|_\infty$$By the Banach fixed-point theorem, repeated application of a contraction mapping converges to a unique fixed point $V^*$, and convergence is geometric with rate $\gamma$:
$$\|V_k - V^*\|_\infty \leq \gamma^k \|V_0 - V^*\|_\infty$$import numpy as np
# Value iteration on a simple 4-state gridworld
# States: 0=start, 1=mid-left, 2=mid-right, 3=goal (terminal)
# Actions: 0=left, 1=right
n_states = 4
n_actions = 2
gamma = 0.9
# Transition probabilities P[s, a, s'] — deterministic for simplicity
P = np.zeros((n_states, n_actions, n_states))
P[0, 1, 1] = 1.0 # state 0, go right → state 1
P[0, 0, 0] = 1.0 # state 0, go left → stay at 0
P[1, 1, 2] = 1.0 # state 1, go right → state 2
P[1, 0, 0] = 1.0 # state 1, go left → state 0
P[2, 1, 3] = 1.0 # state 2, go right → goal (state 3)
P[2, 0, 1] = 1.0 # state 2, go left → state 1
P[3, :, 3] = 1.0 # terminal state stays
# Rewards: +1 for reaching goal, 0 otherwise
R = np.zeros((n_states, n_actions, n_states))
R[2, 1, 3] = 1.0 # reward for entering goal
# Value iteration
V = np.zeros(n_states)
for iteration in range(100):
V_new = np.zeros(n_states)
for s in range(n_states):
q_values = []
for a in range(n_actions):
q = sum(P[s, a, sp] * (R[s, a, sp] + gamma * V[sp])
for sp in range(n_states))
q_values.append(q)
V_new[s] = max(q_values)
if np.max(np.abs(V_new - V)) < 1e-8:
print(f"Converged in {iteration + 1} iterations")
break
V = V_new
print("Optimal values:", [f"{v:.4f}" for v in V])
# Optimal policy: argmax over Q(s,a) for each state
for s in range(n_states - 1):
q = [sum(P[s, a, sp] * (R[s, a, sp] + gamma * V[sp])
for sp in range(n_states)) for a in range(n_actions)]
print(f" State {s}: best action = {'right' if np.argmax(q) == 1 else 'left'}")
Policy Gradient Theorem
Value-based methods (like Q-learning) learn $Q^*$ and derive a policy. Policy gradient methods directly optimize a parameterized policy $\pi_\theta(a|s)$ by gradient ascent on the expected return.
Define the objective as the expected return under the policy:
$$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \gamma^t R_{t+1}\right] = \mathbb{E}_{\tau \sim \pi_\theta}[G_0]$$where $\tau = (s_0, a_0, r_1, s_1, a_1, r_2, \ldots)$ is a trajectory sampled under $\pi_\theta$.
The Log-Derivative Trick
The probability of a trajectory $\tau$ under policy $\pi_\theta$ is:
$$P(\tau | \theta) = \rho(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t|s_t) \, P(s_{t+1}|s_t,a_t)$$We want $\nabla_\theta J(\theta) = \nabla_\theta \mathbb{E}_{\tau}[R(\tau)]$. The key identity is the log-derivative trick:
$$\nabla_\theta P(\tau|\theta) = P(\tau|\theta) \, \nabla_\theta \log P(\tau|\theta)$$Applying this:
$$\nabla_\theta J(\theta) = \nabla_\theta \sum_\tau P(\tau|\theta) R(\tau) = \sum_\tau P(\tau|\theta) \, \nabla_\theta \log P(\tau|\theta) \, R(\tau)$$ $$= \mathbb{E}_{\tau \sim \pi_\theta}\left[\nabla_\theta \log P(\tau|\theta) \cdot R(\tau)\right]$$Now the crucial simplification: expand $\log P(\tau|\theta)$:
$$\log P(\tau|\theta) = \log \rho(s_0) + \sum_{t=0}^{T-1}\left[\log \pi_\theta(a_t|s_t) + \log P(s_{t+1}|s_t,a_t)\right]$$The initial state distribution and transition dynamics don't depend on $\theta$, so their gradients vanish:
$$\boxed{\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t\right]}$$This is the Policy Gradient Theorem. It says: to increase expected return, adjust $\theta$ to make actions with high returns more likely.
REINFORCE Algorithm
REINFORCE is the simplest policy gradient algorithm. It samples complete trajectories and uses the empirical return as the gradient signal:
$$\hat{g} = \frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t^{(i)}|s_t^{(i)}) \cdot G_t^{(i)}$$import numpy as np
# REINFORCE gradient estimation (conceptual)
# Policy: softmax over 3 actions given a 4-dimensional state
def softmax(logits):
exp_logits = np.exp(logits - np.max(logits))
return exp_logits / exp_logits.sum()
def policy_forward(state, theta):
"""Compute action probabilities: pi(a|s) = softmax(theta @ s)."""
logits = theta @ state # shape (n_actions,)
probs = softmax(logits)
return probs
def compute_log_grad(state, action, theta):
"""Gradient of log pi(a|s) w.r.t. theta."""
probs = policy_forward(state, theta)
# For softmax policy: d/d_theta log pi(a|s) = state * (1[a] - pi(a|s))
one_hot = np.zeros(len(probs))
one_hot[action] = 1.0
grad = np.outer(one_hot - probs, state) # shape (n_actions, state_dim)
return grad
# Setup
np.random.seed(42)
n_actions, state_dim = 3, 4
theta = np.random.randn(n_actions, state_dim) * 0.01
# Simulate one episode
states = [np.random.randn(state_dim) for _ in range(5)]
actions = [np.random.choice(n_actions, p=policy_forward(s, theta)) for s in states]
rewards = [0.0, 0.0, 1.0, 0.0, 1.0]
gamma = 0.99
# Compute returns
returns = np.zeros(len(rewards))
G = 0.0
for t in reversed(range(len(rewards))):
G = rewards[t] + gamma * G
returns[t] = G
# REINFORCE gradient estimate
policy_grad = np.zeros_like(theta)
for t in range(len(states)):
grad_log_pi = compute_log_grad(states[t], actions[t], theta)
policy_grad += grad_log_pi * returns[t]
print("Policy gradient shape:", policy_grad.shape)
print("Gradient norm:", np.linalg.norm(policy_grad))
Variance Reduction with Baselines
REINFORCE has notoriously high variance. We can subtract a baseline $b(s_t)$ from the return without introducing bias:
$$\nabla_\theta J(\theta) = \mathbb{E}_\pi\left[\sum_{t} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot (G_t - b(s_t))\right]$$Why is this unbiased? Because:
$$\mathbb{E}_{a \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)\right] = b(s) \cdot \nabla_\theta \sum_a \pi_\theta(a|s) = b(s) \cdot \nabla_\theta 1 = 0$$The optimal baseline (minimizing variance) is $b^*(s) = \frac{\mathbb{E}[(\nabla \log \pi)^2 G]}{\mathbb{E}[(\nabla \log \pi)^2]}$, but in practice we use $b(s) \approx V^\pi(s)$ — the value function itself. This leads naturally to the advantage function.
Advantage Estimation
The Advantage Function
The advantage function measures how much better an action is compared to the average action from that state:
$$\boxed{A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)}$$Properties:
- $\mathbb{E}_{a \sim \pi}[A^\pi(s,a)] = 0$ — the advantage averages to zero over the policy
- $A^\pi(s,a) > 0$ means action $a$ is better than average from state $s$
- $A^\pi(s,a) < 0$ means action $a$ is worse than average
Using the advantage in the policy gradient gives the cleanest form:
$$\nabla_\theta J(\theta) = \mathbb{E}_\pi\left[\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot A^\pi(s_t, a_t)\right]$$Generalized Advantage Estimation (GAE)
Computing the true advantage requires knowing $Q^\pi$ and $V^\pi$ exactly — which we don't. GAE provides a practical estimator that trades off bias and variance using a parameter $\lambda \in [0,1]$.
Define the TD residual (one-step advantage estimate):
$$\delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)$$Then the $n$-step advantage estimate is:
$$\hat{A}_t^{(n)} = \sum_{k=0}^{n-1} \gamma^k \delta_{t+k}$$GAE is the exponentially-weighted average of all $n$-step estimates:
$$\boxed{\hat{A}_t^{\text{GAE}(\gamma,\lambda)} = \sum_{k=0}^{\infty} (\gamma\lambda)^k \delta_{t+k}}$$This can be computed efficiently with a backward pass (just like returns):
$$\hat{A}_t = \delta_t + \gamma\lambda\,\hat{A}_{t+1}$$The $\lambda$ parameter controls the bias-variance trade-off:
- $\lambda = 0$: $\hat{A}_t = \delta_t$ (one-step TD, low variance, high bias)
- $\lambda = 1$: $\hat{A}_t = G_t - V(s_t)$ (Monte Carlo, high variance, zero bias)
- $\lambda \in (0,1)$: practical sweet spot (typically $\lambda = 0.95$)
import numpy as np
def compute_gae(rewards, values, gamma=0.99, lam=0.95):
"""
Compute Generalized Advantage Estimation.
Args:
rewards: array of rewards [r_1, r_2, ..., r_T]
values: array of value estimates [V(s_0), V(s_1), ..., V(s_T)]
(includes bootstrap value V(s_T) for non-terminal)
gamma: discount factor
lam: GAE lambda parameter
Returns:
advantages: GAE advantages for each timestep
"""
T = len(rewards)
advantages = np.zeros(T)
gae = 0.0
for t in reversed(range(T)):
# TD residual: delta_t = r_{t+1} + gamma * V(s_{t+1}) - V(s_t)
next_value = values[t + 1] if t + 1 < len(values) else 0.0
delta = rewards[t] + gamma * next_value - values[t]
# GAE recursion: A_t = delta_t + gamma * lambda * A_{t+1}
gae = delta + gamma * lam * gae
advantages[t] = gae
return advantages
# Example: 6-step trajectory
rewards = np.array([0.0, 0.0, 1.0, 0.0, 0.0, 1.0])
values = np.array([0.5, 0.6, 0.8, 0.3, 0.4, 0.9, 0.0]) # includes V(s_T)
advantages = compute_gae(rewards, values, gamma=0.99, lam=0.95)
print("Rewards: ", rewards)
print("Values: ", values[:-1])
print("Advantages: ", [f"{a:.4f}" for a in advantages])
print(f"\nMean advantage: {advantages.mean():.4f} (should be ~0 with good value estimates)")
Proximal Policy Optimization (PPO)
Vanilla policy gradients suffer from instability: a single large update can destroy a good policy. PPO constrains how much the policy can change in one update, achieving stability without the complexity of trust-region methods (TRPO).
The Surrogate Objective
Instead of directly using $\nabla_\theta \log \pi_\theta$, PPO defines a probability ratio:
$$r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$$The surrogate objective is:
$$L^{\text{CPI}}(\theta) = \mathbb{E}_t\left[r_t(\theta) \cdot \hat{A}_t\right]$$When $\theta = \theta_{\text{old}}$, $r_t = 1$ and $L^{\text{CPI}} = 0$. The gradient of $L^{\text{CPI}}$ at $\theta_{\text{old}}$ equals the standard policy gradient. But without constraints, maximizing $L^{\text{CPI}}$ can lead to excessively large policy changes.
The Clipping Mechanism
PPO clips the ratio to prevent large deviations. The clipped surrogate objective is:
$$\boxed{L^{\text{CLIP}}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta)\hat{A}_t, \;\text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right]}$$where $\epsilon$ is typically $0.1$ or $0.2$.
How clipping works:
- If $\hat{A}_t > 0$ (good action): the objective increases with $r_t$, but is clipped at $1+\epsilon$ — preventing the policy from moving too far toward this action
- If $\hat{A}_t < 0$ (bad action): the objective increases as $r_t$ decreases, but is clipped at $1-\epsilon$ — preventing the policy from moving too far away from this action
The $\min$ ensures we take the more pessimistic (conservative) bound, creating a trust region around the old policy.
PPO Clipping Behavior
Consider $\epsilon = 0.2$:
- Case 1: $\hat{A}_t = +1$, $r_t = 1.5$ → clipped to $1.2$, gradient zeroed beyond clip
- Case 2: $\hat{A}_t = -1$, $r_t = 0.6$ → clipped to $0.8$, gradient zeroed beyond clip
- Case 3: $\hat{A}_t = +1$, $r_t = 1.1$ → within bounds, normal gradient
The clip removes incentive to move the ratio beyond $[1-\epsilon, 1+\epsilon]$, providing soft trust-region behavior.
Entropy Regularization
To encourage exploration and prevent premature convergence to a deterministic policy, PPO adds an entropy bonus:
$$L^{\text{total}}(\theta) = L^{\text{CLIP}}(\theta) - c_1 L^{\text{VF}}(\theta) + c_2 H[\pi_\theta]$$where:
- $L^{\text{VF}} = (V_\theta(s_t) - V_t^{\text{target}})^2$ is the value function loss
- $H[\pi_\theta] = -\sum_a \pi_\theta(a|s) \log \pi_\theta(a|s)$ is the policy entropy
- $c_1 \approx 0.5$ and $c_2 \approx 0.01$ are hyperparameters
The entropy term keeps the action distribution spread out. As training progresses and the policy improves, entropy naturally decreases — the bonus just prevents it from collapsing too early.
import numpy as np
def ppo_loss(old_probs, new_probs, actions, advantages, epsilon=0.2, entropy_coeff=0.01):
"""
Compute PPO clipped surrogate loss for a batch.
Args:
old_probs: pi_old(a|s) for each (s,a) pair, shape (batch,)
new_probs: pi_theta(a|s) for each (s,a) pair, shape (batch,)
actions: action indices, shape (batch,)
advantages: GAE advantages, shape (batch,)
epsilon: clipping parameter
entropy_coeff: entropy bonus weight
Returns:
loss: scalar PPO loss (negate for gradient ascent)
"""
# Probability ratio
ratio = new_probs / (old_probs + 1e-8)
# Clipped surrogate
surr1 = ratio * advantages
surr2 = np.clip(ratio, 1 - epsilon, 1 + epsilon) * advantages
clip_loss = np.minimum(surr1, surr2).mean()
# Entropy bonus (approximate from new_probs)
entropy = -(new_probs * np.log(new_probs + 1e-8)).mean()
# Total objective (maximize)
total = clip_loss + entropy_coeff * entropy
return -total # negate for minimization
# Example batch
np.random.seed(0)
batch_size = 8
old_probs = np.random.uniform(0.2, 0.8, batch_size)
new_probs = old_probs * np.random.uniform(0.8, 1.3, batch_size) # slightly changed
advantages = np.random.randn(batch_size)
loss = ppo_loss(old_probs, new_probs, None, advantages)
print(f"PPO loss: {loss:.4f}")
print(f"Ratio range: [{(new_probs/old_probs).min():.3f}, {(new_probs/old_probs).max():.3f}]")
print(f"Advantages: mean={advantages.mean():.3f}, std={advantages.std():.3f}")
Practice Exercises
- Bellman derivation: Starting from $V^\pi(s) = \mathbb{E}_\pi[G_t | s_t=s]$ and $G_t = R_{t+1} + \gamma G_{t+1}$, derive the Bellman expectation equation by expanding the expectation over the first action and first transition.
- Contraction proof: Prove that the Bellman optimality operator is a $\gamma$-contraction. Start from the definition $\mathcal{T}[V](s) = \max_a \sum_{s'} P(s'|s,a)[R + \gamma V(s')]$ and show $|\mathcal{T}[V](s) - \mathcal{T}[V'](s)| \leq \gamma\|V - V'\|_\infty$.
- Baseline unbiasedness: Prove that subtracting any state-dependent baseline $b(s)$ from the return does not bias the policy gradient. Show that $\mathbb{E}_{a\sim\pi}[\nabla\log\pi(a|s) \cdot b(s)] = 0$.
- GAE limits: Verify that GAE with $\lambda=0$ gives the one-step TD advantage $\delta_t$, and with $\lambda=1$ gives the Monte Carlo advantage $G_t - V(s_t)$.
- PPO clipping analysis: For $\epsilon=0.2$, sketch the effective objective as a function of $r_t$ for both $\hat{A}_t > 0$ and $\hat{A}_t < 0$. Identify the regions where the gradient is zero.
- Coding challenge: Implement value iteration for a 5×5 gridworld with walls. Verify that your values converge and extract the optimal policy.