Back to Math for AI Hub

Reinforcement Learning Mathematics

May 30, 2026Wasil Zafar45 min read

Reinforcement learning lets agents learn from interaction. This extension derives the mathematical framework from scratch — Markov Decision Processes, Bellman equations, policy gradients, advantage estimation, and PPO — giving you the formal tools to understand every RL algorithm.

Table of Contents

  1. Markov Decision Processes
  2. Bellman Equations
  3. Policy Gradient Theorem
  4. Advantage Estimation
  5. Proximal Policy Optimization
  6. Practice Exercises
Math Foundations: This extension builds on Part 4: Probability (expectations, conditional distributions) and Part 8: Calculus & Optimization (gradients, chain rule). It provides the canonical derivations for all RL content across PyTorch RL and AI in the Wild.

Markov Decision Processes

Reinforcement learning operates in a Markov Decision Process (MDP) — a mathematical framework for sequential decision-making where outcomes are partly random and partly under the control of a decision-maker (agent).

Formal Definition

An MDP is a tuple $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$ where:

SymbolNameDescription
$\mathcal{S}$State spaceSet of all possible states the environment can be in
$\mathcal{A}$Action spaceSet of all actions available to the agent
$P(s'|s,a)$Transition functionProbability of reaching state $s'$ given state $s$ and action $a$
$R(s,a,s')$Reward functionImmediate reward received after transitioning
$\gamma \in [0,1)$Discount factorWeight given to future rewards vs immediate rewards

The Markov property is the key assumption: the future depends only on the current state, not on the history of how we got there:

$$P(s_{t+1} | s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1} | s_t, a_t)$$

A policy $\pi(a|s)$ maps states to a probability distribution over actions. A deterministic policy selects one action per state: $a = \pi(s)$. A stochastic policy gives $\pi(a|s) = P(a_t = a | s_t = s)$.

Why stochastic policies? In partially observable environments or multi-agent settings, deterministic policies can be exploitable. Stochastic policies also enable exploration and make the policy gradient well-defined.

Returns & Discounting

The return $G_t$ is the total discounted reward from time step $t$ onward:

$$G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$$

The discount factor $\gamma$ controls the trade-off:

  • $\gamma \to 0$: agent is myopic, only cares about immediate reward
  • $\gamma \to 1$: agent is far-sighted, values future rewards almost equally

Why discount? Mathematically, discounting ensures the infinite sum converges when rewards are bounded: $|G_t| \leq \frac{R_{\max}}{1-\gamma}$. Practically, it encodes the intuition that a reward now is worth more than the same reward later.

import numpy as np

# Compute discounted return from a sequence of rewards
def compute_return(rewards, gamma=0.99):
    """Calculate G_t = sum of gamma^k * r_{t+k+1} for each timestep."""
    T = len(rewards)
    returns = np.zeros(T)
    G = 0.0
    for t in reversed(range(T)):
        G = rewards[t] + gamma * G
        returns[t] = G
    return returns

# Example: rewards from a 5-step episode
rewards = [1.0, 0.0, -1.0, 2.0, 1.0]
gamma = 0.99

returns = compute_return(rewards, gamma)
print("Rewards:", rewards)
print("Returns:", [f"{g:.4f}" for g in returns])
print(f"G_0 = {returns[0]:.4f}")  # Total discounted return from start

Bellman Equations

The Bellman equations are the recursive relationships that define value functions. They are the backbone of nearly every RL algorithm — from dynamic programming to deep Q-networks.

Bellman Expectation Equations

The state-value function $V^\pi(s)$ is the expected return when starting in state $s$ and following policy $\pi$:

$$V^\pi(s) = \mathbb{E}_\pi[G_t | s_t = s] = \mathbb{E}_\pi\left[\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \;\middle|\; s_t = s\right]$$

By expanding one step of the return $G_t = R_{t+1} + \gamma G_{t+1}$ and taking the expectation:

$$\boxed{V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)\left[R(s,a,s') + \gamma V^\pi(s')\right]}$$

This is the Bellman expectation equation for $V^\pi$. It says: the value of a state equals the expected immediate reward plus the discounted value of the next state, averaged over all actions and transitions.

Similarly, the action-value function (Q-function) $Q^\pi(s,a)$ is the expected return when taking action $a$ in state $s$ and then following $\pi$:

$$Q^\pi(s,a) = \sum_{s'} P(s'|s,a)\left[R(s,a,s') + \gamma \sum_{a'} \pi(a'|s') Q^\pi(s',a')\right]$$

The relationship between $V$ and $Q$ is straightforward:

$$V^\pi(s) = \sum_a \pi(a|s) \, Q^\pi(s,a) \qquad \text{(average Q over policy)}$$

Bellman Optimality Equations

The optimal value function $V^*(s)$ achieves the maximum expected return over all policies:

$$V^*(s) = \max_\pi V^\pi(s) = \max_a \sum_{s'} P(s'|s,a)\left[R(s,a,s') + \gamma V^*(s')\right]$$

And the optimal action-value function:

$$\boxed{Q^*(s,a) = \sum_{s'} P(s'|s,a)\left[R(s,a,s') + \gamma \max_{a'} Q^*(s',a')\right]}$$

This is the equation that Q-learning and DQN approximate. Once we have $Q^*$, the optimal policy is simply $\pi^*(s) = \arg\max_a Q^*(s,a)$.

Bellman Backup Diagram
flowchart TD
    S["State s"] --> A1["Action a₁"]
    S --> A2["Action a₂"]
    A1 --> S1["s' with P(s'|s,a₁)"]
    A1 --> S2["s'' with P(s''|s,a₁)"]
    A2 --> S3["s' with P(s'|s,a₂)"]
    S1 --> V1["R + γV(s')"]
    S2 --> V2["R + γV(s'')"]
    S3 --> V3["R + γV(s')"]
        

Value Iteration Convergence

Value iteration applies the Bellman optimality operator $\mathcal{T}$ repeatedly:

$$V_{k+1}(s) = \mathcal{T}[V_k](s) = \max_a \sum_{s'} P(s'|s,a)\left[R(s,a,s') + \gamma V_k(s')\right]$$

Theorem (Contraction): The Bellman optimality operator is a $\gamma$-contraction in the supremum norm:

$$\|\mathcal{T}[V] - \mathcal{T}[V']\|_\infty \leq \gamma \|V - V'\|_\infty$$

Proof sketch: For any state $s$:

$$|\mathcal{T}[V](s) - \mathcal{T}[V'](s)| \leq \max_a \sum_{s'} P(s'|s,a) \gamma |V(s') - V'(s')| \leq \gamma \|V - V'\|_\infty$$

By the Banach fixed-point theorem, repeated application of a contraction mapping converges to a unique fixed point $V^*$, and convergence is geometric with rate $\gamma$:

$$\|V_k - V^*\|_\infty \leq \gamma^k \|V_0 - V^*\|_\infty$$
import numpy as np

# Value iteration on a simple 4-state gridworld
# States: 0=start, 1=mid-left, 2=mid-right, 3=goal (terminal)
# Actions: 0=left, 1=right

n_states = 4
n_actions = 2
gamma = 0.9

# Transition probabilities P[s, a, s'] — deterministic for simplicity
P = np.zeros((n_states, n_actions, n_states))
P[0, 1, 1] = 1.0  # state 0, go right → state 1
P[0, 0, 0] = 1.0  # state 0, go left → stay at 0
P[1, 1, 2] = 1.0  # state 1, go right → state 2
P[1, 0, 0] = 1.0  # state 1, go left → state 0
P[2, 1, 3] = 1.0  # state 2, go right → goal (state 3)
P[2, 0, 1] = 1.0  # state 2, go left → state 1
P[3, :, 3] = 1.0  # terminal state stays

# Rewards: +1 for reaching goal, 0 otherwise
R = np.zeros((n_states, n_actions, n_states))
R[2, 1, 3] = 1.0  # reward for entering goal

# Value iteration
V = np.zeros(n_states)
for iteration in range(100):
    V_new = np.zeros(n_states)
    for s in range(n_states):
        q_values = []
        for a in range(n_actions):
            q = sum(P[s, a, sp] * (R[s, a, sp] + gamma * V[sp])
                    for sp in range(n_states))
            q_values.append(q)
        V_new[s] = max(q_values)
    if np.max(np.abs(V_new - V)) < 1e-8:
        print(f"Converged in {iteration + 1} iterations")
        break
    V = V_new

print("Optimal values:", [f"{v:.4f}" for v in V])
# Optimal policy: argmax over Q(s,a) for each state
for s in range(n_states - 1):
    q = [sum(P[s, a, sp] * (R[s, a, sp] + gamma * V[sp])
             for sp in range(n_states)) for a in range(n_actions)]
    print(f"  State {s}: best action = {'right' if np.argmax(q) == 1 else 'left'}")

Policy Gradient Theorem

Value-based methods (like Q-learning) learn $Q^*$ and derive a policy. Policy gradient methods directly optimize a parameterized policy $\pi_\theta(a|s)$ by gradient ascent on the expected return.

Define the objective as the expected return under the policy:

$$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \gamma^t R_{t+1}\right] = \mathbb{E}_{\tau \sim \pi_\theta}[G_0]$$

where $\tau = (s_0, a_0, r_1, s_1, a_1, r_2, \ldots)$ is a trajectory sampled under $\pi_\theta$.

The Log-Derivative Trick

The probability of a trajectory $\tau$ under policy $\pi_\theta$ is:

$$P(\tau | \theta) = \rho(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t|s_t) \, P(s_{t+1}|s_t,a_t)$$

We want $\nabla_\theta J(\theta) = \nabla_\theta \mathbb{E}_{\tau}[R(\tau)]$. The key identity is the log-derivative trick:

$$\nabla_\theta P(\tau|\theta) = P(\tau|\theta) \, \nabla_\theta \log P(\tau|\theta)$$

Applying this:

$$\nabla_\theta J(\theta) = \nabla_\theta \sum_\tau P(\tau|\theta) R(\tau) = \sum_\tau P(\tau|\theta) \, \nabla_\theta \log P(\tau|\theta) \, R(\tau)$$ $$= \mathbb{E}_{\tau \sim \pi_\theta}\left[\nabla_\theta \log P(\tau|\theta) \cdot R(\tau)\right]$$

Now the crucial simplification: expand $\log P(\tau|\theta)$:

$$\log P(\tau|\theta) = \log \rho(s_0) + \sum_{t=0}^{T-1}\left[\log \pi_\theta(a_t|s_t) + \log P(s_{t+1}|s_t,a_t)\right]$$

The initial state distribution and transition dynamics don't depend on $\theta$, so their gradients vanish:

$$\boxed{\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t\right]}$$

This is the Policy Gradient Theorem. It says: to increase expected return, adjust $\theta$ to make actions with high returns more likely.

Key insight: We never need to differentiate through the environment dynamics $P(s'|s,a)$ — only through our own policy $\pi_\theta$. This is why policy gradients work even when the environment is a black box.

REINFORCE Algorithm

REINFORCE is the simplest policy gradient algorithm. It samples complete trajectories and uses the empirical return as the gradient signal:

$$\hat{g} = \frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t^{(i)}|s_t^{(i)}) \cdot G_t^{(i)}$$
import numpy as np

# REINFORCE gradient estimation (conceptual)
# Policy: softmax over 3 actions given a 4-dimensional state

def softmax(logits):
    exp_logits = np.exp(logits - np.max(logits))
    return exp_logits / exp_logits.sum()

def policy_forward(state, theta):
    """Compute action probabilities: pi(a|s) = softmax(theta @ s)."""
    logits = theta @ state  # shape (n_actions,)
    probs = softmax(logits)
    return probs

def compute_log_grad(state, action, theta):
    """Gradient of log pi(a|s) w.r.t. theta."""
    probs = policy_forward(state, theta)
    # For softmax policy: d/d_theta log pi(a|s) = state * (1[a] - pi(a|s))
    one_hot = np.zeros(len(probs))
    one_hot[action] = 1.0
    grad = np.outer(one_hot - probs, state)  # shape (n_actions, state_dim)
    return grad

# Setup
np.random.seed(42)
n_actions, state_dim = 3, 4
theta = np.random.randn(n_actions, state_dim) * 0.01

# Simulate one episode
states = [np.random.randn(state_dim) for _ in range(5)]
actions = [np.random.choice(n_actions, p=policy_forward(s, theta)) for s in states]
rewards = [0.0, 0.0, 1.0, 0.0, 1.0]
gamma = 0.99

# Compute returns
returns = np.zeros(len(rewards))
G = 0.0
for t in reversed(range(len(rewards))):
    G = rewards[t] + gamma * G
    returns[t] = G

# REINFORCE gradient estimate
policy_grad = np.zeros_like(theta)
for t in range(len(states)):
    grad_log_pi = compute_log_grad(states[t], actions[t], theta)
    policy_grad += grad_log_pi * returns[t]

print("Policy gradient shape:", policy_grad.shape)
print("Gradient norm:", np.linalg.norm(policy_grad))

Variance Reduction with Baselines

REINFORCE has notoriously high variance. We can subtract a baseline $b(s_t)$ from the return without introducing bias:

$$\nabla_\theta J(\theta) = \mathbb{E}_\pi\left[\sum_{t} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot (G_t - b(s_t))\right]$$

Why is this unbiased? Because:

$$\mathbb{E}_{a \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)\right] = b(s) \cdot \nabla_\theta \sum_a \pi_\theta(a|s) = b(s) \cdot \nabla_\theta 1 = 0$$

The optimal baseline (minimizing variance) is $b^*(s) = \frac{\mathbb{E}[(\nabla \log \pi)^2 G]}{\mathbb{E}[(\nabla \log \pi)^2]}$, but in practice we use $b(s) \approx V^\pi(s)$ — the value function itself. This leads naturally to the advantage function.

Advantage Estimation

The Advantage Function

The advantage function measures how much better an action is compared to the average action from that state:

$$\boxed{A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)}$$

Properties:

  • $\mathbb{E}_{a \sim \pi}[A^\pi(s,a)] = 0$ — the advantage averages to zero over the policy
  • $A^\pi(s,a) > 0$ means action $a$ is better than average from state $s$
  • $A^\pi(s,a) < 0$ means action $a$ is worse than average

Using the advantage in the policy gradient gives the cleanest form:

$$\nabla_\theta J(\theta) = \mathbb{E}_\pi\left[\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot A^\pi(s_t, a_t)\right]$$

Generalized Advantage Estimation (GAE)

Computing the true advantage requires knowing $Q^\pi$ and $V^\pi$ exactly — which we don't. GAE provides a practical estimator that trades off bias and variance using a parameter $\lambda \in [0,1]$.

Define the TD residual (one-step advantage estimate):

$$\delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)$$

Then the $n$-step advantage estimate is:

$$\hat{A}_t^{(n)} = \sum_{k=0}^{n-1} \gamma^k \delta_{t+k}$$

GAE is the exponentially-weighted average of all $n$-step estimates:

$$\boxed{\hat{A}_t^{\text{GAE}(\gamma,\lambda)} = \sum_{k=0}^{\infty} (\gamma\lambda)^k \delta_{t+k}}$$

This can be computed efficiently with a backward pass (just like returns):

$$\hat{A}_t = \delta_t + \gamma\lambda\,\hat{A}_{t+1}$$

The $\lambda$ parameter controls the bias-variance trade-off:

  • $\lambda = 0$: $\hat{A}_t = \delta_t$ (one-step TD, low variance, high bias)
  • $\lambda = 1$: $\hat{A}_t = G_t - V(s_t)$ (Monte Carlo, high variance, zero bias)
  • $\lambda \in (0,1)$: practical sweet spot (typically $\lambda = 0.95$)
import numpy as np

def compute_gae(rewards, values, gamma=0.99, lam=0.95):
    """
    Compute Generalized Advantage Estimation.
    
    Args:
        rewards: array of rewards [r_1, r_2, ..., r_T]
        values: array of value estimates [V(s_0), V(s_1), ..., V(s_T)]
                (includes bootstrap value V(s_T) for non-terminal)
        gamma: discount factor
        lam: GAE lambda parameter
    
    Returns:
        advantages: GAE advantages for each timestep
    """
    T = len(rewards)
    advantages = np.zeros(T)
    gae = 0.0
    
    for t in reversed(range(T)):
        # TD residual: delta_t = r_{t+1} + gamma * V(s_{t+1}) - V(s_t)
        next_value = values[t + 1] if t + 1 < len(values) else 0.0
        delta = rewards[t] + gamma * next_value - values[t]
        # GAE recursion: A_t = delta_t + gamma * lambda * A_{t+1}
        gae = delta + gamma * lam * gae
        advantages[t] = gae
    
    return advantages

# Example: 6-step trajectory
rewards = np.array([0.0, 0.0, 1.0, 0.0, 0.0, 1.0])
values = np.array([0.5, 0.6, 0.8, 0.3, 0.4, 0.9, 0.0])  # includes V(s_T)

advantages = compute_gae(rewards, values, gamma=0.99, lam=0.95)
print("Rewards:    ", rewards)
print("Values:     ", values[:-1])
print("Advantages: ", [f"{a:.4f}" for a in advantages])
print(f"\nMean advantage: {advantages.mean():.4f} (should be ~0 with good value estimates)")

Proximal Policy Optimization (PPO)

Vanilla policy gradients suffer from instability: a single large update can destroy a good policy. PPO constrains how much the policy can change in one update, achieving stability without the complexity of trust-region methods (TRPO).

The Surrogate Objective

Instead of directly using $\nabla_\theta \log \pi_\theta$, PPO defines a probability ratio:

$$r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$$

The surrogate objective is:

$$L^{\text{CPI}}(\theta) = \mathbb{E}_t\left[r_t(\theta) \cdot \hat{A}_t\right]$$

When $\theta = \theta_{\text{old}}$, $r_t = 1$ and $L^{\text{CPI}} = 0$. The gradient of $L^{\text{CPI}}$ at $\theta_{\text{old}}$ equals the standard policy gradient. But without constraints, maximizing $L^{\text{CPI}}$ can lead to excessively large policy changes.

The Clipping Mechanism

PPO clips the ratio to prevent large deviations. The clipped surrogate objective is:

$$\boxed{L^{\text{CLIP}}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta)\hat{A}_t, \;\text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right]}$$

where $\epsilon$ is typically $0.1$ or $0.2$.

How clipping works:

  • If $\hat{A}_t > 0$ (good action): the objective increases with $r_t$, but is clipped at $1+\epsilon$ — preventing the policy from moving too far toward this action
  • If $\hat{A}_t < 0$ (bad action): the objective increases as $r_t$ decreases, but is clipped at $1-\epsilon$ — preventing the policy from moving too far away from this action

The $\min$ ensures we take the more pessimistic (conservative) bound, creating a trust region around the old policy.

Analysis
PPO Clipping Behavior

Consider $\epsilon = 0.2$:

  • Case 1: $\hat{A}_t = +1$, $r_t = 1.5$ → clipped to $1.2$, gradient zeroed beyond clip
  • Case 2: $\hat{A}_t = -1$, $r_t = 0.6$ → clipped to $0.8$, gradient zeroed beyond clip
  • Case 3: $\hat{A}_t = +1$, $r_t = 1.1$ → within bounds, normal gradient

The clip removes incentive to move the ratio beyond $[1-\epsilon, 1+\epsilon]$, providing soft trust-region behavior.

trust regionstabilityclipping

Entropy Regularization

To encourage exploration and prevent premature convergence to a deterministic policy, PPO adds an entropy bonus:

$$L^{\text{total}}(\theta) = L^{\text{CLIP}}(\theta) - c_1 L^{\text{VF}}(\theta) + c_2 H[\pi_\theta]$$

where:

  • $L^{\text{VF}} = (V_\theta(s_t) - V_t^{\text{target}})^2$ is the value function loss
  • $H[\pi_\theta] = -\sum_a \pi_\theta(a|s) \log \pi_\theta(a|s)$ is the policy entropy
  • $c_1 \approx 0.5$ and $c_2 \approx 0.01$ are hyperparameters

The entropy term keeps the action distribution spread out. As training progresses and the policy improves, entropy naturally decreases — the bonus just prevents it from collapsing too early.

import numpy as np

def ppo_loss(old_probs, new_probs, actions, advantages, epsilon=0.2, entropy_coeff=0.01):
    """
    Compute PPO clipped surrogate loss for a batch.
    
    Args:
        old_probs: pi_old(a|s) for each (s,a) pair, shape (batch,)
        new_probs: pi_theta(a|s) for each (s,a) pair, shape (batch,)
        actions: action indices, shape (batch,)
        advantages: GAE advantages, shape (batch,)
        epsilon: clipping parameter
        entropy_coeff: entropy bonus weight
    
    Returns:
        loss: scalar PPO loss (negate for gradient ascent)
    """
    # Probability ratio
    ratio = new_probs / (old_probs + 1e-8)
    
    # Clipped surrogate
    surr1 = ratio * advantages
    surr2 = np.clip(ratio, 1 - epsilon, 1 + epsilon) * advantages
    clip_loss = np.minimum(surr1, surr2).mean()
    
    # Entropy bonus (approximate from new_probs)
    entropy = -(new_probs * np.log(new_probs + 1e-8)).mean()
    
    # Total objective (maximize)
    total = clip_loss + entropy_coeff * entropy
    return -total  # negate for minimization

# Example batch
np.random.seed(0)
batch_size = 8
old_probs = np.random.uniform(0.2, 0.8, batch_size)
new_probs = old_probs * np.random.uniform(0.8, 1.3, batch_size)  # slightly changed
advantages = np.random.randn(batch_size)

loss = ppo_loss(old_probs, new_probs, None, advantages)
print(f"PPO loss: {loss:.4f}")
print(f"Ratio range: [{(new_probs/old_probs).min():.3f}, {(new_probs/old_probs).max():.3f}]")
print(f"Advantages: mean={advantages.mean():.3f}, std={advantages.std():.3f}")

Practice Exercises

Exercises: Work through these to solidify the derivations above.
  1. Bellman derivation: Starting from $V^\pi(s) = \mathbb{E}_\pi[G_t | s_t=s]$ and $G_t = R_{t+1} + \gamma G_{t+1}$, derive the Bellman expectation equation by expanding the expectation over the first action and first transition.
  2. Contraction proof: Prove that the Bellman optimality operator is a $\gamma$-contraction. Start from the definition $\mathcal{T}[V](s) = \max_a \sum_{s'} P(s'|s,a)[R + \gamma V(s')]$ and show $|\mathcal{T}[V](s) - \mathcal{T}[V'](s)| \leq \gamma\|V - V'\|_\infty$.
  3. Baseline unbiasedness: Prove that subtracting any state-dependent baseline $b(s)$ from the return does not bias the policy gradient. Show that $\mathbb{E}_{a\sim\pi}[\nabla\log\pi(a|s) \cdot b(s)] = 0$.
  4. GAE limits: Verify that GAE with $\lambda=0$ gives the one-step TD advantage $\delta_t$, and with $\lambda=1$ gives the Monte Carlo advantage $G_t - V(s_t)$.
  5. PPO clipping analysis: For $\epsilon=0.2$, sketch the effective objective as a function of $r_t$ for both $\hat{A}_t > 0$ and $\hat{A}_t < 0$. Identify the regions where the gradient is zero.
  6. Coding challenge: Implement value iteration for a 5×5 gridworld with walls. Verify that your values converge and extract the optimal policy.