Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/lansinuote/Simple_Reinforcement_Learning/llms.txt

Use this file to discover all available pages before exploring further.

Actor-Critic produces good policies, but vanilla policy gradient methods suffer from a fundamental instability: a single bad update can drastically change the policy, collapsing performance in a way that is hard to recover from. Proximal Policy Optimization (PPO) solves this with a clipped surrogate objective that limits how much the policy is allowed to change in any single update. By reusing each batch of data for multiple gradient steps — while keeping updates within a trust region — PPO achieves much better sample efficiency and training stability than REINFORCE or standard Actor-Critic.

Core Idea: Clipped Surrogate Objective

Define the probability ratio:
r_t(θ) = π(a_t|s_t; θ) / π(a_t|s_t; θ_old)
PPO maximises a clipped version of the policy gradient:
L_CLIP = E_t [ min( r_t · A_t,  clip(r_t, 1-ε, 1+ε) · A_t ) ]
The clipping constrains the ratio to [1-ε, 1+ε] (here ε=0.2, so [0.8, 1.2]), preventing any single update from pushing the policy too far.

Environment Setup

The notebook uses CartPole-v1 with the same MyWrapper seen in earlier algorithms, enforcing a 200-step episode limit.
import gym

class MyWrapper(gym.Wrapper):

    def __init__(self):
        env = gym.make('CartPole-v1', render_mode='rgb_array')
        super().__init__(env)
        self.env = env
        self.step_n = 0

    def reset(self):
        state, _ = self.env.reset()
        self.step_n = 0
        return state

    def step(self, action):
        state, reward, terminated, truncated, info = self.env.step(action)
        done = terminated or truncated
        self.step_n += 1
        if self.step_n >= 200:
            done = True
        return state, reward, done, info


env = MyWrapper()
env.reset()

Networks

The actor uses a Softmax output to produce a categorical distribution over the two discrete actions. The critic outputs a scalar V(s).
import torch

# Actor: state (4) → action probabilities (2)
model = torch.nn.Sequential(
    torch.nn.Linear(4, 128),
    torch.nn.ReLU(),
    torch.nn.Linear(128, 2),
    torch.nn.Softmax(dim=1),
)

# Critic: state (4) → scalar V(s)
model_td = torch.nn.Sequential(
    torch.nn.Linear(4, 128),
    torch.nn.ReLU(),
    torch.nn.Linear(128, 1),
)

model(torch.randn(2, 4)), model_td(torch.randn(2, 4))

Action Selection

import random

def get_action(state):
    state = torch.FloatTensor(state).reshape(1, 4)
    # [1, 4] → [1, 2]
    prob = model(state)

    # Sample one action weighted by its probability
    action = random.choices(range(2), weights=prob[0].tolist(), k=1)[0]
    return action

Data Collection

PPO collects a full episode (rollout) before each round of updates, recording next states so the critic can compute bootstrap targets.
def get_data():
    states      = []
    rewards     = []
    actions     = []
    next_states = []
    overs       = []

    state = env.reset()
    over  = False

    while not over:
        action = get_action(state)
        next_state, reward, over, _ = env.step(action)

        states.append(state)
        rewards.append(reward)
        actions.append(action)
        next_states.append(next_state)
        overs.append(over)

        state = next_state

    # Convert to tensors
    states      = torch.FloatTensor(states).reshape(-1, 4)       # [b, 4]
    rewards     = torch.FloatTensor(rewards).reshape(-1, 1)      # [b, 1]
    actions     = torch.LongTensor(actions).reshape(-1, 1)       # [b, 1]
    next_states = torch.FloatTensor(next_states).reshape(-1, 4)  # [b, 4]
    overs       = torch.LongTensor(overs).reshape(-1, 1)         # [b, 1]

    return states, rewards, actions, next_states, overs

Advantage Computation (GAE)

PPO uses Generalised Advantage Estimation (GAE) to compute a smoothed advantage signal. The formula accumulates TD errors backwards:
A_t = δ_t + (γλ) δ_{t+1} + (γλ)² δ_{t+2} + …
def get_advantages(deltas):
    advantages = []
    s = 0.0

    # Traverse backwards through TD errors
    for delta in deltas[::-1]:
        s = 0.98 * 0.95 * s + delta   # γ=0.98, λ=0.95
        advantages.append(s)

    advantages.reverse()
    return advantages

Training Loop

The key PPO innovation is reusing the same batch of data for multiple epochs of gradient updates (here 10 inner iterations) while keeping the policy ratio within the clipped range. Both the actor and critic are updated on each inner iteration.
def train():
    optimizer    = torch.optim.Adam(model.parameters(),    lr=1e-3)
    optimizer_td = torch.optim.Adam(model_td.parameters(), lr=1e-2)
    loss_fn      = torch.nn.MSELoss()

    for epoch in range(500):
        states, rewards, actions, next_states, overs = get_data()

        # ── Critic values and targets ────────────────────────────────
        # [b, 4] → [b, 1]
        values  = model_td(states)

        targets = model_td(next_states).detach()
        targets = targets * 0.98
        targets *= (1 - overs)
        targets += rewards

        # TD errors → GAE advantages
        deltas     = (targets - values).squeeze(dim=1).tolist()
        advantages = get_advantages(deltas)
        advantages = torch.FloatTensor(advantages).reshape(-1, 1)

        # Old policy probabilities (frozen snapshot before inner loop)
        old_probs = model(states)
        old_probs = old_probs.gather(dim=1, index=actions)
        old_probs = old_probs.detach()

        # ── Multiple PPO epochs on the same batch ────────────────────
        for _ in range(10):
            # New policy probabilities
            new_probs = model(states)
            new_probs = new_probs.gather(dim=1, index=actions)

            # Probability ratio π_new / π_old
            ratios = new_probs / old_probs

            # Clipped surrogate objective (ε = 0.2)
            surr1 = ratios * advantages
            surr2 = torch.clamp(ratios, 0.8, 1.2) * advantages
            loss  = -torch.min(surr1, surr2).mean()

            # Recompute critic values and update critic each inner step
            values  = model_td(states)
            loss_td = loss_fn(values, targets)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            optimizer_td.zero_grad()
            loss_td.backward()
            optimizer_td.step()

        if epoch % 50 == 0:
            print(epoch, sum([test(play=False) for _ in range(10)]) / 10)

train()

Why Clipping Works

L_CLIP = E [ min( ratio · A,  clip(ratio, 1-ε, 1+ε) · A ) ]
  • When A > 0 (action was good): ratio is clipped at 1+ε, preventing the policy from increasing that action’s probability too aggressively.
  • When A < 0 (action was bad): ratio is clipped at 1-ε, preventing the policy from suppressing the bad action probability too quickly.
The min ensures the pessimistic (clipped) bound is always taken, so the objective only improves when the policy doesn’t deviate too far.
The clip range [0.8, 1.2] corresponds to ε=0.2. Increasing ε gives bigger updates per step but increases instability. Typical values range from 0.1 to 0.3.
PPO reuses each data batch for multiple gradient steps (range(10)). This improves sample efficiency compared to REINFORCE, which discards each episode immediately after one gradient step.

Build docs developers (and LLMs) love