Documentation Index
Fetch the complete documentation index at: https://mintlify.com/lansinuote/Simple_Reinforcement_Learning/llms.txt
Use this file to discover all available pages before exploring further.
Actor-Critic produces good policies, but vanilla policy gradient methods suffer from a fundamental instability: a single bad update can drastically change the policy, collapsing performance in a way that is hard to recover from. Proximal Policy Optimization (PPO) solves this with a clipped surrogate objective that limits how much the policy is allowed to change in any single update. By reusing each batch of data for multiple gradient steps — while keeping updates within a trust region — PPO achieves much better sample efficiency and training stability than REINFORCE or standard Actor-Critic.
Core Idea: Clipped Surrogate Objective
Define the probability ratio:
r_t(θ) = π(a_t|s_t; θ) / π(a_t|s_t; θ_old)
PPO maximises a clipped version of the policy gradient:
L_CLIP = E_t [ min( r_t · A_t, clip(r_t, 1-ε, 1+ε) · A_t ) ]
The clipping constrains the ratio to [1-ε, 1+ε] (here ε=0.2, so [0.8, 1.2]), preventing any single update from pushing the policy too far.
Environment Setup
The notebook uses CartPole-v1 with the same MyWrapper seen in earlier algorithms, enforcing a 200-step episode limit.
import gym
class MyWrapper(gym.Wrapper):
def __init__(self):
env = gym.make('CartPole-v1', render_mode='rgb_array')
super().__init__(env)
self.env = env
self.step_n = 0
def reset(self):
state, _ = self.env.reset()
self.step_n = 0
return state
def step(self, action):
state, reward, terminated, truncated, info = self.env.step(action)
done = terminated or truncated
self.step_n += 1
if self.step_n >= 200:
done = True
return state, reward, done, info
env = MyWrapper()
env.reset()
Networks
The actor uses a Softmax output to produce a categorical distribution over the two discrete actions. The critic outputs a scalar V(s).
import torch
# Actor: state (4) → action probabilities (2)
model = torch.nn.Sequential(
torch.nn.Linear(4, 128),
torch.nn.ReLU(),
torch.nn.Linear(128, 2),
torch.nn.Softmax(dim=1),
)
# Critic: state (4) → scalar V(s)
model_td = torch.nn.Sequential(
torch.nn.Linear(4, 128),
torch.nn.ReLU(),
torch.nn.Linear(128, 1),
)
model(torch.randn(2, 4)), model_td(torch.randn(2, 4))
Action Selection
import random
def get_action(state):
state = torch.FloatTensor(state).reshape(1, 4)
# [1, 4] → [1, 2]
prob = model(state)
# Sample one action weighted by its probability
action = random.choices(range(2), weights=prob[0].tolist(), k=1)[0]
return action
Data Collection
PPO collects a full episode (rollout) before each round of updates, recording next states so the critic can compute bootstrap targets.
def get_data():
states = []
rewards = []
actions = []
next_states = []
overs = []
state = env.reset()
over = False
while not over:
action = get_action(state)
next_state, reward, over, _ = env.step(action)
states.append(state)
rewards.append(reward)
actions.append(action)
next_states.append(next_state)
overs.append(over)
state = next_state
# Convert to tensors
states = torch.FloatTensor(states).reshape(-1, 4) # [b, 4]
rewards = torch.FloatTensor(rewards).reshape(-1, 1) # [b, 1]
actions = torch.LongTensor(actions).reshape(-1, 1) # [b, 1]
next_states = torch.FloatTensor(next_states).reshape(-1, 4) # [b, 4]
overs = torch.LongTensor(overs).reshape(-1, 1) # [b, 1]
return states, rewards, actions, next_states, overs
Advantage Computation (GAE)
PPO uses Generalised Advantage Estimation (GAE) to compute a smoothed advantage signal. The formula accumulates TD errors backwards:
A_t = δ_t + (γλ) δ_{t+1} + (γλ)² δ_{t+2} + …
def get_advantages(deltas):
advantages = []
s = 0.0
# Traverse backwards through TD errors
for delta in deltas[::-1]:
s = 0.98 * 0.95 * s + delta # γ=0.98, λ=0.95
advantages.append(s)
advantages.reverse()
return advantages
Training Loop
The key PPO innovation is reusing the same batch of data for multiple epochs of gradient updates (here 10 inner iterations) while keeping the policy ratio within the clipped range. Both the actor and critic are updated on each inner iteration.
def train():
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
optimizer_td = torch.optim.Adam(model_td.parameters(), lr=1e-2)
loss_fn = torch.nn.MSELoss()
for epoch in range(500):
states, rewards, actions, next_states, overs = get_data()
# ── Critic values and targets ────────────────────────────────
# [b, 4] → [b, 1]
values = model_td(states)
targets = model_td(next_states).detach()
targets = targets * 0.98
targets *= (1 - overs)
targets += rewards
# TD errors → GAE advantages
deltas = (targets - values).squeeze(dim=1).tolist()
advantages = get_advantages(deltas)
advantages = torch.FloatTensor(advantages).reshape(-1, 1)
# Old policy probabilities (frozen snapshot before inner loop)
old_probs = model(states)
old_probs = old_probs.gather(dim=1, index=actions)
old_probs = old_probs.detach()
# ── Multiple PPO epochs on the same batch ────────────────────
for _ in range(10):
# New policy probabilities
new_probs = model(states)
new_probs = new_probs.gather(dim=1, index=actions)
# Probability ratio π_new / π_old
ratios = new_probs / old_probs
# Clipped surrogate objective (ε = 0.2)
surr1 = ratios * advantages
surr2 = torch.clamp(ratios, 0.8, 1.2) * advantages
loss = -torch.min(surr1, surr2).mean()
# Recompute critic values and update critic each inner step
values = model_td(states)
loss_td = loss_fn(values, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
optimizer_td.zero_grad()
loss_td.backward()
optimizer_td.step()
if epoch % 50 == 0:
print(epoch, sum([test(play=False) for _ in range(10)]) / 10)
train()
Why Clipping Works
L_CLIP = E [ min( ratio · A, clip(ratio, 1-ε, 1+ε) · A ) ]
- When A > 0 (action was good): ratio is clipped at
1+ε, preventing the policy from increasing that action’s probability too aggressively.
- When A < 0 (action was bad): ratio is clipped at
1-ε, preventing the policy from suppressing the bad action probability too quickly.
The min ensures the pessimistic (clipped) bound is always taken, so the objective only improves when the policy doesn’t deviate too far.
The clip range [0.8, 1.2] corresponds to ε=0.2. Increasing ε gives bigger updates per step but increases instability. Typical values range from 0.1 to 0.3.
PPO reuses each data batch for multiple gradient steps (range(10)). This improves sample efficiency compared to REINFORCE, which discards each episode immediately after one gradient step.