Documentation Index
Fetch the complete documentation index at: https://mintlify.com/lansinuote/Simple_Reinforcement_Learning/llms.txt
Use this file to discover all available pages before exploring further.
REINFORCE works, but its Monte Carlo return estimates carry extremely high variance. Every episode produces a different trajectory, and the gradient signal can fluctuate wildly from one update to the next. The Actor-Critic architecture addresses this by introducing a second network — the critic — that estimates the state-value function V(s). Instead of weighting log-probabilities by the full discounted return G_t, Actor-Critic uses the temporal difference (TD) error as an advantage signal:
δ_t = r_t + γ V(s_{t+1}) - V(s_t)
This single-step estimate replaces the noisy Monte Carlo sum, substantially reducing variance while keeping the algorithm unbiased. The policy network (actor) learns which actions to take; the value network (critic) learns how good each state is.
How Actor-Critic Works
Collect an episode
Run the actor policy π(a|s) to generate a trajectory: (s_0, a_0, r_0, s_1, …).
Evaluate the critic
For each transition compute the TD target r + γ V(s') and the current value V(s). The difference δ = target - V(s) is the advantage estimate.
Update the critic
Minimise the MSE loss between V(s) and the TD target r + γ V(s').
Update the actor
Maximise log π(a|s) · δ — equivalently minimise -log π(a|s) · δ. The advantage δ is detached from the graph so gradients do not flow into the critic through this term.
Environment
import gym
class MyWrapper(gym.Wrapper):
def __init__(self):
env = gym.make('CartPole-v1', render_mode='rgb_array')
super().__init__(env)
self.env = env
self.step_n = 0
def reset(self):
state, _ = self.env.reset()
self.step_n = 0
return state
def step(self, action):
state, reward, terminated, truncated, info = self.env.step(action)
done = terminated or truncated
self.step_n += 1
if self.step_n >= 200:
done = True
return state, reward, done, info
env = MyWrapper()
env.reset()
Networks
The actor and critic are separate networks. The actor outputs a probability distribution over actions (Softmax); the critic outputs a scalar value V(s).
import torch
# Actor: state (4) → action probabilities (2)
model = torch.nn.Sequential(
torch.nn.Linear(4, 128),
torch.nn.ReLU(),
torch.nn.Linear(128, 2),
torch.nn.Softmax(dim=1),
)
# Critic: state (4) → scalar V(s)
model_td = torch.nn.Sequential(
torch.nn.Linear(4, 128),
torch.nn.ReLU(),
torch.nn.Linear(128, 1),
)
model(torch.randn(2, 4)), model_td(torch.randn(2, 4))
Action Selection
The actor samples actions from its output distribution, exactly as in REINFORCE.
import random
def get_action(state):
state = torch.FloatTensor(state).reshape(1, 4)
# [1, 4] → [1, 2]
prob = model(state)
# Sample one action weighted by its probability
action = random.choices(range(2), weights=prob[0].tolist(), k=1)[0]
return action
Data Collection
Actor-Critic collects a full episode before performing an update, similar to REINFORCE but also recording the next state so the critic can compute bootstrap targets.
def get_data():
states = []
rewards = []
actions = []
next_states = []
overs = []
state = env.reset()
over = False
while not over:
action = get_action(state)
next_state, reward, over, _ = env.step(action)
states.append(state)
rewards.append(reward)
actions.append(action)
next_states.append(next_state)
overs.append(over)
state = next_state
# Convert lists to tensors
states = torch.FloatTensor(states).reshape(-1, 4) # [b, 4]
rewards = torch.FloatTensor(rewards).reshape(-1, 1) # [b, 1]
actions = torch.LongTensor(actions).reshape(-1, 1) # [b, 1]
next_states = torch.FloatTensor(next_states).reshape(-1, 4) # [b, 4]
overs = torch.LongTensor(overs).reshape(-1, 1) # [b, 1]
return states, rewards, actions, next_states, overs
Training Loop
Both optimizers run on every episode. The TD error delta is computed once and shared between the critic loss (MSE) and the actor gradient signal.
def train():
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
optimizer_td = torch.optim.Adam(model_td.parameters(), lr=1e-2)
loss_fn = torch.nn.MSELoss()
for i in range(1000):
states, rewards, actions, next_states, overs = get_data()
# ── Critic values and targets ─────────────────────────────────
# [b, 4] → [b, 1]
values = model_td(states)
# Bootstrap targets: γ · V(s') · (1 - done) + r
targets = model_td(next_states) * 0.98
targets *= (1 - overs)
targets += rewards
# TD error (advantage) – detach so actor gradients don't flow into critic
delta = (targets - values).detach()
# ── Actor ─────────────────────────────────────────────────────
# Log probability of the chosen actions: [b, 4] → [b, 2] → [b, 1]
probs = model(states)
probs = probs.gather(dim=1, index=actions)
# Actor loss: -log π(a|s) * δ
loss = (-probs.log() * delta).mean()
optimizer.zero_grad()
loss.backward()
optimizer.step()
# ── Critic update ─────────────────────────────────────────────
loss_td = loss_fn(values, targets.detach())
optimizer_td.zero_grad()
loss_td.backward()
optimizer_td.step()
if i % 100 == 0:
print(i, sum([test(play=False) for _ in range(10)]) / 10)
train()
The targets tensor is detached before computing loss_td to prevent gradients from flowing through the bootstrap target — this is standard practice in TD learning.
Advantage Estimate
The core of Actor-Critic is the TD error used as an advantage signal:
δ_t = r_t + γ · V(s_{t+1}) · (1 - done_t) - V(s_t)
In code:
targets = model_td(next_states) * 0.98 # γ · V(s')
targets *= (1 - overs) # zero out terminal states
targets += rewards # + r
delta = (targets - values).detach() # δ = target - V(s)
A positive δ means the outcome was better than expected → the actor should make the chosen action more likely. A negative δ means it was worse → less likely.
Actor vs Critic Roles
Actor (Policy)
Critic (Value)
- Parameterises π(a|s; θ)
- Outputs action probabilities via Softmax
- Gradient: ∇_θ log π(a|s) · δ
- Optimiser: Adam with lr=1e-3
- Parameterises V(s; φ)
- Outputs a scalar value
- Loss: MSE(V(s), r + γ V(s’))
- Optimiser: Adam with lr=1e-2 (higher rate for faster convergence)
The critic uses a higher learning rate than the actor. This is a common heuristic: the critic should converge faster so that it provides useful signals to the actor early in training.