Actor-Critic: Joint Policy and Value Network Training

REINFORCE works, but its Monte Carlo return estimates carry extremely high variance. Every episode produces a different trajectory, and the gradient signal can fluctuate wildly from one update to the next. The Actor-Critic architecture addresses this by introducing a second network — the critic — that estimates the state-value function V(s). Instead of weighting log-probabilities by the full discounted return G_t, Actor-Critic uses the temporal difference (TD) error as an advantage signal:

δ_t = r_t + γ V(s_{t+1}) - V(s_t)

This single-step estimate replaces the noisy Monte Carlo sum, substantially reducing variance while keeping the algorithm unbiased. The policy network (actor) learns which actions to take; the value network (critic) learns how good each state is.

How Actor-Critic Works

Collect an episode

Run the actor policy π(a|s) to generate a trajectory: (s_0, a_0, r_0, s_1, …).

Evaluate the critic

For each transition compute the TD target r + γ V(s') and the current value V(s). The difference δ = target - V(s) is the advantage estimate.

Update the critic

Minimise the MSE loss between V(s) and the TD target r + γ V(s').

Update the actor

Maximise log π(a|s) · δ — equivalently minimise -log π(a|s) · δ. The advantage δ is detached from the graph so gradients do not flow into the critic through this term.

Environment

import gym

class MyWrapper(gym.Wrapper):

    def __init__(self):
        env = gym.make('CartPole-v1', render_mode='rgb_array')
        super().__init__(env)
        self.env = env
        self.step_n = 0

    def reset(self):
        state, _ = self.env.reset()
        self.step_n = 0
        return state

    def step(self, action):
        state, reward, terminated, truncated, info = self.env.step(action)
        done = terminated or truncated
        self.step_n += 1
        if self.step_n >= 200:
            done = True
        return state, reward, done, info


env = MyWrapper()
env.reset()

Networks

The actor and critic are separate networks. The actor outputs a probability distribution over actions (Softmax); the critic outputs a scalar value V(s).

import torch

# Actor: state (4) → action probabilities (2)
model = torch.nn.Sequential(
    torch.nn.Linear(4, 128),
    torch.nn.ReLU(),
    torch.nn.Linear(128, 2),
    torch.nn.Softmax(dim=1),
)

# Critic: state (4) → scalar V(s)
model_td = torch.nn.Sequential(
    torch.nn.Linear(4, 128),
    torch.nn.ReLU(),
    torch.nn.Linear(128, 1),
)

model(torch.randn(2, 4)), model_td(torch.randn(2, 4))

Action Selection

The actor samples actions from its output distribution, exactly as in REINFORCE.

import random

def get_action(state):
    state = torch.FloatTensor(state).reshape(1, 4)

    # [1, 4] → [1, 2]
    prob = model(state)

    # Sample one action weighted by its probability
    action = random.choices(range(2), weights=prob[0].tolist(), k=1)[0]

    return action

Data Collection

Actor-Critic collects a full episode before performing an update, similar to REINFORCE but also recording the next state so the critic can compute bootstrap targets.

def get_data():
    states      = []
    rewards     = []
    actions     = []
    next_states = []
    overs       = []

    state = env.reset()
    over  = False

    while not over:
        action = get_action(state)
        next_state, reward, over, _ = env.step(action)

        states.append(state)
        rewards.append(reward)
        actions.append(action)
        next_states.append(next_state)
        overs.append(over)

        state = next_state

    # Convert lists to tensors
    states      = torch.FloatTensor(states).reshape(-1, 4)       # [b, 4]
    rewards     = torch.FloatTensor(rewards).reshape(-1, 1)      # [b, 1]
    actions     = torch.LongTensor(actions).reshape(-1, 1)       # [b, 1]
    next_states = torch.FloatTensor(next_states).reshape(-1, 4)  # [b, 4]
    overs       = torch.LongTensor(overs).reshape(-1, 1)         # [b, 1]

    return states, rewards, actions, next_states, overs

Training Loop

Both optimizers run on every episode. The TD error delta is computed once and shared between the critic loss (MSE) and the actor gradient signal.

def train():
    optimizer    = torch.optim.Adam(model.parameters(),    lr=1e-3)
    optimizer_td = torch.optim.Adam(model_td.parameters(), lr=1e-2)
    loss_fn      = torch.nn.MSELoss()

    for i in range(1000):
        states, rewards, actions, next_states, overs = get_data()

        # ── Critic values and targets ─────────────────────────────────
        # [b, 4] → [b, 1]
        values = model_td(states)

        # Bootstrap targets: γ · V(s') · (1 - done) + r
        targets = model_td(next_states) * 0.98
        targets *= (1 - overs)
        targets += rewards

        # TD error (advantage) – detach so actor gradients don't flow into critic
        delta = (targets - values).detach()

        # ── Actor ─────────────────────────────────────────────────────
        # Log probability of the chosen actions: [b, 4] → [b, 2] → [b, 1]
        probs = model(states)
        probs = probs.gather(dim=1, index=actions)

        # Actor loss: -log π(a|s) * δ
        loss = (-probs.log() * delta).mean()

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # ── Critic update ─────────────────────────────────────────────
        loss_td = loss_fn(values, targets.detach())

        optimizer_td.zero_grad()
        loss_td.backward()
        optimizer_td.step()

        if i % 100 == 0:
            print(i, sum([test(play=False) for _ in range(10)]) / 10)

train()

The targets tensor is detached before computing loss_td to prevent gradients from flowing through the bootstrap target — this is standard practice in TD learning.

Advantage Estimate

The core of Actor-Critic is the TD error used as an advantage signal:

δ_t = r_t + γ · V(s_{t+1}) · (1 - done_t) - V(s_t)

In code:

targets = model_td(next_states) * 0.98   # γ · V(s')
targets *= (1 - overs)                   # zero out terminal states
targets += rewards                        # + r

delta = (targets - values).detach()       # δ = target - V(s)

A positive δ means the outcome was better than expected → the actor should make the chosen action more likely. A negative δ means it was worse → less likely.

Actor vs Critic Roles

Actor (Policy)
Critic (Value)

Parameterises π(a|s; θ)
Outputs action probabilities via Softmax
Gradient: ∇_θ log π(a|s) · δ
Optimiser: Adam with lr=1e-3

The critic uses a higher learning rate than the actor. This is a common heuristic: the critic should converge faster so that it provides useful signals to the actor early in training.

Get Started

Foundations

Tabular & Model-Based Methods

Deep RL Algorithms

Advanced Topics

Actor-Critic: Joint Policy and Value Network Training

How Actor-Critic Works

Environment

Networks

Action Selection

Data Collection

Training Loop

Advantage Estimate

Actor vs Critic Roles

Build docs developers (and LLMs) love

Get Started

Foundations

Tabular & Model-Based Methods

Deep RL Algorithms

Advanced Topics

Documentation Index

​How Actor-Critic Works

​Environment

​Networks

​Action Selection

​Data Collection

​Training Loop

​Advantage Estimate

​Actor vs Critic Roles

Build docs developers (and LLMs) love

How Actor-Critic Works

Environment

Networks

Action Selection

Data Collection

Training Loop

Advantage Estimate

Actor vs Critic Roles