Policy Gradient Objectives: PPO, A2C, and REINFORCE in TorchRL

Policy gradient methods optimize a stochastic policy by estimating the gradient of the expected return with respect to policy parameters. TorchRL provides four production-ready policy gradient loss modules — ClipPPOLoss, KLPENPPOLoss, A2CLoss, and ReinforceLoss — all built on the shared LossModule base class. Every loss reads its inputs from a TensorDict, writes named loss_* scalars back into a new TensorDict, and delegates advantage computation to an interchangeable value estimator such as GAE.

LossModule Base Class

All TorchRL objectives inherit from LossModule, which itself inherits from TensorDictModuleBase. The base class handles:

Functionalization — wrapping actor/critic parameters into TensorDictParams so meta-RL and gradient checkpointing work out of the box.
Key configuration — a set_keys() method so every input/output tensordict key can be renamed without subclassing.
Value estimator injection — a make_value_estimator() method that swaps in any ValueEstimatorBase (GAE, TDLambda, VTrace, …) at runtime.
Exploration suppression — the forward() method is automatically decorated to run under ExplorationType.DETERMINISTIC so exploration noise is suppressed during loss computation.

from torchrl.objectives.common import LossModule
from torchrl.objectives.utils import ValueEstimators

loss = ClipPPOLoss(actor, critic)

# Rename any tensordict key the loss reads or writes
loss.set_keys(advantage="gae_advantage", value="my_state_value")

# Swap the built-in GAE for TD(λ)
loss.make_value_estimator(ValueEstimators.TDLambda, gamma=0.99, lmbda=0.95)

make_value_estimator() must be called before the first forward() pass. Calling it afterwards replaces the estimator but does not recompute any cached advantages that may already be stored in your replay buffer.

ClipPPOLoss

ClipPPOLoss is the standard PPO variant. The objective clips the importance-weighted advantage to keep the policy update within a trust region:

loss = -min(r * A,  clip(r, 1 − ε, 1 + ε) * A)

where r = π_new(a|s) / π_old(a|s) is the probability ratio and ε is clip_epsilon.

from torchrl.objectives import ClipPPOLoss

loss_module = ClipPPOLoss(
    actor_network=actor,
    critic_network=critic,
    clip_epsilon=0.2,
    entropy_bonus=True,
    entropy_coeff=0.01,
    critic_coeff=1.0,
    loss_critic_type="smooth_l1",
)

Constructor Parameters

actor_network

ProbabilisticTensorDictSequential

required

The stochastic policy operator. Must be a ProbabilisticTensorDictSequential (or compatible subclass) that writes the sampled action and its log probability into the output TensorDict. The log-probability key defaults to "sample_log_prob" (or "action_log_prob" when composite log-prob aggregation is disabled).

critic_network

TensorDictModule

required

The value operator. Typically a ValueOperator that reads observations and writes a scalar "state_value" into the output TensorDict.

clip_epsilon

float | tuple[float, float]

default:"0.2"

Clipping threshold for the importance weight ratio.

float x: symmetric clipping [1 − x, 1 + x].
tuple (eps_low, eps_high): asymmetric clipping as in DAPO Clip-Higher (e.g. (0.20, 0.28)). Exposes clip_epsilon_low / clip_epsilon_high as schedulable buffers instead of clip_epsilon.

entropy_bonus

bool

default:"True"

If True, adds an entropy bonus to the total loss to encourage exploration.

entropy_coeff

float | dict

default:"0.01"

Entropy multiplier.

Scalar: a single coefficient applied to the summed entropy of all action heads.
Mapping {head_name: coeff}: per-head coefficients for composite action spaces.

critic_coeff

float | None

default:"1.0"

Multiplier applied to the critic loss before summing. Pass None to exclude the critic loss from the returned output keys entirely.

loss_critic_type

str

default:"\"smooth_l1\""

Loss function used for the value discrepancy. One of "l1", "l2", or "smooth_l1".

normalize_advantage

bool

default:"False"

If True, normalises advantages to zero mean and unit variance before use. Set to True (MAPPO default) when using multi-agent variants.

clip_value

float | bool

default:"False"

If a float, clips value predictions with respect to the stored value estimate to limit extreme updates. If True, reuses clip_epsilon as the threshold (only valid with scalar clip_epsilon). False disables clipping.

separate_losses

bool

default:"False"

If True, shared parameters between actor and critic are trained only on the policy loss. Gradients from the critic loss are not propagated to shared parameters.

reduction

str

default:"\"mean\""

Reduction applied to scalar loss outputs. One of "none", "mean", or "sum".

Output Keys

Key	Description
`loss_objective`	Clipped surrogate policy loss (negated, to be minimised)
`loss_critic`	Value function loss weighted by `critic_coeff`
`loss_entropy`	Entropy bonus weighted by `entropy_coeff` (when `entropy_bonus=True`)
`entropy`	Raw policy entropy (for logging)
`kl_approx`	Approximate KL divergence between old and new policy (for monitoring)
`clip_fraction`	Fraction of samples where the ratio was clipped (for monitoring)
`explained_variance`	R² of critic predictions vs. value targets (when `log_explained_variance=True`)

Input Keys (via `set_keys`)

Key	Default	Description
`advantage`	`"advantage"`	Pre-computed advantage estimates (written by GAE)
`value_target`	`"value_target"`	Value function training targets
`value`	`"state_value"`	Critic predictions
`sample_log_prob`	`"sample_log_prob"`	Log-probability of the collected action
`action`	`"action"`	Collected actions

Complete PPO Training Example

import torch
from torch import nn
from tensordict import TensorDict
from torchrl.data.tensor_specs import Bounded
from torchrl.modules.distributions import NormalParamExtractor, TanhNormal
from torchrl.modules.tensordict_module.actors import ProbabilisticActor, ValueOperator
from torchrl.modules.tensordict_module.common import SafeModule
from torchrl.objectives import ClipPPOLoss
from torchrl.objectives.value import GAE

n_act, n_obs = 4, 3
spec = Bounded(-torch.ones(n_act), torch.ones(n_act), (n_act,))

# Build actor
net = nn.Sequential(nn.Linear(n_obs, 2 * n_act), NormalParamExtractor())
module = SafeModule(net, in_keys=["observation"], out_keys=["loc", "scale"])
actor = ProbabilisticActor(
    module=module,
    distribution_class=TanhNormal,
    in_keys=["loc", "scale"],
    spec=spec,
    return_log_prob=True,
)

# Build critic
critic = ValueOperator(
    module=nn.Linear(n_obs, 1),
    in_keys=["observation"],
)

# Create loss module
loss_module = ClipPPOLoss(
    actor_network=actor,
    critic_network=critic,
    clip_epsilon=0.2,
    entropy_bonus=True,
    entropy_coeff=0.01,
    critic_coeff=1.0,
)

# Attach a GAE advantage estimator
advantage_module = GAE(
    gamma=0.99,
    lmbda=0.95,
    value_network=critic,
    differentiable=False,
)

# Training loop sketch
optimizer_actor = torch.optim.Adam(actor.parameters(), lr=3e-4)
optimizer_critic = torch.optim.Adam(critic.parameters(), lr=1e-3)

# data: TensorDict with shape [batch_size, time_steps]
# Compute advantages in place
advantage_module(data)

# Compute all loss components in one forward pass
loss_td = loss_module(data)

loss = loss_td["loss_objective"] + loss_td["loss_critic"] + loss_td["loss_entropy"]
loss.backward()

optimizer_actor.step()
optimizer_critic.step()

Compute advantages before the inner PPO update loop and after rolling out the environment. Re-computing advantages inside the update loop using stale parameters defeats the purpose of the importance-weight correction.

KLPENPPOLoss

KLPENPPOLoss is the KL-penalty variant of PPO. Instead of hard clipping, it adds a soft penalty proportional to the KL divergence between the old and new policy:

loss = -r * A + β * KL(π_old ‖ π_new)

The β multiplier is adapted automatically after each update epoch: it is increased when KL > dtarg and decreased when KL < dtarg, keeping policy updates close to the target divergence.

from torchrl.objectives import KLPENPPOLoss

loss_module = KLPENPPOLoss(
    actor_network=actor,
    critic_network=critic,
    dtarg=0.01,
    beta=1.0,
    increment=2.0,
    decrement=0.5,
    entropy_bonus=True,
    entropy_coeff=0.01,
    critic_coeff=1.0,
)

Additional Parameters vs. ClipPPOLoss

dtarg

float

default:"0.01"

Target KL divergence. The beta multiplier is adapted to keep KL(π_old ‖ π_new) near this value.

beta

float

default:"1.0"

Initial KL penalty coefficient. Registered as a schedulable buffer so it can be set directly: loss.beta = 0.5.

increment

float

default:"2.0"

Factor by which beta is multiplied when the observed KL exceeds dtarg. Must be >= 1.0.

decrement

float

default:"0.5"

Factor by which beta is multiplied when the observed KL is below dtarg. Must be <= 1.0.

samples_mc_kl

int

default:"1"

Number of Monte Carlo samples used to estimate the KL when no closed-form formula is available.

Output Keys

KLPENPPOLoss returns the same keys as ClipPPOLoss (loss_objective, loss_critic, loss_entropy, entropy) plus:

Key	Description
`kl`	Observed KL divergence between old and updated policy (for monitoring)

beta is updated in-place during forward(). If you run multiple gradient steps on the same batch, beta may change between steps. This is the intended adaptive behaviour — track loss_td["kl"] to observe the drift.

A2CLoss

A2CLoss implements Advantage Actor-Critic, a simpler on-policy objective that uses the REINFORCE gradient estimator weighted by the advantage. Unlike PPO it does not apply any importance-weight correction, so data must be fresh (collected by the current policy).

from torchrl.objectives import A2CLoss

loss_module = A2CLoss(
    actor_network=actor,
    critic_network=critic,
    entropy_bonus=True,
    entropy_coeff=0.01,
    critic_coeff=1.0,
    loss_critic_type="smooth_l1",
)

Constructor Parameters

actor_network

ProbabilisticTensorDictSequential

required

Stochastic policy network.

critic_network

ValueOperator

required

Value network returning "state_value".

entropy_bonus

bool

default:"True"

Add an entropy regularisation term to favour exploration.

entropy_coeff

float

default:"0.01"

Entropy bonus weight.

critic_coeff

float | None

default:"1.0"

Multiplier for the critic loss. Pass None to remove the critic loss from outputs and decouple the in-keys from the critic.

loss_critic_type

str

default:"\"smooth_l1\""

Loss function for the value residual. One of "l1", "l2", "smooth_l1".

Output Keys

Key	Description
`loss_objective`	REINFORCE policy gradient loss
`loss_critic`	Value function loss
`loss_entropy`	Entropy bonus (when `entropy_bonus=True`)
`entropy`	Raw policy entropy

ReinforceLoss

ReinforceLoss is the vanilla REINFORCE (Williams, 1992) policy gradient. It computes −log π(a|s) * A where the advantage can be a simple Monte Carlo return, a baseline-subtracted return, or any other advantage estimate.

from torchrl.objectives import ReinforceLoss

loss_module = ReinforceLoss(
    actor_network=actor,
    critic_network=critic,   # optional; used as a baseline
    loss_critic_type="smooth_l1",
)

Constructor Parameters

actor_network

ProbabilisticTensorDictSequential

required

Stochastic policy that returns log-probabilities.

critic_network

ValueOperator

required

Baseline value network. Predictions are used to reduce gradient variance via advantage estimation.

delay_value

bool

default:"False"

If True, creates a separate target network for the critic. Incompatible with functional=False.

loss_critic_type

str

default:"\"smooth_l1\""

Loss for the baseline value residual.

Output Keys

Key	Description
`loss_actor`	REINFORCE policy gradient loss
`loss_value`	Baseline value function loss

Comparing the Four Loss Classes

ClipPPOLoss
KLPENPPOLoss
A2CLoss
ReinforceLoss

Best for most on-policy tasks. The clipped objective provides a stable trust region without hyperparameter sensitivity. Dominant in the literature since 2017.

loss = ClipPPOLoss(actor, critic, clip_epsilon=0.2)

Useful when you need a soft, differentiable trust region constraint. The adaptive beta can be tricky to tune — monitor kl throughout training.

loss = KLPENPPOLoss(actor, critic, dtarg=0.01, beta=1.0)

Simpler than PPO; well-suited for synchronous parallel environments where collecting data with the latest policy is cheap. No importance-weight correction means you must not re-use stale batches.

loss = A2CLoss(actor, critic, entropy_bonus=True)

Classic baseline algorithm. High variance compared to A2C/PPO, but useful as a pedagogical reference or for simple environments.

loss = ReinforceLoss(actor, critic)

Switching the Value Estimator

Every policy gradient loss uses GAE by default (with hyperparameters from default_value_kwargs()). You can swap it at any time:

from torchrl.objectives.utils import ValueEstimators

loss_module = ClipPPOLoss(actor, critic)

# Use TD(λ) instead of GAE
loss_module.make_value_estimator(
    ValueEstimators.TDLambda,
    gamma=0.99,
    lmbda=0.95,
)

# Or build a standalone GAE and use it separately
from torchrl.objectives.value import GAE

gae = GAE(gamma=0.99, lmbda=0.95, value_network=critic)
gae(data)                    # writes "advantage" and "value_target" into data
losses = loss_module(data)   # reads pre-computed advantage

If the "advantage" key is absent from the input TensorDict, the loss module will compute advantages on the fly using its internal value estimator. Pre-computing advantages externally (as shown above) is strongly preferred in practice because it lets you reuse the same advantage estimates across multiple PPO gradient steps.

Environments

Data & Buffers

Collectors

Modules

Objectives

Policy Gradient Objectives: PPO, A2C, and REINFORCE in TorchRL

LossModule Base Class

ClipPPOLoss

Constructor Parameters

Output Keys

Input Keys (via `set_keys`)

Complete PPO Training Example

KLPENPPOLoss

Additional Parameters vs. ClipPPOLoss

Output Keys

A2CLoss

Constructor Parameters

Output Keys

ReinforceLoss

Constructor Parameters

Output Keys

Comparing the Four Loss Classes

Switching the Value Estimator

Build docs developers (and LLMs) love

Environments

Data & Buffers

Collectors

Modules

Objectives

Documentation Index

​LossModule Base Class

​ClipPPOLoss

​Constructor Parameters

​Output Keys

​Input Keys (via set_keys)

​Complete PPO Training Example

​KLPENPPOLoss

​Additional Parameters vs. ClipPPOLoss

​Output Keys

​A2CLoss

​Constructor Parameters

​Output Keys

​ReinforceLoss

​Constructor Parameters

​Output Keys

​Comparing the Four Loss Classes

​Switching the Value Estimator

Build docs developers (and LLMs) love

LossModule Base Class

ClipPPOLoss

Constructor Parameters

Output Keys

Input Keys (via `set_keys`)

Complete PPO Training Example

KLPENPPOLoss

Additional Parameters vs. ClipPPOLoss

Output Keys

A2CLoss

Constructor Parameters

Output Keys

ReinforceLoss

Constructor Parameters

Output Keys

Comparing the Four Loss Classes

Switching the Value Estimator