Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/pytorch/rl/llms.txt

Use this file to discover all available pages before exploring further.

Policy gradient methods optimize a stochastic policy by estimating the gradient of the expected return with respect to policy parameters. TorchRL provides four production-ready policy gradient loss modules — ClipPPOLoss, KLPENPPOLoss, A2CLoss, and ReinforceLoss — all built on the shared LossModule base class. Every loss reads its inputs from a TensorDict, writes named loss_* scalars back into a new TensorDict, and delegates advantage computation to an interchangeable value estimator such as GAE.

LossModule Base Class

All TorchRL objectives inherit from LossModule, which itself inherits from TensorDictModuleBase. The base class handles:
  • Functionalization — wrapping actor/critic parameters into TensorDictParams so meta-RL and gradient checkpointing work out of the box.
  • Key configuration — a set_keys() method so every input/output tensordict key can be renamed without subclassing.
  • Value estimator injection — a make_value_estimator() method that swaps in any ValueEstimatorBase (GAE, TDLambda, VTrace, …) at runtime.
  • Exploration suppression — the forward() method is automatically decorated to run under ExplorationType.DETERMINISTIC so exploration noise is suppressed during loss computation.
from torchrl.objectives.common import LossModule
from torchrl.objectives.utils import ValueEstimators

loss = ClipPPOLoss(actor, critic)

# Rename any tensordict key the loss reads or writes
loss.set_keys(advantage="gae_advantage", value="my_state_value")

# Swap the built-in GAE for TD(λ)
loss.make_value_estimator(ValueEstimators.TDLambda, gamma=0.99, lmbda=0.95)
make_value_estimator() must be called before the first forward() pass. Calling it afterwards replaces the estimator but does not recompute any cached advantages that may already be stored in your replay buffer.

ClipPPOLoss

ClipPPOLoss is the standard PPO variant. The objective clips the importance-weighted advantage to keep the policy update within a trust region:
loss = -min(r * A,  clip(r, 1 − ε, 1 + ε) * A)
where r = π_new(a|s) / π_old(a|s) is the probability ratio and ε is clip_epsilon.
from torchrl.objectives import ClipPPOLoss

loss_module = ClipPPOLoss(
    actor_network=actor,
    critic_network=critic,
    clip_epsilon=0.2,
    entropy_bonus=True,
    entropy_coeff=0.01,
    critic_coeff=1.0,
    loss_critic_type="smooth_l1",
)

Constructor Parameters

actor_network
ProbabilisticTensorDictSequential
required
The stochastic policy operator. Must be a ProbabilisticTensorDictSequential (or compatible subclass) that writes the sampled action and its log probability into the output TensorDict. The log-probability key defaults to "sample_log_prob" (or "action_log_prob" when composite log-prob aggregation is disabled).
critic_network
TensorDictModule
required
The value operator. Typically a ValueOperator that reads observations and writes a scalar "state_value" into the output TensorDict.
clip_epsilon
float | tuple[float, float]
default:"0.2"
Clipping threshold for the importance weight ratio.
  • float x: symmetric clipping [1 − x, 1 + x].
  • tuple (eps_low, eps_high): asymmetric clipping as in DAPO Clip-Higher (e.g. (0.20, 0.28)). Exposes clip_epsilon_low / clip_epsilon_high as schedulable buffers instead of clip_epsilon.
entropy_bonus
bool
default:"True"
If True, adds an entropy bonus to the total loss to encourage exploration.
entropy_coeff
float | dict
default:"0.01"
Entropy multiplier.
  • Scalar: a single coefficient applied to the summed entropy of all action heads.
  • Mapping {head_name: coeff}: per-head coefficients for composite action spaces.
critic_coeff
float | None
default:"1.0"
Multiplier applied to the critic loss before summing. Pass None to exclude the critic loss from the returned output keys entirely.
loss_critic_type
str
default:"\"smooth_l1\""
Loss function used for the value discrepancy. One of "l1", "l2", or "smooth_l1".
normalize_advantage
bool
default:"False"
If True, normalises advantages to zero mean and unit variance before use. Set to True (MAPPO default) when using multi-agent variants.
clip_value
float | bool
default:"False"
If a float, clips value predictions with respect to the stored value estimate to limit extreme updates. If True, reuses clip_epsilon as the threshold (only valid with scalar clip_epsilon). False disables clipping.
separate_losses
bool
default:"False"
If True, shared parameters between actor and critic are trained only on the policy loss. Gradients from the critic loss are not propagated to shared parameters.
reduction
str
default:"\"mean\""
Reduction applied to scalar loss outputs. One of "none", "mean", or "sum".

Output Keys

KeyDescription
loss_objectiveClipped surrogate policy loss (negated, to be minimised)
loss_criticValue function loss weighted by critic_coeff
loss_entropyEntropy bonus weighted by entropy_coeff (when entropy_bonus=True)
entropyRaw policy entropy (for logging)
kl_approxApproximate KL divergence between old and new policy (for monitoring)
clip_fractionFraction of samples where the ratio was clipped (for monitoring)
explained_varianceR² of critic predictions vs. value targets (when log_explained_variance=True)

Input Keys (via set_keys)

KeyDefaultDescription
advantage"advantage"Pre-computed advantage estimates (written by GAE)
value_target"value_target"Value function training targets
value"state_value"Critic predictions
sample_log_prob"sample_log_prob"Log-probability of the collected action
action"action"Collected actions

Complete PPO Training Example

import torch
from torch import nn
from tensordict import TensorDict
from torchrl.data.tensor_specs import Bounded
from torchrl.modules.distributions import NormalParamExtractor, TanhNormal
from torchrl.modules.tensordict_module.actors import ProbabilisticActor, ValueOperator
from torchrl.modules.tensordict_module.common import SafeModule
from torchrl.objectives import ClipPPOLoss
from torchrl.objectives.value import GAE

n_act, n_obs = 4, 3
spec = Bounded(-torch.ones(n_act), torch.ones(n_act), (n_act,))

# Build actor
net = nn.Sequential(nn.Linear(n_obs, 2 * n_act), NormalParamExtractor())
module = SafeModule(net, in_keys=["observation"], out_keys=["loc", "scale"])
actor = ProbabilisticActor(
    module=module,
    distribution_class=TanhNormal,
    in_keys=["loc", "scale"],
    spec=spec,
    return_log_prob=True,
)

# Build critic
critic = ValueOperator(
    module=nn.Linear(n_obs, 1),
    in_keys=["observation"],
)

# Create loss module
loss_module = ClipPPOLoss(
    actor_network=actor,
    critic_network=critic,
    clip_epsilon=0.2,
    entropy_bonus=True,
    entropy_coeff=0.01,
    critic_coeff=1.0,
)

# Attach a GAE advantage estimator
advantage_module = GAE(
    gamma=0.99,
    lmbda=0.95,
    value_network=critic,
    differentiable=False,
)

# Training loop sketch
optimizer_actor = torch.optim.Adam(actor.parameters(), lr=3e-4)
optimizer_critic = torch.optim.Adam(critic.parameters(), lr=1e-3)

# data: TensorDict with shape [batch_size, time_steps]
# Compute advantages in place
advantage_module(data)

# Compute all loss components in one forward pass
loss_td = loss_module(data)

loss = loss_td["loss_objective"] + loss_td["loss_critic"] + loss_td["loss_entropy"]
loss.backward()

optimizer_actor.step()
optimizer_critic.step()
Compute advantages before the inner PPO update loop and after rolling out the environment. Re-computing advantages inside the update loop using stale parameters defeats the purpose of the importance-weight correction.

KLPENPPOLoss

KLPENPPOLoss is the KL-penalty variant of PPO. Instead of hard clipping, it adds a soft penalty proportional to the KL divergence between the old and new policy:
loss = -r * A + β * KL(π_old ‖ π_new)
The β multiplier is adapted automatically after each update epoch: it is increased when KL > dtarg and decreased when KL < dtarg, keeping policy updates close to the target divergence.
from torchrl.objectives import KLPENPPOLoss

loss_module = KLPENPPOLoss(
    actor_network=actor,
    critic_network=critic,
    dtarg=0.01,
    beta=1.0,
    increment=2.0,
    decrement=0.5,
    entropy_bonus=True,
    entropy_coeff=0.01,
    critic_coeff=1.0,
)

Additional Parameters vs. ClipPPOLoss

dtarg
float
default:"0.01"
Target KL divergence. The beta multiplier is adapted to keep KL(π_old ‖ π_new) near this value.
beta
float
default:"1.0"
Initial KL penalty coefficient. Registered as a schedulable buffer so it can be set directly: loss.beta = 0.5.
increment
float
default:"2.0"
Factor by which beta is multiplied when the observed KL exceeds dtarg. Must be >= 1.0.
decrement
float
default:"0.5"
Factor by which beta is multiplied when the observed KL is below dtarg. Must be <= 1.0.
samples_mc_kl
int
default:"1"
Number of Monte Carlo samples used to estimate the KL when no closed-form formula is available.

Output Keys

KLPENPPOLoss returns the same keys as ClipPPOLoss (loss_objective, loss_critic, loss_entropy, entropy) plus:
KeyDescription
klObserved KL divergence between old and updated policy (for monitoring)
beta is updated in-place during forward(). If you run multiple gradient steps on the same batch, beta may change between steps. This is the intended adaptive behaviour — track loss_td["kl"] to observe the drift.

A2CLoss

A2CLoss implements Advantage Actor-Critic, a simpler on-policy objective that uses the REINFORCE gradient estimator weighted by the advantage. Unlike PPO it does not apply any importance-weight correction, so data must be fresh (collected by the current policy).
from torchrl.objectives import A2CLoss

loss_module = A2CLoss(
    actor_network=actor,
    critic_network=critic,
    entropy_bonus=True,
    entropy_coeff=0.01,
    critic_coeff=1.0,
    loss_critic_type="smooth_l1",
)

Constructor Parameters

actor_network
ProbabilisticTensorDictSequential
required
Stochastic policy network.
critic_network
ValueOperator
required
Value network returning "state_value".
entropy_bonus
bool
default:"True"
Add an entropy regularisation term to favour exploration.
entropy_coeff
float
default:"0.01"
Entropy bonus weight.
critic_coeff
float | None
default:"1.0"
Multiplier for the critic loss. Pass None to remove the critic loss from outputs and decouple the in-keys from the critic.
loss_critic_type
str
default:"\"smooth_l1\""
Loss function for the value residual. One of "l1", "l2", "smooth_l1".

Output Keys

KeyDescription
loss_objectiveREINFORCE policy gradient loss
loss_criticValue function loss
loss_entropyEntropy bonus (when entropy_bonus=True)
entropyRaw policy entropy

ReinforceLoss

ReinforceLoss is the vanilla REINFORCE (Williams, 1992) policy gradient. It computes −log π(a|s) * A where the advantage can be a simple Monte Carlo return, a baseline-subtracted return, or any other advantage estimate.
from torchrl.objectives import ReinforceLoss

loss_module = ReinforceLoss(
    actor_network=actor,
    critic_network=critic,   # optional; used as a baseline
    loss_critic_type="smooth_l1",
)

Constructor Parameters

actor_network
ProbabilisticTensorDictSequential
required
Stochastic policy that returns log-probabilities.
critic_network
ValueOperator
required
Baseline value network. Predictions are used to reduce gradient variance via advantage estimation.
delay_value
bool
default:"False"
If True, creates a separate target network for the critic. Incompatible with functional=False.
loss_critic_type
str
default:"\"smooth_l1\""
Loss for the baseline value residual.

Output Keys

KeyDescription
loss_actorREINFORCE policy gradient loss
loss_valueBaseline value function loss

Comparing the Four Loss Classes

Best for most on-policy tasks. The clipped objective provides a stable trust region without hyperparameter sensitivity. Dominant in the literature since 2017.
loss = ClipPPOLoss(actor, critic, clip_epsilon=0.2)

Switching the Value Estimator

Every policy gradient loss uses GAE by default (with hyperparameters from default_value_kwargs()). You can swap it at any time:
from torchrl.objectives.utils import ValueEstimators

loss_module = ClipPPOLoss(actor, critic)

# Use TD(λ) instead of GAE
loss_module.make_value_estimator(
    ValueEstimators.TDLambda,
    gamma=0.99,
    lmbda=0.95,
)

# Or build a standalone GAE and use it separately
from torchrl.objectives.value import GAE

gae = GAE(gamma=0.99, lmbda=0.95, value_network=critic)
gae(data)                    # writes "advantage" and "value_target" into data
losses = loss_module(data)   # reads pre-computed advantage
If the "advantage" key is absent from the input TensorDict, the loss module will compute advantages on the fly using its internal value estimator. Pre-computing advantages externally (as shown above) is strongly preferred in practice because it lets you reuse the same advantage estimates across multiple PPO gradient steps.

Build docs developers (and LLMs) love