Policy gradient methods optimize a stochastic policy by estimating the gradient of the expected return with respect to policy parameters. TorchRL provides four production-ready policy gradient loss modules —Documentation Index
Fetch the complete documentation index at: https://mintlify.com/pytorch/rl/llms.txt
Use this file to discover all available pages before exploring further.
ClipPPOLoss, KLPENPPOLoss, A2CLoss, and ReinforceLoss — all built on the shared LossModule base class. Every loss reads its inputs from a TensorDict, writes named loss_* scalars back into a new TensorDict, and delegates advantage computation to an interchangeable value estimator such as GAE.
LossModule Base Class
All TorchRL objectives inherit fromLossModule, which itself inherits from TensorDictModuleBase. The base class handles:
- Functionalization — wrapping actor/critic parameters into
TensorDictParamsso meta-RL and gradient checkpointing work out of the box. - Key configuration — a
set_keys()method so every input/output tensordict key can be renamed without subclassing. - Value estimator injection — a
make_value_estimator()method that swaps in anyValueEstimatorBase(GAE, TDLambda, VTrace, …) at runtime. - Exploration suppression — the
forward()method is automatically decorated to run underExplorationType.DETERMINISTICso exploration noise is suppressed during loss computation.
make_value_estimator() must be called before the first forward() pass. Calling it afterwards replaces the estimator but does not recompute any cached advantages that may already be stored in your replay buffer.ClipPPOLoss
ClipPPOLoss is the standard PPO variant. The objective clips the importance-weighted advantage to keep the policy update within a trust region:
r = π_new(a|s) / π_old(a|s) is the probability ratio and ε is clip_epsilon.
Constructor Parameters
The stochastic policy operator. Must be a
ProbabilisticTensorDictSequential
(or compatible subclass) that writes the sampled action and its log probability
into the output TensorDict. The log-probability key defaults to
"sample_log_prob" (or "action_log_prob" when composite log-prob
aggregation is disabled).The value operator. Typically a
ValueOperator that reads observations and
writes a scalar "state_value" into the output TensorDict.Clipping threshold for the importance weight ratio.
- float
x: symmetric clipping[1 − x, 1 + x]. - tuple
(eps_low, eps_high): asymmetric clipping as in DAPO Clip-Higher (e.g.(0.20, 0.28)). Exposesclip_epsilon_low/clip_epsilon_highas schedulable buffers instead ofclip_epsilon.
If
True, adds an entropy bonus to the total loss to encourage exploration.Entropy multiplier.
- Scalar: a single coefficient applied to the summed entropy of all action heads.
- Mapping
{head_name: coeff}: per-head coefficients for composite action spaces.
Multiplier applied to the critic loss before summing. Pass
None to exclude
the critic loss from the returned output keys entirely.Loss function used for the value discrepancy. One of
"l1", "l2", or
"smooth_l1".If
True, normalises advantages to zero mean and unit variance before use.
Set to True (MAPPO default) when using multi-agent variants.If a
float, clips value predictions with respect to the stored value
estimate to limit extreme updates. If True, reuses clip_epsilon as the
threshold (only valid with scalar clip_epsilon). False disables clipping.If
True, shared parameters between actor and critic are trained only on the
policy loss. Gradients from the critic loss are not propagated to shared
parameters.Reduction applied to scalar loss outputs. One of
"none", "mean", or
"sum".Output Keys
| Key | Description |
|---|---|
loss_objective | Clipped surrogate policy loss (negated, to be minimised) |
loss_critic | Value function loss weighted by critic_coeff |
loss_entropy | Entropy bonus weighted by entropy_coeff (when entropy_bonus=True) |
entropy | Raw policy entropy (for logging) |
kl_approx | Approximate KL divergence between old and new policy (for monitoring) |
clip_fraction | Fraction of samples where the ratio was clipped (for monitoring) |
explained_variance | R² of critic predictions vs. value targets (when log_explained_variance=True) |
Input Keys (via set_keys)
| Key | Default | Description |
|---|---|---|
advantage | "advantage" | Pre-computed advantage estimates (written by GAE) |
value_target | "value_target" | Value function training targets |
value | "state_value" | Critic predictions |
sample_log_prob | "sample_log_prob" | Log-probability of the collected action |
action | "action" | Collected actions |
Complete PPO Training Example
KLPENPPOLoss
KLPENPPOLoss is the KL-penalty variant of PPO. Instead of hard clipping, it adds a soft penalty proportional to the KL divergence between the old and new policy:
β multiplier is adapted automatically after each update epoch: it is increased when KL > dtarg and decreased when KL < dtarg, keeping policy updates close to the target divergence.
Additional Parameters vs. ClipPPOLoss
Target KL divergence. The
beta multiplier is adapted to keep KL(π_old ‖ π_new) near this value.Initial KL penalty coefficient. Registered as a schedulable buffer so it can be
set directly:
loss.beta = 0.5.Factor by which
beta is multiplied when the observed KL exceeds dtarg.
Must be >= 1.0.Factor by which
beta is multiplied when the observed KL is below dtarg.
Must be <= 1.0.Number of Monte Carlo samples used to estimate the KL when no closed-form
formula is available.
Output Keys
KLPENPPOLoss returns the same keys as ClipPPOLoss (loss_objective, loss_critic, loss_entropy, entropy) plus:
| Key | Description |
|---|---|
kl | Observed KL divergence between old and updated policy (for monitoring) |
A2CLoss
A2CLoss implements Advantage Actor-Critic, a simpler on-policy objective that uses the REINFORCE gradient estimator weighted by the advantage. Unlike PPO it does not apply any importance-weight correction, so data must be fresh (collected by the current policy).
Constructor Parameters
Stochastic policy network.
Value network returning
"state_value".Add an entropy regularisation term to favour exploration.
Entropy bonus weight.
Multiplier for the critic loss. Pass
None to remove the critic loss from
outputs and decouple the in-keys from the critic.Loss function for the value residual. One of
"l1", "l2", "smooth_l1".Output Keys
| Key | Description |
|---|---|
loss_objective | REINFORCE policy gradient loss |
loss_critic | Value function loss |
loss_entropy | Entropy bonus (when entropy_bonus=True) |
entropy | Raw policy entropy |
ReinforceLoss
ReinforceLoss is the vanilla REINFORCE (Williams, 1992) policy gradient. It computes −log π(a|s) * A where the advantage can be a simple Monte Carlo return, a baseline-subtracted return, or any other advantage estimate.
Constructor Parameters
Stochastic policy that returns log-probabilities.
Baseline value network. Predictions are used to reduce gradient variance via
advantage estimation.
If
True, creates a separate target network for the critic. Incompatible with
functional=False.Loss for the baseline value residual.
Output Keys
| Key | Description |
|---|---|
loss_actor | REINFORCE policy gradient loss |
loss_value | Baseline value function loss |
Comparing the Four Loss Classes
- ClipPPOLoss
- KLPENPPOLoss
- A2CLoss
- ReinforceLoss
Best for most on-policy tasks. The clipped objective provides a stable
trust region without hyperparameter sensitivity. Dominant in the literature
since 2017.
Switching the Value Estimator
Every policy gradient loss usesGAE by default (with hyperparameters from default_value_kwargs()). You can swap it at any time:
If the
"advantage" key is absent from the input TensorDict, the loss module will compute advantages on the fly using its internal value estimator. Pre-computing advantages externally (as shown above) is strongly preferred in practice because it lets you reuse the same advantage estimates across multiple PPO gradient steps.