Loss Modules, Objectives, and Value Estimators in TorchRL
TorchRL loss modules read TensorDict keys and expose named losses. Learn ClipPPOLoss, SACLoss, DQNLoss, TD3Loss, value estimators, and the LossModule base.
Use this file to discover all available pages before exploring further.
TorchRL loss modules are nn.Module subclasses that read a TensorDict, compute one or more differentiable losses, and write them back under keys prefixed with "loss_". Unlike monolithic loss functions, each component loss is exposed separately so optimizers, loggers, and schedulers can work on individual terms. Key mappings — which TensorDict field each part of the loss reads — are configurable at construction time, making it straightforward to adapt a loss to a custom environment schema without modifying the algorithm itself.
Value estimator registry — each loss declares a default_value_estimator; call loss.make_value_estimator(ValueEstimators.GAE, gamma=0.99) to swap it out.
Schedulable buffers — scalar coefficients like entropy_coeff can be updated with direct assignment and are tracked as proper nn.Module buffers.
from torchrl.objectives.common import LossModulefrom torchrl.objectives.utils import SoftUpdate, HardUpdate, ValueEstimators# Remap tensordict keys to match your environment schema.loss.set_keys( action="my_action", observation="obs", reward="extrinsic_reward",)# Swap the value estimator.loss.make_value_estimator(ValueEstimators.TDLambda, gamma=0.99, lmbda=0.95)# Attach a target network updater (for off-policy losses).updater = SoftUpdate(loss, eps=0.005)updater.step() # call after each gradient step
An alternative PPO formulation using KL-divergence penalty instead of clipping, with adaptive penalty coefficient. Useful when the clip surrogate is too conservative.
Soft Actor-Critic combines a stochastic actor, twin Q-networks, and a learnable entropy temperature. Returns "loss_actor", "loss_qvalue", and "loss_alpha".
from torchrl.objectives import SACLossfrom torchrl.objectives.utils import SoftUpdateloss_fn = SACLoss( actor_network=actor, qvalue_network=qvalue, # or a list of 2 Q-networks num_qvalue_nets=2, alpha_init=1.0, target_entropy="auto", # −dim(action) loss_function="smooth_l1", delay_qvalue=True, # use target Q-networks)target_updater = SoftUpdate(loss_fn, eps=0.005)losses = loss_fn(batch)# losses["loss_actor"] — maximize entropy-regularized Q# losses["loss_qvalue"] — Bellman backup# losses["loss_alpha"] — temperature tuning
Value estimators compute advantages and return targets from raw trajectory data and populate the TensorDict with "advantage" and "value_target" keys before the loss is called.
Generalized Advantage Estimation. Standard choice for PPO and A2C.
from torchrl.objectives.value import TDLambdaEstimatoradvantage_fn = TDLambdaEstimator( value_network=critic, gamma=0.99, lmbda=0.95,)
TD(λ) return estimator. More general than GAE; useful for off-policy corrections.
from torchrl.objectives.value import VTraceadvantage_fn = VTrace( gamma=0.99, actor_network=actor, # required: used to compute log-probs for IS weights value_network=critic, rho_thresh=1.0, c_thresh=1.0,)
V-trace off-policy correction (IMPALA). Clips importance ratios for stability in distributed settings.
from torchrl.objectives.value import MultiAgentGAEadvantage_fn = MultiAgentGAE( gamma=0.99, lmbda=0.95, value_network=critic, agent_dim=-2, # dimension holding the agent index in the value tensor)
Multi-agent extension of GAE. Broadcasts team-level rewards/dones to per-agent shape and writes per-agent advantages.
Multi-Agent PPO with a centralized critic. Reads per-agent observations
under ("agents", "observation") and expects a group_map to identify
agent groups.
IPPOLoss
Independent PPO: each agent has its own actor and critic; no shared
parameters across agents.
QMixerLoss
QMIX / VDN monotonic mixing network that combines per-agent Q-values
into a joint Q-value for cooperative tasks.