Offline RL and imitation learning objectives train policies from a static dataset of previously collected transitions, without any further interaction with the environment. TorchRL provides a full suite of these objectives —Documentation Index
Fetch the complete documentation index at: https://mintlify.com/pytorch/rl/llms.txt
Use this file to discover all available pages before exploring further.
IQLLoss, CQLLoss, DiscreteCQLLoss, DiscreteIQLLoss, BCLoss, TD3BCLoss, DTLoss, OnlineDTLoss, DiffusionBCLoss, ACTLoss, and GAILLoss — all sharing the same LossModule interface and TensorDict-based I/O.
IQLLoss
Implicit Q-Learning (Kostrikov et al. 2021) is a state-of-the-art offline RL algorithm that avoids querying the policy during training by learning an expectile regression over the Q-function instead of solving the constrained maximization problem explicitly. This eliminates out-of-distribution action queries that destabilize other offline methods. The IQL objective consists of three components:- Value loss: expectile regression of V(s) against Q(s, a)
- Q-value loss: standard Bellman regression using V(s’) as the target
- Actor loss: advantage-weighted behavior cloning
Constructor Parameters
Stochastic policy network. During training, the actor is updated via
advantage-weighted behavioral cloning (AWR):
log π(a|s) * exp(A(s,a) / β).Q(s, a) parametric model(s). If a single module is passed it is duplicated
num_qvalue_nets times; otherwise parameters are stacked.State value function V(s). The IQL algorithm requires a separate V-network;
if omitted the module raises an error.
Number of Q-networks. The minimum across the ensemble is used for value
targets, reducing overestimation.
Inverse temperature β for advantage-weighted actor updates. Larger values
make the actor more aggressive in following high-advantage actions.
Expectile τ ∈ (0.5, 1.0) for value regression. Higher τ (e.g. 0.9) is
critical for long-horizon tasks such as AntMaze that require dynamic
programming (“stitching”).
Loss for the Q-value regression residual.
Output Keys
| Key | Description |
|---|---|
loss_actor | Advantage-weighted behavioral cloning loss |
loss_qvalue | Q-function Bellman regression loss |
loss_value | Expectile regression loss for the V-network |
entropy | Policy entropy (for logging) |
IQL Training Example
CQLLoss and DiscreteCQLLoss
Conservative Q-Learning (Kumar et al. 2020) regularizes the Q-function by adding a penalty that pushes down Q-values on out-of-distribution actions and pushes up Q-values on in-distribution (dataset) actions. This conservative constraint prevents the policy from exploiting erroneously high Q-values for unseen actions.Constructor Parameters
Stochastic policy network (SAC-style).
Q(s, a) network(s). CQL uses two Q-networks by default.
Initial entropy temperature. Tuned automatically when
fixed_alpha=False.CQL temperature for the logsumexp penalty on random actions.
Weight of the conservative CQL penalty relative to the standard TD loss.
If
True, a Lagrange multiplier is learned to adaptively balance the CQL
penalty against the TD error.Threshold for the Lagrange multiplier. Active only when
with_lagrange=True.Number of random actions sampled per state for computing the CQL penalty.
Output Keys
| Key | Description |
|---|---|
loss_actor | SAC actor loss |
loss_qvalue | Q-function TD regression loss |
loss_cql | Conservative Q-learning penalty |
loss_actor_bc | Behavioral cloning regularization component |
loss_alpha | Temperature loss |
alpha | Current temperature |
entropy | Policy entropy |
DiscreteCQLLoss, which takes a QValueActor instead of a stochastic actor:
BCLoss
Behavior Cloning (BC) trains a policy to imitate a demonstrator by minimizing the negative log-likelihood (or a surrogate loss) of expert actions. It is the simplest offline approach and serves as a strong baseline for dense-reward tasks.Constructor Parameters
The actor to be trained. Works with both stochastic policies (minimizes NLL)
and deterministic policies (minimizes reconstruction loss). Any module that
implements
get_dist() is handled as stochastic.Loss function used when the actor is deterministic (non-distribution-based).
One of
"l1", "l2", "mse", "smooth_l1", "cross_entropy", or a
custom callable. When None, the loss defaults to the negative log-likelihood
of the policy distribution (NLL for stochastic actors).Reduction applied to the element-wise losses. One of
"none", "mean",
"sum".Input Keys (via set_keys)
Expert action key in the dataset TensorDict. Also selects the key where the
actor writes its prediction.
Boolean mask marking padded action timesteps to exclude from the loss (e.g.
"action_is_pad" from chunked / VLA-style behavior cloning). True = padded
(excluded).Output Keys
| Key | Description |
|---|---|
loss_bc | Behavior cloning loss (NLL or MSE) |
BC Examples
TD3BCLoss
TD3+BC (Fujimoto & Gu 2021) combines TD3 with a behavioral cloning regularization term. The actor loss is:λ normalizes the Q-value term so the BC and RL components remain in balance.
DTLoss and OnlineDTLoss
Decision Transformer (Chen et al. 2021) formulates RL as a sequence modelling problem: given the context of past states, actions, and return-to-go tokens, the transformer predicts the next action.DTLoss is the offline variant; OnlineDTLoss extends it to online fine-tuning with entropy regularization.
OnlineDTLoss accepts the same alpha_init, min_alpha, max_alpha, fixed_alpha, and target_entropy arguments as SACLoss. Its output includes loss_actor, loss_alpha, and entropy.
The Decision Transformer input format is different from standard RL: the TensorDict must include
"return_to_go" tokens alongside observations and actions, arranged in a sequence context window.DiffusionBCLoss
DiffusionBCLoss trains a diffusion-based behaviour cloning policy (e.g. Diffusion Policy, Chi et al. 2023). The denoising score-matching objective minimizes the prediction error of the reverse diffusion process:
ACTLoss
ACTLoss trains an Action Chunking Transformer (Zhao et al. 2023) for robot manipulation. ACT predicts a chunk of future actions together with a variational latent that captures multi-modal behavior. The loss combines an MSE reconstruction term with a KL regularization on the latent:
GAILLoss
Generative Adversarial Imitation Learning (Ho & Ermon 2016) trains a discriminator to distinguish expert trajectories from policy-generated ones, then uses the discriminator output as a surrogate reward. The policy is trained with any on-policy RL algorithm using these learned rewards.log D(s, a) + log(1 − D(s̃, ã)) while the policy gradient loss minimizes −log D(s̃, ã), encouraging the policy to produce transitions the discriminator cannot distinguish from expert data.
Choosing an Offline Objective
- IQLLoss
- CQLLoss
- BCLoss
- TD3BCLoss
Best overall offline algorithm. Avoids OOD action queries entirely and
achieves strong performance on D4RL benchmarks. Start here.