Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/pytorch/rl/llms.txt

Use this file to discover all available pages before exploring further.

Offline RL and imitation learning objectives train policies from a static dataset of previously collected transitions, without any further interaction with the environment. TorchRL provides a full suite of these objectives — IQLLoss, CQLLoss, DiscreteCQLLoss, DiscreteIQLLoss, BCLoss, TD3BCLoss, DTLoss, OnlineDTLoss, DiffusionBCLoss, ACTLoss, and GAILLoss — all sharing the same LossModule interface and TensorDict-based I/O.

IQLLoss

Implicit Q-Learning (Kostrikov et al. 2021) is a state-of-the-art offline RL algorithm that avoids querying the policy during training by learning an expectile regression over the Q-function instead of solving the constrained maximization problem explicitly. This eliminates out-of-distribution action queries that destabilize other offline methods. The IQL objective consists of three components:
  • Value loss: expectile regression of V(s) against Q(s, a)
  • Q-value loss: standard Bellman regression using V(s’) as the target
  • Actor loss: advantage-weighted behavior cloning
from torchrl.objectives import IQLLoss

loss_module = IQLLoss(
    actor_network=actor,
    qvalue_network=qvalue,
    value_network=value,
    num_qvalue_nets=2,
    temperature=3.0,
    expectile=0.7,
    loss_function="smooth_l1",
)

Constructor Parameters

actor_network
ProbabilisticTensorDictSequential
required
Stochastic policy network. During training, the actor is updated via advantage-weighted behavioral cloning (AWR): log π(a|s) * exp(A(s,a) / β).
qvalue_network
TensorDictModule | list[TensorDictModule]
required
Q(s, a) parametric model(s). If a single module is passed it is duplicated num_qvalue_nets times; otherwise parameters are stacked.
value_network
TensorDictModule | None
default:"None"
State value function V(s). The IQL algorithm requires a separate V-network; if omitted the module raises an error.
num_qvalue_nets
int
default:"2"
Number of Q-networks. The minimum across the ensemble is used for value targets, reducing overestimation.
temperature
float
default:"1.0"
Inverse temperature β for advantage-weighted actor updates. Larger values make the actor more aggressive in following high-advantage actions.
expectile
float
default:"0.5"
Expectile τ ∈ (0.5, 1.0) for value regression. Higher τ (e.g. 0.9) is critical for long-horizon tasks such as AntMaze that require dynamic programming (“stitching”).
loss_function
str
default:"\"smooth_l1\""
Loss for the Q-value regression residual.

Output Keys

KeyDescription
loss_actorAdvantage-weighted behavioral cloning loss
loss_qvalueQ-function Bellman regression loss
loss_valueExpectile regression loss for the V-network
entropyPolicy entropy (for logging)

IQL Training Example

import torch
from torch import nn
from tensordict import TensorDict
from torchrl.data import Bounded
from torchrl.modules.distributions import NormalParamExtractor, TanhNormal
from torchrl.modules.tensordict_module.actors import ProbabilisticActor, ValueOperator
from torchrl.modules.tensordict_module.common import SafeModule
from torchrl.objectives import IQLLoss
from torchrl.objectives.utils import SoftUpdate

n_act, n_obs = 4, 3
spec = Bounded(-torch.ones(n_act), torch.ones(n_act), (n_act,))

# Actor
actor = ProbabilisticActor(
    module=SafeModule(
        nn.Sequential(nn.Linear(n_obs, 2 * n_act), NormalParamExtractor()),
        in_keys=["observation"], out_keys=["loc", "scale"],
    ),
    in_keys=["loc", "scale"], spec=spec, distribution_class=TanhNormal,
)

# Q-value and state-value networks
qvalue = ValueOperator(
    module=nn.Linear(n_obs + n_act, 1),
    in_keys=["observation", "action"],
    out_keys=["state_action_value"],
)
value = ValueOperator(
    module=nn.Linear(n_obs, 1),
    in_keys=["observation"],
    out_keys=["state_value"],
)

loss_module = IQLLoss(
    actor_network=actor,
    qvalue_network=qvalue,
    value_network=value,
    temperature=3.0,
    expectile=0.7,
)

updater = SoftUpdate(loss_module, eps=0.005)

optim = torch.optim.Adam(loss_module.parameters(), lr=3e-4)

# Training loop (offline — no env interaction)
for batch in dataloader:
    loss_td = loss_module(batch)
    loss = (
        loss_td["loss_actor"] +
        loss_td["loss_qvalue"] +
        loss_td["loss_value"]
    )
    loss.backward()
    optim.step()
    optim.zero_grad()
    updater.step()
For AntMaze and other locomotion tasks requiring stitching, use expectile=0.9 and temperature=10.0. For D4RL locomotion benchmarks (HalfCheetah, Hopper, Walker2D), expectile=0.7 and temperature=3.0 are typical defaults.

CQLLoss and DiscreteCQLLoss

Conservative Q-Learning (Kumar et al. 2020) regularizes the Q-function by adding a penalty that pushes down Q-values on out-of-distribution actions and pushes up Q-values on in-distribution (dataset) actions. This conservative constraint prevents the policy from exploiting erroneously high Q-values for unseen actions.
from torchrl.objectives import CQLLoss

loss_module = CQLLoss(
    actor_network=actor,
    qvalue_network=qvalue,
    alpha_init=1.0,
    fixed_alpha=False,
    target_entropy="auto",
    temperature=1.0,
    min_q_weight=1.0,
    with_lagrange=False,
)

Constructor Parameters

actor_network
ProbabilisticTensorDictSequential
required
Stochastic policy network (SAC-style).
qvalue_network
TensorDictModule | list[TensorDictModule]
required
Q(s, a) network(s). CQL uses two Q-networks by default.
alpha_init
float
default:"1.0"
Initial entropy temperature. Tuned automatically when fixed_alpha=False.
temperature
float
default:"1.0"
CQL temperature for the logsumexp penalty on random actions.
min_q_weight
float
default:"1.0"
Weight of the conservative CQL penalty relative to the standard TD loss.
with_lagrange
bool
default:"False"
If True, a Lagrange multiplier is learned to adaptively balance the CQL penalty against the TD error.
lagrange_thresh
float
default:"0.0"
Threshold for the Lagrange multiplier. Active only when with_lagrange=True.
num_random
int
default:"10"
Number of random actions sampled per state for computing the CQL penalty.

Output Keys

KeyDescription
loss_actorSAC actor loss
loss_qvalueQ-function TD regression loss
loss_cqlConservative Q-learning penalty
loss_actor_bcBehavioral cloning regularization component
loss_alphaTemperature loss
alphaCurrent temperature
entropyPolicy entropy
For discrete action spaces use DiscreteCQLLoss, which takes a QValueActor instead of a stochastic actor:
from torchrl.objectives import DiscreteCQLLoss

loss_module = DiscreteCQLLoss(
    value_network=q_actor,
    action_space=spec,
    loss_function="smooth_l1",
)

BCLoss

Behavior Cloning (BC) trains a policy to imitate a demonstrator by minimizing the negative log-likelihood (or a surrogate loss) of expert actions. It is the simplest offline approach and serves as a strong baseline for dense-reward tasks.
from torchrl.objectives import BCLoss

loss_module = BCLoss(actor_network=actor)

Constructor Parameters

actor_network
TensorDictModule
required
The actor to be trained. Works with both stochastic policies (minimizes NLL) and deterministic policies (minimizes reconstruction loss). Any module that implements get_dist() is handled as stochastic.
loss_function
str | Callable | None
default:"None"
Loss function used when the actor is deterministic (non-distribution-based). One of "l1", "l2", "mse", "smooth_l1", "cross_entropy", or a custom callable. When None, the loss defaults to the negative log-likelihood of the policy distribution (NLL for stochastic actors).
reduction
str
default:"\"mean\""
Reduction applied to the element-wise losses. One of "none", "mean", "sum".

Input Keys (via set_keys)

action
NestedKey
default:"\"action\""
Expert action key in the dataset TensorDict. Also selects the key where the actor writes its prediction.
pad_mask
NestedKey | None
default:"None"
Boolean mask marking padded action timesteps to exclude from the loss (e.g. "action_is_pad" from chunked / VLA-style behavior cloning). True = padded (excluded).

Output Keys

KeyDescription
loss_bcBehavior cloning loss (NLL or MSE)

BC Examples

import torch
from torch import nn
from torchrl.data.tensor_specs import Bounded
from torchrl.modules.tensordict_module.actors import ProbabilisticActor
from torchrl.modules.tensordict_module.common import SafeModule
from torchrl.modules.distributions import NormalParamExtractor, TanhNormal
from torchrl.objectives import BCLoss

n_act, n_obs = 4, 3
spec = Bounded(-torch.ones(n_act), torch.ones(n_act), (n_act,))
net = nn.Sequential(nn.Linear(n_obs, 2 * n_act), NormalParamExtractor())
actor = ProbabilisticActor(
    module=SafeModule(net, in_keys=["observation"], out_keys=["loc", "scale"]),
    in_keys=["loc", "scale"], spec=spec, distribution_class=TanhNormal,
)

loss = BCLoss(actor_network=actor)  # minimizes -log π(a_expert | s)

TD3BCLoss

TD3+BC (Fujimoto & Gu 2021) combines TD3 with a behavioral cloning regularization term. The actor loss is:
loss_actor = -λ * Q(s, π(s)) + MSE(π(s), a_dataset)
where λ normalizes the Q-value term so the BC and RL components remain in balance.
from torchrl.objectives import TD3BCLoss

loss_module = TD3BCLoss(
    actor_network=actor,
    qvalue_network=qvalue,
    action_spec=spec,
    policy_noise=0.2,
    noise_clip=0.5,
    alpha=2.5,   # BC regularization weight
)

DTLoss and OnlineDTLoss

Decision Transformer (Chen et al. 2021) formulates RL as a sequence modelling problem: given the context of past states, actions, and return-to-go tokens, the transformer predicts the next action. DTLoss is the offline variant; OnlineDTLoss extends it to online fine-tuning with entropy regularization.
from torchrl.objectives import DTLoss, OnlineDTLoss

# Offline Decision Transformer
offline_loss = DTLoss(actor_network=actor)

# Online Decision Transformer (with entropy-tuned alpha)
online_loss = OnlineDTLoss(
    actor_network=actor,
    alpha_init=1.0,
    fixed_alpha=False,
    target_entropy="auto",
)
OnlineDTLoss accepts the same alpha_init, min_alpha, max_alpha, fixed_alpha, and target_entropy arguments as SACLoss. Its output includes loss_actor, loss_alpha, and entropy.
The Decision Transformer input format is different from standard RL: the TensorDict must include "return_to_go" tokens alongside observations and actions, arranged in a sequence context window.

DiffusionBCLoss

DiffusionBCLoss trains a diffusion-based behaviour cloning policy (e.g. Diffusion Policy, Chi et al. 2023). The denoising score-matching objective minimizes the prediction error of the reverse diffusion process:
from torchrl.objectives import DiffusionBCLoss

loss_module = DiffusionBCLoss(actor_network=diffusion_actor)

ACTLoss

ACTLoss trains an Action Chunking Transformer (Zhao et al. 2023) for robot manipulation. ACT predicts a chunk of future actions together with a variational latent that captures multi-modal behavior. The loss combines an MSE reconstruction term with a KL regularization on the latent:
from torchrl.objectives import ACTLoss

loss_module = ACTLoss(
    actor_network=act_module,
    kl_coeff=10.0,   # weight of the KL term
)

GAILLoss

Generative Adversarial Imitation Learning (Ho & Ermon 2016) trains a discriminator to distinguish expert trajectories from policy-generated ones, then uses the discriminator output as a surrogate reward. The policy is trained with any on-policy RL algorithm using these learned rewards.
from torchrl.objectives import GAILLoss

gail_loss = GAILLoss(
    actor_network=actor,
    discriminator_network=discriminator,
)
The discriminator is trained to maximize log D(s, a) + log(1 − D(s̃, ã)) while the policy gradient loss minimizes −log D(s̃, ã), encouraging the policy to produce transitions the discriminator cannot distinguish from expert data.

Choosing an Offline Objective

Best overall offline algorithm. Avoids OOD action queries entirely and achieves strong performance on D4RL benchmarks. Start here.
loss = IQLLoss(actor, qvalue, value, expectile=0.7, temperature=3.0)

Build docs developers (and LLMs) love