Offline RL and Imitation Learning Objectives in TorchRL

Offline RL and imitation learning objectives train policies from a static dataset of previously collected transitions, without any further interaction with the environment. TorchRL provides a full suite of these objectives — IQLLoss, CQLLoss, DiscreteCQLLoss, DiscreteIQLLoss, BCLoss, TD3BCLoss, DTLoss, OnlineDTLoss, DiffusionBCLoss, ACTLoss, and GAILLoss — all sharing the same LossModule interface and TensorDict-based I/O.

IQLLoss

Implicit Q-Learning (Kostrikov et al. 2021) is a state-of-the-art offline RL algorithm that avoids querying the policy during training by learning an expectile regression over the Q-function instead of solving the constrained maximization problem explicitly. This eliminates out-of-distribution action queries that destabilize other offline methods. The IQL objective consists of three components:

Value loss: expectile regression of V(s) against Q(s, a)
Q-value loss: standard Bellman regression using V(s’) as the target
Actor loss: advantage-weighted behavior cloning

from torchrl.objectives import IQLLoss

loss_module = IQLLoss(
    actor_network=actor,
    qvalue_network=qvalue,
    value_network=value,
    num_qvalue_nets=2,
    temperature=3.0,
    expectile=0.7,
    loss_function="smooth_l1",
)

Constructor Parameters

actor_network

ProbabilisticTensorDictSequential

required

Stochastic policy network. During training, the actor is updated via advantage-weighted behavioral cloning (AWR): log π(a|s) * exp(A(s,a) / β).

qvalue_network

TensorDictModule | list[TensorDictModule]

required

Q(s, a) parametric model(s). If a single module is passed it is duplicated num_qvalue_nets times; otherwise parameters are stacked.

value_network

TensorDictModule | None

default:"None"

State value function V(s). The IQL algorithm requires a separate V-network; if omitted the module raises an error.

num_qvalue_nets

int

default:"2"

Number of Q-networks. The minimum across the ensemble is used for value targets, reducing overestimation.

temperature

float

default:"1.0"

Inverse temperature β for advantage-weighted actor updates. Larger values make the actor more aggressive in following high-advantage actions.

expectile

float

default:"0.5"

Expectile τ ∈ (0.5, 1.0) for value regression. Higher τ (e.g. 0.9) is critical for long-horizon tasks such as AntMaze that require dynamic programming (“stitching”).

loss_function

str

default:"\"smooth_l1\""

Loss for the Q-value regression residual.

Output Keys

Key	Description
`loss_actor`	Advantage-weighted behavioral cloning loss
`loss_qvalue`	Q-function Bellman regression loss
`loss_value`	Expectile regression loss for the V-network
`entropy`	Policy entropy (for logging)

IQL Training Example

import torch
from torch import nn
from tensordict import TensorDict
from torchrl.data import Bounded
from torchrl.modules.distributions import NormalParamExtractor, TanhNormal
from torchrl.modules.tensordict_module.actors import ProbabilisticActor, ValueOperator
from torchrl.modules.tensordict_module.common import SafeModule
from torchrl.objectives import IQLLoss
from torchrl.objectives.utils import SoftUpdate

n_act, n_obs = 4, 3
spec = Bounded(-torch.ones(n_act), torch.ones(n_act), (n_act,))

# Actor
actor = ProbabilisticActor(
    module=SafeModule(
        nn.Sequential(nn.Linear(n_obs, 2 * n_act), NormalParamExtractor()),
        in_keys=["observation"], out_keys=["loc", "scale"],
    ),
    in_keys=["loc", "scale"], spec=spec, distribution_class=TanhNormal,
)

# Q-value and state-value networks
qvalue = ValueOperator(
    module=nn.Linear(n_obs + n_act, 1),
    in_keys=["observation", "action"],
    out_keys=["state_action_value"],
)
value = ValueOperator(
    module=nn.Linear(n_obs, 1),
    in_keys=["observation"],
    out_keys=["state_value"],
)

loss_module = IQLLoss(
    actor_network=actor,
    qvalue_network=qvalue,
    value_network=value,
    temperature=3.0,
    expectile=0.7,
)

updater = SoftUpdate(loss_module, eps=0.005)

optim = torch.optim.Adam(loss_module.parameters(), lr=3e-4)

# Training loop (offline — no env interaction)
for batch in dataloader:
    loss_td = loss_module(batch)
    loss = (
        loss_td["loss_actor"] +
        loss_td["loss_qvalue"] +
        loss_td["loss_value"]
    )
    loss.backward()
    optim.step()
    optim.zero_grad()
    updater.step()

For AntMaze and other locomotion tasks requiring stitching, use expectile=0.9 and temperature=10.0. For D4RL locomotion benchmarks (HalfCheetah, Hopper, Walker2D), expectile=0.7 and temperature=3.0 are typical defaults.

CQLLoss and DiscreteCQLLoss

Conservative Q-Learning (Kumar et al. 2020) regularizes the Q-function by adding a penalty that pushes down Q-values on out-of-distribution actions and pushes up Q-values on in-distribution (dataset) actions. This conservative constraint prevents the policy from exploiting erroneously high Q-values for unseen actions.

from torchrl.objectives import CQLLoss

loss_module = CQLLoss(
    actor_network=actor,
    qvalue_network=qvalue,
    alpha_init=1.0,
    fixed_alpha=False,
    target_entropy="auto",
    temperature=1.0,
    min_q_weight=1.0,
    with_lagrange=False,
)

Constructor Parameters

actor_network

ProbabilisticTensorDictSequential

required

Stochastic policy network (SAC-style).

qvalue_network

TensorDictModule | list[TensorDictModule]

required

Q(s, a) network(s). CQL uses two Q-networks by default.

alpha_init

float

default:"1.0"

Initial entropy temperature. Tuned automatically when fixed_alpha=False.

temperature

float

default:"1.0"

CQL temperature for the logsumexp penalty on random actions.

min_q_weight

float

default:"1.0"

Weight of the conservative CQL penalty relative to the standard TD loss.

with_lagrange

bool

default:"False"

If True, a Lagrange multiplier is learned to adaptively balance the CQL penalty against the TD error.

lagrange_thresh

float

default:"0.0"

Threshold for the Lagrange multiplier. Active only when with_lagrange=True.

num_random

int

default:"10"

Number of random actions sampled per state for computing the CQL penalty.

Output Keys

Key	Description
`loss_actor`	SAC actor loss
`loss_qvalue`	Q-function TD regression loss
`loss_cql`	Conservative Q-learning penalty
`loss_actor_bc`	Behavioral cloning regularization component
`loss_alpha`	Temperature loss
`alpha`	Current temperature
`entropy`	Policy entropy

For discrete action spaces use DiscreteCQLLoss, which takes a QValueActor instead of a stochastic actor:

from torchrl.objectives import DiscreteCQLLoss

loss_module = DiscreteCQLLoss(
    value_network=q_actor,
    action_space=spec,
    loss_function="smooth_l1",
)

BCLoss

Behavior Cloning (BC) trains a policy to imitate a demonstrator by minimizing the negative log-likelihood (or a surrogate loss) of expert actions. It is the simplest offline approach and serves as a strong baseline for dense-reward tasks.

from torchrl.objectives import BCLoss

loss_module = BCLoss(actor_network=actor)

Constructor Parameters

actor_network

TensorDictModule

required

The actor to be trained. Works with both stochastic policies (minimizes NLL) and deterministic policies (minimizes reconstruction loss). Any module that implements get_dist() is handled as stochastic.

loss_function

str | Callable | None

default:"None"

Loss function used when the actor is deterministic (non-distribution-based). One of "l1", "l2", "mse", "smooth_l1", "cross_entropy", or a custom callable. When None, the loss defaults to the negative log-likelihood of the policy distribution (NLL for stochastic actors).

reduction

str

default:"\"mean\""

Reduction applied to the element-wise losses. One of "none", "mean", "sum".

Input Keys (via `set_keys`)

action

NestedKey

default:"\"action\""

Expert action key in the dataset TensorDict. Also selects the key where the actor writes its prediction.

pad_mask

NestedKey | None

default:"None"

Boolean mask marking padded action timesteps to exclude from the loss (e.g. "action_is_pad" from chunked / VLA-style behavior cloning). True = padded (excluded).

Output Keys

Key	Description
`loss_bc`	Behavior cloning loss (NLL or MSE)

BC Examples

import torch
from torch import nn
from torchrl.data.tensor_specs import Bounded
from torchrl.modules.tensordict_module.actors import ProbabilisticActor
from torchrl.modules.tensordict_module.common import SafeModule
from torchrl.modules.distributions import NormalParamExtractor, TanhNormal
from torchrl.objectives import BCLoss

n_act, n_obs = 4, 3
spec = Bounded(-torch.ones(n_act), torch.ones(n_act), (n_act,))
net = nn.Sequential(nn.Linear(n_obs, 2 * n_act), NormalParamExtractor())
actor = ProbabilisticActor(
    module=SafeModule(net, in_keys=["observation"], out_keys=["loc", "scale"]),
    in_keys=["loc", "scale"], spec=spec, distribution_class=TanhNormal,
)

loss = BCLoss(actor_network=actor)  # minimizes -log π(a_expert | s)

TD3BCLoss

TD3+BC (Fujimoto & Gu 2021) combines TD3 with a behavioral cloning regularization term. The actor loss is:

loss_actor = -λ * Q(s, π(s)) + MSE(π(s), a_dataset)

where λ normalizes the Q-value term so the BC and RL components remain in balance.

from torchrl.objectives import TD3BCLoss

loss_module = TD3BCLoss(
    actor_network=actor,
    qvalue_network=qvalue,
    action_spec=spec,
    policy_noise=0.2,
    noise_clip=0.5,
    alpha=2.5,   # BC regularization weight
)

DTLoss and OnlineDTLoss

Decision Transformer (Chen et al. 2021) formulates RL as a sequence modelling problem: given the context of past states, actions, and return-to-go tokens, the transformer predicts the next action. DTLoss is the offline variant; OnlineDTLoss extends it to online fine-tuning with entropy regularization.

from torchrl.objectives import DTLoss, OnlineDTLoss

# Offline Decision Transformer
offline_loss = DTLoss(actor_network=actor)

# Online Decision Transformer (with entropy-tuned alpha)
online_loss = OnlineDTLoss(
    actor_network=actor,
    alpha_init=1.0,
    fixed_alpha=False,
    target_entropy="auto",
)

OnlineDTLoss accepts the same alpha_init, min_alpha, max_alpha, fixed_alpha, and target_entropy arguments as SACLoss. Its output includes loss_actor, loss_alpha, and entropy.

The Decision Transformer input format is different from standard RL: the TensorDict must include "return_to_go" tokens alongside observations and actions, arranged in a sequence context window.

DiffusionBCLoss

DiffusionBCLoss trains a diffusion-based behaviour cloning policy (e.g. Diffusion Policy, Chi et al. 2023). The denoising score-matching objective minimizes the prediction error of the reverse diffusion process:

from torchrl.objectives import DiffusionBCLoss

loss_module = DiffusionBCLoss(actor_network=diffusion_actor)

ACTLoss

ACTLoss trains an Action Chunking Transformer (Zhao et al. 2023) for robot manipulation. ACT predicts a chunk of future actions together with a variational latent that captures multi-modal behavior. The loss combines an MSE reconstruction term with a KL regularization on the latent:

from torchrl.objectives import ACTLoss

loss_module = ACTLoss(
    actor_network=act_module,
    kl_coeff=10.0,   # weight of the KL term
)

GAILLoss

Generative Adversarial Imitation Learning (Ho & Ermon 2016) trains a discriminator to distinguish expert trajectories from policy-generated ones, then uses the discriminator output as a surrogate reward. The policy is trained with any on-policy RL algorithm using these learned rewards.

from torchrl.objectives import GAILLoss

gail_loss = GAILLoss(
    actor_network=actor,
    discriminator_network=discriminator,
)

The discriminator is trained to maximize log D(s, a) + log(1 − D(s̃, ã)) while the policy gradient loss minimizes −log D(s̃, ã), encouraging the policy to produce transitions the discriminator cannot distinguish from expert data.

Choosing an Offline Objective

IQLLoss
CQLLoss
BCLoss
TD3BCLoss

Best overall offline algorithm. Avoids OOD action queries entirely and achieves strong performance on D4RL benchmarks. Start here.

loss = IQLLoss(actor, qvalue, value, expectile=0.7, temperature=3.0)

Good when you need conservative Q-values as a downstream planning component. The with_lagrange=True variant is more stable but adds a multiplier to tune.

loss = CQLLoss(actor, qvalue, min_q_weight=5.0)

Strongest baseline for dense-reward tasks and robot manipulation when the dataset is high quality. Zero hyperparameter sensitivity.

loss = BCLoss(actor)

Combines offline Q-learning with BC regularization. Simple and effective; often beats IQL on continuous control with a well-tuned alpha.

loss = TD3BCLoss(actor, qvalue, action_spec=spec, alpha=2.5)

Environments

Data & Buffers

Collectors

Modules

Objectives

Offline RL and Imitation Learning Objectives in TorchRL

IQLLoss

Constructor Parameters

Output Keys

IQL Training Example

CQLLoss and DiscreteCQLLoss

Constructor Parameters

Output Keys

BCLoss

Constructor Parameters

Input Keys (via `set_keys`)

Output Keys

BC Examples

TD3BCLoss

DTLoss and OnlineDTLoss

DiffusionBCLoss

ACTLoss

GAILLoss

Choosing an Offline Objective

Build docs developers (and LLMs) love

Environments

Data & Buffers

Collectors

Modules

Objectives

Documentation Index

​IQLLoss

​Constructor Parameters

​Output Keys

​IQL Training Example

​CQLLoss and DiscreteCQLLoss

​Constructor Parameters

​Output Keys

​BCLoss

​Constructor Parameters

​Input Keys (via set_keys)

​Output Keys

​BC Examples

​TD3BCLoss

​DTLoss and OnlineDTLoss

​DiffusionBCLoss

​ACTLoss

​GAILLoss

​Choosing an Offline Objective

Build docs developers (and LLMs) love

IQLLoss

Constructor Parameters

Output Keys

IQL Training Example

CQLLoss and DiscreteCQLLoss

Constructor Parameters

Output Keys

BCLoss

Constructor Parameters

Input Keys (via `set_keys`)

Output Keys

BC Examples

TD3BCLoss

DTLoss and OnlineDTLoss

DiffusionBCLoss

ACTLoss

GAILLoss

Choosing an Offline Objective