Offline RL and Imitation Learning with TorchRL

Offline reinforcement learning trains a policy entirely from a fixed, pre-collected dataset — no further environment interaction is allowed during learning. This setting removes the need for a simulator at train time and makes RL practical for domains where environment resets are expensive or unsafe, such as healthcare, robotics, and industrial control. TorchRL provides a complete offline stack: dataset loaders for D4RL, Atari DQN, LeRobot, Minari, and Open X-Embodiment; TensorDictReplayBuffer for flexible offline data management; and dedicated loss modules for IQL, CQL, behavior cloning, and the Decision Transformer.

Why Offline RL Is Hard

Standard Q-learning diverges on offline data because the Bellman backup queries Q-values for out-of-distribution (OOD) actions — actions the behavior policy never took. The learned Q-function will confidently overestimate those OOD actions unless an explicit constraint pushes the policy back toward the data distribution. The three main families of solutions are:

Conservative value functions (CQL): penalise Q-values for OOD actions during training.
Implicit constraints (IQL): avoid querying OOD actions altogether by replacing max-Q with an expectile regression.
Supervised cloning (BC, DT): bypass Q-learning entirely and treat offline RL as sequence or regression modelling.

IQLLoss: Implicit Q-Learning

IQLLoss implements the algorithm from “Offline Reinforcement Learning with Implicit Q-Learning” (Kostrikov et al. 2021). It introduces a separate value_network V(s) trained with expectile regression instead of max-Q bootstrapping, so the critic never queries the policy for OOD actions.

import torch
from torch import nn
from torchrl.modules.tensordict_module.actors import ProbabilisticActor, ValueOperator
from torchrl.modules.tensordict_module.common import SafeModule
from torchrl.modules.distributions import NormalParamExtractor, TanhNormal
from torchrl.data import Bounded
from torchrl.objectives.iql import IQLLoss

# Actor: stochastic Gaussian policy
actor_net = nn.Sequential(nn.Linear(obs_dim, 256), nn.ReLU(), nn.Linear(256, 2 * action_dim))
actor = ProbabilisticActor(
    module=SafeModule(actor_net, in_keys=["observation"], out_keys=["loc", "scale"]),
    in_keys=["loc", "scale"],
    out_keys=["action"],
    distribution_class=TanhNormal,
    return_log_prob=True,
)

# Critic: Q(s, a)
qvalue_net = nn.Sequential(nn.Linear(obs_dim + action_dim, 256), nn.ReLU(), nn.Linear(256, 1))
qvalue = ValueOperator(module=qvalue_net, in_keys=["observation", "action"])

# Value: V(s)
value_net = nn.Sequential(nn.Linear(obs_dim, 256), nn.ReLU(), nn.Linear(256, 1))
value = ValueOperator(module=value_net, in_keys=["observation"])

loss_fn = IQLLoss(
    actor_network=actor,
    qvalue_network=qvalue,
    value_network=value,
    num_qvalue_nets=2,         # double Q trick
    temperature=3.0,           # inverse temperature β — higher → closer to max-Q
    expectile=0.7,             # τ — larger values needed for stitching tasks (e.g. AntMaze)
    loss_function="smooth_l1",
)

# Training step
batch = replay_buffer.sample(256)
loss_td = loss_fn(batch)
# loss_td keys: "loss_actor", "loss_qvalue", "loss_value"
total_loss = loss_td["loss_actor"] + loss_td["loss_qvalue"] + loss_td["loss_value"]
total_loss.backward()

For antmaze tasks that require “stitching” together sub-optimal trajectories, set expectile=0.9 or higher. For simpler locomotion datasets, expectile=0.7 with temperature=3.0 is a good starting point.

CQLLoss: Conservative Q-Learning

CQLLoss implements “Conservative Q-Learning for Offline Reinforcement Learning” (Kumar et al. 2020). It adds a CQL regularisation term that penalises Q-values for actions sampled from the current policy (rather than from the dataset), keeping the critic conservative over the offline data distribution.

from torchrl.objectives.cql import CQLLoss

loss_fn = CQLLoss(
    actor_network=actor,
    qvalue_network=qvalue,
    loss_function="smooth_l1",
    alpha_init=1.0,            # initial entropy multiplier
    fixed_alpha=False,         # auto-tune α to match target entropy
    target_entropy="auto",     # -prod(action_dims)
    delay_qvalue=True,         # separate target Q network
    temperature=1.0,           # CQL temperature for action sampling
)

batch = replay_buffer.sample(256)
loss_td = loss_fn(batch)
# loss_td keys: "loss_actor", "loss_actor_bc", "loss_qvalue", "loss_alpha", "loss_cql"

CQL vs. IQL at a Glance

Property	IQL	CQL
OOD action queries	Never	Yes (for regularisation)
V(s) network needed	Yes	No
Auto-tune α	No	Yes
Best for	Stitching tasks	Dense-reward locomotion

BCLoss: Behavior Cloning

BCLoss is the simplest approach: maximise the log-likelihood of expert actions log π(a_expert | s). It is the right baseline before applying more complex offline RL algorithms, and is often competitive on tasks with narrow behavior distributions.

from torchrl.objectives.bc import BCLoss
from torchrl.modules.tensordict_module.actors import Actor
from tensordict import TensorDict

# Works with any actor that has a get_dist() method
bc_loss = BCLoss(actor_network=actor, reduction="mean")

# The input tensordict must contain "observation" and "action"
batch = TensorDict({
    "observation": torch.randn(256, obs_dim),
    "action": action_spec.rand((256,)),
}, batch_size=[256])

loss_td = bc_loss(batch)
# loss_td keys: "loss_bc"
loss_td["loss_bc"].backward()

BCLoss is also compatible with non-TensorDict usage. Pass the actor’s in_keys plus "action" as keyword arguments:

loss_val = bc_loss(observation=obs_tensor, action=action_tensor)

DTLoss: Decision Transformer

DTLoss implements “Decision Transformer: Reinforcement Learning via Sequence Modeling” (Chen et al. 2021). Instead of learning a Q-function, the Decision Transformer autoregressively models π(a | s, R, t) where R is a desired return-to-go, making it a supervised learning problem over offline trajectories.

from torchrl.objectives.decision_transformer import DTLoss

# actor_network must be a ProbabilisticTensorDictSequential that reads
# return-to-go, observation, and timestep, and outputs an action distribution
dt_loss = DTLoss(
    actor_network=dt_actor,
    loss_function="l2",    # L2 regression for continuous actions
    reduction="mean",
)

# Batch must contain: observation, action, return_to_go, timestep
loss_td = dt_loss(batch)
# loss_td keys: "loss"

For online fine-tuning with the Decision Transformer use OnlineDTLoss, which adds an entropy bonus on top of the supervised loss:

from torchrl.objectives.decision_transformer import OnlineDTLoss

online_dt_loss = OnlineDTLoss(actor_network=dt_actor, alpha_init=0.1)

Loading Offline Datasets

TorchRL wraps all major offline RL dataset formats into TensorDictReplayBuffer subclasses via BaseDatasetExperienceReplay. Each class downloads and caches data automatically.

D4RL
Atari DQN
LeRobot
Minari

from torchrl.data.datasets.d4rl import D4RLExperienceReplay

# Requires: pip install d4rl  (or set direct_download=True for no D4RL dependency)
dataset = D4RLExperienceReplay(
    dataset_id="halfcheetah-medium-v2",
    batch_size=256,
    split_trajs=False,      # True → pad trajectories to equal length
    direct_download=True,   # bypass d4rl, download raw HDF5 directly
)

batch = dataset.sample()
# batch has keys: "observation", "action", "reward",
#                 ("next", "observation"), "done", "terminated"

from torchrl.data.datasets.atari_dqn import AtariDQNExperienceReplay

dataset = AtariDQNExperienceReplay(
    dataset_id="Pong/1",     # game/run_id
    batch_size=32,
    num_slices=8,            # number of trajectory slices per sample
)
batch = dataset.sample()

from torchrl.data.datasets.lerobot import LeRobotExperienceReplay

dataset = LeRobotExperienceReplay(
    dataset_id="lerobot/pusht",
    batch_size=64,
    num_slices=10,
)
batch = dataset.sample()

from torchrl.data.datasets.minari_data import MinariExperienceReplay

# Requires: pip install minari
dataset = MinariExperienceReplay(
    dataset_id="door-human-v0",
    batch_size=256,
    split_trajs=True,
)
batch = dataset.sample()

All dataset classes expose the same sample() interface and support transform stacks, prioritised sampling, and prefetching:

from torchrl.data.datasets.d4rl import D4RLExperienceReplay
from torchrl.envs.transforms import RewardScaling

dataset = D4RLExperienceReplay(
    dataset_id="hopper-medium-v2",
    batch_size=256,
    prefetch=4,             # background prefetch threads
    transform=RewardScaling(loc=0.0, scale=1000.0),
)

TensorDictReplayBuffer for Offline Data

For custom offline datasets, use TensorDictReplayBuffer with LazyMemmapStorage to keep the full dataset on disk and load mini-batches on demand:

from torchrl.data import TensorDictReplayBuffer
from torchrl.data.replay_buffers.storages import LazyMemmapStorage
from torchrl.data.replay_buffers.samplers import RandomSampler, PrioritizedSampler
import tempfile

# Memory-mapped storage — dataset stays on disk, zero-copy reads
storage = LazyMemmapStorage(
    max_size=1_000_000,
    scratch_dir=tempfile.mkdtemp(),
    device="cpu",
)

buffer = TensorDictReplayBuffer(
    storage=storage,
    sampler=RandomSampler(),
    batch_size=256,
    prefetch=4,
)

# Populate from any iterable of TensorDicts
buffer.extend(offline_dataset_td)

# Retrieve a training batch
batch = buffer.sample()

For offline RL with TD-error priorities, swap in PrioritizedSampler:

from torchrl.data.replay_buffers.samplers import PrioritizedSampler

buffer = TensorDictReplayBuffer(
    storage=LazyMemmapStorage(max_size=1_000_000),
    sampler=PrioritizedSampler(max_capacity=1_000_000, alpha=0.7, beta=0.5),
    batch_size=256,
)

Full Offline RL Training Loop

Load the Dataset

from torchrl.data.datasets.d4rl import D4RLExperienceReplay

dataset = D4RLExperienceReplay(
    dataset_id="hopper-medium-expert-v2",
    batch_size=256,
    direct_download=True,
)

Build Networks and Loss

from torchrl.objectives.iql import IQLLoss

loss_fn = IQLLoss(
    actor_network=actor,
    qvalue_network=qvalue,
    value_network=value,
    temperature=3.0,
    expectile=0.7,
)
loss_fn.make_value_estimator(gamma=0.99)

actor_opt = torch.optim.Adam(actor.parameters(), lr=3e-4)
critic_opt = torch.optim.Adam(
    list(qvalue.parameters()) + list(value.parameters()), lr=3e-4
)

Train

for step in range(1_000_000):
    batch = dataset.sample()
    loss_td = loss_fn(batch)

    actor_opt.zero_grad()
    loss_td["loss_actor"].backward(retain_graph=True)
    actor_opt.step()

    critic_opt.zero_grad()
    (loss_td["loss_qvalue"] + loss_td["loss_value"]).backward()
    critic_opt.step()

    # Soft-update target networks
    loss_fn.target_qvalue_network_params.lerp_(
        loss_fn.qvalue_network_params, 0.005
    )

Offline datasets can contain millions of transitions. Always pre-compute observation normalisation statistics from the full dataset before training — fitting a running normaliser only on mini-batches can be highly biased for offline data.

Key Imports Reference

# Loss modules
from torchrl.objectives.iql import IQLLoss
from torchrl.objectives.cql import CQLLoss
from torchrl.objectives.bc import BCLoss
from torchrl.objectives.decision_transformer import DTLoss, OnlineDTLoss

# Offline datasets
from torchrl.data.datasets.d4rl import D4RLExperienceReplay
from torchrl.data.datasets.atari_dqn import AtariDQNExperienceReplay
from torchrl.data.datasets.lerobot import LeRobotExperienceReplay
from torchrl.data.datasets.minari_data import MinariExperienceReplay
from torchrl.data.datasets.openx import OpenXExperienceReplay
from torchrl.data.datasets.vd4rl import VD4RLExperienceReplay

# Replay buffers and storage
from torchrl.data import TensorDictReplayBuffer, TensorDictPrioritizedReplayBuffer
from torchrl.data.replay_buffers.storages import LazyMemmapStorage, LazyTensorStorage
from torchrl.data.replay_buffers.samplers import RandomSampler, PrioritizedSampler

Getting Started

Core Concepts

Tutorials

Advanced Topics

Offline RL and Imitation Learning with TorchRL

Why Offline RL Is Hard

IQLLoss: Implicit Q-Learning

CQLLoss: Conservative Q-Learning

CQL vs. IQL at a Glance

BCLoss: Behavior Cloning

DTLoss: Decision Transformer

Loading Offline Datasets

TensorDictReplayBuffer for Offline Data

Full Offline RL Training Loop

Key Imports Reference

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Tutorials

Advanced Topics

Documentation Index

​Why Offline RL Is Hard

​IQLLoss: Implicit Q-Learning

​CQLLoss: Conservative Q-Learning

​CQL vs. IQL at a Glance

​BCLoss: Behavior Cloning

​DTLoss: Decision Transformer

​Loading Offline Datasets

​TensorDictReplayBuffer for Offline Data

​Full Offline RL Training Loop

​Key Imports Reference

Build docs developers (and LLMs) love

Why Offline RL Is Hard

IQLLoss: Implicit Q-Learning

CQLLoss: Conservative Q-Learning

CQL vs. IQL at a Glance

BCLoss: Behavior Cloning

DTLoss: Decision Transformer

Loading Offline Datasets

TensorDictReplayBuffer for Offline Data

Full Offline RL Training Loop

Key Imports Reference