Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/pytorch/rl/llms.txt

Use this file to discover all available pages before exploring further.

Offline reinforcement learning trains a policy entirely from a fixed, pre-collected dataset — no further environment interaction is allowed during learning. This setting removes the need for a simulator at train time and makes RL practical for domains where environment resets are expensive or unsafe, such as healthcare, robotics, and industrial control. TorchRL provides a complete offline stack: dataset loaders for D4RL, Atari DQN, LeRobot, Minari, and Open X-Embodiment; TensorDictReplayBuffer for flexible offline data management; and dedicated loss modules for IQL, CQL, behavior cloning, and the Decision Transformer.

Why Offline RL Is Hard

Standard Q-learning diverges on offline data because the Bellman backup queries Q-values for out-of-distribution (OOD) actions — actions the behavior policy never took. The learned Q-function will confidently overestimate those OOD actions unless an explicit constraint pushes the policy back toward the data distribution. The three main families of solutions are:
  1. Conservative value functions (CQL): penalise Q-values for OOD actions during training.
  2. Implicit constraints (IQL): avoid querying OOD actions altogether by replacing max-Q with an expectile regression.
  3. Supervised cloning (BC, DT): bypass Q-learning entirely and treat offline RL as sequence or regression modelling.

IQLLoss: Implicit Q-Learning

IQLLoss implements the algorithm from “Offline Reinforcement Learning with Implicit Q-Learning” (Kostrikov et al. 2021). It introduces a separate value_network V(s) trained with expectile regression instead of max-Q bootstrapping, so the critic never queries the policy for OOD actions.
import torch
from torch import nn
from torchrl.modules.tensordict_module.actors import ProbabilisticActor, ValueOperator
from torchrl.modules.tensordict_module.common import SafeModule
from torchrl.modules.distributions import NormalParamExtractor, TanhNormal
from torchrl.data import Bounded
from torchrl.objectives.iql import IQLLoss

# Actor: stochastic Gaussian policy
actor_net = nn.Sequential(nn.Linear(obs_dim, 256), nn.ReLU(), nn.Linear(256, 2 * action_dim))
actor = ProbabilisticActor(
    module=SafeModule(actor_net, in_keys=["observation"], out_keys=["loc", "scale"]),
    in_keys=["loc", "scale"],
    out_keys=["action"],
    distribution_class=TanhNormal,
    return_log_prob=True,
)

# Critic: Q(s, a)
qvalue_net = nn.Sequential(nn.Linear(obs_dim + action_dim, 256), nn.ReLU(), nn.Linear(256, 1))
qvalue = ValueOperator(module=qvalue_net, in_keys=["observation", "action"])

# Value: V(s)
value_net = nn.Sequential(nn.Linear(obs_dim, 256), nn.ReLU(), nn.Linear(256, 1))
value = ValueOperator(module=value_net, in_keys=["observation"])

loss_fn = IQLLoss(
    actor_network=actor,
    qvalue_network=qvalue,
    value_network=value,
    num_qvalue_nets=2,         # double Q trick
    temperature=3.0,           # inverse temperature β — higher → closer to max-Q
    expectile=0.7,             # τ — larger values needed for stitching tasks (e.g. AntMaze)
    loss_function="smooth_l1",
)

# Training step
batch = replay_buffer.sample(256)
loss_td = loss_fn(batch)
# loss_td keys: "loss_actor", "loss_qvalue", "loss_value"
total_loss = loss_td["loss_actor"] + loss_td["loss_qvalue"] + loss_td["loss_value"]
total_loss.backward()
For antmaze tasks that require “stitching” together sub-optimal trajectories, set expectile=0.9 or higher. For simpler locomotion datasets, expectile=0.7 with temperature=3.0 is a good starting point.

CQLLoss: Conservative Q-Learning

CQLLoss implements “Conservative Q-Learning for Offline Reinforcement Learning” (Kumar et al. 2020). It adds a CQL regularisation term that penalises Q-values for actions sampled from the current policy (rather than from the dataset), keeping the critic conservative over the offline data distribution.
from torchrl.objectives.cql import CQLLoss

loss_fn = CQLLoss(
    actor_network=actor,
    qvalue_network=qvalue,
    loss_function="smooth_l1",
    alpha_init=1.0,            # initial entropy multiplier
    fixed_alpha=False,         # auto-tune α to match target entropy
    target_entropy="auto",     # -prod(action_dims)
    delay_qvalue=True,         # separate target Q network
    temperature=1.0,           # CQL temperature for action sampling
)

batch = replay_buffer.sample(256)
loss_td = loss_fn(batch)
# loss_td keys: "loss_actor", "loss_actor_bc", "loss_qvalue", "loss_alpha", "loss_cql"

CQL vs. IQL at a Glance

PropertyIQLCQL
OOD action queriesNeverYes (for regularisation)
V(s) network neededYesNo
Auto-tune αNoYes
Best forStitching tasksDense-reward locomotion

BCLoss: Behavior Cloning

BCLoss is the simplest approach: maximise the log-likelihood of expert actions log π(a_expert | s). It is the right baseline before applying more complex offline RL algorithms, and is often competitive on tasks with narrow behavior distributions.
from torchrl.objectives.bc import BCLoss
from torchrl.modules.tensordict_module.actors import Actor
from tensordict import TensorDict

# Works with any actor that has a get_dist() method
bc_loss = BCLoss(actor_network=actor, reduction="mean")

# The input tensordict must contain "observation" and "action"
batch = TensorDict({
    "observation": torch.randn(256, obs_dim),
    "action": action_spec.rand((256,)),
}, batch_size=[256])

loss_td = bc_loss(batch)
# loss_td keys: "loss_bc"
loss_td["loss_bc"].backward()
BCLoss is also compatible with non-TensorDict usage. Pass the actor’s in_keys plus "action" as keyword arguments:
loss_val = bc_loss(observation=obs_tensor, action=action_tensor)

DTLoss: Decision Transformer

DTLoss implements “Decision Transformer: Reinforcement Learning via Sequence Modeling” (Chen et al. 2021). Instead of learning a Q-function, the Decision Transformer autoregressively models π(a | s, R, t) where R is a desired return-to-go, making it a supervised learning problem over offline trajectories.
from torchrl.objectives.decision_transformer import DTLoss

# actor_network must be a ProbabilisticTensorDictSequential that reads
# return-to-go, observation, and timestep, and outputs an action distribution
dt_loss = DTLoss(
    actor_network=dt_actor,
    loss_function="l2",    # L2 regression for continuous actions
    reduction="mean",
)

# Batch must contain: observation, action, return_to_go, timestep
loss_td = dt_loss(batch)
# loss_td keys: "loss"
For online fine-tuning with the Decision Transformer use OnlineDTLoss, which adds an entropy bonus on top of the supervised loss:
from torchrl.objectives.decision_transformer import OnlineDTLoss

online_dt_loss = OnlineDTLoss(actor_network=dt_actor, alpha_init=0.1)

Loading Offline Datasets

TorchRL wraps all major offline RL dataset formats into TensorDictReplayBuffer subclasses via BaseDatasetExperienceReplay. Each class downloads and caches data automatically.
from torchrl.data.datasets.d4rl import D4RLExperienceReplay

# Requires: pip install d4rl  (or set direct_download=True for no D4RL dependency)
dataset = D4RLExperienceReplay(
    dataset_id="halfcheetah-medium-v2",
    batch_size=256,
    split_trajs=False,      # True → pad trajectories to equal length
    direct_download=True,   # bypass d4rl, download raw HDF5 directly
)

batch = dataset.sample()
# batch has keys: "observation", "action", "reward",
#                 ("next", "observation"), "done", "terminated"
All dataset classes expose the same sample() interface and support transform stacks, prioritised sampling, and prefetching:
from torchrl.data.datasets.d4rl import D4RLExperienceReplay
from torchrl.envs.transforms import RewardScaling

dataset = D4RLExperienceReplay(
    dataset_id="hopper-medium-v2",
    batch_size=256,
    prefetch=4,             # background prefetch threads
    transform=RewardScaling(loc=0.0, scale=1000.0),
)

TensorDictReplayBuffer for Offline Data

For custom offline datasets, use TensorDictReplayBuffer with LazyMemmapStorage to keep the full dataset on disk and load mini-batches on demand:
from torchrl.data import TensorDictReplayBuffer
from torchrl.data.replay_buffers.storages import LazyMemmapStorage
from torchrl.data.replay_buffers.samplers import RandomSampler, PrioritizedSampler
import tempfile

# Memory-mapped storage — dataset stays on disk, zero-copy reads
storage = LazyMemmapStorage(
    max_size=1_000_000,
    scratch_dir=tempfile.mkdtemp(),
    device="cpu",
)

buffer = TensorDictReplayBuffer(
    storage=storage,
    sampler=RandomSampler(),
    batch_size=256,
    prefetch=4,
)

# Populate from any iterable of TensorDicts
buffer.extend(offline_dataset_td)

# Retrieve a training batch
batch = buffer.sample()
For offline RL with TD-error priorities, swap in PrioritizedSampler:
from torchrl.data.replay_buffers.samplers import PrioritizedSampler

buffer = TensorDictReplayBuffer(
    storage=LazyMemmapStorage(max_size=1_000_000),
    sampler=PrioritizedSampler(max_capacity=1_000_000, alpha=0.7, beta=0.5),
    batch_size=256,
)

Full Offline RL Training Loop

1
Load the Dataset
2
from torchrl.data.datasets.d4rl import D4RLExperienceReplay

dataset = D4RLExperienceReplay(
    dataset_id="hopper-medium-expert-v2",
    batch_size=256,
    direct_download=True,
)
3
Build Networks and Loss
4
from torchrl.objectives.iql import IQLLoss

loss_fn = IQLLoss(
    actor_network=actor,
    qvalue_network=qvalue,
    value_network=value,
    temperature=3.0,
    expectile=0.7,
)
loss_fn.make_value_estimator(gamma=0.99)

actor_opt = torch.optim.Adam(actor.parameters(), lr=3e-4)
critic_opt = torch.optim.Adam(
    list(qvalue.parameters()) + list(value.parameters()), lr=3e-4
)
5
Train
6
for step in range(1_000_000):
    batch = dataset.sample()
    loss_td = loss_fn(batch)

    actor_opt.zero_grad()
    loss_td["loss_actor"].backward(retain_graph=True)
    actor_opt.step()

    critic_opt.zero_grad()
    (loss_td["loss_qvalue"] + loss_td["loss_value"]).backward()
    critic_opt.step()

    # Soft-update target networks
    loss_fn.target_qvalue_network_params.lerp_(
        loss_fn.qvalue_network_params, 0.005
    )
Offline datasets can contain millions of transitions. Always pre-compute observation normalisation statistics from the full dataset before training — fitting a running normaliser only on mini-batches can be highly biased for offline data.

Key Imports Reference

# Loss modules
from torchrl.objectives.iql import IQLLoss
from torchrl.objectives.cql import CQLLoss
from torchrl.objectives.bc import BCLoss
from torchrl.objectives.decision_transformer import DTLoss, OnlineDTLoss

# Offline datasets
from torchrl.data.datasets.d4rl import D4RLExperienceReplay
from torchrl.data.datasets.atari_dqn import AtariDQNExperienceReplay
from torchrl.data.datasets.lerobot import LeRobotExperienceReplay
from torchrl.data.datasets.minari_data import MinariExperienceReplay
from torchrl.data.datasets.openx import OpenXExperienceReplay
from torchrl.data.datasets.vd4rl import VD4RLExperienceReplay

# Replay buffers and storage
from torchrl.data import TensorDictReplayBuffer, TensorDictPrioritizedReplayBuffer
from torchrl.data.replay_buffers.storages import LazyMemmapStorage, LazyTensorStorage
from torchrl.data.replay_buffers.samplers import RandomSampler, PrioritizedSampler

Build docs developers (and LLMs) love