Documentation Index
Fetch the complete documentation index at: https://mintlify.com/pytorch/rl/llms.txt
Use this file to discover all available pages before exploring further.
Offline reinforcement learning trains a policy entirely from a fixed, pre-collected dataset — no further environment interaction is allowed during learning. This setting removes the need for a simulator at train time and makes RL practical for domains where environment resets are expensive or unsafe, such as healthcare, robotics, and industrial control. TorchRL provides a complete offline stack: dataset loaders for D4RL, Atari DQN, LeRobot, Minari, and Open X-Embodiment; TensorDictReplayBuffer for flexible offline data management; and dedicated loss modules for IQL, CQL, behavior cloning, and the Decision Transformer.
Why Offline RL Is Hard
Standard Q-learning diverges on offline data because the Bellman backup queries Q-values for out-of-distribution (OOD) actions — actions the behavior policy never took. The learned Q-function will confidently overestimate those OOD actions unless an explicit constraint pushes the policy back toward the data distribution. The three main families of solutions are:
- Conservative value functions (CQL): penalise Q-values for OOD actions during training.
- Implicit constraints (IQL): avoid querying OOD actions altogether by replacing max-Q with an expectile regression.
- Supervised cloning (BC, DT): bypass Q-learning entirely and treat offline RL as sequence or regression modelling.
IQLLoss: Implicit Q-Learning
IQLLoss implements the algorithm from “Offline Reinforcement Learning with Implicit Q-Learning” (Kostrikov et al. 2021). It introduces a separate value_network V(s) trained with expectile regression instead of max-Q bootstrapping, so the critic never queries the policy for OOD actions.
import torch
from torch import nn
from torchrl.modules.tensordict_module.actors import ProbabilisticActor, ValueOperator
from torchrl.modules.tensordict_module.common import SafeModule
from torchrl.modules.distributions import NormalParamExtractor, TanhNormal
from torchrl.data import Bounded
from torchrl.objectives.iql import IQLLoss
# Actor: stochastic Gaussian policy
actor_net = nn.Sequential(nn.Linear(obs_dim, 256), nn.ReLU(), nn.Linear(256, 2 * action_dim))
actor = ProbabilisticActor(
module=SafeModule(actor_net, in_keys=["observation"], out_keys=["loc", "scale"]),
in_keys=["loc", "scale"],
out_keys=["action"],
distribution_class=TanhNormal,
return_log_prob=True,
)
# Critic: Q(s, a)
qvalue_net = nn.Sequential(nn.Linear(obs_dim + action_dim, 256), nn.ReLU(), nn.Linear(256, 1))
qvalue = ValueOperator(module=qvalue_net, in_keys=["observation", "action"])
# Value: V(s)
value_net = nn.Sequential(nn.Linear(obs_dim, 256), nn.ReLU(), nn.Linear(256, 1))
value = ValueOperator(module=value_net, in_keys=["observation"])
loss_fn = IQLLoss(
actor_network=actor,
qvalue_network=qvalue,
value_network=value,
num_qvalue_nets=2, # double Q trick
temperature=3.0, # inverse temperature β — higher → closer to max-Q
expectile=0.7, # τ — larger values needed for stitching tasks (e.g. AntMaze)
loss_function="smooth_l1",
)
# Training step
batch = replay_buffer.sample(256)
loss_td = loss_fn(batch)
# loss_td keys: "loss_actor", "loss_qvalue", "loss_value"
total_loss = loss_td["loss_actor"] + loss_td["loss_qvalue"] + loss_td["loss_value"]
total_loss.backward()
For antmaze tasks that require “stitching” together sub-optimal trajectories, set expectile=0.9 or higher. For simpler locomotion datasets, expectile=0.7 with temperature=3.0 is a good starting point.
CQLLoss: Conservative Q-Learning
CQLLoss implements “Conservative Q-Learning for Offline Reinforcement Learning” (Kumar et al. 2020). It adds a CQL regularisation term that penalises Q-values for actions sampled from the current policy (rather than from the dataset), keeping the critic conservative over the offline data distribution.
from torchrl.objectives.cql import CQLLoss
loss_fn = CQLLoss(
actor_network=actor,
qvalue_network=qvalue,
loss_function="smooth_l1",
alpha_init=1.0, # initial entropy multiplier
fixed_alpha=False, # auto-tune α to match target entropy
target_entropy="auto", # -prod(action_dims)
delay_qvalue=True, # separate target Q network
temperature=1.0, # CQL temperature for action sampling
)
batch = replay_buffer.sample(256)
loss_td = loss_fn(batch)
# loss_td keys: "loss_actor", "loss_actor_bc", "loss_qvalue", "loss_alpha", "loss_cql"
CQL vs. IQL at a Glance
| Property | IQL | CQL |
|---|
| OOD action queries | Never | Yes (for regularisation) |
| V(s) network needed | Yes | No |
| Auto-tune α | No | Yes |
| Best for | Stitching tasks | Dense-reward locomotion |
BCLoss: Behavior Cloning
BCLoss is the simplest approach: maximise the log-likelihood of expert actions log π(a_expert | s). It is the right baseline before applying more complex offline RL algorithms, and is often competitive on tasks with narrow behavior distributions.
from torchrl.objectives.bc import BCLoss
from torchrl.modules.tensordict_module.actors import Actor
from tensordict import TensorDict
# Works with any actor that has a get_dist() method
bc_loss = BCLoss(actor_network=actor, reduction="mean")
# The input tensordict must contain "observation" and "action"
batch = TensorDict({
"observation": torch.randn(256, obs_dim),
"action": action_spec.rand((256,)),
}, batch_size=[256])
loss_td = bc_loss(batch)
# loss_td keys: "loss_bc"
loss_td["loss_bc"].backward()
BCLoss is also compatible with non-TensorDict usage. Pass the actor’s in_keys plus "action" as keyword arguments:
loss_val = bc_loss(observation=obs_tensor, action=action_tensor)
DTLoss implements “Decision Transformer: Reinforcement Learning via Sequence Modeling” (Chen et al. 2021). Instead of learning a Q-function, the Decision Transformer autoregressively models π(a | s, R, t) where R is a desired return-to-go, making it a supervised learning problem over offline trajectories.
from torchrl.objectives.decision_transformer import DTLoss
# actor_network must be a ProbabilisticTensorDictSequential that reads
# return-to-go, observation, and timestep, and outputs an action distribution
dt_loss = DTLoss(
actor_network=dt_actor,
loss_function="l2", # L2 regression for continuous actions
reduction="mean",
)
# Batch must contain: observation, action, return_to_go, timestep
loss_td = dt_loss(batch)
# loss_td keys: "loss"
For online fine-tuning with the Decision Transformer use OnlineDTLoss, which adds an entropy bonus on top of the supervised loss:
from torchrl.objectives.decision_transformer import OnlineDTLoss
online_dt_loss = OnlineDTLoss(actor_network=dt_actor, alpha_init=0.1)
Loading Offline Datasets
TorchRL wraps all major offline RL dataset formats into TensorDictReplayBuffer subclasses via BaseDatasetExperienceReplay. Each class downloads and caches data automatically.
D4RL
Atari DQN
LeRobot
Minari
from torchrl.data.datasets.d4rl import D4RLExperienceReplay
# Requires: pip install d4rl (or set direct_download=True for no D4RL dependency)
dataset = D4RLExperienceReplay(
dataset_id="halfcheetah-medium-v2",
batch_size=256,
split_trajs=False, # True → pad trajectories to equal length
direct_download=True, # bypass d4rl, download raw HDF5 directly
)
batch = dataset.sample()
# batch has keys: "observation", "action", "reward",
# ("next", "observation"), "done", "terminated"
from torchrl.data.datasets.atari_dqn import AtariDQNExperienceReplay
dataset = AtariDQNExperienceReplay(
dataset_id="Pong/1", # game/run_id
batch_size=32,
num_slices=8, # number of trajectory slices per sample
)
batch = dataset.sample()
from torchrl.data.datasets.lerobot import LeRobotExperienceReplay
dataset = LeRobotExperienceReplay(
dataset_id="lerobot/pusht",
batch_size=64,
num_slices=10,
)
batch = dataset.sample()
from torchrl.data.datasets.minari_data import MinariExperienceReplay
# Requires: pip install minari
dataset = MinariExperienceReplay(
dataset_id="door-human-v0",
batch_size=256,
split_trajs=True,
)
batch = dataset.sample()
All dataset classes expose the same sample() interface and support transform stacks, prioritised sampling, and prefetching:
from torchrl.data.datasets.d4rl import D4RLExperienceReplay
from torchrl.envs.transforms import RewardScaling
dataset = D4RLExperienceReplay(
dataset_id="hopper-medium-v2",
batch_size=256,
prefetch=4, # background prefetch threads
transform=RewardScaling(loc=0.0, scale=1000.0),
)
TensorDictReplayBuffer for Offline Data
For custom offline datasets, use TensorDictReplayBuffer with LazyMemmapStorage to keep the full dataset on disk and load mini-batches on demand:
from torchrl.data import TensorDictReplayBuffer
from torchrl.data.replay_buffers.storages import LazyMemmapStorage
from torchrl.data.replay_buffers.samplers import RandomSampler, PrioritizedSampler
import tempfile
# Memory-mapped storage — dataset stays on disk, zero-copy reads
storage = LazyMemmapStorage(
max_size=1_000_000,
scratch_dir=tempfile.mkdtemp(),
device="cpu",
)
buffer = TensorDictReplayBuffer(
storage=storage,
sampler=RandomSampler(),
batch_size=256,
prefetch=4,
)
# Populate from any iterable of TensorDicts
buffer.extend(offline_dataset_td)
# Retrieve a training batch
batch = buffer.sample()
For offline RL with TD-error priorities, swap in PrioritizedSampler:
from torchrl.data.replay_buffers.samplers import PrioritizedSampler
buffer = TensorDictReplayBuffer(
storage=LazyMemmapStorage(max_size=1_000_000),
sampler=PrioritizedSampler(max_capacity=1_000_000, alpha=0.7, beta=0.5),
batch_size=256,
)
Full Offline RL Training Loop
from torchrl.data.datasets.d4rl import D4RLExperienceReplay
dataset = D4RLExperienceReplay(
dataset_id="hopper-medium-expert-v2",
batch_size=256,
direct_download=True,
)
from torchrl.objectives.iql import IQLLoss
loss_fn = IQLLoss(
actor_network=actor,
qvalue_network=qvalue,
value_network=value,
temperature=3.0,
expectile=0.7,
)
loss_fn.make_value_estimator(gamma=0.99)
actor_opt = torch.optim.Adam(actor.parameters(), lr=3e-4)
critic_opt = torch.optim.Adam(
list(qvalue.parameters()) + list(value.parameters()), lr=3e-4
)
for step in range(1_000_000):
batch = dataset.sample()
loss_td = loss_fn(batch)
actor_opt.zero_grad()
loss_td["loss_actor"].backward(retain_graph=True)
actor_opt.step()
critic_opt.zero_grad()
(loss_td["loss_qvalue"] + loss_td["loss_value"]).backward()
critic_opt.step()
# Soft-update target networks
loss_fn.target_qvalue_network_params.lerp_(
loss_fn.qvalue_network_params, 0.005
)
Offline datasets can contain millions of transitions. Always pre-compute observation normalisation statistics from the full dataset before training — fitting a running normaliser only on mini-batches can be highly biased for offline data.
Key Imports Reference
# Loss modules
from torchrl.objectives.iql import IQLLoss
from torchrl.objectives.cql import CQLLoss
from torchrl.objectives.bc import BCLoss
from torchrl.objectives.decision_transformer import DTLoss, OnlineDTLoss
# Offline datasets
from torchrl.data.datasets.d4rl import D4RLExperienceReplay
from torchrl.data.datasets.atari_dqn import AtariDQNExperienceReplay
from torchrl.data.datasets.lerobot import LeRobotExperienceReplay
from torchrl.data.datasets.minari_data import MinariExperienceReplay
from torchrl.data.datasets.openx import OpenXExperienceReplay
from torchrl.data.datasets.vd4rl import VD4RLExperienceReplay
# Replay buffers and storage
from torchrl.data import TensorDictReplayBuffer, TensorDictPrioritizedReplayBuffer
from torchrl.data.replay_buffers.storages import LazyMemmapStorage, LazyTensorStorage
from torchrl.data.replay_buffers.samplers import RandomSampler, PrioritizedSampler