Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/pytorch/rl/llms.txt

Use this file to discover all available pages before exploring further.

TorchRL is a PyTorch-native toolkit for reinforcement learning, decision making, robotics, and simulation. Rather than a single algorithm implementation or a narrow benchmark suite, it is a collection of composable pieces for building RL systems while keeping code close to the PyTorch programming model. The library is built around three core design principles: data should have names, structure, batch dimensions, and devices all the way through the training loop; environments, policies, replay buffers, objectives, and collectors should be independent modules that can be swapped without rewriting the rest of the stack; and research code should scale from a local prototype to vectorized, multiprocess, distributed, compiled, recurrent, multi-agent, model-based, or offline workflows without changing the data model.

The TensorDict Data Model

The common thread running through every TorchRL component is TensorDict — a dictionary-like tensor container with full PyTorch operation support, device transfers, shared-memory backing, memmap storage, lazy views, and nn.Module wrappers. RL code traditionally accumulates special cases: tuples from one environment, dicts from another, separate arrays for recurrent states, and loss functions that silently assume a particular batch layout. TorchRL uses TensorDict to make those assumptions explicit. A single TensorDict flows through every component in the pipeline:
TensorDict
  -> policy module writes actions and log-probs
  -> environment reads actions and writes next observations, rewards, done flags
  -> collector batches trajectories from one or many workers
  -> replay buffer stores, samples, prioritizes, and transforms data
  -> loss module reads named keys and writes differentiable losses
  -> optimizer updates ordinary PyTorch parameters
Because every component speaks the same language, they compose without glue code. TensorDict supports standard tensor operations while preserving named fields and nested structure:
# These operations preserve structure and operate on every compatible value.
batch = torch.stack(list_of_tensordicts, dim=0)
batch = batch.reshape(-1)
batch = batch.to("cuda")
mini_batch = batch[:128]

# Nested keys make multi-agent, recurrent, and next-state data explicit.
reward = batch["next", "reward"]
agent_obs = batch["agents", "observation"]
hidden = batch["recurrent_state", "h"]
TensorDict is a separate package (pip install tensordict) and is a required dependency of TorchRL. It is the reason TorchRL components interoperate: a collector emits a TensorDict, a replay buffer stores it without losing structure, a transform adds or removes keys, and a loss reads exactly the keys it needs.

Library Structure

TorchRL is organized around five major subsystems, each independently importable and designed to be swapped without touching the rest of the stack.

Environments & Transforms

Native PyTorch environments, wrappers for Gymnasium, DM Control, Brax, MuJoCo, PettingZoo, VMAS, Isaac Lab, and more. Vectorized containers and a composable transform stack for observation normalization, reward shaping, action scaling, frame stacking, and auto-reset.

Collectors

The bridge between policies and environments. Single-process, async, multiprocess, and distributed collectors own the execution loop, batch trajectories, manage devices, and can synchronize policy weights across workers while environments keep running.

Replay Buffers & Data

Modular replay buffers with pluggable storage (in-memory, memmap), samplers (uniform, prioritized, without-replacement), writers, and transforms. Supports HER, offline datasets, CUDA-aware prioritized sampling, and large-scale distributed data movement.

Modules & Policies

Actors, critics, actor-critic operators, recurrent modules, distribution wrappers, and exploration modules — all as ordinary nn.Module instances with explicit TensorDict input/output key contracts. Includes ProbabilisticActor, ValueOperator, MLP, TanhNormal, and more.

Objectives & Returns

Loss modules for PPO, SAC, DQN, TD3, REDQ, IQL, CQL, DDPG, REINFORCE, Dreamer, Decision Transformer, GAIL, behavior cloning, MAPPO, IPPO, QMIX/VDN, and GRPO. Value estimators include GAE, TD(λ), V-trace, and MultiAgentGAE.

Multi-Agent & LLM

Multi-agent data lives under nested TensorDict keys. Dedicated objectives (MAPPO, IPPO, QMIX, VDN) and value normalization utilities. LLM post-training support via GRPO objectives, Hugging Face/vLLM/SGLang integrations, and async collectors.

Key Features

  • TensorDict-first pipelines — every component reads and writes named tensor fields; no silent shape assumptions or hidden data conventions
  • Broad environment coverage — PyTorch-native environments plus wrappers for Gymnasium, DM Control, Brax, Jumanji, PettingZoo, VMAS, OpenSpiel, Safety-Gymnasium, Isaac Lab, MuJoCo, and more
  • Composable transforms — first-class transform modules that run on-device, participate in specs, and can be inserted or removed without wrapping the entire environment
  • Flexible collectorsCollector, MultiSyncCollector, MultiAsyncCollector, and distributed variants for everything from smoke tests to GPU-heavy simulation farms
  • Modular replay buffers — separate storage, sampler, writer, and transform pieces; supports prioritized replay, HER, memmap-backed storage, and offline datasets
  • Rich policy modulesProbabilisticActor, TanhNormal, ValueOperator, recurrent LSTM/GRU wrappers, and exploration modules with explicit in/out key contracts
  • Complete objectives library — PPO, SAC, DQN, TD3, REDQ, IQL, CQL, Dreamer/DreamerV3, Decision Transformer, behavior cloning, and multi-agent objectives
  • Multi-agent support — MAPPO, IPPO, MultiAgentGAE, value normalization, VMAS and PettingZoo environments, and nested TensorDict conventions for agent groups
  • LLM post-training — GRPO and SFT objectives, conversation containers, Hugging Face/vLLM/SGLang integration, async collectors, and weight-update helpers
  • SOTA implementations — readable, research-ready implementations of PPO, SAC, DQN, TD3, REDQ, IMPALA, CrossQ, GAIL, Dreamer, Decision Transformer, GRPO, and multi-agent algorithms
  • PyTorch-native performancetorch.compile compatibility, CUDA-aware replay, Triton/scan recurrent backends, vectorized return computation, and distributed execution
A quick local rollout shows how few lines of code it takes to get started:
import torch
from tensordict.nn import TensorDictModule
from torch import nn

from torchrl.envs import PendulumEnv, StepCounter, TransformedEnv

# A PyTorch-native environment with a transform stack.
env = TransformedEnv(PendulumEnv(), StepCounter(max_steps=200))

# Policies are regular nn.Modules with explicit TensorDict key contracts.
policy = TensorDictModule(
    nn.Sequential(
        nn.LazyLinear(64),
        nn.Tanh(),
        nn.Linear(64, 1),
        nn.Tanh(),
    ),
    in_keys=["observation"],
    out_keys=["action"],
)

rollout = env.rollout(max_steps=32, policy=policy)
assert rollout.batch_size == torch.Size([32])
assert rollout["next", "reward"].shape[:1] == torch.Size([32])
The same keys-and-TensorDict interface is used by batched environments, multi-agent tasks, collectors, replay buffers, recurrent modules, transforms, and losses — nothing here is specific to PendulumEnv.

Where to Start

Choose a starting point based on what you want to accomplish:
GoalWhere to look
Build and run a first training loopQuickstart — PPO on a Gym environment end to end
Understand environment wrappers and transformsEnvironment concepts and the TransformedEnv / GymEnv API
Scale data collection across processesCollector, MultiSyncCollector, and distributed collector guides
Store and sample large datasetsReplay buffer guide covering prioritized, HER, and offline storage
Write a custom stochastic policyProbabilisticActor, TanhNormal, and the modules reference
Train a multi-agent systemMAPPO/IPPO objectives, MultiAgentGAE, and the VMAS/PettingZoo wrappers
Fine-tune a language model with RLGRPO objective and LLM post-training reference
Browse complete algorithm implementationssota-implementations/ for PPO, SAC, DQN, TD3, Dreamer, and more

Quickstart

Train a continuous-control agent with PPO in under 10 minutes.

Installation

Install stable, nightly, or CUDA builds and optional environment dependencies.

Build docs developers (and LLMs) love