TorchRL is a PyTorch-native toolkit for reinforcement learning, decision making, robotics, and simulation. Rather than a single algorithm implementation or a narrow benchmark suite, it is a collection of composable pieces for building RL systems while keeping code close to the PyTorch programming model. The library is built around three core design principles: data should have names, structure, batch dimensions, and devices all the way through the training loop; environments, policies, replay buffers, objectives, and collectors should be independent modules that can be swapped without rewriting the rest of the stack; and research code should scale from a local prototype to vectorized, multiprocess, distributed, compiled, recurrent, multi-agent, model-based, or offline workflows without changing the data model.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/pytorch/rl/llms.txt
Use this file to discover all available pages before exploring further.
The TensorDict Data Model
The common thread running through every TorchRL component is TensorDict — a dictionary-like tensor container with full PyTorch operation support, device transfers, shared-memory backing, memmap storage, lazy views, andnn.Module wrappers.
RL code traditionally accumulates special cases: tuples from one environment, dicts from another, separate arrays for recurrent states, and loss functions that silently assume a particular batch layout. TorchRL uses TensorDict to make those assumptions explicit. A single TensorDict flows through every component in the pipeline:
TensorDict is a separate package (
pip install tensordict) and is a required dependency of TorchRL. It is the reason TorchRL components interoperate: a collector emits a TensorDict, a replay buffer stores it without losing structure, a transform adds or removes keys, and a loss reads exactly the keys it needs.Library Structure
TorchRL is organized around five major subsystems, each independently importable and designed to be swapped without touching the rest of the stack.Environments & Transforms
Native PyTorch environments, wrappers for Gymnasium, DM Control, Brax, MuJoCo, PettingZoo, VMAS, Isaac Lab, and more. Vectorized containers and a composable transform stack for observation normalization, reward shaping, action scaling, frame stacking, and auto-reset.
Collectors
The bridge between policies and environments. Single-process, async, multiprocess, and distributed collectors own the execution loop, batch trajectories, manage devices, and can synchronize policy weights across workers while environments keep running.
Replay Buffers & Data
Modular replay buffers with pluggable storage (in-memory, memmap), samplers (uniform, prioritized, without-replacement), writers, and transforms. Supports HER, offline datasets, CUDA-aware prioritized sampling, and large-scale distributed data movement.
Modules & Policies
Actors, critics, actor-critic operators, recurrent modules, distribution wrappers, and exploration modules — all as ordinary
nn.Module instances with explicit TensorDict input/output key contracts. Includes ProbabilisticActor, ValueOperator, MLP, TanhNormal, and more.Objectives & Returns
Loss modules for PPO, SAC, DQN, TD3, REDQ, IQL, CQL, DDPG, REINFORCE, Dreamer, Decision Transformer, GAIL, behavior cloning, MAPPO, IPPO, QMIX/VDN, and GRPO. Value estimators include GAE, TD(λ), V-trace, and MultiAgentGAE.
Multi-Agent & LLM
Multi-agent data lives under nested TensorDict keys. Dedicated objectives (MAPPO, IPPO, QMIX, VDN) and value normalization utilities. LLM post-training support via GRPO objectives, Hugging Face/vLLM/SGLang integrations, and async collectors.
Key Features
- TensorDict-first pipelines — every component reads and writes named tensor fields; no silent shape assumptions or hidden data conventions
- Broad environment coverage — PyTorch-native environments plus wrappers for Gymnasium, DM Control, Brax, Jumanji, PettingZoo, VMAS, OpenSpiel, Safety-Gymnasium, Isaac Lab, MuJoCo, and more
- Composable transforms — first-class transform modules that run on-device, participate in specs, and can be inserted or removed without wrapping the entire environment
- Flexible collectors —
Collector,MultiSyncCollector,MultiAsyncCollector, and distributed variants for everything from smoke tests to GPU-heavy simulation farms - Modular replay buffers — separate storage, sampler, writer, and transform pieces; supports prioritized replay, HER, memmap-backed storage, and offline datasets
- Rich policy modules —
ProbabilisticActor,TanhNormal,ValueOperator, recurrent LSTM/GRU wrappers, and exploration modules with explicit in/out key contracts - Complete objectives library — PPO, SAC, DQN, TD3, REDQ, IQL, CQL, Dreamer/DreamerV3, Decision Transformer, behavior cloning, and multi-agent objectives
- Multi-agent support — MAPPO, IPPO,
MultiAgentGAE, value normalization, VMAS and PettingZoo environments, and nested TensorDict conventions for agent groups - LLM post-training — GRPO and SFT objectives, conversation containers, Hugging Face/vLLM/SGLang integration, async collectors, and weight-update helpers
- SOTA implementations — readable, research-ready implementations of PPO, SAC, DQN, TD3, REDQ, IMPALA, CrossQ, GAIL, Dreamer, Decision Transformer, GRPO, and multi-agent algorithms
- PyTorch-native performance —
torch.compilecompatibility, CUDA-aware replay, Triton/scan recurrent backends, vectorized return computation, and distributed execution
PendulumEnv.
Where to Start
Choose a starting point based on what you want to accomplish:| Goal | Where to look |
|---|---|
| Build and run a first training loop | Quickstart — PPO on a Gym environment end to end |
| Understand environment wrappers and transforms | Environment concepts and the TransformedEnv / GymEnv API |
| Scale data collection across processes | Collector, MultiSyncCollector, and distributed collector guides |
| Store and sample large datasets | Replay buffer guide covering prioritized, HER, and offline storage |
| Write a custom stochastic policy | ProbabilisticActor, TanhNormal, and the modules reference |
| Train a multi-agent system | MAPPO/IPPO objectives, MultiAgentGAE, and the VMAS/PettingZoo wrappers |
| Fine-tune a language model with RL | GRPO objective and LLM post-training reference |
| Browse complete algorithm implementations | sota-implementations/ for PPO, SAC, DQN, TD3, Dreamer, and more |
Quickstart
Train a continuous-control agent with PPO in under 10 minutes.
Installation
Install stable, nightly, or CUDA builds and optional environment dependencies.