TorchRL: Modular Reinforcement Learning for PyTorch

TorchRL is a PyTorch-native toolkit for reinforcement learning, decision making, robotics, and simulation. Rather than a single algorithm implementation or a narrow benchmark suite, it is a collection of composable pieces for building RL systems while keeping code close to the PyTorch programming model. The library is built around three core design principles: data should have names, structure, batch dimensions, and devices all the way through the training loop; environments, policies, replay buffers, objectives, and collectors should be independent modules that can be swapped without rewriting the rest of the stack; and research code should scale from a local prototype to vectorized, multiprocess, distributed, compiled, recurrent, multi-agent, model-based, or offline workflows without changing the data model.

The TensorDict Data Model

The common thread running through every TorchRL component is TensorDict — a dictionary-like tensor container with full PyTorch operation support, device transfers, shared-memory backing, memmap storage, lazy views, and nn.Module wrappers. RL code traditionally accumulates special cases: tuples from one environment, dicts from another, separate arrays for recurrent states, and loss functions that silently assume a particular batch layout. TorchRL uses TensorDict to make those assumptions explicit. A single TensorDict flows through every component in the pipeline:

TensorDict
  -> policy module writes actions and log-probs
  -> environment reads actions and writes next observations, rewards, done flags
  -> collector batches trajectories from one or many workers
  -> replay buffer stores, samples, prioritizes, and transforms data
  -> loss module reads named keys and writes differentiable losses
  -> optimizer updates ordinary PyTorch parameters

Because every component speaks the same language, they compose without glue code. TensorDict supports standard tensor operations while preserving named fields and nested structure:

# These operations preserve structure and operate on every compatible value.
batch = torch.stack(list_of_tensordicts, dim=0)
batch = batch.reshape(-1)
batch = batch.to("cuda")
mini_batch = batch[:128]

# Nested keys make multi-agent, recurrent, and next-state data explicit.
reward = batch["next", "reward"]
agent_obs = batch["agents", "observation"]
hidden = batch["recurrent_state", "h"]

TensorDict is a separate package (pip install tensordict) and is a required dependency of TorchRL. It is the reason TorchRL components interoperate: a collector emits a TensorDict, a replay buffer stores it without losing structure, a transform adds or removes keys, and a loss reads exactly the keys it needs.

Library Structure

TorchRL is organized around five major subsystems, each independently importable and designed to be swapped without touching the rest of the stack.

Environments & Transforms

Native PyTorch environments, wrappers for Gymnasium, DM Control, Brax, MuJoCo, PettingZoo, VMAS, Isaac Lab, and more. Vectorized containers and a composable transform stack for observation normalization, reward shaping, action scaling, frame stacking, and auto-reset.

Collectors

The bridge between policies and environments. Single-process, async, multiprocess, and distributed collectors own the execution loop, batch trajectories, manage devices, and can synchronize policy weights across workers while environments keep running.

Replay Buffers & Data

Modular replay buffers with pluggable storage (in-memory, memmap), samplers (uniform, prioritized, without-replacement), writers, and transforms. Supports HER, offline datasets, CUDA-aware prioritized sampling, and large-scale distributed data movement.

Modules & Policies

Actors, critics, actor-critic operators, recurrent modules, distribution wrappers, and exploration modules — all as ordinary nn.Module instances with explicit TensorDict input/output key contracts. Includes ProbabilisticActor, ValueOperator, MLP, TanhNormal, and more.

Objectives & Returns

Loss modules for PPO, SAC, DQN, TD3, REDQ, IQL, CQL, DDPG, REINFORCE, Dreamer, Decision Transformer, GAIL, behavior cloning, MAPPO, IPPO, QMIX/VDN, and GRPO. Value estimators include GAE, TD(λ), V-trace, and MultiAgentGAE.

Multi-Agent & LLM

Multi-agent data lives under nested TensorDict keys. Dedicated objectives (MAPPO, IPPO, QMIX, VDN) and value normalization utilities. LLM post-training support via GRPO objectives, Hugging Face/vLLM/SGLang integrations, and async collectors.

Key Features

TensorDict-first pipelines — every component reads and writes named tensor fields; no silent shape assumptions or hidden data conventions
Broad environment coverage — PyTorch-native environments plus wrappers for Gymnasium, DM Control, Brax, Jumanji, PettingZoo, VMAS, OpenSpiel, Safety-Gymnasium, Isaac Lab, MuJoCo, and more
Composable transforms — first-class transform modules that run on-device, participate in specs, and can be inserted or removed without wrapping the entire environment
Flexible collectors — Collector, MultiSyncCollector, MultiAsyncCollector, and distributed variants for everything from smoke tests to GPU-heavy simulation farms
Modular replay buffers — separate storage, sampler, writer, and transform pieces; supports prioritized replay, HER, memmap-backed storage, and offline datasets
Rich policy modules — ProbabilisticActor, TanhNormal, ValueOperator, recurrent LSTM/GRU wrappers, and exploration modules with explicit in/out key contracts
Complete objectives library — PPO, SAC, DQN, TD3, REDQ, IQL, CQL, Dreamer/DreamerV3, Decision Transformer, behavior cloning, and multi-agent objectives
Multi-agent support — MAPPO, IPPO, MultiAgentGAE, value normalization, VMAS and PettingZoo environments, and nested TensorDict conventions for agent groups
LLM post-training — GRPO and SFT objectives, conversation containers, Hugging Face/vLLM/SGLang integration, async collectors, and weight-update helpers
SOTA implementations — readable, research-ready implementations of PPO, SAC, DQN, TD3, REDQ, IMPALA, CrossQ, GAIL, Dreamer, Decision Transformer, GRPO, and multi-agent algorithms
PyTorch-native performance — torch.compile compatibility, CUDA-aware replay, Triton/scan recurrent backends, vectorized return computation, and distributed execution

A quick local rollout shows how few lines of code it takes to get started:

import torch
from tensordict.nn import TensorDictModule
from torch import nn

from torchrl.envs import PendulumEnv, StepCounter, TransformedEnv

# A PyTorch-native environment with a transform stack.
env = TransformedEnv(PendulumEnv(), StepCounter(max_steps=200))

# Policies are regular nn.Modules with explicit TensorDict key contracts.
policy = TensorDictModule(
    nn.Sequential(
        nn.LazyLinear(64),
        nn.Tanh(),
        nn.Linear(64, 1),
        nn.Tanh(),
    ),
    in_keys=["observation"],
    out_keys=["action"],
)

rollout = env.rollout(max_steps=32, policy=policy)
assert rollout.batch_size == torch.Size([32])
assert rollout["next", "reward"].shape[:1] == torch.Size([32])

The same keys-and-TensorDict interface is used by batched environments, multi-agent tasks, collectors, replay buffers, recurrent modules, transforms, and losses — nothing here is specific to PendulumEnv.

Where to Start

Choose a starting point based on what you want to accomplish:

Goal	Where to look
Build and run a first training loop	Quickstart — PPO on a Gym environment end to end
Understand environment wrappers and transforms	Environment concepts and the `TransformedEnv` / `GymEnv` API
Scale data collection across processes	`Collector`, `MultiSyncCollector`, and distributed collector guides
Store and sample large datasets	Replay buffer guide covering prioritized, HER, and offline storage
Write a custom stochastic policy	`ProbabilisticActor`, `TanhNormal`, and the modules reference
Train a multi-agent system	MAPPO/IPPO objectives, `MultiAgentGAE`, and the VMAS/PettingZoo wrappers
Fine-tune a language model with RL	GRPO objective and LLM post-training reference
Browse complete algorithm implementations	`sota-implementations/` for PPO, SAC, DQN, TD3, Dreamer, and more

Quickstart

Train a continuous-control agent with PPO in under 10 minutes.

Installation

Install stable, nightly, or CUDA builds and optional environment dependencies.

Getting Started

Core Concepts

Tutorials

Advanced Topics

TorchRL: Modular Reinforcement Learning for PyTorch

The TensorDict Data Model

Library Structure

Environments & Transforms

Collectors

Replay Buffers & Data

Modules & Policies

Objectives & Returns

Multi-Agent & LLM

Key Features

Where to Start

Quickstart

Installation

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Tutorials

Advanced Topics

Documentation Index

​The TensorDict Data Model

​Library Structure

Environments & Transforms

Collectors

Replay Buffers & Data

Modules & Policies

Objectives & Returns

Multi-Agent & LLM

​Key Features

​Where to Start

Quickstart

Installation

Build docs developers (and LLMs) love

The TensorDict Data Model

Library Structure

Key Features

Where to Start