This guide walks you through training a continuous-control agent using Proximal Policy Optimization (PPO) on theDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/pytorch/rl/llms.txt
Use this file to discover all available pages before exploring further.
InvertedDoublePendulum-v4 Gymnasium environment. By the end you will have a working training loop that uses TorchRL’s GymEnv wrapper, a stochastic ProbabilisticActor with a TanhNormal distribution, a Collector for batching rollouts, and GAE + ClipPPOLoss for computing advantages and gradients — all connected through the shared TensorDict data model.
Install TorchRL
Install TorchRL and the Gymnasium continuous-control extras. TorchRL requires Python 3.10+ and PyTorch 2.1+.Verify the install by importing the library:
If you plan to run on CUDA and want hardware-accelerated prioritized replay buffers, see the Installation guide for the optional CUDA wheel.
Create the Environment
TorchRL wraps external simulators with a uniform
TensorDict-based API. GymEnv creates a Gymnasium environment and handles device placement. TransformedEnv adds a composable transform stack — here we normalize observations, convert float64 outputs to float32, track cumulative episode rewards, and count steps.check_env_specs runs a short rollout and compares its output shapes and dtypes against the declared specs. If it returns without error, your environment is correctly configured.Build the Actor and Critic
PPO uses a stochastic policy that outputs a distribution over actions. We build the actor in three stages: a neural network that maps observations to distribution parameters, a
TensorDictModule wrapper that declares explicit input/output keys, and a ProbabilisticActor that constructs and samples from a TanhNormal distribution.The critic is a simpler ValueOperator that maps observations to a scalar state-value estimate.Both modules stay in
eval() mode throughout training. TorchRL’s ExplorationType context managers control stochastic vs. deterministic action selection independently from PyTorch’s train/eval module state.Set Up the Collector
A The
Collector owns the environment execution loop. It steps the policy in the environment, accumulates data into batches, and yields TensorDict instances with shape [frames_per_batch]. You iterate over it like a standard Python iterator.Collector returns a new TensorDict on each iteration. The batch contains observations, actions, log-probabilities, rewards, done flags, and any other keys written by the environment or policy transforms. No manual loop management is required — call collector.shutdown() when training ends.Configure GAE and ClipPPOLoss
TorchRL’s The loss returns a
GAE module computes Generalized Advantage Estimates in-place on the collected TensorDict. ClipPPOLoss reads the resulting "advantage" and "value_target" keys alongside the log-probabilities the policy recorded during collection.TensorDict of named scalar losses. We combine three of them:| Key | Meaning |
|---|---|
loss_objective | Clipped policy-gradient loss |
loss_critic | Value network regression loss |
loss_entropy | Entropy bonus (negated) |
Run the Training Loop
With all pieces assembled, the training loop is a straightforward iteration over the collector. For each collected batch we compute advantages, fill the replay buffer, and run multiple epochs of mini-batch PPO updates.After 1 M environment steps the agent should reliably balance the inverted double pendulum and reach the maximum episode length of 1000 steps.
Next Steps
Environments & Transforms
Explore the full environment transform library: image pre-processing, reward scaling, action masking, frame stacking, and more.
Collectors
Learn how to parallelize data collection with
MultiSyncCollector and MultiAsyncCollector for faster wall-clock training.Replay Buffers
Deep-dive into prioritized replay, memmap-backed storage, HER, and offline dataset loading.
SOTA Implementations
Browse complete, research-ready implementations of SAC, DQN, TD3, Dreamer, Decision Transformer, MAPPO, and GRPO in
sota-implementations/.