Model-based reinforcement learning trains an explicit dynamics model of the environment — a world model — that the agent can query to imagine future outcomes without spending real environment steps. TorchRL provides first-class support for model-based RL through a dedicatedDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/pytorch/rl/llms.txt
Use this file to discover all available pages before exploring further.
ModelBasedEnvBase hierarchy, RSSM (Recurrent State-Space Model) building blocks, and three separate loss modules for Dreamer and DreamerV3. All components follow the standard TensorDict data model so they plug directly into TorchRL collectors, replay buffers, and value estimators.
What is Model-Based RL?
In model-free RL the agent only observes real environment transitions. Every gradient step requires fresh rollouts, which makes sample efficiency a bottleneck. Model-based approaches instead maintain a learned transition modelp(s_{t+1} | s_t, a_t) and a reward model r(s_t, a_t). Once a world model is trained, the agent can generate arbitrarily long imagined rollouts inside it, providing dense, cheap training signal with far fewer real environment interactions.
Dreamer (Hafner et al. 2019) and DreamerV3 (Hafner et al. 2023) are the canonical deep model-based algorithms. Both learn a compact latent representation through a Recurrent State-Space Model and train the actor exclusively inside the imagination. TorchRL implements the full family of loss modules for both versions.
World Models and the RSSM
The RSSM splits the latent state into two complementary parts:- Deterministic belief
h_t— a GRU hidden state that accumulates history without noise. - Stochastic state
s_t— a small Gaussian random variable that captures the irreducible uncertainty about the current moment.
RSSMPrior and RSSMPosterior as standalone nn.Module components, and RSSMRollout as a TensorDictModuleBase that chains them across time.
RSSMRollout supports three execution modes via the use_scan flag and compile_step option:
| Mode | Description |
|---|---|
use_scan=False (default) | Standard Python loop over time steps |
use_scan=True | Uses torch._higher_order_ops.scan — more torch.compile-friendly |
compile_step=True | Compiles the individual step function with Inductor |
Dreamer Loss Modules
Dreamer training alternates between two phases: world-model learning from real data and actor/value learning from imagined rollouts. TorchRL maps each phase to a dedicated loss class.p(s | h, a) and posterior q(s | h, o).from torchrl.objectives.dreamer import DreamerModelLoss
world_model_loss = DreamerModelLoss(
world_model=world_model, # TensorDictModule wrapping encoder + RSSM + decoder
lambda_kl=1.0, # weight for KL divergence term
lambda_reco=1.0, # weight for reconstruction term
lambda_reward=1.0, # weight for reward prediction term
free_nats=3, # KL clamping floor (nats)
reco_loss="l2",
reward_loss="l2",
delayed_clamp=False, # clamp before (False) or after (True) averaging
)
# Forward returns (loss_td, updated_tensordict)
loss_td, _ = world_model_loss(batch)
# loss_td has keys: "loss_model_kl", "loss_model_reco", "loss_model_reward"
total_model_loss = loss_td["loss_model_kl"] + loss_td["loss_model_reco"] + loss_td["loss_model_reward"]
total_model_loss.backward()
DreamerActorLoss imagines imagination_horizon steps, computes lambda-return targets, and maximises them:from torchrl.objectives.dreamer import DreamerActorLoss, DreamerValueLoss
from torchrl.envs.model_based import DreamerEnv
model_env = DreamerEnv(
world_model=world_model,
prior_shape=(30,),
belief_shape=(200,),
device="cuda",
)
actor_loss = DreamerActorLoss(
actor_model=actor,
value_model=value_net,
model_based_env=model_env,
imagination_horizon=15, # rollout length in latent space
discount_loss=True, # discount lambda targets by gamma^t
)
value_loss = DreamerValueLoss(
value_model=value_net,
)
DreamerV3 Loss Modules
DreamerV3 introduces KL balancing, symlog-compressed reconstruction, and a two-hot categorical value distribution. TorchRL exposes these via three parallel classes undertorchrl.objectives.
Comparison: Dreamer vs. DreamerV3
| Feature | Dreamer | DreamerV3 |
|---|---|---|
| KL regularisation | Plain KL | KL balancing (free bits) |
| Reconstruction | L2 pixel loss | symlog MSE |
| Value targets | Lambda return | Two-hot categorical CE |
| Actor gradient | Straight-through | REINFORCE + entropy |
| Import path | torchrl.objectives.dreamer | torchrl.objectives.dreamer_v3 |
ModelBasedEnvBase and the Imagination API
ModelBasedEnvBase is a drop-in replacement for any EnvBase that executes transitions through the world model instead of a real simulator. All TorchRL collectors and value estimators work unchanged.
WorldModelEnv is a more generic variant that wraps any callable world model:
RSSM Latent Imagination in Practice
A full Dreamer training step looks like this:from torchrl.collectors import Collector
collector = Collector(
env=real_env,
policy=actor,
frames_per_batch=1000,
)
real_batch = next(iter(collector))
replay_buffer.extend(real_batch)
# Sample from replay buffer
batch = replay_buffer.sample(batch_size=32)
# Encode observations
encoded = obs_encoder(batch)
# Unroll RSSM over the trajectory time dimension
latent_batch = rssm_rollout(encoded)
# Compute world model losses
loss_td, _ = world_model_loss(latent_batch)
wm_optimizer.zero_grad()
(loss_td["loss_model_kl"] + loss_td["loss_model_reco"] + loss_td["loss_model_reward"]).backward()
wm_optimizer.step()
# Start from posterior states obtained during world-model training
posterior_states = latent_batch.select("state", "belief")
# DreamerActorLoss internally rolls out imagination_horizon steps
actor_loss_td, fake_data = actor_loss(posterior_states)
value_loss_td, _ = value_loss(fake_data)
actor_optimizer.zero_grad()
actor_loss_td["loss_actor"].backward()
actor_optimizer.step()
value_optimizer.zero_grad()
value_loss_td["loss_value"].backward()
value_optimizer.step()
PILCO: Gaussian Process World Models
For lower-dimensional problems TorchRL also provides a Gaussian Process world model based on PILCO (Deisenroth & Rasmussen, 2011).GPWorldModel fits one independent GP per state dimension and propagates Gaussian beliefs via analytic moment matching — no neural network required.
ExponentialQuadraticCost computes E_{x~N(m,s)}[c(x)] analytically (Eq. 24-25 in the PILCO paper), enabling gradient-based policy search without stochastic sampling.
GPWorldModel requires gpytorch and botorch as optional dependencies. Install them with pip install gpytorch botorch. The PILCO path scales well to continuous control tasks with low state dimensionality (≤ 20) but becomes computationally expensive for high-dimensional observations.