Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/pytorch/rl/llms.txt

Use this file to discover all available pages before exploring further.

Effective exploration is crucial in reinforcement learning: a policy that only exploits what it already knows will fail to discover better regions of the state-action space. TorchRL exploration modules are composable add-ons — you build a deterministic or stochastic policy first, then stack an exploration module on top using TensorDictSequential. Each exploration module reads the action from the TensorDict, applies its perturbation, and writes the modified action back in-place, leaving all other keys untouched. All annealed modules share the same step(frames) API so you can drive the annealing schedule from a single loop counter.

The Annealing Pattern

Every exploration module that decays its noise level exposes a step(frames=1) method. You must call it explicitly — typically once per environment step or once per training update — to advance the schedule. TorchRL cannot detect omissions automatically, so a missing step() call will silently keep exploration at its initial level.
# Typical training loop pattern
for i in range(num_steps):
    td = collector.next()
    loss = compute_loss(td)
    loss.backward()
    optimizer.step()

    # Advance the annealing schedule
    exploration_module.step(frames=td.batch_size[0])
Forgetting to call exploration_module.step() inside your training loop is a silent bug. No warning or exception will be raised — exploration will simply remain at eps_init / sigma_init for the entire run.

EGreedyModule

EGreedyModule implements ε-greedy exploration for both discrete and continuous action spaces. On each forward pass, each element of the batch independently draws a uniform random number; when it falls below the current ε, that element’s action is replaced by a uniform random sample from spec. Otherwise the original action from the policy is kept.
spec
TensorSpec
required
The action spec used to draw random replacement actions. If None is passed, the module will raise at call time — useful only for delayed initialization via set_exploration_modules_spec_from_env.
eps_init
float
default:"1.0"
Initial exploration probability. Must be greater than or equal to eps_end.
eps_end
float
default:"0.1"
Final exploration probability after annealing is complete.
annealing_num_steps
int
default:"1000"
Number of step() calls over which ε is linearly annealed from eps_init to eps_end. Additional calls after this point are no-ops.
action_key
NestedKey
default:"\"action\""
The key in the TensorDict where the current action is stored and where the (possibly replaced) action will be written back.
action_mask_key
NestedKey
If set, reads a boolean action mask from this key and applies it to the action spec before sampling random replacement actions. Useful for environments with a dynamic set of valid actions.
device
torch.device
Device for the eps and threshold buffers.

Example

import torch
from tensordict import TensorDict
from tensordict.nn import TensorDictSequential
from torchrl.data import Bounded
from torchrl.modules import Actor, EGreedyModule

# Deterministic policy
spec = Bounded(-1, 1, torch.Size([4]))
actor = Actor(module=torch.nn.Linear(4, 4), spec=spec)

# Wrap with ε-greedy
exploration = EGreedyModule(
    spec=spec,
    eps_init=1.0,
    eps_end=0.05,
    annealing_num_steps=10_000,
)
policy = TensorDictSequential(actor, exploration)

td = TensorDict({"observation": torch.zeros(10, 4)}, [10])
td = policy(td)

# After each update step:
exploration.step(frames=10)
print(exploration.eps)  # decreasing from 1.0

AdditiveGaussianModule

AdditiveGaussianModule adds zero-mean Gaussian noise to continuous actions, with the noise standard deviation annealed over training. After adding the noise, the result is always projected back onto the valid action range via spec.project(). Setting safe=True additionally registers a forward hook that validates all output keys against the spec.
spec
TensorSpec
The action spec used for projection after noise addition. Can be None for delayed initialization (set via set_exploration_modules_spec_from_env or the spec property setter).
sigma_init
float
default:"1.0"
Initial standard deviation of the additive Gaussian noise.
sigma_end
float
default:"0.1"
Final standard deviation after annealing is complete.
annealing_num_steps
int
default:"1000"
Number of step() calls over which σ is linearly annealed.
mean
float
default:"0.0"
Mean of the Gaussian noise distribution.
std
float
default:"1.0"
Standard deviation of the base Gaussian distribution (before σ scaling).
action_key
NestedKey
default:"\"action\""
Key where the action is read from and written back to.
safe
bool
default:"False"
When True, registers an additional forward hook that validates all output keys against spec after noise is applied. Note that the primary noise addition already calls spec.project() internally regardless of this flag.

Example

import torch
from tensordict import TensorDict
from tensordict.nn import TensorDictSequential
from torchrl.data import Bounded
from torchrl.modules import Actor, AdditiveGaussianModule

spec = Bounded(-1, 1, torch.Size([2]))
actor = Actor(module=torch.nn.Linear(3, 2), spec=spec)

gauss = AdditiveGaussianModule(
    spec=spec,
    sigma_init=0.5,
    sigma_end=0.01,
    annealing_num_steps=50_000,
    safe=True,
)
policy = TensorDictSequential(actor, gauss)

td = TensorDict({"observation": torch.randn(8, 3)}, [8])
td = policy(td)
gauss.step(frames=8)
AdditiveGaussianModule is the recommended continuous-action exploration strategy for algorithms like DDPG and TD3. For correlated noise (which can help in environments requiring sustained actions), prefer OrnsteinUhlenbeckProcessModule.

OrnsteinUhlenbeckProcessModule

OrnsteinUhlenbeckProcessModule implements the Ornstein-Uhlenbeck (OU) process from “Continuous Control with Deep Reinforcement Learning”. Unlike plain Gaussian noise, the OU process is auto-correlated in time: each step’s noise depends on the previous step, producing smooth, structured exploration trajectories that are useful for physical control tasks requiring sustained directional actions. The noise update equation is: noiset=noiset1+θ(μnoiset1)dt+σtdtW\text{noise}_t = \text{noise}_{t-1} + \theta \cdot (\mu - \text{noise}_{t-1}) \cdot dt + \sigma_t \cdot \sqrt{dt} \cdot W The module stores _ou_prev_noise and _ou_steps as keys in the TensorDict so that state persists across rollout steps. These are zeroed automatically at episode reset when using TorchRL collectors.
spec
TensorSpec
required
The action spec for projecting the noisy action back onto the valid space.
eps_init
float
default:"1.0"
Initial noise scaling factor ε.
eps_end
float
default:"0.1"
Final noise scaling factor after annealing.
annealing_num_steps
int
default:"1000"
Number of step() calls for ε annealing.
theta
float
default:"0.15"
Mean-reversion speed of the OU process (θ in the equation above).
mu
float
default:"0.0"
Long-run mean of the OU process (μ).
sigma
float
default:"0.2"
Diffusion coefficient of the noise (σ).
dt
float
default:"0.01"
Time step size (dt in the equation above).
x0
Tensor | ndarray | None
Initial noise value. Defaults to zero if None.
sigma_min
float
Minimum sigma value in the sigma annealing equation. When provided, sigma is clamped to this floor after each annealing step. Defaults to None (no floor).
n_steps_annealing
int
default:"1000"
Number of steps over which sigma is annealed toward sigma_min. Distinct from annealing_num_steps, which controls ε annealing.
action_key
NestedKey
default:"\"action\""
TensorDict key for the action to perturb.
is_init_key
NestedKey
default:"\"is_init\""
Key indicating episode resets; the OU state is zeroed when this flag is True.

Example

import torch
from tensordict import TensorDict
from tensordict.nn import TensorDictSequential
from torchrl.data import Bounded
from torchrl.modules import Actor, OrnsteinUhlenbeckProcessModule

spec = Bounded(-1, 1, torch.Size([4]))
actor = Actor(module=torch.nn.Linear(4, 4), spec=spec)

ou = OrnsteinUhlenbeckProcessModule(
    spec=spec,
    theta=0.15,
    mu=0.0,
    sigma=0.2,
    dt=0.01,
    eps_init=1.0,
    eps_end=0.1,
    annealing_num_steps=10_000,
)
policy = TensorDictSequential(actor, ou)

td = TensorDict({"observation": torch.zeros(10, 4)}, [10])
td = policy(td)
ou.step(frames=10)

NoisyLinear and NoisyLazyLinear

NoisyLinear (from “Noisy Networks for Exploration”) replaces a standard linear layer with a parametric-noise version: the weight matrix is W = μ + σ ⊙ ε, where μ and σ are learned parameters and ε is a random Gaussian perturbation. The parameters σ are updated by gradient descent, automatically discovering the right noise scale for each layer. Use NoisyLinear as a drop-in replacement for nn.Linear in any architecture by passing layer_class=NoisyLinear to MLP.
in_features
int
required
Input feature dimension.
out_features
int
required
Output feature dimension.
std_init
float
default:"0.5"
Initial value for all entries in σ. Lower values start with less noise.
use_exploration_type
bool | None
default:"True"
When True (default), noise is applied only when the global exploration type is ExplorationType.RANDOM. When False, noise is applied during model.train() mode (legacy behavior).
NoisyLazyLinear is the lazy variant — in_features is inferred from the first forward pass. Use it when the input size is unknown at construction time.
from torchrl.modules import MLP, NoisyLinear

# Noisy MLP using NoisyLinear layers throughout
noisy_mlp = MLP(
    in_features=8,
    out_features=4,
    depth=2,
    num_cells=64,
    layer_class=NoisyLinear,
)

# Reset noise (call before each forward in the training loop)
from torchrl.modules import reset_noise
reset_noise(noisy_mlp)
Call reset_noise(model) before each forward pass during collection to draw fresh noise samples. The noise is deterministic within a single forward call but freshly sampled each time you call reset_noise.

RandomPolicy

RandomPolicy is the simplest possible policy: it ignores observations entirely and samples uniformly from the action spec. It is useful for initial random data collection and for baselines.
from torchrl.modules import RandomPolicy
from torchrl.data import Bounded
import torch
from tensordict import TensorDict

action_spec = Bounded(-torch.ones(3), torch.ones(3))
policy = RandomPolicy(action_spec=action_spec)

td = TensorDict({}, [])
td = policy(td)
print(td["action"])  # uniform sample in [-1, 1]^3
RandomPolicy supports lazy initialization: if action_spec=None, the spec can be set later by a data collector calling set_action_spec_from_env(env), or via set_exploration_modules_spec_from_env(policy, env).

Lazy Spec Initialization

When writing environment-agnostic training scripts, you may not know the action spec at construction time. Pass spec=None to EGreedyModule, AdditiveGaussianModule, or RandomPolicy and call set_exploration_modules_spec_from_env after the environment is available:
from torchrl.modules import AdditiveGaussianModule, set_exploration_modules_spec_from_env
from torchrl.envs import GymEnv

env = GymEnv("HalfCheetah-v4")
gauss = AdditiveGaussianModule(spec=None)  # deferred
set_exploration_modules_spec_from_env(policy, env)  # fills in spec

Choosing an Exploration Strategy

Use EGreedyModule. It replaces the greedy action with a uniform random sample, which is appropriate for finite action spaces (DQN, etc.).

Build docs developers (and LLMs) love