Effective exploration is crucial in reinforcement learning: a policy that only exploits what it already knows will fail to discover better regions of the state-action space. TorchRL exploration modules are composable add-ons — you build a deterministic or stochastic policy first, then stack an exploration module on top usingDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/pytorch/rl/llms.txt
Use this file to discover all available pages before exploring further.
TensorDictSequential. Each exploration module reads the action from the TensorDict, applies its perturbation, and writes the modified action back in-place, leaving all other keys untouched. All annealed modules share the same step(frames) API so you can drive the annealing schedule from a single loop counter.
The Annealing Pattern
Every exploration module that decays its noise level exposes astep(frames=1) method. You must call it explicitly — typically once per environment step or once per training update — to advance the schedule. TorchRL cannot detect omissions automatically, so a missing step() call will silently keep exploration at its initial level.
EGreedyModule
EGreedyModule implements ε-greedy exploration for both discrete and continuous action spaces. On each forward pass, each element of the batch independently draws a uniform random number; when it falls below the current ε, that element’s action is replaced by a uniform random sample from spec. Otherwise the original action from the policy is kept.
The action spec used to draw random replacement actions. If
None is
passed, the module will raise at call time — useful only for delayed
initialization via set_exploration_modules_spec_from_env.Initial exploration probability. Must be greater than or equal to
eps_end.Final exploration probability after annealing is complete.
Number of
step() calls over which ε is linearly annealed from
eps_init to eps_end. Additional calls after this point are no-ops.The key in the TensorDict where the current action is stored and where the
(possibly replaced) action will be written back.
If set, reads a boolean action mask from this key and applies it to the
action spec before sampling random replacement actions. Useful for
environments with a dynamic set of valid actions.
Device for the
eps and threshold buffers.Example
AdditiveGaussianModule
AdditiveGaussianModule adds zero-mean Gaussian noise to continuous actions, with the noise standard deviation annealed over training. After adding the noise, the result is always projected back onto the valid action range via spec.project(). Setting safe=True additionally registers a forward hook that validates all output keys against the spec.
The action spec used for projection after noise addition. Can be
None
for delayed initialization (set via set_exploration_modules_spec_from_env
or the spec property setter).Initial standard deviation of the additive Gaussian noise.
Final standard deviation after annealing is complete.
Number of
step() calls over which σ is linearly annealed.Mean of the Gaussian noise distribution.
Standard deviation of the base Gaussian distribution (before σ scaling).
Key where the action is read from and written back to.
When
True, registers an additional forward hook that validates all
output keys against spec after noise is applied. Note that the
primary noise addition already calls spec.project() internally
regardless of this flag.Example
OrnsteinUhlenbeckProcessModule
OrnsteinUhlenbeckProcessModule implements the Ornstein-Uhlenbeck (OU) process from “Continuous Control with Deep Reinforcement Learning”. Unlike plain Gaussian noise, the OU process is auto-correlated in time: each step’s noise depends on the previous step, producing smooth, structured exploration trajectories that are useful for physical control tasks requiring sustained directional actions.
The noise update equation is:
The module stores _ou_prev_noise and _ou_steps as keys in the TensorDict so that state persists across rollout steps. These are zeroed automatically at episode reset when using TorchRL collectors.
The action spec for projecting the noisy action back onto the valid space.
Initial noise scaling factor ε.
Final noise scaling factor after annealing.
Number of
step() calls for ε annealing.Mean-reversion speed of the OU process (θ in the equation above).
Long-run mean of the OU process (μ).
Diffusion coefficient of the noise (σ).
Time step size (dt in the equation above).
Initial noise value. Defaults to zero if
None.Minimum sigma value in the sigma annealing equation. When provided,
sigma is clamped to this floor after each annealing step. Defaults
to
None (no floor).Number of steps over which sigma is annealed toward
sigma_min.
Distinct from annealing_num_steps, which controls ε annealing.TensorDict key for the action to perturb.
Key indicating episode resets; the OU state is zeroed when this flag is
True.Example
NoisyLinear and NoisyLazyLinear
NoisyLinear (from “Noisy Networks for Exploration”) replaces a standard linear layer with a parametric-noise version: the weight matrix is W = μ + σ ⊙ ε, where μ and σ are learned parameters and ε is a random Gaussian perturbation. The parameters σ are updated by gradient descent, automatically discovering the right noise scale for each layer.
Use NoisyLinear as a drop-in replacement for nn.Linear in any architecture by passing layer_class=NoisyLinear to MLP.
Input feature dimension.
Output feature dimension.
Initial value for all entries in σ. Lower values start with less noise.
When
True (default), noise is applied only when the global exploration
type is ExplorationType.RANDOM. When False, noise is applied during
model.train() mode (legacy behavior).NoisyLazyLinear is the lazy variant — in_features is inferred from the first forward pass. Use it when the input size is unknown at construction time.
Call
reset_noise(model) before each forward pass during collection to
draw fresh noise samples. The noise is deterministic within a single
forward call but freshly sampled each time you call reset_noise.RandomPolicy
RandomPolicy is the simplest possible policy: it ignores observations entirely and samples uniformly from the action spec. It is useful for initial random data collection and for baselines.
RandomPolicy supports lazy initialization: if action_spec=None, the spec can be set later by a data collector calling set_action_spec_from_env(env), or via set_exploration_modules_spec_from_env(policy, env).
Lazy Spec Initialization
When writing environment-agnostic training scripts, you may not know the action spec at construction time. Passspec=None to EGreedyModule, AdditiveGaussianModule, or RandomPolicy and call set_exploration_modules_spec_from_env after the environment is available:
Choosing an Exploration Strategy
- Discrete Actions
- Implicit / Parameter Noise
Use
EGreedyModule. It replaces the greedy action with a uniform random
sample, which is appropriate for finite action spaces (DQN, etc.).