Exploration Modules and Noise Strategies in TorchRL

Effective exploration is crucial in reinforcement learning: a policy that only exploits what it already knows will fail to discover better regions of the state-action space. TorchRL exploration modules are composable add-ons — you build a deterministic or stochastic policy first, then stack an exploration module on top using TensorDictSequential. Each exploration module reads the action from the TensorDict, applies its perturbation, and writes the modified action back in-place, leaving all other keys untouched. All annealed modules share the same step(frames) API so you can drive the annealing schedule from a single loop counter.

The Annealing Pattern

Every exploration module that decays its noise level exposes a step(frames=1) method. You must call it explicitly — typically once per environment step or once per training update — to advance the schedule. TorchRL cannot detect omissions automatically, so a missing step() call will silently keep exploration at its initial level.

# Typical training loop pattern
for i in range(num_steps):
    td = collector.next()
    loss = compute_loss(td)
    loss.backward()
    optimizer.step()

    # Advance the annealing schedule
    exploration_module.step(frames=td.batch_size[0])

Forgetting to call exploration_module.step() inside your training loop is a silent bug. No warning or exception will be raised — exploration will simply remain at eps_init / sigma_init for the entire run.

EGreedyModule

EGreedyModule implements ε-greedy exploration for both discrete and continuous action spaces. On each forward pass, each element of the batch independently draws a uniform random number; when it falls below the current ε, that element’s action is replaced by a uniform random sample from spec. Otherwise the original action from the policy is kept.

spec

TensorSpec

required

The action spec used to draw random replacement actions. If None is passed, the module will raise at call time — useful only for delayed initialization via set_exploration_modules_spec_from_env.

eps_init

float

default:"1.0"

Initial exploration probability. Must be greater than or equal to eps_end.

eps_end

float

default:"0.1"

Final exploration probability after annealing is complete.

annealing_num_steps

int

default:"1000"

Number of step() calls over which ε is linearly annealed from eps_init to eps_end. Additional calls after this point are no-ops.

action_key

NestedKey

default:"\"action\""

The key in the TensorDict where the current action is stored and where the (possibly replaced) action will be written back.

action_mask_key

NestedKey

If set, reads a boolean action mask from this key and applies it to the action spec before sampling random replacement actions. Useful for environments with a dynamic set of valid actions.

device

torch.device

Device for the eps and threshold buffers.

Example

import torch
from tensordict import TensorDict
from tensordict.nn import TensorDictSequential
from torchrl.data import Bounded
from torchrl.modules import Actor, EGreedyModule

# Deterministic policy
spec = Bounded(-1, 1, torch.Size([4]))
actor = Actor(module=torch.nn.Linear(4, 4), spec=spec)

# Wrap with ε-greedy
exploration = EGreedyModule(
    spec=spec,
    eps_init=1.0,
    eps_end=0.05,
    annealing_num_steps=10_000,
)
policy = TensorDictSequential(actor, exploration)

td = TensorDict({"observation": torch.zeros(10, 4)}, [10])
td = policy(td)

# After each update step:
exploration.step(frames=10)
print(exploration.eps)  # decreasing from 1.0

AdditiveGaussianModule

AdditiveGaussianModule adds zero-mean Gaussian noise to continuous actions, with the noise standard deviation annealed over training. After adding the noise, the result is always projected back onto the valid action range via spec.project(). Setting safe=True additionally registers a forward hook that validates all output keys against the spec.

spec

TensorSpec

The action spec used for projection after noise addition. Can be None for delayed initialization (set via set_exploration_modules_spec_from_env or the spec property setter).

sigma_init

float

default:"1.0"

Initial standard deviation of the additive Gaussian noise.

sigma_end

float

default:"0.1"

Final standard deviation after annealing is complete.

annealing_num_steps

int

default:"1000"

Number of step() calls over which σ is linearly annealed.

mean

float

default:"0.0"

Mean of the Gaussian noise distribution.

std

float

default:"1.0"

Standard deviation of the base Gaussian distribution (before σ scaling).

action_key

NestedKey

default:"\"action\""

Key where the action is read from and written back to.

safe

bool

default:"False"

When True, registers an additional forward hook that validates all output keys against spec after noise is applied. Note that the primary noise addition already calls spec.project() internally regardless of this flag.

Example

import torch
from tensordict import TensorDict
from tensordict.nn import TensorDictSequential
from torchrl.data import Bounded
from torchrl.modules import Actor, AdditiveGaussianModule

spec = Bounded(-1, 1, torch.Size([2]))
actor = Actor(module=torch.nn.Linear(3, 2), spec=spec)

gauss = AdditiveGaussianModule(
    spec=spec,
    sigma_init=0.5,
    sigma_end=0.01,
    annealing_num_steps=50_000,
    safe=True,
)
policy = TensorDictSequential(actor, gauss)

td = TensorDict({"observation": torch.randn(8, 3)}, [8])
td = policy(td)
gauss.step(frames=8)

AdditiveGaussianModule is the recommended continuous-action exploration strategy for algorithms like DDPG and TD3. For correlated noise (which can help in environments requiring sustained actions), prefer OrnsteinUhlenbeckProcessModule.

OrnsteinUhlenbeckProcessModule

OrnsteinUhlenbeckProcessModule implements the Ornstein-Uhlenbeck (OU) process from “Continuous Control with Deep Reinforcement Learning”. Unlike plain Gaussian noise, the OU process is auto-correlated in time: each step’s noise depends on the previous step, producing smooth, structured exploration trajectories that are useful for physical control tasks requiring sustained directional actions. The noise update equation is:

\text{noise}_t = \text{noise}_{t-1} + \theta \cdot (\mu - \text{noise}_{t-1}) \cdot dt + \sigma_t \cdot \sqrt{dt} \cdot W

The module stores _ou_prev_noise and _ou_steps as keys in the TensorDict so that state persists across rollout steps. These are zeroed automatically at episode reset when using TorchRL collectors.

spec

TensorSpec

required

The action spec for projecting the noisy action back onto the valid space.

eps_init

float

default:"1.0"

Initial noise scaling factor ε.

eps_end

float

default:"0.1"

Final noise scaling factor after annealing.

annealing_num_steps

int

default:"1000"

Number of step() calls for ε annealing.

theta

float

default:"0.15"

Mean-reversion speed of the OU process (θ in the equation above).

float

default:"0.0"

Long-run mean of the OU process (μ).

sigma

float

default:"0.2"

Diffusion coefficient of the noise (σ).

float

default:"0.01"

Time step size (dt in the equation above).

Tensor | ndarray | None

Initial noise value. Defaults to zero if None.

sigma_min

float

Minimum sigma value in the sigma annealing equation. When provided, sigma is clamped to this floor after each annealing step. Defaults to None (no floor).

n_steps_annealing

int

default:"1000"

Number of steps over which sigma is annealed toward sigma_min. Distinct from annealing_num_steps, which controls ε annealing.

action_key

NestedKey

default:"\"action\""

TensorDict key for the action to perturb.

is_init_key

NestedKey

default:"\"is_init\""

Key indicating episode resets; the OU state is zeroed when this flag is True.

Example

import torch
from tensordict import TensorDict
from tensordict.nn import TensorDictSequential
from torchrl.data import Bounded
from torchrl.modules import Actor, OrnsteinUhlenbeckProcessModule

spec = Bounded(-1, 1, torch.Size([4]))
actor = Actor(module=torch.nn.Linear(4, 4), spec=spec)

ou = OrnsteinUhlenbeckProcessModule(
    spec=spec,
    theta=0.15,
    mu=0.0,
    sigma=0.2,
    dt=0.01,
    eps_init=1.0,
    eps_end=0.1,
    annealing_num_steps=10_000,
)
policy = TensorDictSequential(actor, ou)

td = TensorDict({"observation": torch.zeros(10, 4)}, [10])
td = policy(td)
ou.step(frames=10)

NoisyLinear and NoisyLazyLinear

NoisyLinear (from “Noisy Networks for Exploration”) replaces a standard linear layer with a parametric-noise version: the weight matrix is W = μ + σ ⊙ ε, where μ and σ are learned parameters and ε is a random Gaussian perturbation. The parameters σ are updated by gradient descent, automatically discovering the right noise scale for each layer. Use NoisyLinear as a drop-in replacement for nn.Linear in any architecture by passing layer_class=NoisyLinear to MLP.

in_features

int

required

Input feature dimension.

out_features

int

required

Output feature dimension.

std_init

float

default:"0.5"

Initial value for all entries in σ. Lower values start with less noise.

use_exploration_type

bool | None

default:"True"

When True (default), noise is applied only when the global exploration type is ExplorationType.RANDOM. When False, noise is applied during model.train() mode (legacy behavior).

NoisyLazyLinear is the lazy variant — in_features is inferred from the first forward pass. Use it when the input size is unknown at construction time.

from torchrl.modules import MLP, NoisyLinear

# Noisy MLP using NoisyLinear layers throughout
noisy_mlp = MLP(
    in_features=8,
    out_features=4,
    depth=2,
    num_cells=64,
    layer_class=NoisyLinear,
)

# Reset noise (call before each forward in the training loop)
from torchrl.modules import reset_noise
reset_noise(noisy_mlp)

Call reset_noise(model) before each forward pass during collection to draw fresh noise samples. The noise is deterministic within a single forward call but freshly sampled each time you call reset_noise.

RandomPolicy

RandomPolicy is the simplest possible policy: it ignores observations entirely and samples uniformly from the action spec. It is useful for initial random data collection and for baselines.

from torchrl.modules import RandomPolicy
from torchrl.data import Bounded
import torch
from tensordict import TensorDict

action_spec = Bounded(-torch.ones(3), torch.ones(3))
policy = RandomPolicy(action_spec=action_spec)

td = TensorDict({}, [])
td = policy(td)
print(td["action"])  # uniform sample in [-1, 1]^3

RandomPolicy supports lazy initialization: if action_spec=None, the spec can be set later by a data collector calling set_action_spec_from_env(env), or via set_exploration_modules_spec_from_env(policy, env).

Lazy Spec Initialization

When writing environment-agnostic training scripts, you may not know the action spec at construction time. Pass spec=None to EGreedyModule, AdditiveGaussianModule, or RandomPolicy and call set_exploration_modules_spec_from_env after the environment is available:

from torchrl.modules import AdditiveGaussianModule, set_exploration_modules_spec_from_env
from torchrl.envs import GymEnv

env = GymEnv("HalfCheetah-v4")
gauss = AdditiveGaussianModule(spec=None)  # deferred
set_exploration_modules_spec_from_env(policy, env)  # fills in spec

Choosing an Exploration Strategy

Discrete Actions
Continuous — Uncorrelated
Continuous — Correlated
Implicit / Parameter Noise

Use EGreedyModule. It replaces the greedy action with a uniform random sample, which is appropriate for finite action spaces (DQN, etc.).

Use AdditiveGaussianModule. It adds i.i.d. Gaussian noise to each action dimension independently, suitable for most continuous control tasks (TD3, SAC with deterministic eval).

Use OrnsteinUhlenbeckProcessModule. The mean-reverting process produces temporally correlated noise that encourages sustained directional exploration — historically useful for locomotion tasks (DDPG).

Use NoisyLinear layers. The network itself learns when and how much to explore, often outperforming additive noise schemes in deep Q-networks (Rainbow, Ape-X).

Environments

Data & Buffers

Collectors

Modules

Objectives

Exploration Modules and Noise Strategies in TorchRL

The Annealing Pattern

EGreedyModule

Example

AdditiveGaussianModule

Example

OrnsteinUhlenbeckProcessModule

Example

NoisyLinear and NoisyLazyLinear

RandomPolicy

Lazy Spec Initialization

Choosing an Exploration Strategy

Build docs developers (and LLMs) love

Environments

Data & Buffers

Collectors

Modules

Objectives

Documentation Index

​The Annealing Pattern

​EGreedyModule

​Example

​AdditiveGaussianModule

​Example

​OrnsteinUhlenbeckProcessModule

​Example

​NoisyLinear and NoisyLazyLinear

​RandomPolicy

​Lazy Spec Initialization

​Choosing an Exploration Strategy

Build docs developers (and LLMs) love

The Annealing Pattern

EGreedyModule

Example

AdditiveGaussianModule

Example

OrnsteinUhlenbeckProcessModule

Example

NoisyLinear and NoisyLazyLinear

RandomPolicy

Lazy Spec Initialization

Choosing an Exploration Strategy