TorchRL’s actor and critic modules are thin wrappers around ordinaryDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/pytorch/rl/llms.txt
Use this file to discover all available pages before exploring further.
nn.Module objects that give them a TensorDict interface: each module declares which keys it reads from and writes to a TensorDict, making policy construction composable and data-flow explicit. Rather than wiring raw tensor arguments through function calls, you describe your policy as a sequence of named transformations on a shared dictionary. This page covers all the building blocks — from the base TensorDictModule through stochastic actors, value operators, and combined actor-critic architectures.
TensorDictModule and SafeModule
Every TorchRL module is rooted inTensorDictModule from the tensordict library. SafeModule (torchrl.modules.SafeModule) extends it with an optional output-validation step: when safe=True, the module calls spec.project() on any out-of-bounds sample before writing it to the TensorDict.
SafeModule is a subclass of tensordict.nn.TensorDictModule. You can use
either class; SafeModule simply adds the spec and safe keyword
arguments.Actor
Actor is a convenience subclass of SafeModule for deterministic policies. It sets default keys in_keys=["observation"] and out_keys=["action"] and, if a non-Composite spec is given for the action, wraps it automatically as Composite(action=spec).
ProbabilisticActor
ProbabilisticActor is the stochastic-policy workhorse in TorchRL. It wraps a TensorDictModule backbone together with a distribution class, sampling an action from the distribution and (optionally) writing the log-probability back to the TensorDict.
The constructor accepts two distinct module patterns:
- Split backbone: A
TensorDictModulethat writes distribution parameters to dedicated keys (e.g."loc"and"scale"), followed by sampling insideProbabilisticActor. - Composite distribution: A single
TensorDictModulewriting to a nested"params"key, paired withCompositeDistribution.
The backbone module that computes distribution parameters and writes them to
the TensorDict.
Keys to read from the TensorDict as distribution inputs. Must match
constructor keyword names of
distribution_class (e.g. "loc",
"scale" for Normal). When given as a dict, keys are distribution
parameter names and values are the TensorDict keys that supply them.Keys where sampled values are written. Skips sampling if these keys
already exist in the input TensorDict.
Output spec for the first sampled tensor. Non-
Composite specs are
automatically wrapped as Composite(action=spec).A
torch.distributions.Distribution subclass used for sampling. Common
choices: TanhNormal, MaskedCategorical, Normal,
CompositeDistribution.Extra keyword arguments forwarded to
distribution_class at construction
time. For example {"low": -1.0, "high": 1.0} when using TanhNormal.When
True, writes sample_log_prob (the log-probability of the sampled
action) into the output TensorDict.Fallback interaction mode when the global
interaction_type() returns
None. Options: DETERMINISTIC, RANDOM, MODE, MEAN, MEDIAN.
Collectors override this to RANDOM automatically.Experimental. When
True, writes the distribution parameters to the
TensorDict so the original distribution can be reconstructed later (useful
for PPO’s KL / ratio computation).Routes sampling through an explicit RNG instead of the global PyTorch RNG.
Pass an
int as a shorthand for Generator().manual_seed(int), or a
NestedKey to read the generator from the TensorDict on each forward call.Example: stochastic actor with TanhNormal
ValueOperator
ValueOperator wraps a value-function network with sensible default keys. When "action" is present in in_keys, it defaults to out_keys=["state_action_value"] (for Q-functions); otherwise it defaults to out_keys=["state_value"] (for state-value functions V(s)).
A neural network that computes value estimates.
Keys read from the TensorDict. Include
"action" to build a Q-function.Keys written to the TensorDict. Auto-selected based on whether
"action"
is in in_keys.Combined Actor-Critic Architectures
ActorValueOperator
ActorValueOperator composes three sub-modules that share a common observation encoder: a common_operator that produces a hidden state, a policy_operator that turns the hidden state into an action, and a value_operator that turns it into a value estimate. Use get_policy_operator() and get_value_operator() to extract standalone operators for collection and loss computation.
ActorCriticOperator
ActorCriticOperator is like ActorValueOperator but wires the action into the critic, producing Q(s, a) instead of V(s). The critic receives both the hidden state and the action produced by the policy.
ActorCriticWrapper
ActorCriticWrapper bundles an actor and a critic that do not share parameters. It accepts any two TensorDictModule objects and exposes the same get_policy_operator() / get_value_operator() interface.
Discrete Action Policies
QValueActor and QValueModule
QValueActor converts raw action-value logits into a greedy action. It wraps a backbone module (which outputs action values) with a QValueModule that applies argmax and writes the selected action into the TensorDict.
DistributionalQValueActor
DistributionalQValueActor implements distributional RL (C51): the network outputs a distribution over returns for each action. The module applies a softmax over the return atoms and computes the expected Q-values to select the greedy action.
Model Builders
MLP
MLP is a flexible multi-layer perceptron builder that inherits from nn.Sequential. It supports lazy input inference (no in_features needed), per-layer normalization, and dropout.
Input feature dimension. If omitted, uses
LazyLinear for the first layer.Output feature dimension. If a
torch.Size, the output is reshaped to
that shape.Number of hidden layers.
depth=0 produces a single linear layer;
depth=N produces N+1 linear layers. Defaults to None, which
derives depth from the length of num_cells (or 0 if num_cells
is also omitted).Width of each hidden layer. Can be a list to set per-layer widths; the
list length must equal
depth.Activation applied after every hidden layer.
Dropout probability applied after each activation. Omit for no dropout.
Linear layer class. Pass
NoisyLinear for noisy networks.ConvNet
ConvNet is a configurable 2-D convolutional network builder with a terminal SquashDims aggregator that flattens spatial dimensions.
Input channel count. Omit for lazy initialization.
Output channel counts per convolutional layer.
Kernel sizes per layer. Can be rectangular tuples, e.g.
(2, 3).Stride per layer.
Activation after each convolutional layer.
Value Normalization
Value normalization stabilizes critic training by keeping value targets on a fixed scale throughout training.ValueNorm (abstract base)
ValueNorm defines a common interface with three abstract methods:
update(value_target)— fold a batch of targets into the running statistics.normalize(value_target)— standardize using current statistics.denormalize(normalised_value)— invert the normalization.
PopArtValueNorm
Exponentially-weighted moving-average normalizer (van Hasselt et al., AAAI 2019). Uses a debiasing term so early estimates are unbiased. Recommended for multi-task or curriculum settings where the reward scale can drift.RunningValueNorm
Welford exact running mean and variance with no exponential decay. Cheaper and more stable thanPopArtValueNorm when value targets are stationary. Good default for single-task, fixed-reward-scale runs.