Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/pytorch/rl/llms.txt

Use this file to discover all available pages before exploring further.

Value-based and actor-critic objectives optimize a policy by estimating the value of state-action pairs. TorchRL provides SACLoss, TD3Loss, DQNLoss, DistributionalDQNLoss, DDPGLoss, REDQLoss, and CrossQLoss, all built on the LossModule base. These objectives are off-policy — they can train from data collected by earlier versions of the policy — and typically maintain one or more frozen target networks that are updated slowly via SoftUpdate or HardUpdate.

SACLoss

Soft Actor-Critic (Haarnoja et al. 2018) maximizes a maximum-entropy objective, encouraging the policy to be both high-reward and high-entropy. It maintains an ensemble of Q-networks, a stochastic actor, and an adaptive temperature parameter alpha that balances exploitation against entropy.
from torchrl.objectives import SACLoss

loss_module = SACLoss(
    actor_network=actor,
    qvalue_network=qvalue,
    value_network=value,       # optional (SAC v1 only)
    num_qvalue_nets=2,
    loss_function="smooth_l1",
    alpha_init=1.0,
    fixed_alpha=False,
    target_entropy="auto",
)

Constructor Parameters

actor_network
ProbabilisticTensorDictSequential
required
Stochastic actor. Must output a sampled action and its log-probability. The key defaults to "sample_log_prob" or "action_log_prob" depending on the composite_lp_aggregate setting.
qvalue_network
TensorDictModule | list[TensorDictModule]
required
Q(s, a) parametric model(s). Typically outputs "state_action_value". A single module is duplicated num_qvalue_nets times; a list stacks parameters.
value_network
TensorDictModule | None
default:"None"
V(s) parametric model (SAC version 1). If omitted, the module uses version 2 where only Q-networks are required and there is no separate value network.
num_qvalue_nets
int
default:"2"
Number of Q-networks in the ensemble. The minimum Q-value across the ensemble is used to compute targets, reducing overestimation bias.
loss_function
str
default:"\"smooth_l1\""
Loss for the Q-value regression. One of "l1", "l2", "smooth_l1".
alpha_init
float
default:"1.0"
Initial entropy temperature. When fixed_alpha=False this is the starting value before automatic tuning begins.
min_alpha
float | None
default:"None"
Lower bound on the tuned alpha. None means no lower bound.
max_alpha
float | None
default:"None"
Upper bound on the tuned alpha. None means no upper bound.
fixed_alpha
bool
default:"False"
If True, alpha is frozen at alpha_init and not optimised.
target_entropy
float | str
default:"\"auto\""
Entropy target for automatic temperature tuning. "auto" computes −prod(n_actions) from the action spec.
delay_actor
bool
default:"False"
Whether to create a separate target actor network.
delay_qvalue
bool
default:"True"
Whether to create separate target Q-value networks.

Output Keys

KeyDescription
loss_actorSAC actor loss (maximizes Q + entropy)
loss_qvalueQ-function regression loss
loss_alphaTemperature loss (minimises alpha * (entropy − target_entropy))
loss_valueValue-network loss (SAC v1 only, when value_network is provided)
alphaCurrent temperature value (for logging)
entropyCurrent policy entropy (for logging)

Complete SAC Example

import torch
from torch import nn
from tensordict import TensorDict
from torchrl.data import Bounded
from torchrl.modules.distributions import NormalParamExtractor, TanhNormal
from torchrl.modules.tensordict_module.actors import ProbabilisticActor, ValueOperator
from torchrl.modules.tensordict_module.common import SafeModule
from torchrl.objectives import SACLoss
from torchrl.objectives.utils import SoftUpdate

n_act, n_obs = 4, 3
spec = Bounded(-torch.ones(n_act), torch.ones(n_act), (n_act,))

# Actor
net = nn.Sequential(nn.Linear(n_obs, 2 * n_act), NormalParamExtractor())
module = SafeModule(net, in_keys=["observation"], out_keys=["loc", "scale"])
actor = ProbabilisticActor(
    module=module, in_keys=["loc", "scale"], spec=spec,
    distribution_class=TanhNormal
)

# Q-value network
class QNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(n_obs + n_act, 1)
    def forward(self, obs, act):
        return self.linear(torch.cat([obs, act], -1))

qvalue = ValueOperator(module=QNet(), in_keys=["observation", "action"])

# SAC loss
loss_module = SACLoss(actor_network=actor, qvalue_network=qvalue)

# Slowly-updated target networks
updater = SoftUpdate(loss_module, eps=0.005)

# Separate optimisers per component
optim_actor  = torch.optim.Adam(actor.parameters(), lr=3e-4)
optim_qvalue = torch.optim.Adam(qvalue.parameters(), lr=3e-4)
optim_alpha  = torch.optim.Adam([loss_module.log_alpha], lr=3e-4)

# Training step
loss_td = loss_module(batch)
loss_td["loss_actor"].backward(retain_graph=True)
loss_td["loss_qvalue"].backward(retain_graph=True)
loss_td["loss_alpha"].backward()

optim_actor.step();  optim_qvalue.step();  optim_alpha.step()
updater.step()   # soft-update target networks
Use SoftUpdate(loss_module, eps=0.005) to keep target networks in sync. Call updater.step() once per gradient update, not per environment step.

TD3Loss

Twin Delayed Deep Deterministic Policy Gradient (Fujimoto et al. 2018) addresses overestimation in deterministic actor-critic methods with two Q-networks and delayed actor updates. Target actions are perturbed with clipped Gaussian noise to smooth the Q-function.
from torchrl.objectives import TD3Loss

loss_module = TD3Loss(
    actor_network=actor,
    qvalue_network=qvalue,
    action_spec=spec,
    num_qvalue_nets=2,
    policy_noise=0.2,
    noise_clip=0.5,
    loss_function="smooth_l1",
)

Constructor Parameters

actor_network
TensorDictModule
required
Deterministic policy network mapping observations to actions.
qvalue_network
TensorDictModule | list[TensorDictModule]
required
Q(s, a) network(s). Outputs "state_action_value". A single network is replicated num_qvalue_nets times.
action_spec
TensorSpec
required
Action space spec (exclusive with bounds). Required to clip noisy target actions.
num_qvalue_nets
int
default:"2"
Size of the Q-network ensemble.
policy_noise
float
default:"0.2"
Standard deviation of the Gaussian noise added to target policy actions.
noise_clip
float
default:"0.5"
Maximum absolute value of the target policy action noise (clips the sampled noise before adding it).
loss_function
str
default:"\"smooth_l1\""
Loss for Q-function regression.

Output Keys

KeyDescription
loss_actorDeterministic policy loss (maximizes min_i Q_i(s, π(s)))
loss_qvalueQ-function regression loss
pred_valueMean predicted Q-value (for logging)
target_valueMean target Q-value (for logging)
state_action_value_actorQ-value evaluated at the current actor output
next_state_valueBootstrap value from the target network

DDPGLoss

Deep Deterministic Policy Gradient (Lillicrap et al. 2015) is the deterministic actor-critic predecessor to TD3. It uses a single Q-network without noise smoothing.
from torchrl.objectives import DDPGLoss

loss_module = DDPGLoss(
    actor_network=actor,
    value_network=value,
    loss_function="smooth_l1",
    delay_actor=False,
    delay_value=True,
)

Constructor Parameters

actor_network
TensorDictModule
required
Deterministic policy operator.
value_network
TensorDictModule
required
Q(s, a) critic. Reads observations and actions, writes "state_action_value".
loss_function
str
default:"\"smooth_l1\""
Loss for the Q-function residual.
delay_actor
bool
default:"False"
Whether to create a target actor network.
delay_value
bool
default:"True"
Whether to create a target value (Q) network.

Output Keys

KeyDescription
loss_actorDDPG actor loss
loss_valueQ-function regression loss
pred_valuePredicted Q-value (for logging)
target_valueTD target (for logging)
pred_value_maxMaximum predicted Q-value over batch
target_value_maxMaximum TD target over batch

DQNLoss

Deep Q-Network (Mnih et al. 2015) is the standard Q-learning algorithm for discrete action spaces. It regresses Q-values against a bootstrapped target computed from a frozen target network.
from torchrl.objectives import DQNLoss

loss_module = DQNLoss(
    value_network=actor,
    loss_function="l2",
    delay_value=True,
    double_dqn=False,
    action_space=spec,
)

Constructor Parameters

value_network
QValueActor | nn.Module
required
Q-value network. Outputs "action_value" — a vector of Q-values, one per discrete action. TorchRL wraps plain nn.Modules in a QValueActor automatically.
loss_function
str
default:"\"l2\""
Loss function for the Bellman residual. One of "l1", "l2", "smooth_l1".
delay_value
bool
default:"True"
If True, creates a target Q-network for computing stable bootstrap targets.
double_dqn
bool
default:"False"
Enable Double DQN (Van Hasselt et al. 2015): use the online network to select the next action but the target network to evaluate it. Requires delay_value=True.
action_space
str | TensorSpec
default:"None"
Discrete action space specification. Must be one of "one-hot", "mult_one_hot", "binary", "categorical", or an equivalent TorchRL spec instance.

Output Keys

KeyDescription
lossBellman residual loss

DQN Example

from torchrl.modules import MLP
from torchrl.data import OneHot
from torchrl.modules.tensordict_module.actors import QValueActor
from torchrl.objectives import DQNLoss

n_obs, n_act = 4, 3
spec = OneHot(n_act)

value_net = MLP(in_features=n_obs, out_features=n_act)
actor = QValueActor(value_net, in_keys=["observation"], action_space=spec)
loss = DQNLoss(actor, action_space=spec, delay_value=True, double_dqn=True)

DistributionalDQNLoss

Distributional DQN (Bellemare et al. 2017) models the full distribution of returns rather than the expected value. The network outputs a probability distribution over n_atoms support values between Vmin and Vmax, and the loss minimises the cross-entropy between the projected target distribution and the predicted distribution.
from torchrl.objectives import DistributionalDQNLoss
from torchrl.modules import DistributionalQValueActor, MLP

actor = DistributionalQValueActor(
    module=MLP(n_obs, n_act * n_atoms),
    action_space=spec,
    support=torch.linspace(Vmin, Vmax, n_atoms),
)
loss = DistributionalDQNLoss(actor, gamma=0.99)
DistributionalDQNLoss does not expose a loss_function parameter — the loss is always the KL divergence between projected target distribution and predicted distribution.

REDQLoss

Randomized Ensemble Double Q-Learning (Chen et al. 2021) uses a large ensemble of Q-networks and randomizes the choice of two networks per update step. This provides strong overestimation control and improved sample efficiency, at the cost of higher memory usage.
from torchrl.objectives import REDQLoss

loss_module = REDQLoss(
    actor_network=actor,
    qvalue_network=qvalue,
    num_qvalue_nets=10,
    sub_sample_num=2,       # number of networks sampled per update
    target_entropy="auto",
)
REDQLoss accepts the same alpha_init, fixed_alpha, and target_entropy keyword arguments as SACLoss. The key difference is the large ensemble (num_qvalue_nets=10 is typical) combined with random sub-sampling at each update.

CrossQLoss

Cross Q-Learning (Bhatt et al. 2019) avoids the double-sampling issue in SAC by estimating Q-values for both the current and next state in a single forward pass, sharing a batch normalization layer between them. This eliminates the need for target networks.
from torchrl.objectives import CrossQLoss

loss_module = CrossQLoss(
    actor_network=actor,
    qvalue_network=qvalue,
    num_qvalue_nets=2,
    target_entropy="auto",
)

Target Network Updaters

All off-policy objectives that use target networks expose delay_* parameters. After creating the loss, attach a SoftUpdate or HardUpdate updater:
from torchrl.objectives.utils import SoftUpdate, HardUpdate

# Polyak averaging (τ controls how fast the target tracks the online network)
soft_updater = SoftUpdate(loss_module, eps=0.005)
soft_updater.step()   # call once per gradient update

# Hard update (copies weights every N steps)
hard_updater = HardUpdate(loss_module, value_network_update_interval=1000)
hard_updater.step()   # increments internal counter; copies when interval reached

Comparing Value-Based Objectives

Best for continuous action spaces requiring strong exploration. Automatic entropy tuning (fixed_alpha=False) is stable across most environments.
loss = SACLoss(actor, qvalue, num_qvalue_nets=2)

Build docs developers (and LLMs) love