Value-based and actor-critic objectives optimize a policy by estimating the value of state-action pairs. TorchRL providesDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/pytorch/rl/llms.txt
Use this file to discover all available pages before exploring further.
SACLoss, TD3Loss, DQNLoss, DistributionalDQNLoss, DDPGLoss, REDQLoss, and CrossQLoss, all built on the LossModule base. These objectives are off-policy — they can train from data collected by earlier versions of the policy — and typically maintain one or more frozen target networks that are updated slowly via SoftUpdate or HardUpdate.
SACLoss
Soft Actor-Critic (Haarnoja et al. 2018) maximizes a maximum-entropy objective, encouraging the policy to be both high-reward and high-entropy. It maintains an ensemble of Q-networks, a stochastic actor, and an adaptive temperature parameteralpha that balances exploitation against entropy.
Constructor Parameters
Stochastic actor. Must output a sampled action and its log-probability. The
key defaults to
"sample_log_prob" or "action_log_prob" depending on the
composite_lp_aggregate setting.Q(s, a) parametric model(s). Typically outputs
"state_action_value". A
single module is duplicated num_qvalue_nets times; a list stacks
parameters.V(s) parametric model (SAC version 1). If omitted, the module uses version 2
where only Q-networks are required and there is no separate value network.
Number of Q-networks in the ensemble. The minimum Q-value across the ensemble
is used to compute targets, reducing overestimation bias.
Loss for the Q-value regression. One of
"l1", "l2", "smooth_l1".Initial entropy temperature. When
fixed_alpha=False this is the starting
value before automatic tuning begins.Lower bound on the tuned
alpha. None means no lower bound.Upper bound on the tuned
alpha. None means no upper bound.If
True, alpha is frozen at alpha_init and not optimised.Entropy target for automatic temperature tuning.
"auto" computes
−prod(n_actions) from the action spec.Whether to create a separate target actor network.
Whether to create separate target Q-value networks.
Output Keys
| Key | Description |
|---|---|
loss_actor | SAC actor loss (maximizes Q + entropy) |
loss_qvalue | Q-function regression loss |
loss_alpha | Temperature loss (minimises alpha * (entropy − target_entropy)) |
loss_value | Value-network loss (SAC v1 only, when value_network is provided) |
alpha | Current temperature value (for logging) |
entropy | Current policy entropy (for logging) |
Complete SAC Example
TD3Loss
Twin Delayed Deep Deterministic Policy Gradient (Fujimoto et al. 2018) addresses overestimation in deterministic actor-critic methods with two Q-networks and delayed actor updates. Target actions are perturbed with clipped Gaussian noise to smooth the Q-function.Constructor Parameters
Deterministic policy network mapping observations to actions.
Q(s, a) network(s). Outputs
"state_action_value". A single network is
replicated num_qvalue_nets times.Action space spec (exclusive with
bounds). Required to clip noisy target
actions.Size of the Q-network ensemble.
Standard deviation of the Gaussian noise added to target policy actions.
Maximum absolute value of the target policy action noise (clips the sampled
noise before adding it).
Loss for Q-function regression.
Output Keys
| Key | Description |
|---|---|
loss_actor | Deterministic policy loss (maximizes min_i Q_i(s, π(s))) |
loss_qvalue | Q-function regression loss |
pred_value | Mean predicted Q-value (for logging) |
target_value | Mean target Q-value (for logging) |
state_action_value_actor | Q-value evaluated at the current actor output |
next_state_value | Bootstrap value from the target network |
DDPGLoss
Deep Deterministic Policy Gradient (Lillicrap et al. 2015) is the deterministic actor-critic predecessor to TD3. It uses a single Q-network without noise smoothing.Constructor Parameters
Deterministic policy operator.
Q(s, a) critic. Reads observations and actions, writes
"state_action_value".Loss for the Q-function residual.
Whether to create a target actor network.
Whether to create a target value (Q) network.
Output Keys
| Key | Description |
|---|---|
loss_actor | DDPG actor loss |
loss_value | Q-function regression loss |
pred_value | Predicted Q-value (for logging) |
target_value | TD target (for logging) |
pred_value_max | Maximum predicted Q-value over batch |
target_value_max | Maximum TD target over batch |
DQNLoss
Deep Q-Network (Mnih et al. 2015) is the standard Q-learning algorithm for discrete action spaces. It regresses Q-values against a bootstrapped target computed from a frozen target network.Constructor Parameters
Q-value network. Outputs
"action_value" — a vector of Q-values, one per
discrete action. TorchRL wraps plain nn.Modules in a QValueActor
automatically.Loss function for the Bellman residual. One of
"l1", "l2", "smooth_l1".If
True, creates a target Q-network for computing stable bootstrap targets.Enable Double DQN (Van Hasselt et al. 2015): use the online network to
select the next action but the target network to evaluate it. Requires
delay_value=True.Discrete action space specification. Must be one of
"one-hot",
"mult_one_hot", "binary", "categorical", or an equivalent TorchRL spec
instance.Output Keys
| Key | Description |
|---|---|
loss | Bellman residual loss |
DQN Example
DistributionalDQNLoss
Distributional DQN (Bellemare et al. 2017) models the full distribution of returns rather than the expected value. The network outputs a probability distribution overn_atoms support values between Vmin and Vmax, and the loss minimises the cross-entropy between the projected target distribution and the predicted distribution.
DistributionalDQNLoss does not expose a loss_function parameter — the loss is always the KL divergence between projected target distribution and predicted distribution.REDQLoss
Randomized Ensemble Double Q-Learning (Chen et al. 2021) uses a large ensemble of Q-networks and randomizes the choice of two networks per update step. This provides strong overestimation control and improved sample efficiency, at the cost of higher memory usage.CrossQLoss
Cross Q-Learning (Bhatt et al. 2019) avoids the double-sampling issue in SAC by estimating Q-values for both the current and next state in a single forward pass, sharing a batch normalization layer between them. This eliminates the need for target networks.Target Network Updaters
All off-policy objectives that use target networks exposedelay_* parameters. After creating the loss, attach a SoftUpdate or HardUpdate updater:
Comparing Value-Based Objectives
- SACLoss
- TD3Loss
- DDPGLoss
- DQNLoss
Best for continuous action spaces requiring strong exploration. Automatic
entropy tuning (
fixed_alpha=False) is stable across most environments.