Value-Based Objectives: SAC, TD3, DQN, and DDPG in TorchRL

Value-based and actor-critic objectives optimize a policy by estimating the value of state-action pairs. TorchRL provides SACLoss, TD3Loss, DQNLoss, DistributionalDQNLoss, DDPGLoss, REDQLoss, and CrossQLoss, all built on the LossModule base. These objectives are off-policy — they can train from data collected by earlier versions of the policy — and typically maintain one or more frozen target networks that are updated slowly via SoftUpdate or HardUpdate.

SACLoss

Soft Actor-Critic (Haarnoja et al. 2018) maximizes a maximum-entropy objective, encouraging the policy to be both high-reward and high-entropy. It maintains an ensemble of Q-networks, a stochastic actor, and an adaptive temperature parameter alpha that balances exploitation against entropy.

from torchrl.objectives import SACLoss

loss_module = SACLoss(
    actor_network=actor,
    qvalue_network=qvalue,
    value_network=value,       # optional (SAC v1 only)
    num_qvalue_nets=2,
    loss_function="smooth_l1",
    alpha_init=1.0,
    fixed_alpha=False,
    target_entropy="auto",
)

Constructor Parameters

actor_network

ProbabilisticTensorDictSequential

required

Stochastic actor. Must output a sampled action and its log-probability. The key defaults to "sample_log_prob" or "action_log_prob" depending on the composite_lp_aggregate setting.

qvalue_network

TensorDictModule | list[TensorDictModule]

required

Q(s, a) parametric model(s). Typically outputs "state_action_value". A single module is duplicated num_qvalue_nets times; a list stacks parameters.

value_network

TensorDictModule | None

default:"None"

V(s) parametric model (SAC version 1). If omitted, the module uses version 2 where only Q-networks are required and there is no separate value network.

num_qvalue_nets

int

default:"2"

Number of Q-networks in the ensemble. The minimum Q-value across the ensemble is used to compute targets, reducing overestimation bias.

loss_function

str

default:"\"smooth_l1\""

Loss for the Q-value regression. One of "l1", "l2", "smooth_l1".

alpha_init

float

default:"1.0"

Initial entropy temperature. When fixed_alpha=False this is the starting value before automatic tuning begins.

min_alpha

float | None

default:"None"

Lower bound on the tuned alpha. None means no lower bound.

max_alpha

float | None

default:"None"

Upper bound on the tuned alpha. None means no upper bound.

fixed_alpha

bool

default:"False"

If True, alpha is frozen at alpha_init and not optimised.

target_entropy

float | str

default:"\"auto\""

Entropy target for automatic temperature tuning. "auto" computes −prod(n_actions) from the action spec.

delay_actor

bool

default:"False"

Whether to create a separate target actor network.

delay_qvalue

bool

default:"True"

Whether to create separate target Q-value networks.

Output Keys

Key	Description
`loss_actor`	SAC actor loss (maximizes Q + entropy)
`loss_qvalue`	Q-function regression loss
`loss_alpha`	Temperature loss (minimises `alpha * (entropy − target_entropy)`)
`loss_value`	Value-network loss (SAC v1 only, when `value_network` is provided)
`alpha`	Current temperature value (for logging)
`entropy`	Current policy entropy (for logging)

Complete SAC Example

import torch
from torch import nn
from tensordict import TensorDict
from torchrl.data import Bounded
from torchrl.modules.distributions import NormalParamExtractor, TanhNormal
from torchrl.modules.tensordict_module.actors import ProbabilisticActor, ValueOperator
from torchrl.modules.tensordict_module.common import SafeModule
from torchrl.objectives import SACLoss
from torchrl.objectives.utils import SoftUpdate

n_act, n_obs = 4, 3
spec = Bounded(-torch.ones(n_act), torch.ones(n_act), (n_act,))

# Actor
net = nn.Sequential(nn.Linear(n_obs, 2 * n_act), NormalParamExtractor())
module = SafeModule(net, in_keys=["observation"], out_keys=["loc", "scale"])
actor = ProbabilisticActor(
    module=module, in_keys=["loc", "scale"], spec=spec,
    distribution_class=TanhNormal
)

# Q-value network
class QNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(n_obs + n_act, 1)
    def forward(self, obs, act):
        return self.linear(torch.cat([obs, act], -1))

qvalue = ValueOperator(module=QNet(), in_keys=["observation", "action"])

# SAC loss
loss_module = SACLoss(actor_network=actor, qvalue_network=qvalue)

# Slowly-updated target networks
updater = SoftUpdate(loss_module, eps=0.005)

# Separate optimisers per component
optim_actor  = torch.optim.Adam(actor.parameters(), lr=3e-4)
optim_qvalue = torch.optim.Adam(qvalue.parameters(), lr=3e-4)
optim_alpha  = torch.optim.Adam([loss_module.log_alpha], lr=3e-4)

# Training step
loss_td = loss_module(batch)
loss_td["loss_actor"].backward(retain_graph=True)
loss_td["loss_qvalue"].backward(retain_graph=True)
loss_td["loss_alpha"].backward()

optim_actor.step();  optim_qvalue.step();  optim_alpha.step()
updater.step()   # soft-update target networks

Use SoftUpdate(loss_module, eps=0.005) to keep target networks in sync. Call updater.step() once per gradient update, not per environment step.

TD3Loss

Twin Delayed Deep Deterministic Policy Gradient (Fujimoto et al. 2018) addresses overestimation in deterministic actor-critic methods with two Q-networks and delayed actor updates. Target actions are perturbed with clipped Gaussian noise to smooth the Q-function.

from torchrl.objectives import TD3Loss

loss_module = TD3Loss(
    actor_network=actor,
    qvalue_network=qvalue,
    action_spec=spec,
    num_qvalue_nets=2,
    policy_noise=0.2,
    noise_clip=0.5,
    loss_function="smooth_l1",
)

Constructor Parameters

actor_network

TensorDictModule

required

Deterministic policy network mapping observations to actions.

qvalue_network

TensorDictModule | list[TensorDictModule]

required

Q(s, a) network(s). Outputs "state_action_value". A single network is replicated num_qvalue_nets times.

action_spec

TensorSpec

required

Action space spec (exclusive with bounds). Required to clip noisy target actions.

num_qvalue_nets

int

default:"2"

Size of the Q-network ensemble.

policy_noise

float

default:"0.2"

Standard deviation of the Gaussian noise added to target policy actions.

noise_clip

float

default:"0.5"

Maximum absolute value of the target policy action noise (clips the sampled noise before adding it).

loss_function

str

default:"\"smooth_l1\""

Loss for Q-function regression.

Output Keys

Key	Description
`loss_actor`	Deterministic policy loss (maximizes `min_i Q_i(s, π(s))`)
`loss_qvalue`	Q-function regression loss
`pred_value`	Mean predicted Q-value (for logging)
`target_value`	Mean target Q-value (for logging)
`state_action_value_actor`	Q-value evaluated at the current actor output
`next_state_value`	Bootstrap value from the target network

DDPGLoss

Deep Deterministic Policy Gradient (Lillicrap et al. 2015) is the deterministic actor-critic predecessor to TD3. It uses a single Q-network without noise smoothing.

from torchrl.objectives import DDPGLoss

loss_module = DDPGLoss(
    actor_network=actor,
    value_network=value,
    loss_function="smooth_l1",
    delay_actor=False,
    delay_value=True,
)

Constructor Parameters

actor_network

TensorDictModule

required

Deterministic policy operator.

value_network

TensorDictModule

required

Q(s, a) critic. Reads observations and actions, writes "state_action_value".

loss_function

str

default:"\"smooth_l1\""

Loss for the Q-function residual.

delay_actor

bool

default:"False"

Whether to create a target actor network.

delay_value

bool

default:"True"

Whether to create a target value (Q) network.

Output Keys

Key	Description
`loss_actor`	DDPG actor loss
`loss_value`	Q-function regression loss
`pred_value`	Predicted Q-value (for logging)
`target_value`	TD target (for logging)
`pred_value_max`	Maximum predicted Q-value over batch
`target_value_max`	Maximum TD target over batch

DQNLoss

Deep Q-Network (Mnih et al. 2015) is the standard Q-learning algorithm for discrete action spaces. It regresses Q-values against a bootstrapped target computed from a frozen target network.

from torchrl.objectives import DQNLoss

loss_module = DQNLoss(
    value_network=actor,
    loss_function="l2",
    delay_value=True,
    double_dqn=False,
    action_space=spec,
)

Constructor Parameters

value_network

QValueActor | nn.Module

required

Q-value network. Outputs "action_value" — a vector of Q-values, one per discrete action. TorchRL wraps plain nn.Modules in a QValueActor automatically.

loss_function

str

default:"\"l2\""

Loss function for the Bellman residual. One of "l1", "l2", "smooth_l1".

delay_value

bool

default:"True"

If True, creates a target Q-network for computing stable bootstrap targets.

double_dqn

bool

default:"False"

Enable Double DQN (Van Hasselt et al. 2015): use the online network to select the next action but the target network to evaluate it. Requires delay_value=True.

action_space

str | TensorSpec

default:"None"

Discrete action space specification. Must be one of "one-hot", "mult_one_hot", "binary", "categorical", or an equivalent TorchRL spec instance.

Output Keys

Key	Description
`loss`	Bellman residual loss

DQN Example

from torchrl.modules import MLP
from torchrl.data import OneHot
from torchrl.modules.tensordict_module.actors import QValueActor
from torchrl.objectives import DQNLoss

n_obs, n_act = 4, 3
spec = OneHot(n_act)

value_net = MLP(in_features=n_obs, out_features=n_act)
actor = QValueActor(value_net, in_keys=["observation"], action_space=spec)
loss = DQNLoss(actor, action_space=spec, delay_value=True, double_dqn=True)

DistributionalDQNLoss

Distributional DQN (Bellemare et al. 2017) models the full distribution of returns rather than the expected value. The network outputs a probability distribution over n_atoms support values between Vmin and Vmax, and the loss minimises the cross-entropy between the projected target distribution and the predicted distribution.

from torchrl.objectives import DistributionalDQNLoss
from torchrl.modules import DistributionalQValueActor, MLP

actor = DistributionalQValueActor(
    module=MLP(n_obs, n_act * n_atoms),
    action_space=spec,
    support=torch.linspace(Vmin, Vmax, n_atoms),
)
loss = DistributionalDQNLoss(actor, gamma=0.99)

DistributionalDQNLoss does not expose a loss_function parameter — the loss is always the KL divergence between projected target distribution and predicted distribution.

REDQLoss

Randomized Ensemble Double Q-Learning (Chen et al. 2021) uses a large ensemble of Q-networks and randomizes the choice of two networks per update step. This provides strong overestimation control and improved sample efficiency, at the cost of higher memory usage.

from torchrl.objectives import REDQLoss

loss_module = REDQLoss(
    actor_network=actor,
    qvalue_network=qvalue,
    num_qvalue_nets=10,
    sub_sample_num=2,       # number of networks sampled per update
    target_entropy="auto",
)

REDQLoss accepts the same alpha_init, fixed_alpha, and target_entropy keyword arguments as SACLoss. The key difference is the large ensemble (num_qvalue_nets=10 is typical) combined with random sub-sampling at each update.

CrossQLoss

Cross Q-Learning (Bhatt et al. 2019) avoids the double-sampling issue in SAC by estimating Q-values for both the current and next state in a single forward pass, sharing a batch normalization layer between them. This eliminates the need for target networks.

from torchrl.objectives import CrossQLoss

loss_module = CrossQLoss(
    actor_network=actor,
    qvalue_network=qvalue,
    num_qvalue_nets=2,
    target_entropy="auto",
)

Target Network Updaters

All off-policy objectives that use target networks expose delay_* parameters. After creating the loss, attach a SoftUpdate or HardUpdate updater:

from torchrl.objectives.utils import SoftUpdate, HardUpdate

# Polyak averaging (τ controls how fast the target tracks the online network)
soft_updater = SoftUpdate(loss_module, eps=0.005)
soft_updater.step()   # call once per gradient update

# Hard update (copies weights every N steps)
hard_updater = HardUpdate(loss_module, value_network_update_interval=1000)
hard_updater.step()   # increments internal counter; copies when interval reached

Comparing Value-Based Objectives

SACLoss
TD3Loss
DDPGLoss
DQNLoss

Best for continuous action spaces requiring strong exploration. Automatic entropy tuning (fixed_alpha=False) is stable across most environments.

loss = SACLoss(actor, qvalue, num_qvalue_nets=2)

Better than DDPG in practice. Target action noise reduces overestimation smoothly. Use with delay_actor=True and a delayed actor update schedule.

loss = TD3Loss(actor, qvalue, action_spec=spec)

Classic deterministic actor-critic. Simple but prone to Q-value overestimation. Prefer TD3 for new projects.

loss = DDPGLoss(actor, value)

Standard choice for discrete action spaces. Enable double_dqn=True for reduced bias at minimal extra cost.

loss = DQNLoss(actor, action_space=spec, double_dqn=True)

Environments

Data & Buffers

Collectors

Modules

Objectives

Value-Based Objectives: SAC, TD3, DQN, and DDPG in TorchRL

SACLoss

Constructor Parameters

Output Keys

Complete SAC Example

TD3Loss

Constructor Parameters

Output Keys

DDPGLoss

Constructor Parameters

Output Keys

DQNLoss

Constructor Parameters

Output Keys

DQN Example

DistributionalDQNLoss

REDQLoss

CrossQLoss

Target Network Updaters

Comparing Value-Based Objectives

Build docs developers (and LLMs) love

Environments

Data & Buffers

Collectors

Modules

Objectives

Documentation Index

​SACLoss

​Constructor Parameters

​Output Keys

​Complete SAC Example

​TD3Loss

​Constructor Parameters

​Output Keys

​DDPGLoss

​Constructor Parameters

​Output Keys

​DQNLoss

​Constructor Parameters

​Output Keys

​DQN Example

​DistributionalDQNLoss

​REDQLoss

​CrossQLoss

​Target Network Updaters

​Comparing Value-Based Objectives

Build docs developers (and LLMs) love

SACLoss

Constructor Parameters

Output Keys

Complete SAC Example

TD3Loss

Constructor Parameters

Output Keys

DDPGLoss

Constructor Parameters

Output Keys

DQNLoss

Constructor Parameters

Output Keys

DQN Example

DistributionalDQNLoss

REDQLoss

CrossQLoss

Target Network Updaters

Comparing Value-Based Objectives