Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/pytorch/rl/llms.txt

Use this file to discover all available pages before exploring further.

Multi-agent reinforcement learning (MARL) requires objectives that handle multiple actors simultaneously, potentially sharing or mixing their value estimates. TorchRL provides MAPPOLoss, IPPOLoss, and QMixerLoss, all of which follow the standard multi-agent data convention of nesting per-agent tensors under a group key (typically "agents"). These objectives compose cleanly with the same LossModule interface used by single-agent algorithms.

Multi-Agent Data Convention

All TorchRL multi-agent objectives expect per-agent data to be nested under a group key inside the TensorDict. For an environment with n_agents agents:
# Shape notation: [*B, T, n_agents, feature_dim]
tensordict = TensorDict({
    ("agents", "observation"): torch.randn(batch, T, n_agents, obs_dim),
    ("agents", "action"):      torch.randn(batch, T, n_agents, act_dim),
    ("agents", "state_value"): torch.randn(batch, T, n_agents, 1),
    ("next", "reward"):        torch.randn(batch, T, 1),    # team-shared
    ("next", "done"):          torch.zeros(batch, T, 1, dtype=torch.bool),
    ("next", "terminated"):    torch.zeros(batch, T, 1, dtype=torch.bool),
}, batch_size=[batch, T])
Team-shared signals (reward, done, terminated) have shape [*B, T, 1] — they are not duplicated along the agent dimension. MultiAgentGAE automatically broadcasts them to [*B, T, n_agents, 1] before computing advantages.
For competitive settings where agents receive individual rewards, use per-agent reward keys: ("next", "agents", "reward") of shape [*B, T, n_agents, 1].

MAPPOLoss

Multi-Agent PPO (Yu et al. 2022, NeurIPS) pairs a decentralised actor (each agent’s policy sees only its own observation) with a centralised critic (a single value function that conditions on the full team state or concatenated observations). The decentralised execution at test time, combined with centralised training, is the defining CTDE (Centralized Training, Decentralized Execution) paradigm. MAPPOLoss is a thin specialisation of ClipPPOLoss with three differences:
  1. The default value estimator is MultiAgentGAE instead of GAE.
  2. normalize_advantage_exclude_dims defaults to (-2,) so the agent dimension is excluded from advantage standardization.
  3. An optional ValueNorm (PopArt or running-mean normalization) can be attached to stabilize critic loss when reward scales drift.
from torchrl.objectives.multiagent import MAPPOLoss
from torchrl.modules import PopArtValueNorm

loss_module = MAPPOLoss(
    actor_network=actor,
    critic_network=critic,
    clip_epsilon=0.2,
    entropy_coeff=0.01,
    critic_coeff=1.0,
    normalize_advantage=True,
    value_norm=PopArtValueNorm(shape=1),
)
loss_module.set_keys(
    value=("agents", "state_value"),
    action=("agents", "action"),
)

Constructor Parameters

actor_network
ProbabilisticTensorDictSequential
required
Per-agent decentralised policy. Build with MultiAgentMLP(centralized=False, share_params=True) for cooperative homogeneous teams. Reads ("agents", "observation") and writes ("agents", "action").
critic_network
TensorDictModule
required
Centralised value operator. Build with MultiAgentMLP(centralized=True, share_params=True) so it conditions on all agents’ observations and returns ("agents", "state_value") of shape [*B, n_agents, 1].
value_norm
ValueNorm | None
default:"None"
Optional running normalizer for the critic target and prediction. When provided, the target and prediction are normalised before the MSE / smooth-L1 distance, stabilising training on tasks with drifting reward scales. The MAPPO paper (Yu et al. Table 13) reports this is load-bearing on SMAC.Supported types:
  • PopArtValueNorm: exponential moving-average normalization with parameter rescaling (recommended for SMAC and other sparse-reward tasks).
  • RunningValueNorm: simple mean-variance normalization without parameter rescaling (for stationary reward scales).
clip_epsilon
float
default:"0.2"
PPO importance-weight clip threshold. Inherited from ClipPPOLoss.
entropy_coeff
float
default:"0.01"
Entropy bonus weight. Defaults to 0.01 (MAPPO default), compared to ClipPPOLoss’s 0.01.
normalize_advantage
bool
default:"True"
Whether to standardise advantages before use. Defaults to True (MAPPO default), unlike the parent ClipPPOLoss which defaults to False.
normalize_advantage_exclude_dims
tuple[int, ...]
default:"(-2,)"
Dimensions excluded from advantage standardization. Defaults to (-2,) to exclude the agent dimension so each agent’s advantages are normalized independently.

Output Keys

MAPPOLoss returns the same keys as ClipPPOLoss:
KeyDescription
loss_objectiveClipped surrogate PPO objective
loss_criticCritic MSE / smooth-L1 loss
loss_entropyEntropy bonus
entropyMean policy entropy across agents
kl_approxApproximate KL divergence
clip_fractionFraction of clipped importance weights
explained_varianceR² of critic predictions vs. value targets

Complete MAPPO Example

import torch
from tensordict.nn import TensorDictModule
from torchrl.modules import MultiAgentMLP, PopArtValueNorm, ProbabilisticActor
from torchrl.modules.distributions import NormalParamExtractor, TanhNormal
from torchrl.objectives.multiagent import MAPPOLoss
from torchrl.objectives.value import MultiAgentGAE

n_agents, obs_dim, action_dim = 3, 6, 2

# Decentralised actor (each agent sees its own obs_dim-dimensional observation)
actor_net = torch.nn.Sequential(
    MultiAgentMLP(
        n_agent_inputs=obs_dim,
        n_agent_outputs=2 * action_dim,
        n_agents=n_agents,
        centralized=False,
        share_params=True,
    ),
    NormalParamExtractor(),
)
actor_module = TensorDictModule(
    actor_net,
    in_keys=[("agents", "observation")],
    out_keys=[("agents", "loc"), ("agents", "scale")],
)
actor = ProbabilisticActor(
    module=actor_module,
    in_keys=[("agents", "loc"), ("agents", "scale")],
    out_keys=[("agents", "action")],
    distribution_class=TanhNormal,
    return_log_prob=True,
)

# Centralised critic (conditions on all agents' observations)
critic = TensorDictModule(
    MultiAgentMLP(
        n_agent_inputs=obs_dim,
        n_agent_outputs=1,
        n_agents=n_agents,
        centralized=True,
        share_params=True,
    ),
    in_keys=[("agents", "observation")],
    out_keys=[("agents", "state_value")],
)

# MAPPO loss
loss_module = MAPPOLoss(
    actor_network=actor,
    critic_network=critic,
    clip_epsilon=0.2,
    entropy_coeff=0.01,
    value_norm=PopArtValueNorm(shape=1),
)
loss_module.set_keys(
    value=("agents", "state_value"),
    action=("agents", "action"),
)

# Multi-agent GAE for advantage estimation
gae = MultiAgentGAE(
    gamma=0.99,
    lmbda=0.95,
    value_network=critic,
    average_gae=False,
)

# Training step
gae(data)                         # writes ("agents", "advantage") etc.
loss_td = loss_module(data)
loss = loss_td["loss_objective"] + loss_td["loss_critic"] + loss_td["loss_entropy"]
loss.backward()

IPPOLoss

Independent PPO (de Witt et al. 2020) is the decentralised counterpart of MAPPO. Each agent has its own value function that conditions only on its local observation — there is no shared critic and no global state. The paper “Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge?” demonstrates that IPPO is surprisingly competitive with MAPPO on many SMAC scenarios. IPPOLoss is structurally identical to MAPPOLoss; the only difference is the critic construction:
from torchrl.objectives.multiagent import IPPOLoss

# Per-agent (decentralised) critic
per_agent_critic = TensorDictModule(
    MultiAgentMLP(
        n_agent_inputs=obs_dim,
        n_agent_outputs=1,
        n_agents=n_agents,
        centralized=False,   # <-- key difference vs. MAPPO
        share_params=True,
    ),
    in_keys=[("agents", "observation")],
    out_keys=[("agents", "state_value")],
)

loss_module = IPPOLoss(actor_network=actor, critic_network=per_agent_critic)

When to Use MAPPO vs. IPPO

MAPPOIPPO
CriticCentralised (full team state)Per-agent (local obs only)
Requires global stateYes (or concatenated obs)No
Typical advantageHigher (more information)Slightly lower
Training complexityHigherLower
SMAC performanceCompetitiveCompetitive
Competitive MARLNot directly applicableMore natural
Start with IPPOLoss when you don’t have access to a global state key. The per-agent MultiAgentGAE handles the shared reward broadcasting automatically without requiring any additional inputs.

QMixerLoss

QMIX (Rashid et al. 2018) and VDN (Sunehag et al. 2017) are value-decomposition methods for cooperative MARL. Each agent maintains a local Q-function, and a mixer network combines them into a global Q-value used for DQN-style updates. The mixer enforces the Individual-Global-Max (IGM) consistency constraint so that the joint greedy policy decomposes into independent per-agent greedy policies. QMixerLoss takes a QValueActor (local Q-networks) and a TensorDictModule mixer, then applies the standard DQN objective on the global mixed Q-value.
from torchrl.objectives.multiagent import QMixerLoss
from torchrl.modules.models.multiagent import QMixer

loss_module = QMixerLoss(
    local_value_network=qnet,
    mixer_network=qmixer,
    loss_function="l2",
    delay_value=True,
    action_space="categorical",
)

Constructor Parameters

local_value_network
QValueActor | nn.Module
required
Local Q-value actor. Outputs ("agents", "action_value") of shape [*B, n_agents, n_actions] and ("agents", "chosen_action_value") of shape [*B, n_agents, 1].
mixer_network
TensorDictModule | nn.Module
required
Mixing network. Reads ("agents", "chosen_action_value") (and optionally a global "state" key) and writes the global "chosen_action_value" of shape [*B, 1]. Use QMixer from torchrl.modules.models.multiagent for the standard monotonic QMIX architecture, or wrap a simple sum to get VDN.
loss_function
str
default:"\"l2\""
Loss function for the global Q-value Bellman regression.
delay_value
bool
default:"True"
If True, creates separate target value networks for computing Bellman targets with a frozen network.
action_space
str | TensorSpec
default:"None"
Discrete action space type. Must be one of "one-hot", "mult_one_hot", "binary", "categorical", or an equivalent TorchRL spec.

Input Keys (via set_keys)

KeyDefaultDescription
local_value("agents", "chosen_action_value")Per-agent chosen Q-values
global_value"chosen_action_value"Mixed global Q-value
action("agents", "action")Per-agent actions
priority"td_error"Priority key for replay buffer

Output Keys

KeyDescription
lossBellman regression loss on the global mixed Q-value

Complete QMIX Example

import torch
from torch import nn
from tensordict import TensorDict
from tensordict.nn import TensorDictModule
from torchrl.modules import QValueModule, SafeSequential
from torchrl.modules.models.multiagent import QMixer
from torchrl.objectives.multiagent import QMixerLoss

n_agents, obs_dim, n_actions = 4, 10, 3
state_shape = (64, 64, 3)

# Per-agent Q-network
q_module = TensorDictModule(
    nn.Linear(obs_dim, n_actions),
    in_keys=[("agents", "observation")],
    out_keys=[("agents", "action_value")],
)
value_module = QValueModule(
    action_value_key=("agents", "action_value"),
    out_keys=[
        ("agents", "action"),
        ("agents", "action_value"),
        ("agents", "chosen_action_value"),
    ],
    action_space="categorical",
)
qnet = SafeSequential(q_module, value_module)

# QMIX mixer
qmixer = TensorDictModule(
    module=QMixer(
        state_shape=state_shape,
        mixing_embed_dim=32,
        n_agents=n_agents,
        device="cpu",
    ),
    in_keys=[("agents", "chosen_action_value"), "state"],
    out_keys=["chosen_action_value"],
)

loss = QMixerLoss(qnet, qmixer, action_space="categorical")

# Data with agent dimension and global state
td = TensorDict({
    "agents": TensorDict(
        {"observation": torch.zeros(32, n_agents, obs_dim)}, [32, n_agents]
    ),
    "state": torch.zeros(32, *state_shape),
    "next": TensorDict({
        "agents": TensorDict(
            {"observation": torch.zeros(32, n_agents, obs_dim)}, [32, n_agents]
        ),
        "state": torch.zeros(32, *state_shape),
        "reward": torch.zeros(32, 1),
        "done": torch.zeros(32, 1, dtype=torch.bool),
        "terminated": torch.zeros(32, 1, dtype=torch.bool),
    }, [32]),
}, [32])

loss_td = loss(qnet(td))
print(loss_td["loss"])

QMIX vs VDN

The mixer is a hypernetwork conditioned on the global state that produces non-negative weights for a monotonic combination of local Q-values. Suitable when the global state is available and agents have complex dependencies.
qmixer = TensorDictModule(
    QMixer(state_shape=state_shape, mixing_embed_dim=32, n_agents=n_agents),
    in_keys=[("agents", "chosen_action_value"), "state"],
    out_keys=["chosen_action_value"],
)

MultiAgentGAE

MultiAgentGAE is the multi-agent extension of GAE for settings where the value network produces per-agent estimates [*B, T, n_agents, 1] but the reward/done signals are team-shared [*B, T, 1]. It automatically broadcasts shared signals to the agent dimension before running the standard GAE recursion.
from torchrl.objectives.value import MultiAgentGAE

gae = MultiAgentGAE(
    gamma=0.99,
    lmbda=0.95,
    value_network=critic,
    average_gae=False,
    agent_dim=-2,   # default: penultimate dimension holds agents
)
gae(data)
# writes ("agents", "advantage") of shape [*B, T, n_agents, 1]
# writes ("agents", "value_target") of shape [*B, T, n_agents, 1]
agent_dim
int
default:"-2"
The dimension holding the agent index in the value tensor. Negative dimensions are interpreted modulo value.ndim. Defaults to -2 (penultimate), consistent with MultiAgentMLP’s output layout.
See the Value Estimators reference for the full MultiAgentGAE API.

Connecting Multi-Agent Objectives to Environments

TorchRL’s multi-agent environments (VmasEnv, PettingZooEnv, etc.) automatically use the nested ("agents", ...) key convention. The collector’s output can be passed directly to the loss module after advantage estimation:
from torchrl.envs.libs.vmas import VmasEnv
from torchrl.collectors import SyncDataCollector

env = VmasEnv("simple_spread", num_envs=32, n_agents=3)
collector = SyncDataCollector(env, actor, frames_per_batch=2000)

for data in collector:
    # data["agents"]["observation"]: [32, T, 3, obs_dim]
    gae(data)
    loss_td = loss_module(data)
    ...
If your environment returns per-agent rewards under ("next", "agents", "reward") (competitive / mixed settings), you must configure MultiAgentGAE accordingly and pass the correct key via loss_module.set_keys(reward=("agents", "reward")).

Build docs developers (and LLMs) love