Multi-agent reinforcement learning (MARL) requires objectives that handle multiple actors simultaneously, potentially sharing or mixing their value estimates. TorchRL providesDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/pytorch/rl/llms.txt
Use this file to discover all available pages before exploring further.
MAPPOLoss, IPPOLoss, and QMixerLoss, all of which follow the standard multi-agent data convention of nesting per-agent tensors under a group key (typically "agents"). These objectives compose cleanly with the same LossModule interface used by single-agent algorithms.
Multi-Agent Data Convention
All TorchRL multi-agent objectives expect per-agent data to be nested under a group key inside theTensorDict. For an environment with n_agents agents:
Team-shared signals (
reward, done, terminated) have shape [*B, T, 1] — they are not duplicated along the agent dimension. MultiAgentGAE automatically broadcasts them to [*B, T, n_agents, 1] before computing advantages.("next", "agents", "reward") of shape [*B, T, n_agents, 1].
MAPPOLoss
Multi-Agent PPO (Yu et al. 2022, NeurIPS) pairs a decentralised actor (each agent’s policy sees only its own observation) with a centralised critic (a single value function that conditions on the full team state or concatenated observations). The decentralised execution at test time, combined with centralised training, is the defining CTDE (Centralized Training, Decentralized Execution) paradigm.MAPPOLoss is a thin specialisation of ClipPPOLoss with three differences:
- The default value estimator is
MultiAgentGAEinstead ofGAE. normalize_advantage_exclude_dimsdefaults to(-2,)so the agent dimension is excluded from advantage standardization.- An optional
ValueNorm(PopArt or running-mean normalization) can be attached to stabilize critic loss when reward scales drift.
Constructor Parameters
Per-agent decentralised policy. Build with
MultiAgentMLP(centralized=False, share_params=True) for cooperative homogeneous teams. Reads
("agents", "observation") and writes ("agents", "action").Centralised value operator. Build with
MultiAgentMLP(centralized=True, share_params=True) so it conditions on all agents’ observations and returns
("agents", "state_value") of shape [*B, n_agents, 1].Optional running normalizer for the critic target and prediction. When
provided, the target and prediction are normalised before the MSE / smooth-L1
distance, stabilising training on tasks with drifting reward scales. The MAPPO
paper (Yu et al. Table 13) reports this is load-bearing on SMAC.Supported types:
PopArtValueNorm: exponential moving-average normalization with parameter rescaling (recommended for SMAC and other sparse-reward tasks).RunningValueNorm: simple mean-variance normalization without parameter rescaling (for stationary reward scales).
PPO importance-weight clip threshold. Inherited from
ClipPPOLoss.Entropy bonus weight. Defaults to
0.01 (MAPPO default), compared to
ClipPPOLoss’s 0.01.Whether to standardise advantages before use. Defaults to
True (MAPPO
default), unlike the parent ClipPPOLoss which defaults to False.Dimensions excluded from advantage standardization. Defaults to
(-2,) to
exclude the agent dimension so each agent’s advantages are normalized
independently.Output Keys
MAPPOLoss returns the same keys as ClipPPOLoss:
| Key | Description |
|---|---|
loss_objective | Clipped surrogate PPO objective |
loss_critic | Critic MSE / smooth-L1 loss |
loss_entropy | Entropy bonus |
entropy | Mean policy entropy across agents |
kl_approx | Approximate KL divergence |
clip_fraction | Fraction of clipped importance weights |
explained_variance | R² of critic predictions vs. value targets |
Complete MAPPO Example
IPPOLoss
Independent PPO (de Witt et al. 2020) is the decentralised counterpart of MAPPO. Each agent has its own value function that conditions only on its local observation — there is no shared critic and no global state. The paper “Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge?” demonstrates that IPPO is surprisingly competitive with MAPPO on many SMAC scenarios.IPPOLoss is structurally identical to MAPPOLoss; the only difference is the critic construction:
When to Use MAPPO vs. IPPO
| MAPPO | IPPO | |
|---|---|---|
| Critic | Centralised (full team state) | Per-agent (local obs only) |
| Requires global state | Yes (or concatenated obs) | No |
| Typical advantage | Higher (more information) | Slightly lower |
| Training complexity | Higher | Lower |
| SMAC performance | Competitive | Competitive |
| Competitive MARL | Not directly applicable | More natural |
QMixerLoss
QMIX (Rashid et al. 2018) and VDN (Sunehag et al. 2017) are value-decomposition methods for cooperative MARL. Each agent maintains a local Q-function, and a mixer network combines them into a global Q-value used for DQN-style updates. The mixer enforces the Individual-Global-Max (IGM) consistency constraint so that the joint greedy policy decomposes into independent per-agent greedy policies.QMixerLoss takes a QValueActor (local Q-networks) and a TensorDictModule mixer, then applies the standard DQN objective on the global mixed Q-value.
Constructor Parameters
Local Q-value actor. Outputs
("agents", "action_value") of shape
[*B, n_agents, n_actions] and ("agents", "chosen_action_value") of shape
[*B, n_agents, 1].Mixing network. Reads
("agents", "chosen_action_value") (and optionally a
global "state" key) and writes the global "chosen_action_value" of shape
[*B, 1]. Use QMixer from torchrl.modules.models.multiagent for the
standard monotonic QMIX architecture, or wrap a simple sum to get VDN.Loss function for the global Q-value Bellman regression.
If
True, creates separate target value networks for computing Bellman
targets with a frozen network.Discrete action space type. Must be one of
"one-hot", "mult_one_hot",
"binary", "categorical", or an equivalent TorchRL spec.Input Keys (via set_keys)
| Key | Default | Description |
|---|---|---|
local_value | ("agents", "chosen_action_value") | Per-agent chosen Q-values |
global_value | "chosen_action_value" | Mixed global Q-value |
action | ("agents", "action") | Per-agent actions |
priority | "td_error" | Priority key for replay buffer |
Output Keys
| Key | Description |
|---|---|
loss | Bellman regression loss on the global mixed Q-value |
Complete QMIX Example
QMIX vs VDN
- QMIX
- VDN
The mixer is a hypernetwork conditioned on the global state that produces
non-negative weights for a monotonic combination of local Q-values. Suitable
when the global state is available and agents have complex dependencies.
MultiAgentGAE
MultiAgentGAE is the multi-agent extension of GAE for settings where the value network produces per-agent estimates [*B, T, n_agents, 1] but the reward/done signals are team-shared [*B, T, 1]. It automatically broadcasts shared signals to the agent dimension before running the standard GAE recursion.
The dimension holding the agent index in the value tensor. Negative dimensions
are interpreted modulo
value.ndim. Defaults to -2 (penultimate), consistent
with MultiAgentMLP’s output layout.MultiAgentGAE API.
Connecting Multi-Agent Objectives to Environments
TorchRL’s multi-agent environments (VmasEnv, PettingZooEnv, etc.) automatically use the nested ("agents", ...) key convention. The collector’s output can be passed directly to the loss module after advantage estimation: