Multi-Agent RL Objectives: MAPPO, IPPO, and QMIX in TorchRL

Multi-agent reinforcement learning (MARL) requires objectives that handle multiple actors simultaneously, potentially sharing or mixing their value estimates. TorchRL provides MAPPOLoss, IPPOLoss, and QMixerLoss, all of which follow the standard multi-agent data convention of nesting per-agent tensors under a group key (typically "agents"). These objectives compose cleanly with the same LossModule interface used by single-agent algorithms.

Multi-Agent Data Convention

All TorchRL multi-agent objectives expect per-agent data to be nested under a group key inside the TensorDict. For an environment with n_agents agents:

# Shape notation: [*B, T, n_agents, feature_dim]
tensordict = TensorDict({
    ("agents", "observation"): torch.randn(batch, T, n_agents, obs_dim),
    ("agents", "action"):      torch.randn(batch, T, n_agents, act_dim),
    ("agents", "state_value"): torch.randn(batch, T, n_agents, 1),
    ("next", "reward"):        torch.randn(batch, T, 1),    # team-shared
    ("next", "done"):          torch.zeros(batch, T, 1, dtype=torch.bool),
    ("next", "terminated"):    torch.zeros(batch, T, 1, dtype=torch.bool),
}, batch_size=[batch, T])

Team-shared signals (reward, done, terminated) have shape [*B, T, 1] — they are not duplicated along the agent dimension. MultiAgentGAE automatically broadcasts them to [*B, T, n_agents, 1] before computing advantages.

For competitive settings where agents receive individual rewards, use per-agent reward keys: ("next", "agents", "reward") of shape [*B, T, n_agents, 1].

MAPPOLoss

Multi-Agent PPO (Yu et al. 2022, NeurIPS) pairs a decentralised actor (each agent’s policy sees only its own observation) with a centralised critic (a single value function that conditions on the full team state or concatenated observations). The decentralised execution at test time, combined with centralised training, is the defining CTDE (Centralized Training, Decentralized Execution) paradigm. MAPPOLoss is a thin specialisation of ClipPPOLoss with three differences:

The default value estimator is MultiAgentGAE instead of GAE.
normalize_advantage_exclude_dims defaults to (-2,) so the agent dimension is excluded from advantage standardization.
An optional ValueNorm (PopArt or running-mean normalization) can be attached to stabilize critic loss when reward scales drift.

from torchrl.objectives.multiagent import MAPPOLoss
from torchrl.modules import PopArtValueNorm

loss_module = MAPPOLoss(
    actor_network=actor,
    critic_network=critic,
    clip_epsilon=0.2,
    entropy_coeff=0.01,
    critic_coeff=1.0,
    normalize_advantage=True,
    value_norm=PopArtValueNorm(shape=1),
)
loss_module.set_keys(
    value=("agents", "state_value"),
    action=("agents", "action"),
)

Constructor Parameters

actor_network

ProbabilisticTensorDictSequential

required

Per-agent decentralised policy. Build with MultiAgentMLP(centralized=False, share_params=True) for cooperative homogeneous teams. Reads ("agents", "observation") and writes ("agents", "action").

critic_network

TensorDictModule

required

Centralised value operator. Build with MultiAgentMLP(centralized=True, share_params=True) so it conditions on all agents’ observations and returns ("agents", "state_value") of shape [*B, n_agents, 1].

value_norm

ValueNorm | None

default:"None"

Optional running normalizer for the critic target and prediction. When provided, the target and prediction are normalised before the MSE / smooth-L1 distance, stabilising training on tasks with drifting reward scales. The MAPPO paper (Yu et al. Table 13) reports this is load-bearing on SMAC.Supported types:

PopArtValueNorm: exponential moving-average normalization with parameter rescaling (recommended for SMAC and other sparse-reward tasks).
RunningValueNorm: simple mean-variance normalization without parameter rescaling (for stationary reward scales).

clip_epsilon

float

default:"0.2"

PPO importance-weight clip threshold. Inherited from ClipPPOLoss.

entropy_coeff

float

default:"0.01"

Entropy bonus weight. Defaults to 0.01 (MAPPO default), compared to ClipPPOLoss’s 0.01.

normalize_advantage

bool

default:"True"

Whether to standardise advantages before use. Defaults to True (MAPPO default), unlike the parent ClipPPOLoss which defaults to False.

normalize_advantage_exclude_dims

tuple[int, ...]

default:"(-2,)"

Dimensions excluded from advantage standardization. Defaults to (-2,) to exclude the agent dimension so each agent’s advantages are normalized independently.

Output Keys

MAPPOLoss returns the same keys as ClipPPOLoss:

Key	Description
`loss_objective`	Clipped surrogate PPO objective
`loss_critic`	Critic MSE / smooth-L1 loss
`loss_entropy`	Entropy bonus
`entropy`	Mean policy entropy across agents
`kl_approx`	Approximate KL divergence
`clip_fraction`	Fraction of clipped importance weights
`explained_variance`	R² of critic predictions vs. value targets

Complete MAPPO Example

import torch
from tensordict.nn import TensorDictModule
from torchrl.modules import MultiAgentMLP, PopArtValueNorm, ProbabilisticActor
from torchrl.modules.distributions import NormalParamExtractor, TanhNormal
from torchrl.objectives.multiagent import MAPPOLoss
from torchrl.objectives.value import MultiAgentGAE

n_agents, obs_dim, action_dim = 3, 6, 2

# Decentralised actor (each agent sees its own obs_dim-dimensional observation)
actor_net = torch.nn.Sequential(
    MultiAgentMLP(
        n_agent_inputs=obs_dim,
        n_agent_outputs=2 * action_dim,
        n_agents=n_agents,
        centralized=False,
        share_params=True,
    ),
    NormalParamExtractor(),
)
actor_module = TensorDictModule(
    actor_net,
    in_keys=[("agents", "observation")],
    out_keys=[("agents", "loc"), ("agents", "scale")],
)
actor = ProbabilisticActor(
    module=actor_module,
    in_keys=[("agents", "loc"), ("agents", "scale")],
    out_keys=[("agents", "action")],
    distribution_class=TanhNormal,
    return_log_prob=True,
)

# Centralised critic (conditions on all agents' observations)
critic = TensorDictModule(
    MultiAgentMLP(
        n_agent_inputs=obs_dim,
        n_agent_outputs=1,
        n_agents=n_agents,
        centralized=True,
        share_params=True,
    ),
    in_keys=[("agents", "observation")],
    out_keys=[("agents", "state_value")],
)

# MAPPO loss
loss_module = MAPPOLoss(
    actor_network=actor,
    critic_network=critic,
    clip_epsilon=0.2,
    entropy_coeff=0.01,
    value_norm=PopArtValueNorm(shape=1),
)
loss_module.set_keys(
    value=("agents", "state_value"),
    action=("agents", "action"),
)

# Multi-agent GAE for advantage estimation
gae = MultiAgentGAE(
    gamma=0.99,
    lmbda=0.95,
    value_network=critic,
    average_gae=False,
)

# Training step
gae(data)                         # writes ("agents", "advantage") etc.
loss_td = loss_module(data)
loss = loss_td["loss_objective"] + loss_td["loss_critic"] + loss_td["loss_entropy"]
loss.backward()

IPPOLoss

Independent PPO (de Witt et al. 2020) is the decentralised counterpart of MAPPO. Each agent has its own value function that conditions only on its local observation — there is no shared critic and no global state. The paper “Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge?” demonstrates that IPPO is surprisingly competitive with MAPPO on many SMAC scenarios. IPPOLoss is structurally identical to MAPPOLoss; the only difference is the critic construction:

from torchrl.objectives.multiagent import IPPOLoss

# Per-agent (decentralised) critic
per_agent_critic = TensorDictModule(
    MultiAgentMLP(
        n_agent_inputs=obs_dim,
        n_agent_outputs=1,
        n_agents=n_agents,
        centralized=False,   # <-- key difference vs. MAPPO
        share_params=True,
    ),
    in_keys=[("agents", "observation")],
    out_keys=[("agents", "state_value")],
)

loss_module = IPPOLoss(actor_network=actor, critic_network=per_agent_critic)

When to Use MAPPO vs. IPPO

	MAPPO	IPPO
Critic	Centralised (full team state)	Per-agent (local obs only)
Requires global state	Yes (or concatenated obs)	No
Typical advantage	Higher (more information)	Slightly lower
Training complexity	Higher	Lower
SMAC performance	Competitive	Competitive
Competitive MARL	Not directly applicable	More natural

Start with IPPOLoss when you don’t have access to a global state key. The per-agent MultiAgentGAE handles the shared reward broadcasting automatically without requiring any additional inputs.

QMixerLoss

QMIX (Rashid et al. 2018) and VDN (Sunehag et al. 2017) are value-decomposition methods for cooperative MARL. Each agent maintains a local Q-function, and a mixer network combines them into a global Q-value used for DQN-style updates. The mixer enforces the Individual-Global-Max (IGM) consistency constraint so that the joint greedy policy decomposes into independent per-agent greedy policies. QMixerLoss takes a QValueActor (local Q-networks) and a TensorDictModule mixer, then applies the standard DQN objective on the global mixed Q-value.

from torchrl.objectives.multiagent import QMixerLoss
from torchrl.modules.models.multiagent import QMixer

loss_module = QMixerLoss(
    local_value_network=qnet,
    mixer_network=qmixer,
    loss_function="l2",
    delay_value=True,
    action_space="categorical",
)

Constructor Parameters

local_value_network

QValueActor | nn.Module

required

Local Q-value actor. Outputs ("agents", "action_value") of shape [*B, n_agents, n_actions] and ("agents", "chosen_action_value") of shape [*B, n_agents, 1].

mixer_network

TensorDictModule | nn.Module

required

Mixing network. Reads ("agents", "chosen_action_value") (and optionally a global "state" key) and writes the global "chosen_action_value" of shape [*B, 1]. Use QMixer from torchrl.modules.models.multiagent for the standard monotonic QMIX architecture, or wrap a simple sum to get VDN.

loss_function

str

default:"\"l2\""

Loss function for the global Q-value Bellman regression.

delay_value

bool

default:"True"

If True, creates separate target value networks for computing Bellman targets with a frozen network.

action_space

str | TensorSpec

default:"None"

Discrete action space type. Must be one of "one-hot", "mult_one_hot", "binary", "categorical", or an equivalent TorchRL spec.

Input Keys (via `set_keys`)

Key	Default	Description
`local_value`	`("agents", "chosen_action_value")`	Per-agent chosen Q-values
`global_value`	`"chosen_action_value"`	Mixed global Q-value
`action`	`("agents", "action")`	Per-agent actions
`priority`	`"td_error"`	Priority key for replay buffer

Output Keys

Key	Description
`loss`	Bellman regression loss on the global mixed Q-value

Complete QMIX Example

import torch
from torch import nn
from tensordict import TensorDict
from tensordict.nn import TensorDictModule
from torchrl.modules import QValueModule, SafeSequential
from torchrl.modules.models.multiagent import QMixer
from torchrl.objectives.multiagent import QMixerLoss

n_agents, obs_dim, n_actions = 4, 10, 3
state_shape = (64, 64, 3)

# Per-agent Q-network
q_module = TensorDictModule(
    nn.Linear(obs_dim, n_actions),
    in_keys=[("agents", "observation")],
    out_keys=[("agents", "action_value")],
)
value_module = QValueModule(
    action_value_key=("agents", "action_value"),
    out_keys=[
        ("agents", "action"),
        ("agents", "action_value"),
        ("agents", "chosen_action_value"),
    ],
    action_space="categorical",
)
qnet = SafeSequential(q_module, value_module)

# QMIX mixer
qmixer = TensorDictModule(
    module=QMixer(
        state_shape=state_shape,
        mixing_embed_dim=32,
        n_agents=n_agents,
        device="cpu",
    ),
    in_keys=[("agents", "chosen_action_value"), "state"],
    out_keys=["chosen_action_value"],
)

loss = QMixerLoss(qnet, qmixer, action_space="categorical")

# Data with agent dimension and global state
td = TensorDict({
    "agents": TensorDict(
        {"observation": torch.zeros(32, n_agents, obs_dim)}, [32, n_agents]
    ),
    "state": torch.zeros(32, *state_shape),
    "next": TensorDict({
        "agents": TensorDict(
            {"observation": torch.zeros(32, n_agents, obs_dim)}, [32, n_agents]
        ),
        "state": torch.zeros(32, *state_shape),
        "reward": torch.zeros(32, 1),
        "done": torch.zeros(32, 1, dtype=torch.bool),
        "terminated": torch.zeros(32, 1, dtype=torch.bool),
    }, [32]),
}, [32])

loss_td = loss(qnet(td))
print(loss_td["loss"])

QMIX vs VDN

QMIX
VDN

The mixer is a hypernetwork conditioned on the global state that produces non-negative weights for a monotonic combination of local Q-values. Suitable when the global state is available and agents have complex dependencies.

qmixer = TensorDictModule(
    QMixer(state_shape=state_shape, mixing_embed_dim=32, n_agents=n_agents),
    in_keys=[("agents", "chosen_action_value"), "state"],
    out_keys=["chosen_action_value"],
)

Value Decomposition Networks sum local Q-values directly, without a learned mixer or global state. Much simpler but assumes additive reward decomposition.

# VDN: sum the per-agent Q-values
vdn_mixer = TensorDictModule(
    lambda chosen: chosen.sum(dim=-2),  # sum over agent dim
    in_keys=[("agents", "chosen_action_value")],
    out_keys=["chosen_action_value"],
)

MultiAgentGAE

MultiAgentGAE is the multi-agent extension of GAE for settings where the value network produces per-agent estimates [*B, T, n_agents, 1] but the reward/done signals are team-shared [*B, T, 1]. It automatically broadcasts shared signals to the agent dimension before running the standard GAE recursion.

from torchrl.objectives.value import MultiAgentGAE

gae = MultiAgentGAE(
    gamma=0.99,
    lmbda=0.95,
    value_network=critic,
    average_gae=False,
    agent_dim=-2,   # default: penultimate dimension holds agents
)
gae(data)
# writes ("agents", "advantage") of shape [*B, T, n_agents, 1]
# writes ("agents", "value_target") of shape [*B, T, n_agents, 1]

agent_dim

int

default:"-2"

The dimension holding the agent index in the value tensor. Negative dimensions are interpreted modulo value.ndim. Defaults to -2 (penultimate), consistent with MultiAgentMLP’s output layout.

See the Value Estimators reference for the full MultiAgentGAE API.

Connecting Multi-Agent Objectives to Environments

TorchRL’s multi-agent environments (VmasEnv, PettingZooEnv, etc.) automatically use the nested ("agents", ...) key convention. The collector’s output can be passed directly to the loss module after advantage estimation:

from torchrl.envs.libs.vmas import VmasEnv
from torchrl.collectors import SyncDataCollector

env = VmasEnv("simple_spread", num_envs=32, n_agents=3)
collector = SyncDataCollector(env, actor, frames_per_batch=2000)

for data in collector:
    # data["agents"]["observation"]: [32, T, 3, obs_dim]
    gae(data)
    loss_td = loss_module(data)
    ...

If your environment returns per-agent rewards under ("next", "agents", "reward") (competitive / mixed settings), you must configure MultiAgentGAE accordingly and pass the correct key via loss_module.set_keys(reward=("agents", "reward")).

Environments

Data & Buffers

Collectors

Modules

Objectives

Multi-Agent RL Objectives: MAPPO, IPPO, and QMIX in TorchRL

Multi-Agent Data Convention

MAPPOLoss

Constructor Parameters

Output Keys

Complete MAPPO Example

IPPOLoss

When to Use MAPPO vs. IPPO

QMixerLoss

Constructor Parameters

Input Keys (via `set_keys`)

Output Keys

Complete QMIX Example

QMIX vs VDN

MultiAgentGAE

Connecting Multi-Agent Objectives to Environments

Build docs developers (and LLMs) love

Environments

Data & Buffers

Collectors

Modules

Objectives

Documentation Index

​Multi-Agent Data Convention

​MAPPOLoss

​Constructor Parameters

​Output Keys

​Complete MAPPO Example

​IPPOLoss

​When to Use MAPPO vs. IPPO

​QMixerLoss

​Constructor Parameters

​Input Keys (via set_keys)

​Output Keys

​Complete QMIX Example

​QMIX vs VDN

​MultiAgentGAE

​Connecting Multi-Agent Objectives to Environments

Build docs developers (and LLMs) love

Multi-Agent Data Convention

MAPPOLoss

Constructor Parameters

Output Keys

Complete MAPPO Example

IPPOLoss

When to Use MAPPO vs. IPPO

QMixerLoss

Constructor Parameters

Input Keys (via `set_keys`)

Output Keys

Complete QMIX Example

QMIX vs VDN

MultiAgentGAE

Connecting Multi-Agent Objectives to Environments