Training Overview

Foundation environments support two training frameworks: RLlib for distributed CPU-based training and WarpDrive for massively parallel GPU-accelerated training. Both frameworks work with the same Foundation environment APIs and the same hierarchical agent setup.

Hierarchical agent setup

Foundation uses a two-level multi-agent structure:

Workers (agents "0" through "n-1") — mobile economic actors that gather resources, trade, and build. They optimize post-tax utility.
Social planner (agent "p") — a government-like agent that sets tax rates or policy interventions. It optimizes a social welfare objective.

Both roles are trained simultaneously using PPO. Because the agents and planner have different observation and action spaces, they are assigned separate policies in the multi-agent configuration.

# Policy mapping used in both RLlib and WarpDrive training
policy_tag_to_agent_id_map = {
    "a": [str(agent_id) for agent_id in range(env_wrapper.env.n_agents)],
    "p": ["p"],
}

Action modes

Each agent type can operate in one of two action modes, controlled by the environment configuration:

Parameter	Type	Description
`multi_action_mode_agents`	bool	Whether mobile agents use multi-action mode. When `True`, each action subspace is sampled independently (`MultiDiscrete`). When `False`, a single flattened action is used (`Discrete`).
`multi_action_mode_planner`	bool	Same as above for the planner agent.

Curriculum learning

Training is stabilized using a two-phase curriculum approach, as described in The AI Economist paper:

Phase one — agents only, no taxes

Train only the worker agents in a free market (taxes disabled via disable_taxes: true on the PeriodicBracketTax component). Labor costs are annealed from zero using the energy_warmup_constant and energy_warmup_method parameters so that agents learn to explore before facing full costs.

Phase two — agents and planner, with taxes

Resume from the phase-one agent checkpoint and begin training the planner. Tax rates are annealed via tax_annealing_schedule. High planner entropy regularization at the start (via entropy_coeff_schedule) exposes agents to a wide range of tax levels before the planner begins to optimize.

Training configurations

Configuration files drive all aspects of training: environment setup, trainer hyperparameters, and policy network architecture. Both backends use YAML configs.

name: "covid_and_economy_environment"
env:
    n_agents: 51
    episode_length: 540
    multi_action_mode_agents: False
    multi_action_mode_planner: False
    flatten_masks: True
    flatten_observations: False
trainer:
    num_envs: 60
    num_episodes: 1000
    train_batch_size: 5400
policy:
    a:
        to_train: True
        algorithm: "PPO"
        gamma: 0.98
        lr: 0.0001
        model:
            type: "fully_connected"
            fc_dims: [256, 256]
    p:
        to_train: True
        algorithm: "PPO"
        entropy_coeff:
        - [0, 0.5]
        - [50000000, 0.05]
        gamma: 0.98
        lr: 0.0001
        model:
            type: "fully_connected"
            fc_dims: [256, 256]

Choose a training framework

RLlib

Distributed multi-agent RL on CPU clusters using Ray. Supports the Gather-Trade-Build scenario and two-level curriculum learning. Recommended when GPU hardware is unavailable or when running large distributed rollouts.

WarpDrive (GPU)

Massively parallel GPU-accelerated training using CUDA. Runs many environment copies simultaneously on a single GPU. Used for the COVID-19 and economic simulation at scale.

Get Started

Core Concepts

Simulations

Training with RL

Extending Foundation

Hierarchical agent setup

Action modes

Curriculum learning

Training configurations

Choose a training framework

RLlib

WarpDrive (GPU)

Build docs developers (and LLMs) love

Get Started

Core Concepts

Simulations

Training with RL

Extending Foundation

Documentation Index

​Hierarchical agent setup

​Action modes

​Curriculum learning

​Training configurations

​Choose a training framework

RLlib

WarpDrive (GPU)

Build docs developers (and LLMs) love

Hierarchical agent setup

Action modes

Curriculum learning

Training configurations

Choose a training framework