Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/salesforce/ai-economist/llms.txt

Use this file to discover all available pages before exploring further.

Foundation environments support two training frameworks: RLlib for distributed CPU-based training and WarpDrive for massively parallel GPU-accelerated training. Both frameworks work with the same Foundation environment APIs and the same hierarchical agent setup.

Hierarchical agent setup

Foundation uses a two-level multi-agent structure:
  • Workers (agents "0" through "n-1") — mobile economic actors that gather resources, trade, and build. They optimize post-tax utility.
  • Social planner (agent "p") — a government-like agent that sets tax rates or policy interventions. It optimizes a social welfare objective.
Both roles are trained simultaneously using PPO. Because the agents and planner have different observation and action spaces, they are assigned separate policies in the multi-agent configuration.
# Policy mapping used in both RLlib and WarpDrive training
policy_tag_to_agent_id_map = {
    "a": [str(agent_id) for agent_id in range(env_wrapper.env.n_agents)],
    "p": ["p"],
}

Action modes

Each agent type can operate in one of two action modes, controlled by the environment configuration:
ParameterTypeDescription
multi_action_mode_agentsboolWhether mobile agents use multi-action mode. When True, each action subspace is sampled independently (MultiDiscrete). When False, a single flattened action is used (Discrete).
multi_action_mode_plannerboolSame as above for the planner agent.

Curriculum learning

Training is stabilized using a two-phase curriculum approach, as described in The AI Economist paper:
1

Phase one — agents only, no taxes

Train only the worker agents in a free market (taxes disabled via disable_taxes: true on the PeriodicBracketTax component). Labor costs are annealed from zero using the energy_warmup_constant and energy_warmup_method parameters so that agents learn to explore before facing full costs.
2

Phase two — agents and planner, with taxes

Resume from the phase-one agent checkpoint and begin training the planner. Tax rates are annealed via tax_annealing_schedule. High planner entropy regularization at the start (via entropy_coeff_schedule) exposes agents to a wide range of tax levels before the planner begins to optimize.

Training configurations

Configuration files drive all aspects of training: environment setup, trainer hyperparameters, and policy network architecture. Both backends use YAML configs.
name: "covid_and_economy_environment"
env:
    n_agents: 51
    episode_length: 540
    multi_action_mode_agents: False
    multi_action_mode_planner: False
    flatten_masks: True
    flatten_observations: False
trainer:
    num_envs: 60
    num_episodes: 1000
    train_batch_size: 5400
policy:
    a:
        to_train: True
        algorithm: "PPO"
        gamma: 0.98
        lr: 0.0001
        model:
            type: "fully_connected"
            fc_dims: [256, 256]
    p:
        to_train: True
        algorithm: "PPO"
        entropy_coeff:
        - [0, 0.5]
        - [50000000, 0.05]
        gamma: 0.98
        lr: 0.0001
        model:
            type: "fully_connected"
            fc_dims: [256, 256]

Choose a training framework

RLlib

Distributed multi-agent RL on CPU clusters using Ray. Supports the Gather-Trade-Build scenario and two-level curriculum learning. Recommended when GPU hardware is unavailable or when running large distributed rollouts.

WarpDrive (GPU)

Massively parallel GPU-accelerated training using CUDA. Runs many environment copies simultaneously on a single GPU. Used for the COVID-19 and economic simulation at scale.

Build docs developers (and LLMs) love