Training with RLlib

RLlib is a scalable reinforcement learning library built on Ray. Foundation environments integrate with RLlib through the RLlibEnvWrapper class, which subclasses MultiAgentEnv and exposes separate observation and action spaces for worker agents and the social planner.

Installation

conda create --name rllib-training python=3.7 --yes
conda activate rllib-training

pip install ai-economist>=1.5
pip install gym==0.21
pip install tensorflow==1.14
pip install "ray[rllib]==0.8.4"

The two-level curriculum experiments in the AI Economist paper were run on a 16-CPU, 60 GB machine on Google Cloud Platform (n1-standard-16) with 15 rollout workers and 1 trainer worker.

Environment wrapper

The RLlibEnvWrapper in tutorials/rllib/env_wrapper.py wraps any Foundation environment to be compatible with RLlib’s MultiAgentEnv interface. It handles observation and action space construction for both agents and the planner.

from ray.rllib.env.multi_agent_env import MultiAgentEnv
from ai_economist import foundation

class RLlibEnvWrapper(MultiAgentEnv):
    """
    Environment wrapper for RLlib. It sub-classes MultiAgentEnv.
    This wrapper adds the action and observation space to the environment,
    and adapts the reset and step functions to run with RLlib.
    """

    def __init__(self, env_config, verbose=False):
        self.env_config_dict = env_config["env_config_dict"]
        self.env = foundation.make_env_instance(**self.env_config_dict)

        obs = self.env.reset()

        self.observation_space = self._dict_to_spaces_dict(obs["0"])
        self.observation_space_pl = self._dict_to_spaces_dict(obs["p"])
        ...

Action space and `multi_action_mode`

The multi_action_mode environment parameter controls how each agent’s action space is structured:

if self.env.world.agents[0].multi_action_mode:
    self.action_space = spaces.MultiDiscrete(
        self.env.get_agent(self.sample_agent_idx).action_spaces
    )
else:
    self.action_space = spaces.Discrete(
        self.env.get_agent(self.sample_agent_idx).action_spaces
    )

The same logic applies to the planner via multi_action_mode_planner. Set these in your environment configuration YAML:

env:
  multi_action_mode_agents: false   # Discrete action space for workers
  multi_action_mode_planner: true   # MultiDiscrete action space for planner

Registering the environment and launching training

The training script in tutorials/rllib/training_script.py sets up a PPO trainer using the RLlibEnvWrapper as the environment.

import ray
from ray.rllib.agents.ppo import PPOTrainer
from env_wrapper import RLlibEnvWrapper

ray.init(log_to_driver=False)

def build_trainer(run_configuration):
    trainer_config = run_configuration.get("trainer")

    env_config = {
        "env_config_dict": run_configuration.get("env"),
        "num_envs_per_worker": trainer_config.get("num_envs_per_worker"),
    }

    # Build separate policy tuples for agents and planner
    dummy_env = RLlibEnvWrapper(env_config)

    agent_policy_tuple = (
        None,
        dummy_env.observation_space,
        dummy_env.action_space,
        run_configuration.get("agent_policy"),
    )
    planner_policy_tuple = (
        None,
        dummy_env.observation_space_pl,
        dummy_env.action_space_pl,
        run_configuration.get("planner_policy"),
    )

    policies = {"a": agent_policy_tuple, "p": planner_policy_tuple}

    def policy_mapping_fun(i):
        if str(i).isdigit() or i == "a":
            return "a"
        return "p"

    # Select which policies to train
    if run_configuration["general"]["train_planner"]:
        policies_to_train = ["a", "p"]
    else:
        policies_to_train = ["a"]

    trainer_config.update({
        "env_config": env_config,
        "multiagent": {
            "policies": policies,
            "policies_to_train": policies_to_train,
            "policy_mapping_fn": policy_mapping_fun,
        },
    })

    return PPOTrainer(
        env=RLlibEnvWrapper,
        config=trainer_config,
    )

To run training, point the script at a directory containing a config.yaml:

python training_script.py --run-dir phase1

Results are written to ~/ray_results. Checkpoints and dense logs go into phase1/ckpts/ and phase1/dense_logs/.

Two-level curriculum training

The two-level curriculum approach from The AI Economist paper staggers agent and planner learning to stabilize training in the non-stationary multi-agent environment.

Phase one — agents only

Disable taxes

Set disable_taxes: true on the PeriodicBracketTax component. This trains workers in a free market so they develop robust labor and trading policies before tax dynamics are introduced.

Anneal labor costs

Use energy_warmup_constant and energy_warmup_method to gradually ramp up labor costs. This prevents early convergence to a do-nothing policy caused by high labor costs with low rewards.

env:
  energy_cost: 0.21
  energy_warmup_constant: 10000
  energy_warmup_method: auto  # ramp based on positive-reward timesteps

Run phase one

python training_script.py --run-dir phase1

Only agent weights are saved. The planner uses the random model during this phase (custom_model: random).

Phase two — agents and planner

Restore agent weights from phase one

Set restore_tf_weights_agents in the general section to the checkpoint path produced at the end of phase one:

general:
  restore_tf_weights_agents: "phase1/ckpts/agent.tf.weights.global-step-25000000"
  train_planner: true

Enable tax annealing

Add tax_annealing_schedule to PeriodicBracketTax to prevent the planner from setting destructive tax rates during early exploration:

- PeriodicBracketTax:
    bracket_spacing: us-federal
    tax_annealing_schedule:
    - -100
    - 0.001

Schedule planner entropy

Use entropy_coeff_schedule for the planner policy to keep entropy high initially, giving agents time to adapt to diverse tax settings before the planner begins to optimize:

planner_policy:
  entropy_coeff_schedule:
  - [0, 0.5]
  - [50000000, 0.05]

Run phase two

python training_script.py --run-dir phase2

Custom models

The tutorials/rllib/tf_models.py file provides two registered TensorFlow models for use with RLlib:

Model name	Description
`random`	Samples actions uniformly at random. Used for the planner during phase one.
`keras_conv_lstm`	Combines convolutional layers (spatial), fully-connected layers (non-spatial), and an LSTM (historical) for structured observations. Used in the paper.

Reference these by name in your policy configuration:

agent_policy:
  model:
    custom_model: keras_conv_lstm
    custom_options:
      fc_dim: 128
      idx_emb_dim: 4
      input_emb_vocab: 100
      lstm_cell_size: 128
      num_conv: 2
      num_fc: 2
    max_seq_len: 25

planner_policy:
  model:
    custom_model: random

Visualizing results

tensorboard --logdir ~/ray_results

The full interactive tutorial is available on Colab: multi_agent_training_with_rllib.ipynb

Get Started

Core Concepts

Simulations

Training with RL

Extending Foundation

Installation

Environment wrapper

Action space and `multi_action_mode`

Registering the environment and launching training

Two-level curriculum training

Phase one — agents only

Phase two — agents and planner

Custom models

Visualizing results

Build docs developers (and LLMs) love

Get Started

Core Concepts

Simulations

Training with RL

Extending Foundation

Documentation Index

​Installation

​Environment wrapper

​Action space and multi_action_mode

​Registering the environment and launching training

​Two-level curriculum training

​Phase one — agents only

​Phase two — agents and planner

​Custom models

​Visualizing results

Build docs developers (and LLMs) love

Installation

Environment wrapper

Action space and `multi_action_mode`

Registering the environment and launching training

Two-level curriculum training

Phase one — agents only

Phase two — agents and planner

Custom models

Visualizing results