The BALROG domain group evaluates an agent’s ability to play text-based and grid-world games. It wraps four distinct environments —Documentation Index
Fetch the complete documentation index at: https://mintlify.com/facebookresearch/HyperAgents/llms.txt
Use this file to discover all available pages before exploring further.
babyai, babaisai, minihack, and nle — each selectable as a separate domain variant.
What It Evaluates
BALROG tests sequential decision-making in partially observable environments. The agent receives a textual description of the game state each step and must produce a valid action. The primary metric isaverage_progress — the mean episode completion fraction across all tasks in the environment, expressed as a percentage.
Each environment runs multiple episodes per task. Results are aggregated into a report.json with per-task and per-environment breakdowns.
The Four Environments
- babyai
- babaisai
- minihack
- nle
BabyAI is a grid-world environment with language-conditioned navigation and manipulation tasks. The default task set includes 5 tasks from
BabyAI-MixedTrainLocal-v0:goto— navigate to a target objectpickup— pick up a specified objectopen— open a doorputnext— place an object next to anotherpick_up_seq_go_to— pick up an object then navigate
Hydra Configuration
BALROG uses Hydra for configuration. The config file is atdomains/balrog/config/config.yaml.
Key configuration sections:
Setup
Run the post-install script
BALROG requires additional game data (Boxoban levels and TextWorld games) that must be downloaded after installing the Python packages:This downloads:
- Boxoban levels from the DeepMind repository into the MiniHack data directory
- TextWorld game files (
tw-games.zip) into./domains/balrog/
Output Structure
Each episode produces a JSON file and optionally a text trajectory. Results are organized under<output_dir>/<env_name>/<task_name>/.
The report.json summary contains:
Domain Properties
| Property | Value |
|---|---|
| Score key | average_progress |
| Splits | train only |
| Eval subset | full dataset |
| Ensemble supported | No |
| Staged eval samples | 1 (all variants) |