TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/facebookresearch/HyperAgents/llms.txt
Use this file to discover all available pages before exploring further.
polyglot domain is a software engineering benchmark modeled after SWE-bench. Agents are given a problem statement and must produce a code patch that makes failing tests pass. Unlike other HyperAgents domains, each instance runs inside its own Docker container — a fresh, isolated environment containing the target repository at a specific commit.
What It Evaluates
Polyglot tests the ability to fix bugs and implement features in real code repositories across six languages: Python, Rust, Go, JavaScript, C++, and Java. The primary metric isaccuracy_score — the fraction of instances where the agent’s patch causes all tests to pass (resolved).
Each instance is independent. The eval result for a single instance is one of:
| Result | Meaning |
|---|---|
resolved | Patch applied; all tests pass |
unresolved | Patch applied; tests still fail |
empty_patch | Agent produced no patch |
incomplete | Container setup failed before agent ran |
error | Unexpected exception during processing |
How It Differs from Other Domains
- Separate harness: Polyglot uses
domains/polyglot/harness.pyinstead ofdomains/harness.py. The main harness does not handle thepolyglotdomain. - Docker containers: Each instance builds and starts a dedicated container with the repository pre-installed at the correct commit.
- No CSV dataset: The dataset is a JSON file (
polyglot_benchmark_metadata.json) prepared viaprepare_polyglot_dataset.py. - Per-language test commands: Test execution is language-specific (e.g.,
pytestfor Python,cargo testfor Rust,go testfor Go,npm run testfor JavaScript,cmake + makefor C++,./gradlew testfor Java). - 10-minute agent timeout: Each agent invocation is wrapped in
timeout 600inside the container. - No ensemble support:
can_domain_ensembled("polyglot")returnsFalse.
Dataset Subsets
Two predefined subsets are available indomains/polyglot/subsets/:
| Subset | File | Description |
|---|---|---|
small | subsets/small.json | Small list of instance IDs for quick testing |
medium | subsets/medium.json | Larger representative set |
Setup
Run evaluation
| Argument | Default | Description |
|---|---|---|
--subset | small | Dataset subset: small, medium, or full |
--num_samples | -1 (all) | Limit number of instances |
--max_workers | 5 | Parallel Docker containers |
--model_name_or_path | timestamp | Label for this run |
--model_patch_paths | None | Comma-separated patch files to pre-apply |
Container Lifecycle
For each instance, the harness:- Builds a Docker image for the repository/language combination (cached after first build)
- Starts a container from that image
- Copies
task_agent.py,agent/,utils/, and other required files into the container - Applies any pre-existing model patches (from
--model_patch_paths) - Installs
requirements.txtinside the container - Runs
run_task_agent.pywith a 10-minute timeout - Reads the resulting
model_patch.difffrom the container - Resets the repository to
test_commitand applies the patch - Runs the language-specific test command with a 2-minute timeout
- Cleans up the container
Report Format
Thereport.json produced by domains/polyglot/report.py contains:
Domain Properties
| Property | Value |
|---|---|
| Score key | accuracy_score |
| Splits | train only |
| Eval subset | full dataset |
| Ensemble supported | No |
| Staged eval samples | 10 / 60 (~17%) |
| Parallelism | Multiple Docker containers via --max_workers |
The
accuracy_score is computed as resolved_instances / submitted_instances. If expected_num_tasks is explicitly provided to get_all_performance(), it is used as the denominator instead, ensuring that incomplete runs are not artificially inflated.