HyperAgents includes two IMO-related domains sourced from the Google DeepMind superhuman/imobench dataset. They test different aspects of mathematical reasoning:Documentation Index
Fetch the complete documentation index at: https://mintlify.com/facebookresearch/HyperAgents/llms.txt
Use this file to discover all available pages before exploring further.
imo_grading— evaluate a student’s written answer against official grading rubricsimo_proof— write a complete mathematical proof for an IMO problem
Dataset Setup
Both domains share the same dataset, which must be downloaded before first use:- Clones the
google-deepmind/superhumanrepository at a pinned commit (c1ee02e) - Copies the
imobench/*.csvfiles intodomains/imo/ - Removes the cloned repository
- Runs
python -m domains.imo.curate_subsetsto generate balanced filtered subsets
imo_grading
What It Evaluates
Given an IMO problem, its official solution, grading guidelines, and a student’s answer, the agent must predict the grade that a human marker would assign. Grades are drawn from four discrete labels:incorrect, partial, almost, correct.
This is a classification task. Scoring uses two metrics:
overall_accuracy— exact-match label accuracy- Normalized MAE — mean absolute error over the point mapping
{incorrect: 0, partial: 1, almost: 6, correct: 7}, normalized by the maximum score of 7
Dataset Format
Each row in the grading CSV represents one graded answer:| Column | Description |
|---|---|
Grading ID | Unique identifier |
Problem | The IMO problem statement |
Solution | The official reference solution |
Grading guidelines | Official rubric |
Response | Student’s answer |
Reward | Ground truth grade: incorrect, partial, almost, or correct |
Dataset Subsets
Filtered subsets follow the same convention aspaper_review:
Setup and Run
imo_proof
What It Evaluates
Given an IMO problem statement, the agent must generate a complete mathematical proof. Proofs are not evaluated directly; they are passed to a separate proof-grading agent (imo_proof_grading) which assigns a grade using the same four-label rubric as imo_grading.
The primary score key is points_percentage — the fraction of total possible points (7 per problem) earned across all problems:
Dataset Format
The proof dataset CSV has one row per IMO problem:| Column | Description |
|---|---|
Problem ID | Unique identifier |
Problem | The problem statement |
Solution | Reference solution (used by the grader, not the agent) |
Proof Grader Setup
Theimo_proof reporting pipeline depends on a proofgrader Python package that is generated from the current codebase. This must be set up before running domains.report on proof outputs.
Build the proof grader package
./proofgrader_repo/, re-packages it as the proofgrader Python package, and rewrites internal imports.Reporting Pipeline
Thereport.py two-stage pipeline for imo_proof:
- Grading: runs
harness(domain="imo_proof_grading", agent_path="proofgrader.task_agent", ...)against the generated proofs - Scoring: calls
report_proof_grading()which computespoints_percentageandcorrect_percentage - Report file: moved to
<dname>/report.json
Domain Properties
| Property | imo_grading | imo_proof |
|---|---|---|
| Score key | overall_accuracy | points_percentage |
| Splits | train / val / test | train only |
| Eval subset | _filtered_100_train | — |
| Ensemble supported | Yes | No |
| Staged eval samples | 10 / 100 (10%) | 10 / 60 (~17%) |