ProgramBench is a reverse-engineering benchmark: the agent is dropped into a Docker container with a compiled binary and must produce a fresh source codebase that reproduces the binary’s behavior. Solutions are evaluated by running tests against the rebuilt executable — no reference source code is provided. This makes ProgramBench significantly harder than traditional code-repair benchmarks like SWE-bench, since the agent must reason about program behavior from artifacts alone. mini-swe-agent supports ProgramBench through theDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/swe-agent/mini-swe-agent/llms.txt
Use this file to discover all available pages before exploring further.
mini-extra programbench command, which runs in the same parallel batch mode as the SWE-bench runner. Output is directly compatible with programbench eval.
Prerequisites
Install the programbench package
The runner imports
programbench to discover task instances. Install it before running:Running the benchmark
| Flag | Description |
|---|---|
-o, --output | Output directory (default: timestamped programbench_results_<ts>/) |
-m, --model | Model to use |
-c, --config | Config file(s), filenames, or key=value pairs. Defaults to the built-in programbench.yaml |
-w, --workers | Number of parallel worker threads (default: 1) |
| Flag | Description |
|---|---|
--slice | Slice of instances to run, e.g. 0:5 for the first five |
--filter | Filter instance IDs by regex |
--shuffle | Shuffle instance order (default: false) |
--redo-existing | Re-run instances that already have a submission.tar.gz (default: false) |
| Flag | Description |
|---|---|
--environment-class | Environment backend: docker or singularity |
--model-class | Fully-qualified model class to use |
Output layout
Each instance writes two files under<output>/<instance_id>/:
submission.tar.gz— the agent’s/workspacedirectory, gzip-compressed<instance_id>.traj.json— the full agent trajectory
programbench eval for scoring:
Network isolation
The default config launches each container with
--network none. The agent cannot install packages from the internet, clone repositories, or download source tarballs. The entire reverse-engineering task must be completed offline using only the provided binary and any documentation bundled in the container.environment.run_args in your config file:
How ProgramBench differs from SWE-bench
| SWE-bench | ProgramBench | |
|---|---|---|
| Task type | Patch a bug in existing Python source | Reconstruct source code from a compiled binary |
| Reference code | Provided (the repository) | Not provided |
| Evaluation | Patch passes test suite | Rebuilt executable passes test suite |
| Dataset source | GitHub issues | Compiled programs |
| Output file | preds.json (patches) | submission.tar.gz (workspace archive) |