Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/swe-agent/mini-swe-agent/llms.txt

Use this file to discover all available pages before exploring further.

ProgramBench is a reverse-engineering benchmark: the agent is dropped into a Docker container with a compiled binary and must produce a fresh source codebase that reproduces the binary’s behavior. Solutions are evaluated by running tests against the rebuilt executable — no reference source code is provided. This makes ProgramBench significantly harder than traditional code-repair benchmarks like SWE-bench, since the agent must reason about program behavior from artifacts alone. mini-swe-agent supports ProgramBench through the mini-extra programbench command, which runs in the same parallel batch mode as the SWE-bench runner. Output is directly compatible with programbench eval.
The ProgramBench Docker containers (programbench/<instance>:task_cleanroom) are built for x86 Linux architecture. You may not be able to run them on ARM or other architectures.

Prerequisites

1

Install the programbench package

The runner imports programbench to discover task instances. Install it before running:
pip install programbench
2

Ensure Docker is available

Each task runs inside a container tagged :task_cleanroom — a build-artifact-free image. Docker must be running and accessible.

Running the benchmark

mini-extra programbench \
    --model anthropic/claude-sonnet-4-5-20250929 \
    --workers 4
You can also call the script directly:
python src/minisweagent/run/benchmarks/programbench.py --help
Basic flags
FlagDescription
-o, --outputOutput directory (default: timestamped programbench_results_<ts>/)
-m, --modelModel to use
-c, --configConfig file(s), filenames, or key=value pairs. Defaults to the built-in programbench.yaml
-w, --workersNumber of parallel worker threads (default: 1)
Data selection flags
FlagDescription
--sliceSlice of instances to run, e.g. 0:5 for the first five
--filterFilter instance IDs by regex
--shuffleShuffle instance order (default: false)
--redo-existingRe-run instances that already have a submission.tar.gz (default: false)
Advanced flags
FlagDescription
--environment-classEnvironment backend: docker or singularity
--model-classFully-qualified model class to use

Output layout

Each instance writes two files under <output>/<instance_id>/:
  • submission.tar.gz — the agent’s /workspace directory, gzip-compressed
  • <instance_id>.traj.json — the full agent trajectory
Pass the output directory directly to programbench eval for scoring:
programbench eval <output>/

Network isolation

The default config launches each container with --network none. The agent cannot install packages from the internet, clone repositories, or download source tarballs. The entire reverse-engineering task must be completed offline using only the provided binary and any documentation bundled in the container.
If you need to allow specific hosts, override environment.run_args in your config file:
environment:
  run_args:
    - "--network"
    - "bridge"

How ProgramBench differs from SWE-bench

SWE-benchProgramBench
Task typePatch a bug in existing Python sourceReconstruct source code from a compiled binary
Reference codeProvided (the repository)Not provided
EvaluationPatch passes test suiteRebuilt executable passes test suite
Dataset sourceGitHub issuesCompiled programs
Output filepreds.json (patches)submission.tar.gz (workspace archive)
For general batch evaluation concepts — cost limits, KeyboardInterrupt behavior, Docker troubleshooting, and the Singularity backend — see the SWE-bench FAQ. All of those entries apply equally to ProgramBench runs.

Build docs developers (and LLMs) love