Run mini-swe-agent on ProgramBench

ProgramBench is a reverse-engineering benchmark: the agent is dropped into a Docker container with a compiled binary and must produce a fresh source codebase that reproduces the binary’s behavior. Solutions are evaluated by running tests against the rebuilt executable — no reference source code is provided. This makes ProgramBench significantly harder than traditional code-repair benchmarks like SWE-bench, since the agent must reason about program behavior from artifacts alone. mini-swe-agent supports ProgramBench through the mini-extra programbench command, which runs in the same parallel batch mode as the SWE-bench runner. Output is directly compatible with programbench eval.

The ProgramBench Docker containers (programbench/<instance>:task_cleanroom) are built for x86 Linux architecture. You may not be able to run them on ARM or other architectures.

Prerequisites

Install the programbench package

The runner imports programbench to discover task instances. Install it before running:

pip install programbench

Ensure Docker is available

Each task runs inside a container tagged :task_cleanroom — a build-artifact-free image. Docker must be running and accessible.

Running the benchmark

mini-extra programbench \
    --model anthropic/claude-sonnet-4-5-20250929 \
    --workers 4

You can also call the script directly:

python src/minisweagent/run/benchmarks/programbench.py --help

Basic flags

Flag	Description
`-o`, `--output`	Output directory (default: timestamped `programbench_results_<ts>/`)
`-m`, `--model`	Model to use
`-c`, `--config`	Config file(s), filenames, or key=value pairs. Defaults to the built-in `programbench.yaml`
`-w`, `--workers`	Number of parallel worker threads (default: `1`)

Data selection flags

Flag	Description
`--slice`	Slice of instances to run, e.g. `0:5` for the first five
`--filter`	Filter instance IDs by regex
`--shuffle`	Shuffle instance order (default: `false`)
`--redo-existing`	Re-run instances that already have a `submission.tar.gz` (default: `false`)

Advanced flags

Flag	Description
`--environment-class`	Environment backend: `docker` or `singularity`
`--model-class`	Fully-qualified model class to use

Output layout

Each instance writes two files under <output>/<instance_id>/:

submission.tar.gz — the agent’s /workspace directory, gzip-compressed
<instance_id>.traj.json — the full agent trajectory

Pass the output directory directly to programbench eval for scoring:

programbench eval <output>/

Network isolation

The default config launches each container with --network none. The agent cannot install packages from the internet, clone repositories, or download source tarballs. The entire reverse-engineering task must be completed offline using only the provided binary and any documentation bundled in the container.

If you need to allow specific hosts, override environment.run_args in your config file:

environment:
  run_args:
    - "--network"
    - "bridge"

How ProgramBench differs from SWE-bench

	SWE-bench	ProgramBench
Task type	Patch a bug in existing Python source	Reconstruct source code from a compiled binary
Reference code	Provided (the repository)	Not provided
Evaluation	Patch passes test suite	Rebuilt executable passes test suite
Dataset source	GitHub issues	Compiled programs
Output file	`preds.json` (patches)	`submission.tar.gz` (workspace archive)

For general batch evaluation concepts — cost limits, KeyboardInterrupt behavior, Docker troubleshooting, and the Singularity backend — see the SWE-bench FAQ. All of those entries apply equally to ProgramBench runs.

Get Started

Usage

Batch & Benchmarks

Advanced

Reference

Run mini-swe-agent on ProgramBench

Prerequisites

Running the benchmark

Output layout

Network isolation

How ProgramBench differs from SWE-bench

Build docs developers (and LLMs) love

Get Started

Usage

Batch & Benchmarks

Advanced

Reference

Documentation Index

​Prerequisites

​Running the benchmark

​Output layout

​Network isolation

​How ProgramBench differs from SWE-bench

Build docs developers (and LLMs) love

Prerequisites

Running the benchmark

Output layout

Network isolation

How ProgramBench differs from SWE-bench