Evaluate on SWE-bench with mini-swe-agent

SWE-bench is a benchmark of real-world GitHub issues drawn from popular Python repositories. Each task gives the agent a repository snapshot and a problem statement; the agent must produce a patch that fixes the issue. mini-swe-agent ships two scripts for running on SWE-bench: mini-extra swebench for large-scale parallel evaluation and mini-extra swebench-single for interactive debugging on a single instance.

The SWE-bench Docker containers are built for x86 Linux architecture. You may not be able to run them on ARM or other architectures.

Running the benchmark

Batch mode
Single instance (debugging)

Batch mode loads all task instances from the dataset and processes them in parallel using a thread pool. Results are written to a preds.json file as each instance completes.

mini-extra swebench \
    --model anthropic/claude-sonnet-4-5-20250929 \
    --subset verified \
    --split test \
    --workers 4

Basic flags

Flag	Description
`-o`, `--output`	Output directory for trajectories and `preds.json`
`-m`, `--model`	Model to use (e.g., `anthropic/claude-sonnet-4-5-20250929`)
`-c`, `--config`	Path to a config file, filename, or key=value pair. Defaults to the built-in `swebench.yaml`. If you set this flag the default is not loaded automatically — include it explicitly: `-c swebench.yaml -c model.model_kwargs.temperature=0.5`
`-w`, `--workers`	Number of parallel worker threads (default: `1`)

Data selection flags

Flag	Description
`--subset`	SWE-bench subset: `lite`, `verified`, `full`, `multimodal`, `multilingual`, `smith`, `rebench`, or a path to a local dataset (default: `lite`)
`--split`	Dataset split, e.g. `dev` or `test` (default: `dev`)
`--slice`	Slice of instances to run, e.g. `0:5` for the first five
`--filter`	Filter instance IDs by regex
`--shuffle`	Shuffle the instance order before running (default: `false`)
`--redo-existing`	Re-run instances that already have entries in `preds.json` (default: `false`)

Advanced flags

Flag	Description
`--environment-class`	Environment backend to use. Recommended values: `docker` or `singularity`

You can also invoke the script directly: python src/minisweagent/run/benchmarks/swebench.py --help

Single-instance mode runs on one task with live output — useful for debugging prompts or inspecting agent behavior. Unlike batch mode, it does not produce a preds.json file.

# by instance ID
mini-extra swebench-single \
    --subset verified \
    --split test \
    --model anthropic/claude-sonnet-4-5-20250929 \
    -i sympy__sympy-15599

# by numeric index
mini-extra swebench-single \
    --subset verified \
    --split test \
    -m anthropic/claude-sonnet-4-5-20250929 \
    -i 0

Basic flags

Flag	Description
`-m`, `--model`	Model to use
`-c`, `--config`	Path to a config file (default: `swebench.yaml`)
`-o`, `--output`	Output trajectory file (default: saves to the global config directory)

Data selection flags

Flag	Description
`--subset`	SWE-bench subset or path to a local dataset (default: `lite`)
`--split`	Dataset split (default: `dev`)
`-i`, `--instance`	Instance ID string or numeric index (default: `0`)

Advanced flags

Flag	Description
`--environment-class`	Environment backend: `docker` or `singularity`
`--exit-immediately`	Skip the confirmation prompt at exit (default: `false`)

Pass --exit-immediately to avoid the interactive confirmation when the agent finishes — useful in scripts or CI.

Evaluating results

After a batch run completes, the output directory contains a preds.json file. You can evaluate it using SWE-bench’s free cloud service or a local installation.

Cloud-based (sb-cli)
Local evaluation

The sb-cli is the fastest way to evaluate — it’s free and typically returns results within 20 minutes, regardless of how many instances you ran.

Install sb-cli and get a token

Follow the setup instructions at swebench.com/sb-cli.

Submit predictions

sb-cli submit swe-bench_verified test \
    --predictions_path preds.json \
    --run_id some-id-for-your-run

The evaluation time is determined by the slowest instance in SWE-bench, not the number of instances you submitted.

You can run evaluation locally using a clone of the SWE-bench repository.

python -m swebench.harness.run_evaluation \
    --dataset_name princeton-nlp/SWE-bench_Verified \
    --predictions_path preds.jsonl \
    --max_workers <num_workers> \
    --run_id <run_id>

FAQ

Can I set global cost limits?

Yes. Use the MSWEA_GLOBAL_CALL_LIMIT and MSWEA_GLOBAL_COST_LIMIT environment variables, or set them in the global config file. See configuration for details.

What happens to uncompleted tasks when I abort with KeyboardInterrupt?

Trajectories are only written upon completion, so aborting mid-run is generally safe. On the next run, the script will skip any instances already present in preds.json. However, check preds.json for entries with KeyboardInterrupt as the model patch — these were saved in an aborted state and should be removed or rerun with --redo-existing.

Certain tasks are stuck even though I deleted their trajectories

Completed instances are tracked in preds.json, not by the presence of trajectory files. Remove the relevant entries from preds.json directly, then rerun.

How can I run on a custom or different dataset?

Any dataset that follows the SWE-bench format and is loadable via datasets.load_dataset(path, split=split) works. Pass the path with --subset /path/to/your/dataset.

Some instances are stuck at 'initializing task' for a long time

The runner is likely pulling Docker images for the first time. The run should start immediately on subsequent runs once images are cached locally. If you see docker pull timeouts, increase the timeout via environment.pull_timeout in your config (default is 120 seconds).

I'm having Docker issues

Run the Docker command printed in the console manually to inspect errors. Verify the container is running with docker ps, then test access with:

docker exec -it <container-id> ls

Docker isn't available on my HPC cluster (Singularity/Apptainer)

Use the Singularity/Apptainer backend. Either pass --environment-class singularity on the command line, or set it in your config file:

environment:
  environment_class: singularity

See the configuration guide for more options.

Can I run a startup command in the environment?

Yes. Use run.env_startup_command in your config. The command is rendered with Jinja2 using the instance variables as template context:

run:
  env_startup_command: "apt-get update && apt-get install -y python3-pip"

You can also reference instance fields:

run:
  env_startup_command: "git clone {{ repo_url }} . --force"

This is particularly useful with environments like bubblewrap that don’t pre-install dependencies.

Get Started

Usage

Batch & Benchmarks

Advanced

Reference

Evaluate on SWE-bench with mini-swe-agent

Running the benchmark

Evaluating results

FAQ

Build docs developers (and LLMs) love

Get Started

Usage

Batch & Benchmarks

Advanced

Reference

Documentation Index

​Running the benchmark

​Evaluating results

​FAQ

Build docs developers (and LLMs) love

Running the benchmark

Evaluating results

FAQ