Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/swe-agent/mini-swe-agent/llms.txt

Use this file to discover all available pages before exploring further.

SWE-bench is a benchmark of real-world GitHub issues drawn from popular Python repositories. Each task gives the agent a repository snapshot and a problem statement; the agent must produce a patch that fixes the issue. mini-swe-agent ships two scripts for running on SWE-bench: mini-extra swebench for large-scale parallel evaluation and mini-extra swebench-single for interactive debugging on a single instance.
The SWE-bench Docker containers are built for x86 Linux architecture. You may not be able to run them on ARM or other architectures.

Running the benchmark

Batch mode loads all task instances from the dataset and processes them in parallel using a thread pool. Results are written to a preds.json file as each instance completes.
mini-extra swebench \
    --model anthropic/claude-sonnet-4-5-20250929 \
    --subset verified \
    --split test \
    --workers 4
Basic flags
FlagDescription
-o, --outputOutput directory for trajectories and preds.json
-m, --modelModel to use (e.g., anthropic/claude-sonnet-4-5-20250929)
-c, --configPath to a config file, filename, or key=value pair. Defaults to the built-in swebench.yaml. If you set this flag the default is not loaded automatically — include it explicitly: -c swebench.yaml -c model.model_kwargs.temperature=0.5
-w, --workersNumber of parallel worker threads (default: 1)
Data selection flags
FlagDescription
--subsetSWE-bench subset: lite, verified, full, multimodal, multilingual, smith, rebench, or a path to a local dataset (default: lite)
--splitDataset split, e.g. dev or test (default: dev)
--sliceSlice of instances to run, e.g. 0:5 for the first five
--filterFilter instance IDs by regex
--shuffleShuffle the instance order before running (default: false)
--redo-existingRe-run instances that already have entries in preds.json (default: false)
Advanced flags
FlagDescription
--environment-classEnvironment backend to use. Recommended values: docker or singularity
You can also invoke the script directly: python src/minisweagent/run/benchmarks/swebench.py --help

Evaluating results

After a batch run completes, the output directory contains a preds.json file. You can evaluate it using SWE-bench’s free cloud service or a local installation.
The sb-cli is the fastest way to evaluate — it’s free and typically returns results within 20 minutes, regardless of how many instances you ran.
1

Install sb-cli and get a token

Follow the setup instructions at swebench.com/sb-cli.
2

Submit predictions

sb-cli submit swe-bench_verified test \
    --predictions_path preds.json \
    --run_id some-id-for-your-run
The evaluation time is determined by the slowest instance in SWE-bench, not the number of instances you submitted.

FAQ

Yes. Use the MSWEA_GLOBAL_CALL_LIMIT and MSWEA_GLOBAL_COST_LIMIT environment variables, or set them in the global config file. See configuration for details.
Trajectories are only written upon completion, so aborting mid-run is generally safe. On the next run, the script will skip any instances already present in preds.json. However, check preds.json for entries with KeyboardInterrupt as the model patch — these were saved in an aborted state and should be removed or rerun with --redo-existing.
Completed instances are tracked in preds.json, not by the presence of trajectory files. Remove the relevant entries from preds.json directly, then rerun.
Any dataset that follows the SWE-bench format and is loadable via datasets.load_dataset(path, split=split) works. Pass the path with --subset /path/to/your/dataset.
The runner is likely pulling Docker images for the first time. The run should start immediately on subsequent runs once images are cached locally. If you see docker pull timeouts, increase the timeout via environment.pull_timeout in your config (default is 120 seconds).
Run the Docker command printed in the console manually to inspect errors. Verify the container is running with docker ps, then test access with:
docker exec -it <container-id> ls
Use the Singularity/Apptainer backend. Either pass --environment-class singularity on the command line, or set it in your config file:
environment:
  environment_class: singularity
See the configuration guide for more options.
Yes. Use run.env_startup_command in your config. The command is rendered with Jinja2 using the instance variables as template context:
run:
  env_startup_command: "apt-get update && apt-get install -y python3-pip"
You can also reference instance fields:
run:
  env_startup_command: "git clone {{ repo_url }} . --force"
This is particularly useful with environments like bubblewrap that don’t pre-install dependencies.

Build docs developers (and LLMs) love