Running experiments

Run benchmark experiments to evaluate AI agents on cooperative software engineering tasks.

Quick start

Run a simple experiment with default settings:

cooperbench run --setting solo -s lite

This will:

Run in solo mode (single agent per task)
Use the “lite” subset of tasks
Auto-generate an experiment name like solo-msa-gemini-3-flash-lite
Save results to logs/

Basic usage

Choose a setting

CooperBench supports two evaluation settings:Cooperative (coop): Two agents collaborate on implementing two features

cooperbench run --setting coop -s lite

Solo: Single agent implements both features independently

cooperbench run --setting solo -s lite

Select tasks to run

Filter tasks using subsets, repositories, or task IDs:

Subset
Repository
Task ID
Feature pair

Use predefined task collections:

cooperbench run --setting solo -s lite

Common subsets:

lite - Small subset for quick testing
dev - Development subset
Custom subsets in dataset/subsets/

Run all tasks from a specific repository:

cooperbench run --setting solo -r llama_index_task

Run a specific task:

cooperbench run --setting solo -r llama_index_task -t 8394

Run a specific feature pair within a task:

cooperbench run --setting solo -r llama_index_task -t 8394 -f 1,2

Name your experiment

Provide a custom name or let CooperBench auto-generate one:

# Custom name
cooperbench run -n my-experiment --setting solo -s lite

# Auto-generated (recommended)
cooperbench run --setting solo -s lite
# → solo-msa-gemini-3-flash-lite

Auto-generated names include the setting, agent, model, and filters, making experiments easy to identify.

Run the experiment

Execute and monitor progress:

cooperbench run --setting coop -s lite

Output shows:

Task progress with status indicators
Cost tracking per task
Automatic evaluation results (if enabled)
Summary statistics

Command reference

Basic options

-n, --name

string

Experiment name. Auto-generated if not provided.

cooperbench run -n my-experiment --setting solo -s lite

--setting

enum

default:"coop"

Evaluation setting: coop (collaborative) or solo (independent)

cooperbench run --setting solo -s lite

-s, --subset

string

Use a predefined task subset from dataset/subsets/

cooperbench run --setting solo -s lite

-r, --repo

string

Filter by repository name

cooperbench run --setting solo -r llama_index_task

-t, --task

integer

Filter by specific task ID

cooperbench run --setting solo -r llama_index_task -t 8394

-f, --features

string

Specific feature pair to run (comma-separated)

cooperbench run --setting solo -r llama_index_task -t 8394 -f 1,2

Model and agent

-m, --model

string

default:"vertex_ai/gemini-3-flash-preview"

LLM model to use. Supports any LiteLLM-compatible model.

# OpenAI
cooperbench run --setting solo -s lite -m gpt-4o

# Anthropic
cooperbench run --setting solo -s lite -m claude-3-5-sonnet-20241022

# Google Vertex AI
cooperbench run --setting solo -s lite -m vertex_ai/gemini-3-flash-preview

-a, --agent

string

default:"mini_swe_agent"

Agent framework to use

cooperbench run --setting solo -s lite -a mini_swe_agent

See the custom agents guide to implement your own agent.

--agent-config

string

Path to agent-specific configuration file

cooperbench run --setting solo -s lite --agent-config config/custom.yaml

Concurrency

-c, --concurrency

integer

default:"30"

Number of tasks to run in parallel

# High parallelism for faster runs
cooperbench run --setting solo -s lite --concurrency 50

# Low parallelism to reduce costs
cooperbench run --setting solo -s lite --concurrency 5

Higher concurrency increases speed but also API costs and resource usage.

Collaboration features

--git

boolean

Enable git-based collaboration (agents can push/pull/merge)

cooperbench run --setting coop -s lite --git

Only available in cooperative mode. Requires git server setup.

--no-messaging

boolean

Disable inter-agent messaging

cooperbench run --setting coop -s lite --no-messaging

--redis

string

default:"redis://localhost:6379"

Redis URL for inter-agent communication

cooperbench run --setting coop -s lite --redis redis://myhost:6379

Backend selection

--backend

enum

default:"modal"

Execution backend: modal, docker, or gcp

# Modal (cloud, default)
cooperbench run --setting solo -s lite --backend modal

# Docker (local)
cooperbench run --setting solo -s lite --backend docker

# GCP (Google Cloud)
cooperbench run --setting solo -s lite --backend gcp

See the backends guide for details on each option.

Evaluation

--no-auto-eval

boolean

Disable automatic evaluation after task completion

cooperbench run --setting solo -s lite --no-auto-eval

By default, tasks are evaluated automatically as they complete.

--eval-concurrency

integer

default:"10"

Number of parallel evaluations for auto-eval

cooperbench run --setting solo -s lite --eval-concurrency 20

Other options

--force

boolean

Force rerun even if results already exist

cooperbench run --setting solo -s lite --force

Examples

Single task with detailed output

Run one task to see detailed agent output:

cooperbench run --setting solo -r llama_index_task -t 8394 -f 1,2

Output:

cooperbench solo-msa-gemini-3-flash-llama-index-8394 (solo)
task: llama_index_task/8394 features: [1, 2]
agent: mini_swe_agent
model: vertex_ai/gemini-3-flash-preview

┌───────┬──────────┬───────────┬────────┬────────┬───────┐
│ agent │ feature  │ status    │   cost │  steps │ lines │
├───────┼──────────┼───────────┼────────┼────────┼───────┤
│ solo  │ 1,2      │ Submitted │  $0.42 │     18 │    45 │
└───────┴──────────┴───────────┴────────┴────────┴───────┘

total: $0.42 time: 187s

Cooperative experiment

Run multiple tasks with two agents collaborating:

cooperbench run \
  --setting coop \
  -s lite \
  -m gpt-4o \
  --concurrency 10

Output shows progress:

cooperbench coop-msa-gpt-4o-lite (coop)
tasks: 25 concurrency: 10
agent: mini_swe_agent
model: gpt-4o
tools: messaging

✓ done llama_index_task/8394 [1,2]
  ✓ pass llama_index_task/8394 [1,2]
✓ done dspy_task/142 [1,2]
  ✓ pass dspy_task/142 [1,2]
...

runs:  25 completed
evals: 25 evaluated, 23 passed, 2 failed (92.0%)
cost:  $15.30
time:  8m 42s (agent: 6m 15s)

logs: logs/coop-msa-gpt-4o-lite/coop

Solo with git collaboration

Enable git features in solo mode:

cooperbench run \
  --setting solo \
  -s lite \
  --git \
  --backend gcp

Specific model and high concurrency

Run with Claude and high parallelism:

cooperbench run \
  --setting coop \
  -s lite \
  -m claude-3-5-sonnet-20241022 \
  --concurrency 50 \
  --backend gcp

Filter by repository

Run all tasks from a single repository:

cooperbench run \
  --setting solo \
  -r llama_index_task \
  -m gpt-4o

Output structure

Results are saved to logs/{experiment-name}/:

logs/
└── solo-msa-gemini-3-flash-lite/
    ├── config.json              # Experiment configuration
    ├── summary.json             # Aggregate statistics
    └── solo/                    # Setting-specific results
        └── llama_index_task/
            └── 8394/
                └── f1_f2/       # Feature pair results
                    ├── solo.patch        # Generated code changes
                    ├── result.json       # Task execution details
                    ├── eval.json         # Evaluation results
                    └── trajectory.json   # Agent conversation history

Result files

{
  "run_name": "solo-msa-gemini-3-flash-lite",
  "agent_framework": "mini_swe_agent",
  "model": "vertex_ai/gemini-3-flash-preview",
  "setting": "solo",
  "concurrency": 30,
  "total_tasks": 25,
  "started_at": "2024-03-15T10:30:00"
}

Next steps

Evaluation

Learn how to evaluate your experiment results

Backends

Choose the right execution backend for your needs

Custom agents

Implement your own agent framework

GCP setup

Set up Google Cloud Platform backend

Get Started

Core Concepts

Guides

Results & Analysis

Quick start

Basic usage

Command reference

Basic options

Model and agent

Concurrency

Collaboration features

Backend selection

Evaluation

Other options

Examples

Single task with detailed output

Cooperative experiment

Solo with git collaboration

Specific model and high concurrency

Filter by repository

Output structure

Result files

Next steps

Evaluation

Backends

Custom agents

GCP setup

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Results & Analysis

Documentation Index

​Quick start

​Basic usage

​Command reference

​Basic options

​Model and agent

​Concurrency

​Collaboration features

​Backend selection

​Evaluation

​Other options

​Examples

​Single task with detailed output

​Cooperative experiment

​Solo with git collaboration

​Specific model and high concurrency

​Filter by repository

​Output structure

​Result files

​Next steps

Evaluation

Backends

Custom agents

GCP setup

Build docs developers (and LLMs) love

Quick start

Basic usage

Command reference

Basic options

Model and agent

Concurrency

Collaboration features

Backend selection

Evaluation

Other options

Examples

Single task with detailed output

Cooperative experiment

Solo with git collaboration

Specific model and high concurrency

Filter by repository

Output structure

Result files

Next steps