CooperBench supports two evaluation settings:Cooperative (coop): Two agents collaborate on implementing two features
cooperbench run --setting coop -s lite
Solo: Single agent implements both features independently
cooperbench run --setting solo -s lite
2
Select tasks to run
Filter tasks using subsets, repositories, or task IDs:
Subset
Repository
Task ID
Feature pair
Use predefined task collections:
cooperbench run --setting solo -s lite
Common subsets:
lite - Small subset for quick testing
dev - Development subset
Custom subsets in dataset/subsets/
Run all tasks from a specific repository:
cooperbench run --setting solo -r llama_index_task
Run a specific task:
cooperbench run --setting solo -r llama_index_task -t 8394
Run a specific feature pair within a task:
cooperbench run --setting solo -r llama_index_task -t 8394 -f 1,2
3
Name your experiment
Provide a custom name or let CooperBench auto-generate one:
# Custom namecooperbench run -n my-experiment --setting solo -s lite# Auto-generated (recommended)cooperbench run --setting solo -s lite# → solo-msa-gemini-3-flash-lite
Auto-generated names include the setting, agent, model, and filters, making experiments easy to identify.
LLM model to use. Supports any LiteLLM-compatible model.
# OpenAIcooperbench run --setting solo -s lite -m gpt-4o# Anthropiccooperbench run --setting solo -s lite -m claude-3-5-sonnet-20241022# Google Vertex AIcooperbench run --setting solo -s lite -m vertex_ai/gemini-3-flash-preview
# High parallelism for faster runscooperbench run --setting solo -s lite --concurrency 50# Low parallelism to reduce costscooperbench run --setting solo -s lite --concurrency 5
Higher concurrency increases speed but also API costs and resource usage.
# Modal (cloud, default)cooperbench run --setting solo -s lite --backend modal# Docker (local)cooperbench run --setting solo -s lite --backend docker# GCP (Google Cloud)cooperbench run --setting solo -s lite --backend gcp