Documentation Index Fetch the complete documentation index at: https://mintlify.com/cooperbench/CooperBench/llms.txt
Use this file to discover all available pages before exploring further.
CooperBench is designed as a modular system that can execute agent tasks across different backends while maintaining consistent evaluation standards.
High-level architecture
Core components
Task runner Orchestrates task execution, manages concurrency, and tracks results
Execution backends Provide isolated sandboxes for agent execution (Modal, GCP, Docker)
Communication layer Redis-based messaging for inter-agent communication
Evaluation pipeline Tests merged patches and computes success metrics
Execution backends
CooperBench supports three execution backends, each with different tradeoffs:
Modal (default)
Google Cloud Platform
Docker (local)
Cloud-based serverless execution Modal provides managed containerized sandboxes that scale automatically. Architecture: Features:
Automatic scaling to concurrency limits
Fast cold starts (~5-10 seconds)
Managed infrastructure
GPU support available
Pay-per-use pricing
Setup: # Install Modal
pip install modal
# Authenticate
modal setup
# Run tasks
cooperbench run -n exp --backend modal
Pros:
Zero infrastructure management
Excellent for parallel execution
Fast iteration cycles
Good for small to medium experiments
Cons:
Requires internet connection
Pay-per-use costs
Less control over environment
Enterprise cloud execution GCP provides VM-based execution with fine-grained control. Architecture: Features:
VM-based isolation
Custom machine types
Regional execution
Batch job management
Enterprise security
Setup: # Install GCP dependencies
pip install 'cooperbench[gcp]'
# Configure (interactive wizard)
cooperbench config gcp
# Run tasks
cooperbench run -n exp --backend gcp
Configuration wizard handles:
GCP authentication
Project creation/selection
Service account setup
API enablement
Permissions configuration
Validation testing
Pros:
Production-grade reliability
Fine-grained resource control
Better for large-scale experiments
Integration with GCP services
Cons:
More complex setup
Requires GCP account
Higher minimum costs
Slower cold starts
Local containerized execution Docker provides local execution for development and debugging. Architecture: Features:
Local execution
Full control over environment
No external dependencies
Easy debugging
Reproducible environments
Setup: # Ensure Docker is running
docker ps
# Run tasks
cooperbench run -n exp --backend docker
Pros:
No cloud costs
Works offline
Full debugging access
Fast iteration for development
Cons:
Limited by local resources
Poor parallelization
Manual cleanup needed
Not suitable for large experiments
Backend comparison
Feature Modal GCP Docker Setup complexity Low Medium Low Concurrency High (100+) High (100+) Low (CPU-bound) Cost Usage-based VM-based Free (local) Cold start ~5-10s ~30-60s ~2-5s Internet required Yes Yes No Best for Development, medium scale Production, large scale Local dev, debugging
Agent execution pipeline
When a task runs, CooperBench follows this execution flow:
Task discovery
# Discover tasks based on filters
tasks = discover_tasks(
subset = "lite" ,
repo_filter = "llama_index_task" ,
task_filter = None ,
features_filter = None
)
# Returns: [{"repo": "...", "task_id": 123, "features": [1, 2]}]
Infrastructure setup
Redis : Start or connect to messaging server
Git server (if enabled): Create shared repository
Namespacing : Create unique run ID for isolation
Sandbox initialization
For each agent:
Pull task-specific Docker image
Mount dataset files
Configure environment variables
Set up git remote (if enabled)
Initialize Redis connection
Agent execution
# Load agent framework
runner = get_runner( "mini_swe_agent" )
# Execute task
result = runner.run(
task = feature_description,
image = "cooperbench-llama-index-123" ,
agent_id = "agent1" ,
model_name = "gpt-4o" ,
comm_url = "redis://localhost:6379#run:abc123" ,
git_server_url = "git://git-server:9418" ,
)
Patch extraction
# Extract changes from agent's workspace
git diff HEAD > agent1.patch
Result aggregation
Collect patches from all agents
Extract conversation messages
Compute cost and token metrics
Save trajectories and logs
Redis messaging system
CooperBench uses Redis for real-time agent communication:
Architecture
Message flow
Namespacing : Each run gets unique namespace run:{run_id}
Channels : Per-agent channels run:{run_id}:{agent_id}
Publishing : Agent sends message via send_message command
Subscription : Agents poll for new messages
Delivery : Messages appear in agent’s context as user messages
Example: # Agent 1 publishes
redis.publish(
"run:abc123:agent2" ,
json.dumps({ "from" : "agent1" , "message" : "Starting feature 1" })
)
# Agent 2 receives (polled every N steps)
messages = redis.lrange( "run:abc123:agent2:inbox" , 0 , - 1 )
# Appears in context as:
# "[Message from agent1]: Starting feature 1"
Configuration
# Use local Redis
cooperbench run -n exp --redis redis://localhost:6379
# Use remote Redis
cooperbench run -n exp --redis redis://cloud.redis.com:6379
# Auto-start Redis via Docker
cooperbench run -n exp # detects and starts if needed
# Disable messaging
cooperbench run -n exp --no-messaging
Git collaboration mode
Optional git-based code sharing for agents:
Architecture
How it works
Server creation
# Create git server (per task)
git_server = create_git_server(
backend = "modal" , # or "gcp", "docker"
run_id = "abc123"
)
# Returns: url="git://server:9418"
Agent setup
# Configure remote in agent sandbox
git remote add team git://server:9418
git checkout -b agent1
git push -u team agent1
Collaboration
Agents can use standard git commands: # Push changes
git add .
git commit -m "Implement feature"
git push team agent1
# Fetch teammate's work
git fetch team
git branch -r # see team/agent2
# Merge changes
git merge team/agent2
Cleanup
# Automatically cleaned up after task
git_server.cleanup()
Backend-specific implementation
Serverless git daemon in Modal sandbox
Network-accessible within Modal
Automatic cleanup on completion
VM instance with git daemon
Internal VPC networking
Cleaned up via GCP Batch
Local container with git daemon
Docker network bridge
Manual cleanup required
Evaluation pipeline
After agents complete tasks, patches are evaluated:
Evaluation flow
Evaluation steps
Patch loading
# Load agent patches
patch1 = Path( "logs/.../agent1.patch" ).read_text()
patch2 = Path( "logs/.../agent2.patch" ).read_text()
# Load test patches
tests1 = Path( "dataset/.../feature1/tests.patch" ).read_text()
tests2 = Path( "dataset/.../feature2/tests.patch" ).read_text()
Sandbox creation
Create isolated test environment:
Pull task Docker image
Clone repository at correct commit
Run setup script
Patch application
# Apply agent patches
git apply agent1.patch
git apply agent2.patch # may conflict
# Apply test patches
git apply tests1.patch
git apply tests2.patch
Test execution
# Run complete test suite
bash run_tests.sh
Result analysis
result = {
"both_passed" : all_tests_passed,
"feature1" : {
"passed" : feature1_tests_passed,
"test_output" : "..."
},
"feature2" : {
"passed" : feature2_tests_passed,
"test_output" : "..."
},
"merge_conflict" : had_conflict,
}
Evaluation backends
Evaluation can run on different backends:
# Modal (default)
cooperbench eval -n exp --backend modal
# GCP Batch (efficient for large scale)
cooperbench eval -n exp --backend gcp
# Docker (local)
cooperbench eval -n exp --backend docker
Output structure
CooperBench generates comprehensive logs and metrics:
logs/{run_name}/
├── config.json # Run configuration
├── summary.json # Aggregate results
└── {setting}/ # coop or solo
└── {repo}/
└── task{id}/
└── f{i}_f{j}/ # Feature pair
├── result.json # Task result
├── conversation.json # Messages (coop only)
├── agent{i}.patch # Agent patches
├── agent{i}_traj.json # Trajectories
└── eval.json # Test results
Key output files
result.json - Task execution results
{
"repo" : "llama_index_task" ,
"task_id" : 123 ,
"features" : [ 1 , 2 ],
"setting" : "coop" ,
"total_cost" : 0.45 ,
"total_steps" : 23 ,
"duration_seconds" : 125.3 ,
"agents" : {
"agent1" : {
"feature_id" : 1 ,
"status" : "Submitted" ,
"cost" : 0.23 ,
"steps" : 12 ,
"patch_lines" : 45
},
"agent2" : { ... }
}
}
{
"both_passed" : true ,
"feature1" : {
"passed" : true ,
"test_output" : "test_cache.py::test_basic PASSED \n ..."
},
"feature2" : {
"passed" : true ,
"test_output" : "test_logging.py::test_info PASSED \n ..."
},
"merge_conflict" : false ,
"evaluated_at" : "2026-03-04T10:30:00"
}
conversation.json - Inter-agent messages
[
{
"from" : "agent1" ,
"to" : "agent2" ,
"message" : "I'm working on caching in src/cache.py" ,
"timestamp" : 1234567890 ,
"feature_id" : 1
},
{
"from" : "agent2" ,
"to" : "agent1" ,
"message" : "Got it, I'll handle logging separately" ,
"timestamp" : 1234567895 ,
"feature_id" : 2
}
]
Concurrency and parallelization
CooperBench executes multiple tasks in parallel:
# Run with 30 parallel tasks
cooperbench run - n exp -- concurrency 30
# Each task may spawn 2 agents (coop mode)
# Total: up to 60 concurrent sandboxes
Concurrency architecture
Backend handles spawning and managing agent sandboxes based on concurrency limits.
What’s next?
Quick start Run your first benchmark with the architecture you learned
Backend setup Configure Modal, GCP, or Docker backends
Dataset structure Understand how tasks are organized
CLI reference Complete command-line options and parameters