System architecture

CooperBench is designed as a modular system that can execute agent tasks across different backends while maintaining consistent evaluation standards.

High-level architecture

Core components

Task runner

Orchestrates task execution, manages concurrency, and tracks results

Execution backends

Provide isolated sandboxes for agent execution (Modal, GCP, Docker)

Communication layer

Redis-based messaging for inter-agent communication

Evaluation pipeline

Tests merged patches and computes success metrics

Execution backends

CooperBench supports three execution backends, each with different tradeoffs:

Modal (default)
Google Cloud Platform
Docker (local)

Cloud-based serverless executionModal provides managed containerized sandboxes that scale automatically.Architecture:Features:

Automatic scaling to concurrency limits
Fast cold starts (~5-10 seconds)
Managed infrastructure
GPU support available
Pay-per-use pricing

Setup:

# Install Modal
pip install modal

# Authenticate
modal setup

# Run tasks
cooperbench run -n exp --backend modal

Pros:

Zero infrastructure management
Excellent for parallel execution
Fast iteration cycles
Good for small to medium experiments

Cons:

Requires internet connection
Pay-per-use costs
Less control over environment

Enterprise cloud executionGCP provides VM-based execution with fine-grained control.Architecture:Features:

VM-based isolation
Custom machine types
Regional execution
Batch job management
Enterprise security

Setup:

# Install GCP dependencies
pip install 'cooperbench[gcp]'

# Configure (interactive wizard)
cooperbench config gcp

# Run tasks
cooperbench run -n exp --backend gcp

Configuration wizard handles:

GCP authentication
Project creation/selection
Service account setup
API enablement
Permissions configuration
Validation testing

Pros:

Production-grade reliability
Fine-grained resource control
Better for large-scale experiments
Integration with GCP services

Cons:

More complex setup
Requires GCP account
Higher minimum costs
Slower cold starts

Local containerized executionDocker provides local execution for development and debugging.Architecture:Features:

Local execution
Full control over environment
No external dependencies
Easy debugging
Reproducible environments

Setup:

# Ensure Docker is running
docker ps

# Run tasks
cooperbench run -n exp --backend docker

Pros:

No cloud costs
Works offline
Full debugging access
Fast iteration for development

Cons:

Limited by local resources
Poor parallelization
Manual cleanup needed
Not suitable for large experiments

Backend comparison

Feature	Modal	GCP	Docker
Setup complexity	Low	Medium	Low
Concurrency	High (100+)	High (100+)	Low (CPU-bound)
Cost	Usage-based	VM-based	Free (local)
Cold start	~5-10s	~30-60s	~2-5s
Internet required	Yes	Yes	No
Best for	Development, medium scale	Production, large scale	Local dev, debugging

Agent execution pipeline

When a task runs, CooperBench follows this execution flow:

Task discovery

# Discover tasks based on filters
tasks = discover_tasks(
    subset="lite",
    repo_filter="llama_index_task",
    task_filter=None,
    features_filter=None
)
# Returns: [{"repo": "...", "task_id": 123, "features": [1, 2]}]

Infrastructure setup

Redis: Start or connect to messaging server
Git server (if enabled): Create shared repository
Namespacing: Create unique run ID for isolation

Sandbox initialization

For each agent:

Pull task-specific Docker image
Mount dataset files
Configure environment variables
Set up git remote (if enabled)
Initialize Redis connection

Agent execution

# Load agent framework
runner = get_runner("mini_swe_agent")

# Execute task
result = runner.run(
    task=feature_description,
    image="cooperbench-llama-index-123",
    agent_id="agent1",
    model_name="gpt-4o",
    comm_url="redis://localhost:6379#run:abc123",
    git_server_url="git://git-server:9418",
)

Patch extraction

# Extract changes from agent's workspace
git diff HEAD > agent1.patch

Result aggregation

Collect patches from all agents
Extract conversation messages
Compute cost and token metrics
Save trajectories and logs

Redis messaging system

CooperBench uses Redis for real-time agent communication:

Architecture

Message flow

How messaging works

Namespacing: Each run gets unique namespace run:{run_id}
Channels: Per-agent channels run:{run_id}:{agent_id}
Publishing: Agent sends message via send_message command
Subscription: Agents poll for new messages
Delivery: Messages appear in agent’s context as user messages

Example:

# Agent 1 publishes
redis.publish(
    "run:abc123:agent2",
    json.dumps({"from": "agent1", "message": "Starting feature 1"})
)

# Agent 2 receives (polled every N steps)
messages = redis.lrange("run:abc123:agent2:inbox", 0, -1)
# Appears in context as:
# "[Message from agent1]: Starting feature 1"

Configuration

# Use local Redis
cooperbench run -n exp --redis redis://localhost:6379

# Use remote Redis
cooperbench run -n exp --redis redis://cloud.redis.com:6379

# Auto-start Redis via Docker
cooperbench run -n exp  # detects and starts if needed

# Disable messaging
cooperbench run -n exp --no-messaging

Git collaboration mode

Optional git-based code sharing for agents:

Architecture

How it works

Server creation

# Create git server (per task)
git_server = create_git_server(
    backend="modal",  # or "gcp", "docker"
    run_id="abc123"
)
# Returns: url="git://server:9418"

Agent setup

# Configure remote in agent sandbox
git remote add team git://server:9418
git checkout -b agent1
git push -u team agent1

Collaboration

Agents can use standard git commands:

# Push changes
git add .
git commit -m "Implement feature"
git push team agent1

# Fetch teammate's work
git fetch team
git branch -r  # see team/agent2

# Merge changes
git merge team/agent2

Cleanup

# Automatically cleaned up after task
git_server.cleanup()

Backend-specific implementation

Evaluation pipeline

After agents complete tasks, patches are evaluated:

Evaluation flow

Evaluation steps

Patch loading

# Load agent patches
patch1 = Path("logs/.../agent1.patch").read_text()
patch2 = Path("logs/.../agent2.patch").read_text()

# Load test patches
tests1 = Path("dataset/.../feature1/tests.patch").read_text()
tests2 = Path("dataset/.../feature2/tests.patch").read_text()

Sandbox creation

Create isolated test environment:

Pull task Docker image
Clone repository at correct commit
Run setup script

Patch application

# Apply agent patches
git apply agent1.patch
git apply agent2.patch  # may conflict

# Apply test patches
git apply tests1.patch
git apply tests2.patch

Test execution

# Run complete test suite
bash run_tests.sh

Result analysis

result = {
    "both_passed": all_tests_passed,
    "feature1": {
        "passed": feature1_tests_passed,
        "test_output": "..."
    },
    "feature2": {
        "passed": feature2_tests_passed,
        "test_output": "..."
    },
    "merge_conflict": had_conflict,
}

Evaluation backends

Evaluation can run on different backends:

# Modal (default)
cooperbench eval -n exp --backend modal

# GCP Batch (efficient for large scale)
cooperbench eval -n exp --backend gcp

# Docker (local)
cooperbench eval -n exp --backend docker

Output structure

CooperBench generates comprehensive logs and metrics:

logs/{run_name}/
├── config.json                    # Run configuration
├── summary.json                   # Aggregate results
└── {setting}/                     # coop or solo
    └── {repo}/
        └── task{id}/
            └── f{i}_f{j}/         # Feature pair
                ├── result.json         # Task result
                ├── conversation.json   # Messages (coop only)
                ├── agent{i}.patch      # Agent patches
                ├── agent{i}_traj.json  # Trajectories
                └── eval.json           # Test results

Key output files

result.json - Task execution results

{
  "repo": "llama_index_task",
  "task_id": 123,
  "features": [1, 2],
  "setting": "coop",
  "total_cost": 0.45,
  "total_steps": 23,
  "duration_seconds": 125.3,
  "agents": {
    "agent1": {
      "feature_id": 1,
      "status": "Submitted",
      "cost": 0.23,
      "steps": 12,
      "patch_lines": 45
    },
    "agent2": {...}
  }
}

eval.json - Test results

{
  "both_passed": true,
  "feature1": {
    "passed": true,
    "test_output": "test_cache.py::test_basic PASSED\n..."
  },
  "feature2": {
    "passed": true,
    "test_output": "test_logging.py::test_info PASSED\n..."
  },
  "merge_conflict": false,
  "evaluated_at": "2026-03-04T10:30:00"
}

conversation.json - Inter-agent messages

[
  {
    "from": "agent1",
    "to": "agent2",
    "message": "I'm working on caching in src/cache.py",
    "timestamp": 1234567890,
    "feature_id": 1
  },
  {
    "from": "agent2",
    "to": "agent1",
    "message": "Got it, I'll handle logging separately",
    "timestamp": 1234567895,
    "feature_id": 2
  }
]

Concurrency and parallelization

CooperBench executes multiple tasks in parallel:

# Run with 30 parallel tasks
cooperbench run -n exp --concurrency 30

# Each task may spawn 2 agents (coop mode)
# Total: up to 60 concurrent sandboxes

Concurrency architecture

Backend handles spawning and managing agent sandboxes based on concurrency limits.

What’s next?

Quick start

Run your first benchmark with the architecture you learned

Backend setup

Configure Modal, GCP, or Docker backends

Dataset structure

Understand how tasks are organized

CLI reference

Complete command-line options and parameters

Get Started

Core Concepts

Guides

Results & Analysis

High-level architecture

Core components

Task runner

Execution backends

Communication layer

Evaluation pipeline

Execution backends

Backend comparison

Agent execution pipeline

Redis messaging system

Architecture

Message flow

Configuration

Git collaboration mode

Architecture

How it works

Backend-specific implementation

Evaluation pipeline

Evaluation flow

Evaluation steps

Evaluation backends

Output structure

Key output files

Concurrency and parallelization

Concurrency architecture

What’s next?

Quick start

Backend setup

Dataset structure

CLI reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Results & Analysis

Documentation Index

​High-level architecture

​Core components

Task runner

Execution backends

Communication layer

Evaluation pipeline

​Execution backends

​Backend comparison

​Agent execution pipeline

​Redis messaging system

​Architecture

​Message flow

​Configuration

​Git collaboration mode

​Architecture

​How it works

​Backend-specific implementation

​Evaluation pipeline

​Evaluation flow

​Evaluation steps

​Evaluation backends

​Output structure

​Key output files

​Concurrency and parallelization

​Concurrency architecture

​What’s next?

Quick start

Backend setup

Dataset structure

CLI reference

Build docs developers (and LLMs) love

High-level architecture

Core components

Execution backends

Backend comparison

Agent execution pipeline

Redis messaging system

Architecture

Message flow

Configuration

Git collaboration mode

Architecture

How it works

Backend-specific implementation

Evaluation pipeline

Evaluation flow

Evaluation steps

Evaluation backends

Output structure

Key output files

Concurrency and parallelization

Concurrency architecture

What’s next?