Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/cooperbench/CooperBench/llms.txt

Use this file to discover all available pages before exploring further.

CooperBench supports multiple execution backends for running agent tasks and evaluations. Each backend provides isolated sandbox environments.

Available backends

  • modal (default) - Serverless containers via Modal.com
  • docker - Local Docker containers
  • gcp - Google Cloud Platform Batch jobs

Using backends

Specify backend in run()

from cooperbench import run

# Use Modal (default)
run(
    run_name="modal_run",
    subset="lite",
    backend="modal",
)

# Use Docker
run(
    run_name="docker_run",
    subset="lite",
    backend="docker",
)

# Use GCP Batch
run(
    run_name="gcp_run",
    subset="lite",
    backend="gcp",
)

Specify backend in evaluate()

from cooperbench import evaluate

# Evaluate using Modal
evaluate(
    run_name="my_experiment",
    backend="modal",
)

# Evaluate using Docker (local)
evaluate(
    run_name="my_experiment",
    backend="docker",
    concurrency=5,  # Lower concurrency for local resources
)

# Evaluate using GCP Batch (high-scale)
evaluate(
    run_name="large_experiment",
    backend="gcp",
    concurrency=100,
)

Backend comparison

FeatureModalDockerGCP Batch
SetupRequires Modal accountLocal Docker onlyGCP project required
SpeedFast startupInstantSlower startup
ConcurrencyHigh (100+)Limited by local CPUVery high (1000+)
CostPay per secondFree (local)Pay per hour
Best forDevelopment, medium scaleLocal testing, debuggingLarge-scale evaluation

Backend API

All backends implement the EvalBackend protocol:
class EvalBackend(Protocol):
    """Backend for creating evaluation sandboxes."""

    def create_sandbox(
        self,
        image: str,
        timeout: int = 600,
        workdir: str = "/workspace",
    ) -> Sandbox:
        """Create a new sandbox for evaluation.

        Args:
            image: Docker image name
            timeout: Maximum runtime in seconds
            workdir: Working directory inside container

        Returns:
            Sandbox instance
        """
        ...

Sandbox interface

class Sandbox(Protocol):
    """Abstract sandbox for running commands."""

    def exec(self, *args: str) -> ExecResult:
        """Execute a command.

        Args:
            *args: Command and arguments (e.g., "bash", "-c", "echo hello")

        Returns:
            ExecResult with returncode and output
        """
        ...

    def terminate(self) -> None:
        """Clean up and terminate the sandbox."""
        ...

ExecResult structure

class ExecResult(Protocol):
    """Result of executing a command."""

    @property
    def returncode(self) -> int:
        """Exit code of the command."""
        ...

    def stdout_read(self) -> str:
        """Read stdout output."""
        ...

    def stderr_read(self) -> str:
        """Read stderr output."""
        ...

Using backends programmatically

Get backend instance

from cooperbench.eval.backends import get_backend

# Get Modal backend
modal_backend = get_backend("modal")

# Get Docker backend
docker_backend = get_backend("docker")

# Get GCP backend
gcp_backend = get_backend("gcp")

Create and use sandbox

from cooperbench.eval.backends import get_backend

# Create a sandbox
backend = get_backend("modal")
sandbox = backend.create_sandbox(
    image="cooperbench/llama_index_task:task1",
    timeout=600,
    workdir="/workspace",
)

try:
    # Run commands
    result = sandbox.exec("bash", "-c", "python --version")
    print(f"Exit code: {result.returncode}")
    print(f"Output: {result.stdout_read()}")

    # Apply a patch
    result = sandbox.exec("git", "apply", "agent.patch")

    # Run tests
    result = sandbox.exec("pytest", "tests/")
    print(f"Tests {'passed' if result.returncode == 0 else 'failed'}")
finally:
    sandbox.terminate()

Setup

  1. Install Modal:
pip install modal
  1. Authenticate:
modal token new
  1. Use in CooperBench:
from cooperbench import run

run(
    run_name="modal_test",
    subset="lite",
    backend="modal",
)

Features

  • Serverless execution (no infrastructure to manage)
  • Fast cold starts (typically under 10 seconds)
  • Auto-scaling based on concurrency
  • Pay-per-second billing

Configuration

Modal is configured via environment variables:
export MODAL_TOKEN_ID="your-token-id"
export MODAL_TOKEN_SECRET="your-token-secret"

Docker backend

Setup

  1. Install Docker:
# See https://docs.docker.com/get-docker/
  1. Pull required images:
docker pull cooperbench/llama_index_task:task1
docker pull cooperbench/django_task:task5
# etc.
  1. Use in CooperBench:
from cooperbench import run

run(
    run_name="docker_test",
    subset="lite",
    backend="docker",
    concurrency=5,  # Adjust based on your CPU
)

Features

  • Runs locally (no internet required)
  • No additional costs
  • Full control over environment
  • Good for debugging

Configuration

from cooperbench.eval.backends.docker import DockerBackend

# Create Docker backend with custom settings
backend = DockerBackend()
sandbox = backend.create_sandbox(
    image="cooperbench/llama_index_task:task1",
    timeout=300,
    workdir="/workspace",
)

GCP Batch backend

Setup

  1. Install GCP SDK:
pip install google-cloud-batch google-cloud-storage
  1. Authenticate:
gcloud auth application-default login
  1. Set project:
export GOOGLE_CLOUD_PROJECT="your-project-id"
export GCP_REGION="us-central1"
  1. Use in CooperBench:
from cooperbench import evaluate

# GCP is best for large-scale evaluation
evaluate(
    run_name="large_experiment",
    backend="gcp",
    concurrency=100,
)

Features

  • Massive parallelism (1000+ concurrent tasks)
  • Batch job optimization (single VM startup for many tasks)
  • Cost-effective for large-scale runs
  • Auto-cleanup of resources

Batch evaluation

For GCP, evaluation uses batch mode by default:
from cooperbench import evaluate

# Submits all tasks as a single batch job
evaluate(
    run_name="my_experiment",
    backend="gcp",
    concurrency=200,  # Tasks run in parallel within the batch
)
Batch mode is more efficient because:
  • Single VM startup amortized across all tasks
  • Tasks run in parallel on the VM
  • Automatic cleanup after completion

Configuration

from cooperbench.eval.backends.gcp import GCPBatchBackend

backend = GCPBatchBackend(
    project_id="your-project",
    region="us-central1",
    machine_type="n1-standard-4",
)
Environment variables:
export GOOGLE_CLOUD_PROJECT="your-project-id"
export GCP_REGION="us-central1"  # Optional, defaults to us-central1
export GCP_MACHINE_TYPE="n1-standard-4"  # Optional

Advanced usage

Custom backend implementation

You can implement custom backends:
from cooperbench.eval.backends.base import EvalBackend, Sandbox, ExecResult
from dataclasses import dataclass

@dataclass
class MyExecResult:
    returncode: int
    _stdout: str
    _stderr: str

    def stdout_read(self) -> str:
        return self._stdout

    def stderr_read(self) -> str:
        return self._stderr

class MySandbox:
    def __init__(self, image: str, timeout: int, workdir: str):
        self.image = image
        self.timeout = timeout
        self.workdir = workdir
        # Initialize your sandbox

    def exec(self, *args: str) -> ExecResult:
        # Execute command in your sandbox
        return MyExecResult(
            returncode=0,
            _stdout="Command output",
            _stderr="",
        )

    def terminate(self) -> None:
        # Clean up
        pass

class MyBackend:
    def create_sandbox(
        self,
        image: str,
        timeout: int = 600,
        workdir: str = "/workspace",
    ) -> Sandbox:
        return MySandbox(image, timeout, workdir)

Use custom backend

# Monkey-patch the backend getter
from cooperbench.eval import backends

original_get_backend = backends.get_backend

def custom_get_backend(name: str):
    if name == "my_backend":
        return MyBackend()
    return original_get_backend(name)

backends.get_backend = custom_get_backend

# Now use it
from cooperbench import run

run(
    run_name="custom_backend_test",
    subset="lite",
    backend="my_backend",
)

Direct sandbox usage

from cooperbench.eval.backends import get_backend

backend = get_backend("modal")

# Create sandbox
sandbox = backend.create_sandbox(
    image="python:3.11-slim",
    timeout=300,
)

try:
    # Install dependencies
    sandbox.exec("pip", "install", "requests")

    # Run your code
    result = sandbox.exec(
        "python",
        "-c",
        "import requests; print(requests.__version__)",
    )

    print(result.stdout_read())
finally:
    sandbox.terminate()

Best practices

Choose the right backend

  • Development: Use Modal for fast iteration
  • Debugging: Use Docker for local control
  • Large-scale: Use GCP for cost-effective parallelism

Optimize concurrency

# Modal: high concurrency works well
run(
    run_name="modal_run",
    subset="full",
    backend="modal",
    concurrency=100,
)

# Docker: adjust based on CPU cores
import multiprocessing
cpu_count = multiprocessing.cpu_count()

run(
    run_name="docker_run",
    subset="full",
    backend="docker",
    concurrency=cpu_count - 1,  # Leave one core free
)

# GCP: very high concurrency for evaluation
evaluate(
    run_name="gcp_eval",
    backend="gcp",
    concurrency=200,
)

Handle timeouts

# Increase timeout for complex tasks
run(
    run_name="complex_tasks",
    subset="lite",
    backend="modal",
    # Agent timeout configured in agent settings
)

# For evaluation, timeout is per-test
from cooperbench import run_patch_test

result = run_patch_test(
    repo_name="llama_index_task",
    task_id=1,
    feature_id=1,
    agent_patch="path/to/patch",
    timeout=1200,  # 20 minutes
    backend="modal",
)

Build docs developers (and LLMs) love