Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt

Use this file to discover all available pages before exploring further.

BIRD-Interact is an interactive text-to-SQL benchmark where an agent must complete multi-turn CRUD tasks over a live Postgres database. Unlike static SQL benchmarks, each task involves a conversation between a user simulator, a database environment, and your system agent — all running as separate services. auto-harness wraps this three-service setup entirely: python prepare.py auto-provisions the BIRD-Interact-ADK repository, its isolated virtual environment, the HuggingFace dataset, and the Postgres Docker container, then runs the baseline. This page covers the full setup including the one-time ground truth access step.

Requirements

  • Docker — runs the Postgres database container required by BIRD-Interact
  • Python 3.12+ — required by the BIRD-Interact-ADK’s dependencies
  • git-lfs — required to clone the BIRD-Interact dataset from HuggingFace
  • OPENAI_API_KEY or ANTHROPIC_API_KEY — for your configured agent and user simulator models
  • A coding agent — Claude Code, Codex CLI, or any agent that can read files and run shell commands
prepare.py auto-provisions everything: it clones BIRD-Interact-ADK, creates an isolated .venv-adk, downloads the dataset from HuggingFace, and starts the Postgres container. You do not need to set these up manually unless you want to point at an existing install.

Setup

1

Clone the repository

git clone https://github.com/neosigmaai/auto-harness
cd auto-harness
2

Set up environment variables

cp .env.example .env
Open .env and set your LLM API keys:
# LLM API keys — set whichever your configured models need
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...   # set if using claude models
GEMINI_API_KEY=                # set if using gemini models
The default config uses anthropic/claude-sonnet-4-20250514 for the system agent and anthropic/claude-haiku-4-5-20251001 for the user simulator, so ANTHROPIC_API_KEY is required for that setup.
3

Configure the experiment

cp experiment_config.yaml.template experiment_config.yaml
Open experiment_config.yaml and uncomment the BIRD-Interact section:
benchmark: "bird-interact"
mode: "a-interact"                              # or "c-interact"
dataset: "lite"                                 # "lite" (300 tasks) or "full" (600 tasks)
split: "train"
gate_split: "test"
max_concurrency: 3
threshold: 0.8
patience: 3
per_task_timeout: 1800
system_agent_port: 6100
user_sim_port: 6101
db_env_port: 6102
agent_model: "anthropic/claude-sonnet-4-20250514"
user_model: "anthropic/claude-haiku-4-5-20251001"
Start with dataset: "lite" (300 tasks). The full dataset (600 tasks) roughly doubles baseline runtime. max_concurrency: 3 is the recommended default — BIRD tasks are multi-turn conversations that each spawn three services, so running many concurrently is resource-intensive.
4

Obtain ground truth data (one-time step)

The public BIRD-Interact dataset ships without gold SQL answers to prevent data leakage. You need to request access once before running the baseline.
  1. Email bird.bench25@gmail.com with the subject line:
    [bird-interact-lite GT&Test Cases]
    
  2. You will receive a .jsonl file with the ground truth answers.
  3. Run the merge script provided by prepare.py (it prints the exact command when GT is missing):
    python bird_interact_adk/scripts/combine_public_with_gt.py \
      --public bird_interact_data/bird_interact_data.jsonl \
      --gt <path-to-received-jsonl> \
      --output bird_interact_data/bird_interact_data.jsonl
    
  4. Proceed to the next step.
If you run python prepare.py before completing this step, it will detect the missing ground truth, print the exact email subject and merge command, and exit. Complete this step first, then re-run prepare.py.
5

Initialize the workspace and run the baseline

python prepare.py
prepare.py auto-provisions everything in this order:
  1. Validates required environment variables and tooling (docker, git-lfs)
  2. Clones BIRD-Interact-ADK into ./bird_interact_adk/ (gitignored)
  3. Creates an isolated .venv-adk inside bird_interact_adk/ with ADK dependencies installed
  4. Clones the bird-interact-lite dataset from HuggingFace into bird_interact_data/
  5. Starts the Postgres Docker container
  6. Creates workspace/ and initializes suite.json, learnings.md, results.tsv, and train_results.json
  7. Copies agent/templates/bird_interact.py into agent/agent.py as the starting point
  8. Composes PROGRAM.md from program_templates/base.md + program_templates/bird_interact.md
  9. Runs all 300 lite tasks, generates a stratified 70/30 train/test split at bird_data/task_split.json
  10. Records the baseline score as iteration 0
The baseline run executes 300 tasks at max_concurrency: 3 with a 1800-second per-task timeout. Expect the baseline to take several hours. Tasks that time out are excluded from the split and logged as warnings.
Once complete, you will see output like:
[prepare] BIRD task split created: 210 train, 90 test
[prepare] baseline val_score=0.3222 (29/90 passed) — recorded as iteration 0

[prepare] done. Ready to start the optimization loop.
6

Start the optimization loop

Point your coding agent at the repository and use the following prompt:
Read PROGRAM.md and start the optimization loop.
The baseline is already recorded. Start from step 2 (analyze failures).
The agent will:
  1. Run python benchmark.py to get train-split results
  2. Read train-split traces from workspace/traces/latest/ to diagnose root causes
  3. Edit agent/agent.py with one focused improvement (your system agent code)
  4. Run python gating.py to gate the change — three steps: regression suite, full test score, suite promotion
  5. If the gate passes: commit, run python record.py, update workspace/learnings.md
  6. If the gate fails: revert with git checkout agent/agent.py and try a different approach
  7. Repeat

What auto-harness adds to BIRD-Interact

The integration adds several components that connect the BIRD-Interact-ADK infrastructure to the auto-harness optimization loop:
ComponentPurpose
BirdInteractRunner in benchmark.pySpawns the three ADK services (user simulator, DB environment, system agent) per run, drives orchestrator.runner, parses results into the harness reward format
agent/helpers/bird_interact/bird_service.pyFastAPI service wrapper that serves your agent/agent.py as the BIRD system agent
agent/helpers/bird_interact/bird_adk_runtime.pyGoogle ADK runtime adapter that connects the FastAPI service to the ADK evaluation framework
agent/templates/bird_interact.pyFaithful copy of the stock BIRD-Interact-ADK system agent — your starting point for optimization
program_templates/bird_interact.mdBenchmark-specific guidance appended to PROGRAM.md: trace paths, task ID format, known techniques

Advanced: pointing at an existing BIRD-Interact install

If you already have BIRD-Interact-ADK installed, you can skip auto-provisioning by setting these keys in experiment_config.yaml:
bird_repo: "/absolute/path/to/BIRD-Interact"       # repo root or BIRD-Interact-ADK dir
bird_python_bin: "/absolute/path/to/python"         # python binary with ADK deps installed
bird_data_path: "/absolute/path/to/bird_interact_data.jsonl"
pg_host: "127.0.0.1"
pg_port: 5432
pg_user: "root"
pg_password: "123123"

Known caveats

GPT-5-family models and temperature=0: GPT-5-family models reject an explicit temperature=0 parameter. The bird_interact.py template omits the temperature kwarg for those models to preserve stock behavior. If you are testing a GPT-5-family model, do not add temperature=0 in agent/agent.py.
Separate .venv-adk: prepare.py creates a separate .venv-adk inside bird_interact_adk/ because the ADK’s dependencies (google-adk, psycopg2, etc.) may conflict with other benchmarks’ dependencies. The harness invokes this venv’s Python binary directly — you do not need to activate it manually.
git-lfs required: The BIRD-Interact dataset is stored in HuggingFace using Git LFS. If git-lfs is not installed, the dataset clone will succeed but the .jsonl file will contain LFS pointer text instead of actual data. Install git-lfs before running prepare.py.

Build docs developers (and LLMs) love