Run BIRD-Interact with auto-harness: quickstart guide

BIRD-Interact is an interactive text-to-SQL benchmark where an agent must complete multi-turn CRUD tasks over a live Postgres database. Unlike static SQL benchmarks, each task involves a conversation between a user simulator, a database environment, and your system agent — all running as separate services. auto-harness wraps this three-service setup entirely: python prepare.py auto-provisions the BIRD-Interact-ADK repository, its isolated virtual environment, the HuggingFace dataset, and the Postgres Docker container, then runs the baseline. This page covers the full setup including the one-time ground truth access step.

Requirements

Docker — runs the Postgres database container required by BIRD-Interact
Python 3.12+ — required by the BIRD-Interact-ADK’s dependencies
git-lfs — required to clone the BIRD-Interact dataset from HuggingFace
OPENAI_API_KEY or ANTHROPIC_API_KEY — for your configured agent and user simulator models
A coding agent — Claude Code, Codex CLI, or any agent that can read files and run shell commands

prepare.py auto-provisions everything: it clones BIRD-Interact-ADK, creates an isolated .venv-adk, downloads the dataset from HuggingFace, and starts the Postgres container. You do not need to set these up manually unless you want to point at an existing install.

Setup

Clone the repository

git clone https://github.com/neosigmaai/auto-harness
cd auto-harness

Set up environment variables

cp .env.example .env

Open .env and set your LLM API keys:

# LLM API keys — set whichever your configured models need
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...   # set if using claude models
GEMINI_API_KEY=                # set if using gemini models

The default config uses anthropic/claude-sonnet-4-20250514 for the system agent and anthropic/claude-haiku-4-5-20251001 for the user simulator, so ANTHROPIC_API_KEY is required for that setup.

Configure the experiment

cp experiment_config.yaml.template experiment_config.yaml

Open experiment_config.yaml and uncomment the BIRD-Interact section:

benchmark: "bird-interact"
mode: "a-interact"                              # or "c-interact"
dataset: "lite"                                 # "lite" (300 tasks) or "full" (600 tasks)
split: "train"
gate_split: "test"
max_concurrency: 3
threshold: 0.8
patience: 3
per_task_timeout: 1800
system_agent_port: 6100
user_sim_port: 6101
db_env_port: 6102
agent_model: "anthropic/claude-sonnet-4-20250514"
user_model: "anthropic/claude-haiku-4-5-20251001"

Start with dataset: "lite" (300 tasks). The full dataset (600 tasks) roughly doubles baseline runtime. max_concurrency: 3 is the recommended default — BIRD tasks are multi-turn conversations that each spawn three services, so running many concurrently is resource-intensive.

Obtain ground truth data (one-time step)

The public BIRD-Interact dataset ships without gold SQL answers to prevent data leakage. You need to request access once before running the baseline.

Email bird.bench25@gmail.com with the subject line:
```
[bird-interact-lite GT&Test Cases]
```
You will receive a .jsonl file with the ground truth answers.

Run the merge script provided by prepare.py (it prints the exact command when GT is missing):

python bird_interact_adk/scripts/combine_public_with_gt.py \
  --public bird_interact_data/bird_interact_data.jsonl \
  --gt <path-to-received-jsonl> \
  --output bird_interact_data/bird_interact_data.jsonl

Proceed to the next step.

If you run python prepare.py before completing this step, it will detect the missing ground truth, print the exact email subject and merge command, and exit. Complete this step first, then re-run prepare.py.

Initialize the workspace and run the baseline

python prepare.py

prepare.py auto-provisions everything in this order:

Validates required environment variables and tooling (docker, git-lfs)
Clones BIRD-Interact-ADK into ./bird_interact_adk/ (gitignored)
Creates an isolated .venv-adk inside bird_interact_adk/ with ADK dependencies installed
Clones the bird-interact-lite dataset from HuggingFace into bird_interact_data/
Starts the Postgres Docker container
Creates workspace/ and initializes suite.json, learnings.md, results.tsv, and train_results.json
Copies agent/templates/bird_interact.py into agent/agent.py as the starting point
Composes PROGRAM.md from program_templates/base.md + program_templates/bird_interact.md
Runs all 300 lite tasks, generates a stratified 70/30 train/test split at bird_data/task_split.json
Records the baseline score as iteration 0

The baseline run executes 300 tasks at max_concurrency: 3 with a 1800-second per-task timeout. Expect the baseline to take several hours. Tasks that time out are excluded from the split and logged as warnings.

Once complete, you will see output like:

[prepare] BIRD task split created: 210 train, 90 test
[prepare] baseline val_score=0.3222 (29/90 passed) — recorded as iteration 0

[prepare] done. Ready to start the optimization loop.

Start the optimization loop

Point your coding agent at the repository and use the following prompt:

Read PROGRAM.md and start the optimization loop.
The baseline is already recorded. Start from step 2 (analyze failures).

The agent will:

Run python benchmark.py to get train-split results
Read train-split traces from workspace/traces/latest/ to diagnose root causes
Edit agent/agent.py with one focused improvement (your system agent code)
Run python gating.py to gate the change — three steps: regression suite, full test score, suite promotion
If the gate passes: commit, run python record.py, update workspace/learnings.md
If the gate fails: revert with git checkout agent/agent.py and try a different approach
Repeat

What auto-harness adds to BIRD-Interact

The integration adds several components that connect the BIRD-Interact-ADK infrastructure to the auto-harness optimization loop:

Component	Purpose
`BirdInteractRunner` in `benchmark.py`	Spawns the three ADK services (user simulator, DB environment, system agent) per run, drives `orchestrator.runner`, parses results into the harness reward format
`agent/helpers/bird_interact/bird_service.py`	FastAPI service wrapper that serves your `agent/agent.py` as the BIRD system agent
`agent/helpers/bird_interact/bird_adk_runtime.py`	Google ADK runtime adapter that connects the FastAPI service to the ADK evaluation framework
`agent/templates/bird_interact.py`	Faithful copy of the stock BIRD-Interact-ADK system agent — your starting point for optimization
`program_templates/bird_interact.md`	Benchmark-specific guidance appended to `PROGRAM.md`: trace paths, task ID format, known techniques

Advanced: pointing at an existing BIRD-Interact install

If you already have BIRD-Interact-ADK installed, you can skip auto-provisioning by setting these keys in experiment_config.yaml:

bird_repo: "/absolute/path/to/BIRD-Interact"       # repo root or BIRD-Interact-ADK dir
bird_python_bin: "/absolute/path/to/python"         # python binary with ADK deps installed
bird_data_path: "/absolute/path/to/bird_interact_data.jsonl"
pg_host: "127.0.0.1"
pg_port: 5432
pg_user: "root"
pg_password: "123123"

Known caveats

GPT-5-family models and temperature=0: GPT-5-family models reject an explicit temperature=0 parameter. The bird_interact.py template omits the temperature kwarg for those models to preserve stock behavior. If you are testing a GPT-5-family model, do not add temperature=0 in agent/agent.py.

Separate .venv-adk: prepare.py creates a separate .venv-adk inside bird_interact_adk/ because the ADK’s dependencies (google-adk, psycopg2, etc.) may conflict with other benchmarks’ dependencies. The harness invokes this venv’s Python binary directly — you do not need to activate it manually.

git-lfs required: The BIRD-Interact dataset is stored in HuggingFace using Git LFS. If git-lfs is not installed, the dataset clone will succeed but the .jsonl file will contain LFS pointer text instead of actual data. Install git-lfs before running prepare.py.

Get Started

Core Concepts

Benchmarks

Extending

Run BIRD-Interact with auto-harness: quickstart guide

Requirements

Setup

What auto-harness adds to BIRD-Interact

Advanced: pointing at an existing BIRD-Interact install

Known caveats

Build docs developers (and LLMs) love

Get Started

Core Concepts

Benchmarks

Extending

Documentation Index

​Requirements

​Setup

​What auto-harness adds to BIRD-Interact

​Advanced: pointing at an existing BIRD-Interact install

​Known caveats

Build docs developers (and LLMs) love

Requirements

Setup

What auto-harness adds to BIRD-Interact

Advanced: pointing at an existing BIRD-Interact install

Known caveats