BIRD-Interact is an interactive text-to-SQL benchmark where an agent must complete multi-turn CRUD tasks over a live Postgres database. Unlike static SQL benchmarks, each task involves a conversation between a user simulator, a database environment, and your system agent — all running as separate services. auto-harness wraps this three-service setup entirely:Documentation Index
Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt
Use this file to discover all available pages before exploring further.
python prepare.py auto-provisions the BIRD-Interact-ADK repository, its isolated virtual environment, the HuggingFace dataset, and the Postgres Docker container, then runs the baseline. This page covers the full setup including the one-time ground truth access step.
Requirements
- Docker — runs the Postgres database container required by BIRD-Interact
- Python 3.12+ — required by the BIRD-Interact-ADK’s dependencies
git-lfs— required to clone the BIRD-Interact dataset from HuggingFaceOPENAI_API_KEYorANTHROPIC_API_KEY— for your configured agent and user simulator models- A coding agent — Claude Code, Codex CLI, or any agent that can read files and run shell commands
prepare.py auto-provisions everything: it clones BIRD-Interact-ADK, creates an isolated .venv-adk, downloads the dataset from HuggingFace, and starts the Postgres container. You do not need to set these up manually unless you want to point at an existing install.Setup
Set up environment variables
.env and set your LLM API keys:anthropic/claude-sonnet-4-20250514 for the system agent and anthropic/claude-haiku-4-5-20251001 for the user simulator, so ANTHROPIC_API_KEY is required for that setup.Obtain ground truth data (one-time step)
The public BIRD-Interact dataset ships without gold SQL answers to prevent data leakage. You need to request access once before running the baseline.
- Email
bird.bench25@gmail.comwith the subject line: - You will receive a
.jsonlfile with the ground truth answers. - Run the merge script provided by
prepare.py(it prints the exact command when GT is missing): - Proceed to the next step.
Initialize the workspace and run the baseline
prepare.py auto-provisions everything in this order:- Validates required environment variables and tooling (
docker,git-lfs) - Clones
BIRD-Interact-ADKinto./bird_interact_adk/(gitignored) - Creates an isolated
.venv-adkinsidebird_interact_adk/with ADK dependencies installed - Clones the
bird-interact-litedataset from HuggingFace intobird_interact_data/ - Starts the Postgres Docker container
- Creates
workspace/and initializessuite.json,learnings.md,results.tsv, andtrain_results.json - Copies
agent/templates/bird_interact.pyintoagent/agent.pyas the starting point - Composes
PROGRAM.mdfromprogram_templates/base.md+program_templates/bird_interact.md - Runs all 300 lite tasks, generates a stratified 70/30 train/test split at
bird_data/task_split.json - Records the baseline score as iteration 0
Start the optimization loop
Point your coding agent at the repository and use the following prompt:The agent will:
- Run
python benchmark.pyto get train-split results - Read train-split traces from
workspace/traces/latest/to diagnose root causes - Edit
agent/agent.pywith one focused improvement (your system agent code) - Run
python gating.pyto gate the change — three steps: regression suite, full test score, suite promotion - If the gate passes: commit, run
python record.py, updateworkspace/learnings.md - If the gate fails: revert with
git checkout agent/agent.pyand try a different approach - Repeat
What auto-harness adds to BIRD-Interact
The integration adds several components that connect the BIRD-Interact-ADK infrastructure to the auto-harness optimization loop:| Component | Purpose |
|---|---|
BirdInteractRunner in benchmark.py | Spawns the three ADK services (user simulator, DB environment, system agent) per run, drives orchestrator.runner, parses results into the harness reward format |
agent/helpers/bird_interact/bird_service.py | FastAPI service wrapper that serves your agent/agent.py as the BIRD system agent |
agent/helpers/bird_interact/bird_adk_runtime.py | Google ADK runtime adapter that connects the FastAPI service to the ADK evaluation framework |
agent/templates/bird_interact.py | Faithful copy of the stock BIRD-Interact-ADK system agent — your starting point for optimization |
program_templates/bird_interact.md | Benchmark-specific guidance appended to PROGRAM.md: trace paths, task ID format, known techniques |
Advanced: pointing at an existing BIRD-Interact install
If you already have BIRD-Interact-ADK installed, you can skip auto-provisioning by setting these keys inexperiment_config.yaml:
Known caveats
Separate
.venv-adk: prepare.py creates a separate .venv-adk inside bird_interact_adk/ because the ADK’s dependencies (google-adk, psycopg2, etc.) may conflict with other benchmarks’ dependencies. The harness invokes this venv’s Python binary directly — you do not need to activate it manually.git-lfs required: The BIRD-Interact dataset is stored in HuggingFace using Git LFS. If git-lfs is not installed, the dataset clone will succeed but the .jsonl file will contain LFS pointer text instead of actual data. Install git-lfs before running prepare.py.