Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/neosigmaai/auto-harness/llms.txt

Use this file to discover all available pages before exploring further.

BIRD-Interact is an interactive text-to-SQL benchmark where an agent must complete multi-turn database tasks — querying, inserting, updating, and deleting records — against a live Postgres database. Unlike static SQL benchmarks that grade a single generated query, BIRD-Interact simulates a real conversation: a user simulator sends requests, the agent issues SQL through the BIRD-Interact-ADK services, and correctness is judged on the final database state. auto-harness integrates this via BirdInteractRunner, which spawns the three required ADK services and drives orchestrator.runner to completion.

Agent interface

The BIRD-Interact agent is a Google ADK agent served as a FastAPI service. auto-harness wraps agent/agent.py in agent/helpers/bird_interact/bird_service.py and exposes it at system_agent_port (default 6100). The BIRD orchestrator routes user simulator messages to this service, and the service calls agent/agent.py’s build_agent() to produce responses. The two interaction modes control the agent’s conversational strategy:
ModeDescription
a-interactAutonomous tool-using SQL agent — acts on requests directly
c-interactClarification-first conversational SQL agent — asks before acting
The default integration and the starting template assume a-interact.

BirdInteractRunner

BirdInteractRunner in benchmark.py manages the full lifecycle: resolving the ADK directory and Python interpreter, starting the three services, invoking orchestrator.runner, parsing results, and copying traces.

Constructor

BirdInteractRunner(
    bird_repo: str | None = None,           # path to BIRD-Interact-ADK; auto-resolved if None
    bird_python_bin: str | None = None,     # python with ADK deps; auto-resolved if None
    split: str | None = "train",            # "train", "test", or None (all tasks)
    mode: str = "a-interact",              # "a-interact" or "c-interact"
    dataset: str = "lite",                 # "lite" (300 tasks) or "full" (600 tasks)
    data_path: str | None = None,          # override path to bird_interact_data.jsonl
    agent_model: str | None = None,        # model for the system agent
    user_model: str | None = None,         # model for the user simulator
    patience: int = 3,                     # retries before a task is marked failed
    n_concurrent: int = 3,                 # simultaneous tasks
    per_task_timeout: int = 1800,          # seconds per task
    jobs_dir: str = "workspace/bird_runs",
    system_agent_port: int = 6100,
    user_sim_port: int = 6101,
    db_env_port: int = 6102,
    pg_host: str | None = None,            # Postgres host override
    pg_port: int | None = None,
    pg_user: str | None = None,
    pg_password: str | None = None,
)

Split file

The train/test split is stored at bird_data/task_split.json (the SPLIT_FILE class constant), generated by prepare.py during the baseline run using a 70/30 stratified shuffle with seed=42.
BirdInteractRunner.SPLIT_FILE = "bird_data/task_split.json"

Datasets

DatasetTasksPath
lite300bird_interact_adk/bird-interact-lite/bird_interact_data.jsonl
full600bird_interact_adk/bird-interact-full/bird_interact_data.jsonl

The 3-service architecture

Each BirdInteractRunner.run() call starts three FastAPI services via uvicorn, waits for each to pass a /health check, runs the orchestrator, then terminates all services:
ServiceModuleDefault portDescription
System agentagent.helpers.bird_interact.bird_service6100Serves agent/agent.py as a FastAPI endpoint
User simulatoruser_simulator.server6101Drives the conversation from the BIRD-Interact-ADK
DB environmentdb_environment.server6102Manages the Postgres session
The orchestrator is invoked as a subprocess:
cmd = [
    self.python_bin, "-m", "orchestrator.runner",
    "--mode", self.mode,
    "--data", input_path,
    "--output", output_path,
    "--concurrency", str(concurrency),
]

Trace management

After each train-split run, the runner copies per-task traces into the workspace:
workspace/traces/latest/<instance_id>/
├── trace.json    # dialogue_history, tool_trajectory, adk_events, final_response
└── result.json   # raw per-task reward and metadata
workspace/traces/baseline/ holds immutable first-run traces and is never overwritten. Only train-split traces are saved.

auto-provisioning with prepare.py

prepare.py handles everything automatically on first run:
1

Clones BIRD-Interact-ADK

Clones BIRD-Interact-ADK into ./bird_interact_adk/ (gitignored).
2

Creates an isolated venv

Creates .venv-adk inside bird_interact_adk/ with the ADK’s dependencies (google-adk, psycopg2, etc.) isolated from the main project.
3

Downloads the dataset

Clones the bird-interact-lite dataset from HuggingFace via git-lfs.
4

Starts Postgres

Starts the BIRD Postgres Docker container.
5

Runs the baseline

Runs all 300 lite tasks and generates the train/test split at bird_data/task_split.json.
Advanced users can skip auto-provisioning by setting bird_repo and bird_python_bin in experiment_config.yaml to point at an existing BIRD-Interact-ADK install.

Ground truth access

The public BIRD-Interact dataset ships without gold SQL to prevent data leakage. You must request it before the baseline run will produce meaningful scores.
1

Email for ground truth

Email bird.bench25@gmail.com with subject [bird-interact-lite GT&Test Cases].
2

Merge the ground truth

Run the combine_public_with_gt.py script that prepare.py prints when it detects missing ground truth, passing the .jsonl file you receive.
3

Re-run prepare.py

python prepare.py
If prepare.py detects missing ground truth, it prints the exact merge command to run — you do not need to locate the script manually.

Configuration

Uncomment and edit the BIRD-INTERACT block in experiment_config.yaml:
benchmark: "bird-interact"
mode: "a-interact"                              # or "c-interact"
dataset: "lite"                                 # "lite" or "full"
split: "train"
gate_split: "test"
max_concurrency: 3
threshold: 0.8
patience: 3
per_task_timeout: 1800
system_agent_port: 6100
user_sim_port: 6101
db_env_port: 6102
agent_model: "anthropic/claude-sonnet-4-20250514"
user_model: "anthropic/claude-haiku-4-5-20251001"
Advanced overrides (only needed if you have an existing BIRD-Interact install):
# bird_repo: "/abs/path/to/BIRD-Interact"
# bird_python_bin: "/abs/path/to/python"
# bird_data_path: "/abs/path/to/bird_interact_data.jsonl"
# pg_host: "127.0.0.1"
# pg_port: 5432
# pg_user: "root"
# pg_password: "123123"
Required environment variables:
  • ANTHROPIC_API_KEY (or OPENAI_API_KEY / GEMINI_API_KEY depending on the configured model)
Required tooling: Docker (for the Postgres container), git-lfs (for the HuggingFace dataset).

Known caveats

GPT-5-family models reject an explicit temperature=0 argument. The starting template at agent/templates/bird_interact.py omits the temperature kwarg for those models to preserve stock behavior. Other models are unaffected.
The .venv-adk created inside bird_interact_adk/ is intentionally isolated. The ADK’s dependencies (google-adk, psycopg2, and others) may conflict with the main project’s dependencies. Do not install ADK packages into the main Python environment.

Editing agent/agent.py

The starting template at agent/templates/bird_interact.py is a faithful copy of the stock BIRD-Interact-ADK system agent. The coding agent can improve:
  • AINTERACT_INSTRUCTION — the system prompt for autonomous (a-interact) mode
  • CINTERACT_INSTRUCTION — the system prompt for conversational (c-interact) mode
  • build_agent() — how the model and ADK session are configured per mode
The external BIRD-Interact-ADK repo is treated as read-only benchmark infrastructure. The coding agent must not modify it.

Build docs developers (and LLMs) love