Evaluation class: run SQL evaluation pipelines

The Evaluation class is the main entry point for computing metrics between a predicted SQL query and a ground-truth SQL query. You instantiate it once with a configuration dictionary and then call run_evaluation() for each query pair you want to score. The class routes to one of seven backend pipelines based on the evaluation_technique key, handles database execution, and optionally writes structured JSON logs to disk.

Constructor

from src.metrics.evaluation import Evaluation, EvaluationTechnique
from src.core.database.database_handler import DBMS
from src.core.model_manager import OpenAIModel

evaluator = Evaluation(config)

config

dict

required

Configuration dictionary controlling which pipeline runs and how results are logged.

Show config keys

evaluation_technique

EvaluationTechnique

required

Which metric pipeline to run. Must be one of the seven EvaluationTechnique enum values. See EvaluationTechnique reference.

db_params

dict

required

Database connection parameters passed directly to DatabaseHandler.

Show db_params keys

dbms

DBMS

required

Database management system to connect to. Use DBMS.SQLITE for BIRD benchmark databases.

db_path

str

required

Absolute or relative path to the database file (for SQLite) or connection string (for other backends).

penalize_extra_pred_cols

bool

default:"True"

When True, predicted columns that have no match in the ground truth are counted in the denominator when computing EXP. When False, only matched columns are counted. Has no effect on EXECUTION_ACCURACY and UNIFIED_COLUMN_AND_SEMANTIC_ROW.

embedding_model

OpenAIModel

required

Embedding model used for semantic column and row similarity. Required when evaluation_technique is SEMANTIC_COLUMN_AND_EXACT_CELL, SEMANTIC_COLUMN_AND_PARTIAL_CELL, or UNIFIED_COLUMN_AND_SEMANTIC_ROW. Ignored by the other techniques.

logs_dir_path

str

required

Root directory where evaluation logs are written. A subdirectory named after the technique is created automatically, e.g. logs_dir_path/semantic_column_and_partial_cell/.

enable_log

bool

default:"False"

When True, write a timestamped JSON log file for every run_evaluation() call. Primarily used in CLI mode; in Python you control logging by passing the log argument directly to run_evaluation().

`run_evaluation()`

Executes the configured evaluation pipeline against a single query pair and returns a context dictionary containing metrics and intermediate pipeline state.

result = evaluator.run_evaluation(
    predicted_sql=predicted_sql,
    ground_truth_sql=ground_truth_sql,
    log=True,
)

predicted_sql

str

required

The SQL query generated by the Text-to-SQL system under evaluation.

ground_truth_sql

str

required

The reference SQL query from the benchmark dataset.

log

bool

default:"True"

When True, writes a JSON log file to the configured logs_dir_path. Set to False for batch runs where you don’t need per-query artefacts.

Return value

run_evaluation() returns a dictionary (the pipeline context) that always contains metrics and latency. The other keys depend on which technique ran.

metrics

object

required

The computed evaluation scores for this query pair.

Show metrics fields

integer

required

Execution accuracy. 1 if set(predicted_rows) == set(ground_truth_rows), otherwise 0. Always present regardless of technique.

EXP

float

Execution precision. Ratio of matched cells to total predicted cells. Present for all techniques except EXECUTION_ACCURACY.

EXR

float

Execution recall. Ratio of matched cells to total ground-truth cells. Present for all techniques except EXECUTION_ACCURACY.

float

Harmonic mean of EXP and EXR. Present for all techniques except EXECUTION_ACCURACY.

latency

float

required

Wall-clock time in seconds from when the pipeline started to when assign_metrics completed. Includes query execution and, for semantic techniques, embedding API calls.

has_error

boolean

required

True if either SQL query failed to execute. When True, all metric values are 0.

error_message

string

Human-readable error message. Only present when has_error is True.

predicted_sql

string

Echo of the predicted_sql argument, stored in the context for log traceability.

ground_truth_sql

string

Echo of the ground_truth_sql argument.

evaluation_timestamp

string

ISO-style timestamp string added when log=True. Format: YYYY-MM-DD_HH-MM-SS.

The context dictionary also carries intermediate pipeline state such as pred_cols, gt_cols, pred_rows, gt_rows, matched_cells, and (for semantic techniques) matched_cols and similarity_matrix. These are useful for debugging but are not part of the stable public API.

Usage examples

Python: basic usage

The example below is taken directly from the __main__ block in src/metrics/evaluation.py.

from src.metrics.evaluation import Evaluation, EvaluationTechnique
from src.core.database.database_handler import DBMS
from src.core.model_manager import OpenAIModel

predicted_sql = """
SELECT T3.Phone, T3.City, T3.State, T3.MailStreet
FROM satscores T1
JOIN schools T3 ON T1.cds = T3.CDSCode
WHERE T1.NumTstTakr IS NOT NULL AND T1.NumGE1500 IS NOT NULL
ORDER BY (T1.NumGE1500 * 1.0 / T1.NumTstTakr) DESC
LIMIT 10;
"""

ground_truth_sql = """
SELECT T1.Phone
FROM schools AS T1
INNER JOIN satscores AS T2 ON T1.CDSCode = T2.cds
ORDER BY CAST(T2.NumGE1500 AS REAL) / T2.NumTstTakr DESC
LIMIT 10;
"""

db_name = "california_schools"

config = {
    "evaluation_technique": EvaluationTechnique.SEMANTIC_COLUMN_AND_PARTIAL_CELL,
    "db_params": {
        "dbms": DBMS.SQLITE,
        "db_path": f"data/benchmarks/Bird/dev_databases/{db_name}/{db_name}.sqlite",
    },
    "penalize_extra_pred_cols": True,
    "embedding_model": OpenAIModel.TEXT_EMBEDDING_3_SMALL,
    "logs_dir_path": "data/evaluation_outputs/",
}

evaluator = Evaluation(config)
res = evaluator.run_evaluation(predicted_sql, ground_truth_sql, log=True)

print("Semantic Evaluation Results:")
print(f"Metrics: {res['metrics']}, Latency: {res['latency']}")

Python: minimal binary check

Use EXECUTION_ACCURACY when you only need a pass/fail score and don’t require column-level detail. This technique does not call an embedding API and runs with the fewest dependencies.

from src.metrics.evaluation import Evaluation, EvaluationTechnique
from src.core.database.database_handler import DBMS

config = {
    "evaluation_technique": EvaluationTechnique.EXECUTION_ACCURACY,
    "db_params": {
        "dbms": DBMS.SQLITE,
        "db_path": "data/benchmarks/Bird/dev_databases/california_schools/california_schools.sqlite",
    },
    "penalize_extra_pred_cols": True,
    "embedding_model": None,
    "logs_dir_path": "data/evaluation_outputs/",
}

evaluator = Evaluation(config)
res = evaluator.run_evaluation(
    predicted_sql="SELECT Phone FROM schools LIMIT 10;",
    ground_truth_sql="SELECT T1.Phone FROM schools AS T1 INNER JOIN satscores AS T2 ON T1.CDSCode = T2.cds ORDER BY CAST(T2.NumGE1500 AS REAL) / T2.NumTstTakr DESC LIMIT 10;",
    log=False,
)

print(res["metrics"])  # {'EX': 0}

CLI usage via `load_config_from_env()`

For scripted evaluation runs, you can configure the evaluator entirely through environment variables — typically sourced from scripts/metrics_config.sh — and then invoke evaluation.py as a script.

from src.metrics.evaluation import load_config_from_env, Evaluation

config = load_config_from_env()
evaluator = Evaluation(config)
res = evaluator.run_evaluation(
    predicted_sql=args.predicted_sql,
    ground_truth_sql=args.ground_truth_sql,
    log=config["enable_log"],
)

load_config_from_env() reads the following environment variables and raises ValueError if a required variable is missing or invalid.

Variable	Required	Description
`EVAL_TECHNIQUE`	Yes	String value of an `EvaluationTechnique` enum member
`DBMS`	Yes	DBMS name, e.g. `SQLITE`
`DB_PATH`	Yes	Path to the database file
`EMBEDDING_MODEL`	Yes	`OpenAIModel` attribute name, e.g. `TEXT_EMBEDDING_3_SMALL`
`LOGS_DIR_PATH`	Yes	Root directory for log files
`PENALIZE_EXTRA_PRED_COLS`	No	`"true"` or `"false"`. Defaults to `"true"`
`ENABLE_LOG`	No	`"true"` or `"false"`. Defaults to `"false"`

Command-line invocation:

# 1. Set environment variables
source scripts/metrics_config.sh

# 2. Run evaluation
python src/metrics/evaluation.py \
  --predicted-sql "SELECT Phone FROM schools LIMIT 10;" \
  --ground-truth-sql "SELECT T1.Phone FROM schools AS T1 INNER JOIN satscores AS T2 ON T1.CDSCode = T2.cds ORDER BY CAST(T2.NumGE1500 AS REAL) / T2.NumTstTakr DESC LIMIT 10;"

Set ENABLE_LOG=true in metrics_config.sh to persist a timestamped JSON artefact for every CLI run. Log files are written to LOGS_DIR_PATH/<technique_name>/evaluation-<timestamp>.json.

Evaluation

Query Mutation

Core Utilities

Evaluation class: run SQL evaluation pipelines

Constructor

`run_evaluation()`

Return value

Usage examples

Python: basic usage

Python: minimal binary check

CLI usage via `load_config_from_env()`

Build docs developers (and LLMs) love

Evaluation

Query Mutation

Core Utilities

Documentation Index

​Constructor

​run_evaluation()

​Return value

​Usage examples

​Python: basic usage

​Python: minimal binary check

​CLI usage via load_config_from_env()

Build docs developers (and LLMs) love

Constructor

`run_evaluation()`

Return value

Usage examples

Python: basic usage

Python: minimal binary check

CLI usage via `load_config_from_env()`