Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/dais-polymtl/sqlmorph/llms.txt

Use this file to discover all available pages before exploring further.

The Evaluation class is the main entry point for computing metrics between a predicted SQL query and a ground-truth SQL query. You instantiate it once with a configuration dictionary and then call run_evaluation() for each query pair you want to score. The class routes to one of seven backend pipelines based on the evaluation_technique key, handles database execution, and optionally writes structured JSON logs to disk.

Constructor

from src.metrics.evaluation import Evaluation, EvaluationTechnique
from src.core.database.database_handler import DBMS
from src.core.model_manager import OpenAIModel

evaluator = Evaluation(config)
config
dict
required
Configuration dictionary controlling which pipeline runs and how results are logged.

run_evaluation()

Executes the configured evaluation pipeline against a single query pair and returns a context dictionary containing metrics and intermediate pipeline state.
result = evaluator.run_evaluation(
    predicted_sql=predicted_sql,
    ground_truth_sql=ground_truth_sql,
    log=True,
)
predicted_sql
str
required
The SQL query generated by the Text-to-SQL system under evaluation.
ground_truth_sql
str
required
The reference SQL query from the benchmark dataset.
log
bool
default:"True"
When True, writes a JSON log file to the configured logs_dir_path. Set to False for batch runs where you don’t need per-query artefacts.

Return value

run_evaluation() returns a dictionary (the pipeline context) that always contains metrics and latency. The other keys depend on which technique ran.
metrics
object
required
The computed evaluation scores for this query pair.
latency
float
required
Wall-clock time in seconds from when the pipeline started to when assign_metrics completed. Includes query execution and, for semantic techniques, embedding API calls.
has_error
boolean
required
True if either SQL query failed to execute. When True, all metric values are 0.
error_message
string
Human-readable error message. Only present when has_error is True.
predicted_sql
string
Echo of the predicted_sql argument, stored in the context for log traceability.
ground_truth_sql
string
Echo of the ground_truth_sql argument.
evaluation_timestamp
string
ISO-style timestamp string added when log=True. Format: YYYY-MM-DD_HH-MM-SS.
The context dictionary also carries intermediate pipeline state such as pred_cols, gt_cols, pred_rows, gt_rows, matched_cells, and (for semantic techniques) matched_cols and similarity_matrix. These are useful for debugging but are not part of the stable public API.

Usage examples

Python: basic usage

The example below is taken directly from the __main__ block in src/metrics/evaluation.py.
from src.metrics.evaluation import Evaluation, EvaluationTechnique
from src.core.database.database_handler import DBMS
from src.core.model_manager import OpenAIModel

predicted_sql = """
SELECT T3.Phone, T3.City, T3.State, T3.MailStreet
FROM satscores T1
JOIN schools T3 ON T1.cds = T3.CDSCode
WHERE T1.NumTstTakr IS NOT NULL AND T1.NumGE1500 IS NOT NULL
ORDER BY (T1.NumGE1500 * 1.0 / T1.NumTstTakr) DESC
LIMIT 10;
"""

ground_truth_sql = """
SELECT T1.Phone
FROM schools AS T1
INNER JOIN satscores AS T2 ON T1.CDSCode = T2.cds
ORDER BY CAST(T2.NumGE1500 AS REAL) / T2.NumTstTakr DESC
LIMIT 10;
"""

db_name = "california_schools"

config = {
    "evaluation_technique": EvaluationTechnique.SEMANTIC_COLUMN_AND_PARTIAL_CELL,
    "db_params": {
        "dbms": DBMS.SQLITE,
        "db_path": f"data/benchmarks/Bird/dev_databases/{db_name}/{db_name}.sqlite",
    },
    "penalize_extra_pred_cols": True,
    "embedding_model": OpenAIModel.TEXT_EMBEDDING_3_SMALL,
    "logs_dir_path": "data/evaluation_outputs/",
}

evaluator = Evaluation(config)
res = evaluator.run_evaluation(predicted_sql, ground_truth_sql, log=True)

print("Semantic Evaluation Results:")
print(f"Metrics: {res['metrics']}, Latency: {res['latency']}")

Python: minimal binary check

Use EXECUTION_ACCURACY when you only need a pass/fail score and don’t require column-level detail. This technique does not call an embedding API and runs with the fewest dependencies.
from src.metrics.evaluation import Evaluation, EvaluationTechnique
from src.core.database.database_handler import DBMS

config = {
    "evaluation_technique": EvaluationTechnique.EXECUTION_ACCURACY,
    "db_params": {
        "dbms": DBMS.SQLITE,
        "db_path": "data/benchmarks/Bird/dev_databases/california_schools/california_schools.sqlite",
    },
    "penalize_extra_pred_cols": True,
    "embedding_model": None,
    "logs_dir_path": "data/evaluation_outputs/",
}

evaluator = Evaluation(config)
res = evaluator.run_evaluation(
    predicted_sql="SELECT Phone FROM schools LIMIT 10;",
    ground_truth_sql="SELECT T1.Phone FROM schools AS T1 INNER JOIN satscores AS T2 ON T1.CDSCode = T2.cds ORDER BY CAST(T2.NumGE1500 AS REAL) / T2.NumTstTakr DESC LIMIT 10;",
    log=False,
)

print(res["metrics"])  # {'EX': 0}

CLI usage via load_config_from_env()

For scripted evaluation runs, you can configure the evaluator entirely through environment variables — typically sourced from scripts/metrics_config.sh — and then invoke evaluation.py as a script.
from src.metrics.evaluation import load_config_from_env, Evaluation

config = load_config_from_env()
evaluator = Evaluation(config)
res = evaluator.run_evaluation(
    predicted_sql=args.predicted_sql,
    ground_truth_sql=args.ground_truth_sql,
    log=config["enable_log"],
)
load_config_from_env() reads the following environment variables and raises ValueError if a required variable is missing or invalid.
VariableRequiredDescription
EVAL_TECHNIQUEYesString value of an EvaluationTechnique enum member
DBMSYesDBMS name, e.g. SQLITE
DB_PATHYesPath to the database file
EMBEDDING_MODELYesOpenAIModel attribute name, e.g. TEXT_EMBEDDING_3_SMALL
LOGS_DIR_PATHYesRoot directory for log files
PENALIZE_EXTRA_PRED_COLSNo"true" or "false". Defaults to "true"
ENABLE_LOGNo"true" or "false". Defaults to "false"
Command-line invocation:
# 1. Set environment variables
source scripts/metrics_config.sh

# 2. Run evaluation
python src/metrics/evaluation.py \
  --predicted-sql "SELECT Phone FROM schools LIMIT 10;" \
  --ground-truth-sql "SELECT T1.Phone FROM schools AS T1 INNER JOIN satscores AS T2 ON T1.CDSCode = T2.cds ORDER BY CAST(T2.NumGE1500 AS REAL) / T2.NumTstTakr DESC LIMIT 10;"
Set ENABLE_LOG=true in metrics_config.sh to persist a timestamped JSON artefact for every CLI run. Log files are written to LOGS_DIR_PATH/<technique_name>/evaluation-<timestamp>.json.

Build docs developers (and LLMs) love