Standard Text-to-SQL benchmarks measure Execution Accuracy (EX): a predicted SQL query is marked correct if and only if its result set is an exact match to the ground truth. This binary measure collapses two very different failure modes — returning too many rows and returning too few — into the same score of zero. SQLMorph addresses this by introducing Execution Precision (EXP), Execution Recall (EXR), and their harmonic mean F1, which quantify how much of the predicted result was correct and how much of the ground truth was recovered. These metrics, combined with seven configurable evaluation techniques, enable fine-grained analysis that reveals differences across systems that EX cannot.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/dais-polymtl/sqlmorph/llms.txt
Use this file to discover all available pages before exploring further.
Metric definitions
EX — Execution Accuracy
EX — Execution Accuracy
Binary match of result sets. EX is 1 if the set of rows returned by the predicted SQL exactly matches the set of rows returned by the ground-truth SQL; otherwise it is 0. This is the standard BIRD benchmark metric.EX is computed by
execution_accuracy.run_evaluation_pipeline in src/metrics/metrics/execution_accuracy.py. It executes both queries against the live database using DatabaseHandler and compares the row sets directly.EXP — Execution Precision
EXP — Execution Precision
Fraction of predicted rows that are correct. EXP measures over-prediction: if the system returns extra rows beyond what the ground truth specifies, EXP penalises that surplus. A system that always returns all rows in the table would have EXP near zero on most queries.
EXR — Execution Recall
EXR — Execution Recall
Fraction of ground-truth rows that are recovered. EXR measures under-prediction: if the system misses rows that the ground truth requires, EXR penalises the gap. A system that always returns an empty result set would have EXR of zero.
F1 — harmonic mean of EXP and EXR
F1 — harmonic mean of EXP and EXR
Balanced measure of precision and recall. F1 combines EXP and EXR into a single score that penalises both over- and under-prediction. It reaches 1 only when both EXP and EXR are 1.
Evaluation techniques
SQLMorph provides seven evaluation techniques via theEvaluationTechnique enum in src/metrics/evaluation.py. Each technique defines a different strategy for matching columns and cells between the predicted and ground-truth result sets, offering a spectrum from strict exact matching to semantic similarity.
- Exact matching
- Semantic matching
These techniques use strict string equality for both column names and cell values. No embeddings or external models are required.
Use these techniques when you want fast, deterministic evaluation with no API calls.
| Technique | Column matching | Cell matching |
|---|---|---|
EXECUTION_ACCURACY | N/A | Binary set equality (EX only) |
EXACT_COLUMN_AND_EXACT_CELL | Exact string match | Exact string match |
EXACT_COLUMN_AND_PARTIAL_CELL | Exact string match | Partial (substring) match |
NO_COLUMN_AND_PARTIAL_CELL | Ignored | Partial (substring) match |
EXECUTION_ACCURACY is the standard BIRD baseline. EXACT_COLUMN_AND_EXACT_CELL is the strictest relaxed variant. NO_COLUMN_AND_PARTIAL_CELL is the most permissive exact technique — useful when predicted queries return equivalent data under different column aliases.Using the evaluation API
TheEvaluation class in src/metrics/evaluation.py is the primary entry point for programmatic evaluation. Instantiate it with a configuration dictionary, then call run_evaluation with the predicted and ground-truth SQL strings.
Basic example (exact matching, no API key required)
Semantic evaluation example (requires OpenAI API key)
CLI usage
For one-off evaluations, use the CLI after sourcing the configuration script:Choosing the right technique
When should I use EXECUTION_ACCURACY?
When should I use EXECUTION_ACCURACY?
Use
EXECUTION_ACCURACY when you want a direct comparison with standard BIRD benchmark scores. It produces only EX (binary) and no EXP/EXR/F1. It is the fastest technique because it performs a single set comparison with no column alignment.When should I use semantic techniques?
When should I use semantic techniques?
Use semantic techniques (
SEMANTIC_COLUMN_AND_EXACT_CELL, SEMANTIC_COLUMN_AND_PARTIAL_CELL, or UNIFIED_COLUMN_AND_SEMANTIC_ROW) when your Text-to-SQL system may return columns under different aliases or when the ground-truth SQL and predicted SQL express the same data with different column names. Semantic matching is more forgiving of surface-level naming differences while still penalising incorrect data.What does penalize_extra_pred_cols do?
What does penalize_extra_pred_cols do?
When
penalize_extra_pred_cols is True, extra columns in the predicted result that are not in the ground truth reduce EXP. This penalises over-specification — returning more information than requested. Set it to False if you want to reward systems that return correct data even when they include additional columns.Where are evaluation logs saved?
Where are evaluation logs saved?
When
log=True is passed to run_evaluation, the full evaluation context (metrics, latency, predicted and ground-truth rows, and a timestamp) is serialised to a JSON file under logs_dir_path/<technique_name>/evaluation-<timestamp>.json. Each technique has its own subdirectory, making it easy to compare runs across techniques.The
penalize_extra_pred_cols parameter is named penalize_extra_columns in the METRICS.md documentation but is penalize_extra_pred_cols in the actual Evaluation class config dictionary. Use penalize_extra_pred_cols in your code.