Binary execution accuracy (EX) is the standard metric for Text-to-SQL benchmarks: a query either produces the exact right rows or it doesn’t. This pass/fail view hides important diagnostic information. Two failed queries can fail in completely different ways — one might return every correct row plus hundreds of spurious extras (high recall, low precision), while another might return only a handful of the required rows and nothing extraneous (high precision, low recall). SQLMorph closes this gap by computing three additional metrics — Execution Precision (EXP), Execution Recall (EXR), and F1 — that quantify how wrong a failed query is, not just that it failed.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/dais-polymtl/sqlmorph/llms.txt
Use this file to discover all available pages before exploring further.
Metric definitions
EX — execution accuracy
EX is a binary indicator. It equals1 if the set of rows returned by the predicted SQL exactly matches the set of rows returned by the ground-truth SQL, and 0 otherwise.
EXECUTION_ACCURACY, EX is computed independently of the column-matching and cell-counting logic — it always reflects the raw row-set comparison.
EX compares sets, not multisets. Duplicate rows are deduplicated before comparison. A predicted query that returns the right rows in a different order still receives
EX = 1.EXP — execution precision
EXP measures what fraction of the predicted result is correct. It is computed over matched cells after column alignment.predicted_cells= number of rows in prediction × number of predicted columns (or matched columns whenpenalize_extra_pred_cols=False)matched_cells= cells from matched rows, counting exact and/or partial matches depending on the technique
EXR — execution recall
EXR measures what fraction of the ground truth the prediction recovered.ground_truth_cells= number of ground-truth rows × number of ground-truth columns
F1 score
F1 is the harmonic mean of EXP and EXR, balancing both dimensions into a single score.EXP = 1.0 and EXR = 0.1 gets F1 = 0.18, not 0.55.
Edge case handling
The metric computation inassign_metrics() handles empty result sets consistently across all techniques:
| Condition | EXP | EXR | F1 | Notes |
|---|---|---|---|---|
| Both sets empty | 1.0 | 1.0 | 1.0 | Both queries returned nothing; treated as perfect match |
| Only ground truth empty | 0.0 | 1.0 | 0.0 | System predicted rows when none were expected |
| Only prediction empty | 1.0 | 0.0 | 0.0 | System returned nothing when rows were expected |
| Normal case | matched/predicted | matched/ground_truth | harmonic mean | Standard computation |
EX versus EXP/EXR/F1
WhenEX = 1, all four metrics equal 1.0 — the prediction is perfect. When EX = 0, the binary score gives no further information. EXP, EXR, and F1 fill that gap.
| Scenario | EX | EXP | EXR | F1 | What happened |
|---|---|---|---|---|---|
| Perfect prediction | 1 | 1.0 | 1.0 | 1.0 | Exact match |
| Over-prediction | 0 | low | high | medium | Returned correct rows plus many extras |
| Under-prediction | 0 | high | low | medium | Returned only a subset of correct rows |
| Wrong prediction | 0 | ~0 | ~0 | ~0 | Returned mostly wrong data |
| Partially correct | 0 | medium | medium | medium | Some rows correct, some wrong |
Worked example
Consider aschools database where the ground-truth query returns the top-3 schools ranked by SAT performance:
Ground-truth SQL
| Phone |
|---|
| (650) 329-3700 |
| (415) 749-3500 |
| (510) 531-5300 |
Scenario A: over-prediction
The predicted query returns the right 3 rows but also 7 extra rows (wrong schools): Predicted result (10 rows × 1 column = 10 cells)| Phone |
|---|
| (650) 329-3700 |
| (415) 749-3500 |
| (510) 531-5300 |
| (213) 555-0100 |
| … 6 more wrong rows … |
EXACT_COLUMN_AND_EXACT_CELL:
matched_cells = 3(the 3 correct rows × 1 column)predicted_cells = 10ground_truth_cells = 3
Scenario B: under-prediction
The predicted query returns only 1 of the 3 correct schools, and nothing else: Predicted result (1 row × 1 column = 1 cell)| Phone |
|---|
| (650) 329-3700 |
metrics response field reference
The metrics dictionary is the primary output of run_evaluation(). All values are scalars.
Binary execution accuracy.
1 if set(predicted_rows) == set(ground_truth_rows), otherwise 0. Always present.Execution precision in the range
[0.0, 1.0]. Present for all techniques except EXECUTION_ACCURACY. A value of 1.0 means every predicted cell was matched; 0.0 means no predicted cell was matched.Execution recall in the range
[0.0, 1.0]. Present for all techniques except EXECUTION_ACCURACY. A value of 1.0 means every ground-truth cell was recovered; 0.0 means nothing was recovered.Harmonic mean of EXP and EXR, in the range
[0.0, 1.0]. Present for all techniques except EXECUTION_ACCURACY. 0.0 when either EXP or EXR is 0.