Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/dais-polymtl/sqlmorph/llms.txt

Use this file to discover all available pages before exploring further.

SQLMorph provides seven evaluation techniques through the EvaluationTechnique enum. Each technique implements a different strategy for comparing the result set of a predicted SQL query against the ground-truth result set. They range from a simple binary row-set equality check (EXECUTION_ACCURACY) to embedding-based approaches that align columns semantically and score rows by cosine similarity (UNIFIED_COLUMN_AND_SEMANTIC_ROW). You select the technique once when you construct an Evaluation instance, and it controls the entire pipeline.

Selecting a technique

from src.metrics.evaluation import EvaluationTechnique

technique = EvaluationTechnique.SEMANTIC_COLUMN_AND_PARTIAL_CELL
Pass the selected value in the config dictionary:
config = {
    "evaluation_technique": EvaluationTechnique.SEMANTIC_COLUMN_AND_PARTIAL_CELL,
    # ... other config keys
}
Techniques that require an embedding model (SEMANTIC_COLUMN_AND_EXACT_CELL, SEMANTIC_COLUMN_AND_PARTIAL_CELL, and UNIFIED_COLUMN_AND_SEMANTIC_ROW) call the OpenAI Embeddings API during evaluation. Set the OPENAI_API_KEY environment variable before running and ensure your config["embedding_model"] is a valid OpenAIModel value.

All seven techniques

Enum value: EvaluationTechnique.EXECUTION_ACCURACYString value: "execution_accuracy"Embedding required: NoWhat it measuresBinary row-set equality. Returns EX = 1 if set(predicted_rows) == set(ground_truth_rows), otherwise EX = 0. This is the standard metric used by the BIRD benchmark. It produces only an EX score — no EXP, EXR, or F1.When to use itUse when you need a direct comparison against published BIRD leaderboard scores, or as a fast sanity check before running a heavier technique. Because it ignores column names entirely and treats rows as opaque tuples, it is sensitive to column ordering in SELECT clauses and will return EX = 0 if the predicted query returns extra columns — even if the data is correct.Returned metrics
KeyTypeDescription
EXint1 if result sets match exactly, 0 otherwise
Enum value: EvaluationTechnique.EXACT_COLUMN_AND_EXACT_CELLString value: "exact_column_and_exact_cell"Embedding required: NoWhat it measuresColumn names are matched by exact string equality. The pipeline finds the intersection of predicted and ground-truth column names, projects both result sets onto those shared columns, and then counts matched rows using exact tuple equality. Precision and recall are computed at the cell level: matched_cells / predicted_cells and matched_cells / ground_truth_cells respectively.When to use itUse when the Text-to-SQL system is expected to produce column aliases that exactly match the ground truth — common when the ground truth uses bare column names with no renaming. This is the strictest column-aware technique. If the predicted query renames a column (e.g., phone AS contact_phone), that column will not be matched even if the values are identical.Returned metrics
KeyTypeDescription
EXintBinary row-set equality
EXPfloatExecution precision over matched cells
EXRfloatExecution recall over matched cells
F1floatHarmonic mean of EXP and EXR
Enum value: EvaluationTechnique.EXACT_COLUMN_AND_PARTIAL_CELLString value: "exact_column_and_partial_cell"Embedding required: NoWhat it measuresColumns are matched by exact name, the same as EXACT_COLUMN_AND_EXACT_CELL. The difference is in row matching: after exact row matches are counted, the remaining unmatched rows undergo a greedy partial-row matching phase. Each remaining predicted row is paired with the ground-truth row that shares the most cell values in common (normalised by the number of shared columns). This produces fractional cell-match contributions for partially correct rows.When to use itUse when you expect queries that are directionally correct but off by a few cell values — for example, when the system predicts slightly wrong aggregation values or has off-by-one filter errors. The partial matching surface gives credit for rows that are close but not identical.Returned metrics
KeyTypeDescription
EXintBinary row-set equality
EXPfloatExecution precision, including fractional partial matches
EXRfloatExecution recall, including fractional partial matches
F1floatHarmonic mean of EXP and EXR
Enum value: EvaluationTechnique.SEMANTIC_COLUMN_AND_EXACT_CELLString value: "semantic_column_and_exact_cell"Embedding required: YesWhat it measuresColumns are matched semantically using OpenAI embeddings and the Hungarian algorithm. For each column, a text representation is built from the column name, its inferred data type, and up to ten of its most frequent values. Embeddings of those texts are compared with cosine similarity, and the Hungarian algorithm finds the optimal one-to-one pairing. Only pairs with similarity > 0.7 are kept. Row matching on the semantically aligned columns is then performed with exact tuple equality (no partial credit).When to use itUse when the predicted query uses different column aliases than the ground truth but those columns contain the same data — for instance, phone_number vs Phone vs contact. This technique tolerates naming discrepancies that would cause EXACT_COLUMN_AND_EXACT_CELL to miss valid matches.Returned metrics
KeyTypeDescription
EXintBinary row-set equality
EXPfloatExecution precision over semantically matched cells
EXRfloatExecution recall over semantically matched cells
F1floatHarmonic mean of EXP and EXR
Enum value: EvaluationTechnique.SEMANTIC_COLUMN_AND_PARTIAL_CELLString value: "semantic_column_and_partial_cell"Embedding required: YesWhat it measuresCombines semantic column matching (identical to SEMANTIC_COLUMN_AND_EXACT_CELL) with partial row matching (identical to EXACT_COLUMN_AND_PARTIAL_CELL). After semantically aligned columns are determined, rows are matched first exactly, then residual unmatched rows are scored greedily by per-cell value overlap. Both exact and fractional contributions accumulate into the final matched_cells count.When to use itThis is the most commonly recommended technique for evaluating Text-to-SQL systems on BIRD. It handles both renamed columns and partially correct result sets, making it robust to the two most common failure modes: alias mismatch and aggregation error. It is the technique used in the __main__ example in evaluation.py.Returned metrics
KeyTypeDescription
EXintBinary row-set equality
EXPfloatPrecision with semantic column alignment and partial row credit
EXRfloatRecall with semantic column alignment and partial row credit
F1floatHarmonic mean of EXP and EXR
Enum value: EvaluationTechnique.NO_COLUMN_AND_PARTIAL_CELLString value: "no_column_and_partial_cell"Embedding required: NoWhat it measuresColumn names are completely ignored. Each row is converted to a set of cell value strings, discarding column identity entirely. Rows are matched first by exact cell-set equality, then residual rows are paired using Jaccard similarity over their cell-value sets. Precision and recall are computed over individual cell values rather than whole cells in a column-aligned grid.When to use itUse when the ground truth and prediction may return equivalent data under entirely different schemas — for example, after schema renaming experiments in TQA, or when comparing systems that produce differently structured outputs. Because column identity is ignored, this technique cannot distinguish between a query that returns (name, city) and one that returns (city, name) if the values happen to be the same; it treats them as identical.Returned metrics
KeyTypeDescription
EXintBinary row-set equality
EXPfloatPrecision based on cell-value set overlap
EXRfloatRecall based on cell-value set overlap
F1floatHarmonic mean of EXP and EXR
Enum value: EvaluationTechnique.UNIFIED_COLUMN_AND_SEMANTIC_ROWString value: "unified_column_and_semantic_row"Embedding required: YesWhat it measuresThis technique takes a fundamentally different approach from the others. Instead of projecting onto matched columns, it builds a unified column space from the union of all predicted and ground-truth columns (sorted lexicographically). Both result sets are padded with "N/A" for any column they don’t contain, and each row is serialised as a pipe-separated string. The Hungarian algorithm then matches predicted rows to ground-truth rows based on cosine similarity between their row-level embeddings. A coverage penalty — the fraction of ground-truth columns that appear in the predicted result — is multiplied into each match score. EXP and EXR are computed as sum(penalised_similarity) / P and sum(penalised_similarity) / G respectively, where P and G are the row counts.When to use itUse when you want holistic row-level semantic similarity rather than cell-counting. This technique is most appropriate for exploratory analysis where the predicted query may return a structurally different but semantically related result set. It is the most computationally expensive technique because it embeds every row in both result sets.Returned metrics
KeyTypeDescription
EXintBinary row-set equality
EXPfloatMean penalised row-similarity across predicted rows
EXRfloatMean penalised row-similarity across ground-truth rows
F1floatHarmonic mean of EXP and EXR

Technique comparison at a glance

TechniqueColumn matchingRow matchingEmbeddingMetrics
EXECUTION_ACCURACYNoneExact set equalityNoEX
EXACT_COLUMN_AND_EXACT_CELLExact nameExact tupleNoEX, EXP, EXR, F1
EXACT_COLUMN_AND_PARTIAL_CELLExact nameExact + partial greedyNoEX, EXP, EXR, F1
SEMANTIC_COLUMN_AND_EXACT_CELLSemantic (cosine > 0.7)Exact tupleYesEX, EXP, EXR, F1
SEMANTIC_COLUMN_AND_PARTIAL_CELLSemantic (cosine > 0.7)Exact + partial greedyYesEX, EXP, EXR, F1
NO_COLUMN_AND_PARTIAL_CELLNone (ignored)Jaccard cell-setNoEX, EXP, EXR, F1
UNIFIED_COLUMN_AND_SEMANTIC_ROWUnion + coverage penaltyRow embedding cosineYesEX, EXP, EXR, F1

Build docs developers (and LLMs) love