Text-to-SQL systems have improved rapidly with large language models, yet the standard way to measure them has not kept pace. Execution Accuracy (EX) treats every prediction as either fully correct or fully wrong, masking partial errors and hiding systematic model weaknesses. At the same time, building private evaluation sets that reflect real enterprise schemas is expensive, time-consuming, and hard to reproduce. SQLMorph — a research framework accepted at ICDE 2026 — addresses both problems at once: it generates targeted evaluation data automatically through query mutation, and it provides fine-grained execution metrics that reveal what binary accuracy obscures.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/dais-polymtl/sqlmorph/llms.txt
Use this file to discover all available pages before exploring further.
What SQLMorph does
SQLMorph combines two complementary mutation techniques with a family of relaxed evaluation metrics. Join Query Expansion (JQE) takes existing SQL queries and systematically adds semantically valid joins, producing a spectrum of queries that range from simple single-table lookups to complex multi-join statements. This creates a controlled stress test for how systems handle increasing structural complexity. Applied to state-of-the-art systems evaluated on the BIRD benchmark, JQE exposes a consistent pattern: accuracy degrades as the number of joins grows. Textual Query Augmentation (TQA) generates controlled natural language perturbations of existing questions — paraphrases, abbreviations, formal restatements — without changing the intended SQL answer. These perturbations probe how sensitive a system is to linguistic surface variation. Experiments show that moderate variation causes measurable degradation, and forcing heavy abbreviations results in up to a 17% drop in accuracy across evaluated systems.Fine-grained metrics
Beyond evaluation set generation, SQLMorph defines three execution-level metrics that go beyond binary EX:- Execution Precision (EXP) — of the rows the predicted query returned, what fraction are correct?
- Execution Recall (EXR) — of the rows the ground-truth query returns, what fraction did the prediction recover?
- F1 — the harmonic mean of EXP and EXR, giving a single unified score.
Systems and benchmarks
SQLMorph’s experiments target the BIRD benchmark, a large-scale Text-to-SQL dataset covering diverse real-world databases. The framework has been applied to three state-of-the-art systems:- CHESS — a pipeline-based system using schema linking and candidate generation
- DIN-SQL — a decomposition-based approach using in-context learning
- MAC-SQL — a multi-agent framework for complex schema navigation
Explore the framework
Join Query Expansion
How JQE adds semantically valid joins to increase structural complexity and probe model limits.
Textual Query Augmentation
How TQA perturbs natural language questions to measure robustness to linguistic variation.
Fine-grained metrics
The definitions and behavior of EXP, EXR, and F1 versus binary Execution Accuracy.
Quickstart
Install SQLMorph and run your first evaluation in under five minutes.