SQLMorph: Query Mutation for Text-to-SQL Evaluation

Text-to-SQL systems have improved rapidly with large language models, yet the standard way to measure them has not kept pace. Execution Accuracy (EX) treats every prediction as either fully correct or fully wrong, masking partial errors and hiding systematic model weaknesses. At the same time, building private evaluation sets that reflect real enterprise schemas is expensive, time-consuming, and hard to reproduce. SQLMorph — a research framework accepted at ICDE 2026 — addresses both problems at once: it generates targeted evaluation data automatically through query mutation, and it provides fine-grained execution metrics that reveal what binary accuracy obscures.

What SQLMorph does

SQLMorph combines two complementary mutation techniques with a family of relaxed evaluation metrics. Join Query Expansion (JQE) takes existing SQL queries and systematically adds semantically valid joins, producing a spectrum of queries that range from simple single-table lookups to complex multi-join statements. This creates a controlled stress test for how systems handle increasing structural complexity. Applied to state-of-the-art systems evaluated on the BIRD benchmark, JQE exposes a consistent pattern: accuracy degrades as the number of joins grows. Textual Query Augmentation (TQA) generates controlled natural language perturbations of existing questions — paraphrases, abbreviations, formal restatements — without changing the intended SQL answer. These perturbations probe how sensitive a system is to linguistic surface variation. Experiments show that moderate variation causes measurable degradation, and forcing heavy abbreviations results in up to a 17% drop in accuracy across evaluated systems.

Fine-grained metrics

Beyond evaluation set generation, SQLMorph defines three execution-level metrics that go beyond binary EX:

Execution Precision (EXP) — of the rows the predicted query returned, what fraction are correct?
Execution Recall (EXR) — of the rows the ground-truth query returns, what fraction did the prediction recover?
F1 — the harmonic mean of EXP and EXR, giving a single unified score.

These metrics separate over-prediction (a query that returns too many rows scores low on EXP) from under-prediction (a query that misses rows scores low on EXR), revealing differences between systems that binary EX treats as identical.

Systems and benchmarks

SQLMorph’s experiments target the BIRD benchmark, a large-scale Text-to-SQL dataset covering diverse real-world databases. The framework has been applied to three state-of-the-art systems:

CHESS — a pipeline-based system using schema linking and candidate generation
DIN-SQL — a decomposition-based approach using in-context learning
MAC-SQL — a multi-agent framework for complex schema navigation

Across all three, SQLMorph’s mutation techniques and fine-grained metrics surface accuracy patterns that standard evaluation would not reveal.

Explore the framework

Join Query Expansion

How JQE adds semantically valid joins to increase structural complexity and probe model limits.

Textual Query Augmentation

How TQA perturbs natural language questions to measure robustness to linguistic variation.

Fine-grained metrics

The definitions and behavior of EXP, EXR, and F1 versus binary Execution Accuracy.

Quickstart

Install SQLMorph and run your first evaluation in under five minutes.

Get Started

Core Concepts

Guides

Configuration

SQLMorph: Query Mutation for Text-to-SQL Evaluation

What SQLMorph does

Fine-grained metrics

Systems and benchmarks

Explore the framework

Join Query Expansion

Textual Query Augmentation

Fine-grained metrics

Quickstart

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Configuration

Documentation Index

​What SQLMorph does

​Fine-grained metrics

​Systems and benchmarks

​Explore the framework

Join Query Expansion

Textual Query Augmentation

Fine-grained metrics

Quickstart

Build docs developers (and LLMs) love

What SQLMorph does

Fine-grained metrics

Systems and benchmarks

Explore the framework