SQLMorph architecture and data flow overview

SQLMorph addresses a fundamental limitation in Text-to-SQL research: existing benchmarks measure binary pass/fail accuracy on a fixed set of queries, which leaves structural and linguistic blind spots unexamined. To close that gap, the framework operates in two stages. First, its mutation modules — Join Query Expansion (JQE) and Textual Query Augmentation (TQA) — automatically generate new evaluation queries from an existing benchmark. Second, its fine-grained metrics layer evaluates those queries using Execution Precision (EXP), Execution Recall (EXR), and F1 rather than a single binary score. Applied to three state-of-the-art systems — CHESS, DIN-SQL, and MAC-SQL — SQLMorph exposed accuracy degradation as join complexity grows and revealed up to a 17% accuracy drop when natural language queries use heavy abbreviations.

The data foundation: BIRD benchmark

SQLMorph is built around the BIRD benchmark, a large-scale Text-to-SQL dataset that covers real-world enterprise schemas. Each entry in the BIRD development set provides:

A natural language (NL) question from a user
A ground-truth SQL query that answers it
A database schema defining tables, columns, and foreign-key relationships

SQLMorph takes these entries as its starting point and generates expanded evaluation sets without requiring manual annotation.

How data flows through SQLMorph

Every evaluation in SQLMorph follows a three-step pipeline:

Input: NL query, SQL, and schema

A BIRD benchmark entry enters the system as a triple: the NL question a user would ask, the gold-standard SQL that answers it, and the schema graph of the target database. This triple is the unit of mutation for both JQE and TQA.

Mutation: JQE or TQA generates expanded queries

JQE extends the SQL component. It analyses the database schema as a graph using NetworkX, identifies tables not yet in the original query, and constructs new queries that add those tables via valid joins. The result is a set of structurally more complex SQL queries paired with LLM-generated NL questions.TQA extends the NL component. It renames schema elements (tables and columns) using a controlled obfuscation mapping, then prompts GPT-4o to rewrite the original NL question so it reflects the renamed schema. The result is a set of linguistically degraded queries that share the same SQL ground truth.Both modules produce two output buckets: filtered queries (structurally unique, kept for evaluation) and discarded queries (isomorphic duplicates, excluded).

Evaluation: fine-grained metrics

A Text-to-SQL system generates a predicted SQL for each expanded query. SQLMorph executes both the predicted SQL and the ground-truth SQL against the live database, then computes EX, EXP, EXR, and F1 across seven configurable evaluation techniques — ranging from exact cell matching to semantic embedding-based comparison.

What the SOTA experiments showed

When CHESS, DIN-SQL, and MAC-SQL were evaluated on JQE-generated queries, all three systems showed measurable accuracy degradation as the number of joins in the expanded queries increased — a weakness invisible to standard benchmark scores. TQA experiments revealed that even moderate linguistic changes, such as replacing readable column names with terse abbreviations, caused up to a 17% drop in execution accuracy. These findings demonstrate that SQLMorph’s targeted choke points surface failure modes that binary metrics and static benchmarks cannot.

Explore the three core modules

Join Query Expansion

Learn how JQE uses graph analysis and isomorphism checks to generate structurally diverse SQL queries.

Textual Query Augmentation

Understand how TQA applies schema renaming and LLM prompting to create linguistically challenging queries.

Fine-grained metrics

Explore the EXP, EXR, and F1 metrics and the seven evaluation techniques that power them.

Quickstart

Run your first JQE expansion or metrics evaluation in a few commands.

Get Started

Core Concepts

Guides

Configuration

SQLMorph architecture and data flow overview

The data foundation: BIRD benchmark

How data flows through SQLMorph

What the SOTA experiments showed

Explore the three core modules

Join Query Expansion

Textual Query Augmentation

Fine-grained metrics

Quickstart

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Configuration

Documentation Index

​The data foundation: BIRD benchmark

​How data flows through SQLMorph

​What the SOTA experiments showed

​Explore the three core modules

Join Query Expansion

Textual Query Augmentation

Fine-grained metrics

Quickstart

Build docs developers (and LLMs) love

The data foundation: BIRD benchmark

How data flows through SQLMorph

What the SOTA experiments showed

Explore the three core modules