Join Query Expansion takes an existing SQL query from the BIRD benchmark and generates a family of new queries by adding tables from the database schema that are not already present in the original query. Each addition is made through a semantically valid join, so the expanded queries are executable and meaningful. The goal is to probe how well a Text-to-SQL system handles structural complexity — and specifically how its accuracy changes as join depth increases. Applied to CHESS, DIN-SQL, and MAC-SQL, JQE demonstrated clear accuracy degradation that standard benchmarks could not reveal.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/dais-polymtl/sqlmorph/llms.txt
Use this file to discover all available pages before exploring further.
How JQE builds expanded queries
Represent the schema as a graph
JQE loads the target BIRD database schema as a NetworkX directed graph. Each node represents a table, and each edge represents a foreign-key relationship annotated with the join condition (e.g.,
account.district_id = district.district_id). This graph is pre-computed and stored as a .pkl file alongside each supported database.Extract the query join graph (JQ graph)
The original BIRD query is also represented as a subgraph — its join query graph (JQ graph) — containing only the tables and joins that appear in that specific query. This subgraph is the starting point for expansion.
Find candidate expansions
The
find_candidate_expansions function identifies tables in the schema graph that are not yet in the JQ graph but are connected to at least one table that is. For each such candidate table, it enumerates all valid join conditions using the edge labels from the schema graph. The result is an ordered list of candidate expansion combinations, each specifying a new table and the join conditions that attach it to the existing query.Check for redundant cycles
Before accepting an expansion, JQE checks whether the newly added edges create a redundant cycle — a cycle where the join keys are already fully implied by existing joins in the graph. Redundant cycles are discarded via
check_cycle_redundancy, which uses nx.simple_cycles to enumerate cycles and inspect whether all join keys in the closing edge are already present in other cycle edges.Deduplicate with isomorphism checks
JQE uses
nx.is_isomorphic to compare each expanded graph against all previously accepted graphs and all original BIRD dev-set graphs. Expansions whose graph structure is isomorphic to an already-accepted one are placed in the discarded bucket. A configurable max_isomorphic parameter allows a small number of near-duplicate structures (default: 5) before strict deduplication kicks in.Generate SQL and NL for each expansion
Accepted expansions are converted to executable SQL using
extend_old_query (query-first mode) or translate_graph_into_query (graph-first mode). In query-first mode, the original SQL is extended with additional JOIN clauses. The expanded SQL is then executed against the live SQLite database to confirm validity. GPT-4o generates a corresponding NL question for each accepted expansion.Query-first vs graph-first modes
JQE supports two generation modes, selectable with the-gf flag:
- Query-first (default)
- Graph-first
Extends the original SQL query directly by appending new JOIN clauses and updating the WHERE conditions. This mode preserves the original query’s structure and adds minimal new SQL constructs. It is the default mode and is recommended for most use cases.
Running JQE
The main entry point accepts a database ID and a target number of tables:db_id: one of the 11 supported BIRD development databases (see below)num_tables: number of tables the expanded query should involve; must be at least2and cannot exceed the database-specific maximum stored inrule_inputs/jq_augmentation/<db_id>:jq_graphs_n_tables.pkl
european_football_2 database:
Supported BIRD databases
JQE supports all 11 databases from the BIRD development set:View all 11 supported databases
View all 11 supported databases
| Database ID | Domain |
|---|---|
california_schools | Education |
card_games | Games |
codebase_community | Software |
debit_card_specializing | Finance |
european_football_2 | Sports |
financial | Finance |
formula_1 | Sports |
student_club | Education |
superhero | Entertainment |
thrombosis_prediction | Healthcare |
toxicology | Science |
Output structure
JQE writes its results underdata/rule_outputs/jq_augmentation/:
filtered/— unique expansions that passed isomorphism deduplication; these are the queries used for evaluationdiscarded/— expansions that were structurally too similar to existing queries; kept for transparency but excluded from evaluation
Both
filtered and discarded files come in _aug (expanded) and _ori (original) variants, so you can always trace an expanded query back to its source entry in the BIRD dev set.