Join Query Expansion (JQE) increases the structural complexity of existing SQL queries by adding valid joins drawn from the database schema. Starting from a seed query, JQE enumerates candidate table additions, filters out redundant or isomorphic extensions, and produces paired SQL and natural-language outputs ready for Text-to-SQL evaluation. You control the target database, the number of tables in generated queries, and several sampling parameters that shape the expansion space.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/dais-polymtl/sqlmorph/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
Before running JQE, make sure you have:- An
OPENAI_API_KEYset in a.envfile at the project root (GPT-4o generates the natural-language side of each pair). - The BIRD dataset downloaded and placed under
data/in the project root (see step 2 below).
Full workflow
Source the configuration
Export the environment variables that tell JQE where to find data and where to write outputs:This sets the following variables:
| Variable | Value |
|---|---|
DATA_FOLDER | data |
RULE_INPUTS_BASE | data/rule_inputs/ |
GRAPH_DATA_BASE | data/graph_data/bird_graphs/pickles |
RULE_OUTPUTS_BASE | data/rule_outputs/jq_augmentation |
Download the data
Download the dataset from Google Drive. Rename the downloaded folder to The
data and place it at the project root:data/ directory contains the BIRD databases, pre-computed join-query graphs, and rule-input pickles required for expansion.Run the main expansion script
Pass the database ID and the number of tables you want in the generated queries:For example, to generate 4-table queries for The script loads pre-augmented subgraphs for the chosen database, enumerates valid expansions, executes each candidate query against the database to verify it returns results, generates natural-language questions with GPT-4o, and writes the output files.
european_football_2:Enable graph-first mode (optional)
By default the script follows a query-first strategy, extending existing queries. To also generate queries translated directly from graph structure, add the Graph-first output is written alongside the query-first files (see output structure below).
-g flag:Inspect the outputs
All outputs land under
data/rule_outputs/jq_augmentation/. The directory is organised as follows:filtered/— unique expansions that passed the isomorphism filter.discarded/— queries skipped because they were structurally identical (up to graph isomorphism) to an already-accepted expansion.*_aug.json— the expanded (augmented) SQL and NL pairs.*_ori.json— the original seed queries that were expanded.
CLI arguments
ID of the BIRD dev database to expand queries for. Must be one of the 11 supported databases listed below.
Number of tables to include in the generated queries. Must be at least
2 and cannot exceed the maximum pre-computed for the database (stored in rule_inputs/jq_augmentation/<db_id>/jq_graphs_n_tables.pkl).Maximum number of expanded queries to generate across the entire run. The script stops early if this limit is reached before exhausting all candidate expansions.
Maximum number of structurally isomorphic expansions to retain for any single seed query graph. Expansions beyond this limit are placed in the
discarded/ output.When
true, sort candidate table additions in ascending order by row count before enumerating expansions. This biases generation toward smaller tables first.When
true, also generate queries translated directly from the extended join graph, in addition to the standard query-first expansions.Supported databases
JQE works with the following 11 databases from the BIRD dev set:| Database ID |
|---|
california_schools |
card_games |
codebase_community |
debit_card_specializing |
european_football_2 |
financial |
formula_1 |
student_club |
superhero |
thrombosis_prediction |
toxicology |
The
num_tables upper bound varies per database. Attempting to request more tables than the pre-computed graphs support will cause the script to log an error and exit.Example command
european_football_2, keeping up to 1000 unique expansions (default) and filtering out duplicate isomorphics.