The Join Query Expansion (JQE) module generates new SQL queries by extending existing join-query graphs from the BIRD benchmark. Given a database ID and a target table count, it locates candidate tables adjacent to known query graphs, combines them into extended subgraphs, executes the resulting SQL against the live SQLite database to verify correctness, and partitions the output into filtered (novel) and discarded (isomorphic) query sets. You can drive the module either from Python or from the command line.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/dais-polymtl/sqlmorph/llms.txt
Use this file to discover all available pages before exploring further.
Python API
aug_n_table_sql_queries()
The primary entry point for programmatic use. Loads subgraphs and the schema graph for the given database, generates extended SQL queries, and returns three collections.
The BIRD benchmark database identifier (e.g.,
"financial", "formula_1"). The function resolves all required paths — the SQLite file, schema graph pickle, and rule-inputs directory — using this value together with the RULE_INPUTS_BASE, GRAPH_DATA_BASE, and DATA_FOLDER environment variables.Number of tables that the augmented queries must involve. This controls which pre-augmented subgraphs are loaded from the rule-inputs directory; only subgraphs of exactly
num_tables nodes are used as seeds.When
True, each extended subgraph is also translated into a canonical SQL query by traversing the graph structure directly (graph-first mode), in addition to the query-first extension. Graph-first queries are stored in the graph_first key of the returned dicts and require additional LLM calls to produce natural-language questions and evidence.Maximum total number of valid extended queries to generate before stopping. The generator iterates over seed graphs and candidate combinations in order and stops as soon as this limit is reached.
Maximum number of structurally isomorphic query graphs to keep in the filtered set. Once
max_isomorphic copies of a given graph pattern exist in the filtered output, further isomorphic extensions are moved to the discarded set. The source default differs from the CLI default (see below).Sort candidate tables in ascending order based on their number of rows before enumerating expansions. When
False, the default ordering is used. The Python function default is True; the CLI default is False.(filtered_aug, discarded_aug, all_extensions).
Queries that passed the isomorphism filter and were accepted into the augmented set.
Queries that were rejected because the extended graph was isomorphic to an existing one and the
max_isomorphic cap was already met. Same structure as filtered_aug.Detailed log of every accepted extension attempt (only populated when
graph_first=True). Each entry carries the original and extended query fields plus a status key of either "filtered" or "discarded".find_candidate_expansions()
A lower-level helper used internally by the generator loop. Given the full schema graph and a single join-query subgraph, it finds every table in the schema that is not yet in the subgraph but is connected to at least one table that is, then enumerates all valid edge/join-key combinations for each candidate.
The full schema graph for the database, loaded from
{GRAPH_DATA_BASE}/{db_id}_graph.pkl. Nodes are table names; edges carry join-condition labels.The join-query graph representing the seed SQL query. Nodes are
(table_name, alias) tuples.Passed through to
sort_condition_combinations() to control the ordering of the returned list.candidate_table key (a (table_name, alias) tuple) and a candidate_edges key (a list of edge dicts, each with an edge and a joins list of join-condition strings). Returns an empty list when no adjacent candidates exist.
Command-line interface
Run the full pipeline — SQL generation followed by natural-language question generation — directly from the shell.| Argument | Type | Description |
|---|---|---|
db_id | string | Database identifier (e.g., financial) |
num_tables | integer | Number of tables in the augmented queries |
| Flag | Long form | Type | Default | Description |
|---|---|---|---|---|
-max | --max_queries | int | 1000 | Maximum number of extended queries to generate |
-iso | --max_isomorphic | int | 1 | Maximum isomorphic queries per graph pattern |
-asc | --sort_asc | bool | False | Sort candidate tables ascending by row count |
-g | --graph_first | bool | False | Enable graph-first augmentation mode |
The CLI default for
--max_isomorphic is 1, whereas the Python function default is 5. Pass -iso explicitly if you want consistent behaviour between the two interfaces.true/false, yes/no, t/f, y/n, and 1/0.
Example
Output file structure
Persistence functions write results under$RULE_OUTPUTS_BASE using a {status}/{mode}/ subdirectory layout.
$RULE_OUTPUTS_BASE
{status}
{mode}
{output_file}_ori.json
{output_file}_aug.json
{output_file}.sql
aug_log
| File | Contents |
|---|---|
*_ori.json | Original BIRD benchmark entries that the augmented queries were extended from |
*_aug.json | Augmented query entries with question_id, db_id, question, evidence, SQL, and difficulty |
*.sql | One SQL query per line followed by a tab and the db_id |
augmentation_log.pickle | Cumulative pickle log of all extension attempts across runs (populated in graph-first mode) |
save_graph_first() writes two JSON files per output:
| File | Contents |
|---|---|
*_exp.json | Explicit questions generated directly from the SQL structure |
*_dev.json | Dev-set-like questions with natural-language evidence |