Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/dais-polymtl/sqlmorph/llms.txt

Use this file to discover all available pages before exploring further.

The Join Query Expansion (JQE) module generates new SQL queries by extending existing join-query graphs from the BIRD benchmark. Given a database ID and a target table count, it locates candidate tables adjacent to known query graphs, combines them into extended subgraphs, executes the resulting SQL against the live SQLite database to verify correctness, and partitions the output into filtered (novel) and discarded (isomorphic) query sets. You can drive the module either from Python or from the command line.

Python API

aug_n_table_sql_queries()

The primary entry point for programmatic use. Loads subgraphs and the schema graph for the given database, generates extended SQL queries, and returns three collections.
from src.join_query_expansion.sql_query_gen import aug_n_table_sql_queries

filtered_aug, discarded_aug, all_extensions = aug_n_table_sql_queries(
    db_id="financial",
    num_tables=3,
    graph_first=False,
    max_queries=1000,
    max_isomorphic=5,
    sort_asc=True,
)
Parameters
db_id
string
required
The BIRD benchmark database identifier (e.g., "financial", "formula_1"). The function resolves all required paths — the SQLite file, schema graph pickle, and rule-inputs directory — using this value together with the RULE_INPUTS_BASE, GRAPH_DATA_BASE, and DATA_FOLDER environment variables.
num_tables
integer
required
Number of tables that the augmented queries must involve. This controls which pre-augmented subgraphs are loaded from the rule-inputs directory; only subgraphs of exactly num_tables nodes are used as seeds.
graph_first
boolean
default:"False"
When True, each extended subgraph is also translated into a canonical SQL query by traversing the graph structure directly (graph-first mode), in addition to the query-first extension. Graph-first queries are stored in the graph_first key of the returned dicts and require additional LLM calls to produce natural-language questions and evidence.
max_queries
integer
default:"1000"
Maximum total number of valid extended queries to generate before stopping. The generator iterates over seed graphs and candidate combinations in order and stops as soon as this limit is reached.
max_isomorphic
integer
default:"5"
Maximum number of structurally isomorphic query graphs to keep in the filtered set. Once max_isomorphic copies of a given graph pattern exist in the filtered output, further isomorphic extensions are moved to the discarded set. The source default differs from the CLI default (see below).
sort_asc
boolean
default:"True"
Sort candidate tables in ascending order based on their number of rows before enumerating expansions. When False, the default ordering is used. The Python function default is True; the CLI default is False.
Return value Returns a 3-tuple: (filtered_aug, discarded_aug, all_extensions).
filtered_aug
dict
Queries that passed the isomorphism filter and were accepted into the augmented set.
discarded_aug
dict
Queries that were rejected because the extended graph was isomorphic to an existing one and the max_isomorphic cap was already met. Same structure as filtered_aug.
all_extensions
list[dict]
Detailed log of every accepted extension attempt (only populated when graph_first=True). Each entry carries the original and extended query fields plus a status key of either "filtered" or "discarded".

find_candidate_expansions()

A lower-level helper used internally by the generator loop. Given the full schema graph and a single join-query subgraph, it finds every table in the schema that is not yet in the subgraph but is connected to at least one table that is, then enumerates all valid edge/join-key combinations for each candidate.
from src.join_query_expansion.sql_query_gen import find_candidate_expansions

combinations = find_candidate_expansions(
    schema_graph=schema,   # networkx.Graph loaded from the schema pickle
    jq_graph=jq_graph,     # networkx.DiGraph representing the seed query
    sort_asc=False,
)
schema_graph
networkx.Graph
required
The full schema graph for the database, loaded from {GRAPH_DATA_BASE}/{db_id}_graph.pkl. Nodes are table names; edges carry join-condition labels.
jq_graph
networkx.DiGraph
required
The join-query graph representing the seed SQL query. Nodes are (table_name, alias) tuples.
sort_asc
boolean
default:"False"
Passed through to sort_condition_combinations() to control the ordering of the returned list.
Returns a list of combination dicts. Each dict has a candidate_table key (a (table_name, alias) tuple) and a candidate_edges key (a list of edge dicts, each with an edge and a joins list of join-condition strings). Returns an empty list when no adjacent candidates exist.

Command-line interface

Run the full pipeline — SQL generation followed by natural-language question generation — directly from the shell.
python -m src/join_query_expansion/main.py <db_id> <num_tables> [options]
Positional arguments
ArgumentTypeDescription
db_idstringDatabase identifier (e.g., financial)
num_tablesintegerNumber of tables in the augmented queries
Options
FlagLong formTypeDefaultDescription
-max--max_queriesint1000Maximum number of extended queries to generate
-iso--max_isomorphicint1Maximum isomorphic queries per graph pattern
-asc--sort_ascboolFalseSort candidate tables ascending by row count
-g--graph_firstboolFalseEnable graph-first augmentation mode
The CLI default for --max_isomorphic is 1, whereas the Python function default is 5. Pass -iso explicitly if you want consistent behaviour between the two interfaces.
Boolean flags accept true/false, yes/no, t/f, y/n, and 1/0. Example
python -m src/join_query_expansion/main.py financial 3 \
  --max_queries 500 \
  --max_isomorphic 2 \
  --sort_asc true \
  --graph_first false

Output file structure

Persistence functions write results under $RULE_OUTPUTS_BASE using a {status}/{mode}/ subdirectory layout.
$RULE_OUTPUTS_BASE
{status}
{mode}
{output_file}_ori.json
{output_file}_aug.json
{output_file}.sql
FileContents
*_ori.jsonOriginal BIRD benchmark entries that the augmented queries were extended from
*_aug.jsonAugmented query entries with question_id, db_id, question, evidence, SQL, and difficulty
*.sqlOne SQL query per line followed by a tab and the db_id
augmentation_log.pickleCumulative pickle log of all extension attempts across runs (populated in graph-first mode)
For graph-first output, save_graph_first() writes two JSON files per output:
FileContents
*_exp.jsonExplicit questions generated directly from the SQL structure
*_dev.jsonDev-set-like questions with natural-language evidence

Build docs developers (and LLMs) love