The Textual Query Augmentation (TQA) module decreases the naturalness of BIRD benchmark questions and evidence strings by replacing human-readable table and column names with obfuscated schema identifiers. It calls an LLM (default: GPT-4o) with a few-shot prompt that instructs the model to rewrite the question and evidence to match a provided schema mapping, then streams results row-by-row into an output CSV so progress is never lost between runs. The module supports resuming interrupted jobs: anyDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/dais-polymtl/sqlmorph/llms.txt
Use this file to discover all available pages before exploring further.
(db_id, question_id) pair already present in the output CSV is skipped automatically.
process_nl_queries()
The main entry point. Reads the input queries and schema mapping, calls generate_less_natural_nl() for each unprocessed row, and writes results incrementally to output_csv_path.
Path to the input CSV file containing the queries to augment. Must have at minimum the columns
question_id, db_id, question, and evidence. The function renames question → original_question and evidence → original_evidence internally before processing.Path to the schema mapping CSV that describes how table and column names should be rewritten. Must contain the columns
db_id, table_name, column_name, new_table_name, and new_column_name.Path where the augmented results are written. The function creates parent directories if they do not exist. If the file already exists, completed rows are loaded and skipped, enabling resumable runs.
The OpenAI model to use for generation. Pass any member of the
OpenAIModel enum (e.g., OpenAIModel.GPT_4O_MINI) or use the default GPT-4o for highest quality output.Random seed forwarded to the OpenAI completion call. Use a fixed seed to obtain deterministic outputs across retries.
Sampling temperature for the LLM.
0 produces the most deterministic output; higher values introduce more variation.Processing statistics for the entire run.
generate_less_natural_nl()
Generates a less natural question and evidence for a single query item. Called internally by process_nl_queries() but also useful for one-off transformations.
The original natural-language question to be rewritten.
The original evidence string accompanying the question.
The schema mapping for the relevant database, as returned by
read_schema_mapping() for a single db_id. Structure: {table_name: {"new_name": str, "columns": {old_col: new_col}}}.The OpenAI model to use for generation.
Random seed passed to the completion call.
Sampling temperature for the LLM.
tuple[str, str]: (new_question, new_evidence). If the LLM response cannot be parsed as JSON, or if an exception occurs, the function returns the original (question, evidence) unchanged rather than raising.
read_schema_mapping()
Reads the schema mapping CSV and organises it as a nested dict keyed by db_id.
Path to the schema mapping CSV. Required columns:
db_id, table_name, column_name, new_table_name, new_column_name.dict with the shape:
print_nl_processing_summary()
Prints a formatted summary of the statistics dict returned by process_nl_queries().
The statistics dict returned by
process_nl_queries().CSV schemas
Input queries CSV (queries_csv_path)
| Column | Type | Description |
|---|---|---|
question_id | integer | Unique identifier matching the BIRD benchmark |
db_id | string | Database identifier |
question | string | Natural-language question (renamed original_question by the function) |
evidence | string | Evidence string (renamed original_evidence by the function) |
SQL | string | Gold SQL query (passed through unchanged) |
difficulty | string | Difficulty label (passed through unchanged) |
Schema mapping CSV (mapping_csv_path)
| Column | Type | Description |
|---|---|---|
db_id | string | Database identifier |
table_name | string | Original table name |
column_name | string | Original column name |
new_table_name | string | Obfuscated table name |
new_column_name | string | Obfuscated column name |
Output CSV (output_csv_path)
| Column | Type | Description |
|---|---|---|
question_id | integer | Identifier from the input row |
db_id | string | Database identifier |
original_question | string | Unmodified question from the input |
original_evidence | string | Unmodified evidence from the input |
new_question | string | LLM-generated less-natural question |
new_evidence | string | LLM-generated less-natural evidence |