Textual Query Augmentation (TQA) rewrites the natural-language questions and evidence strings in a Text-to-SQL dataset to be deliberately less natural — replacing intuitive table and column names with obfuscated ones according to a schema-renaming map. The rewriting is performed by GPT-4o via a few-shot prompt, and the pipeline supports incremental processing so that interrupted runs can resume from where they left off.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/dais-polymtl/sqlmorph/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
- An
OPENAI_API_KEYenvironment variable available in your shell. - A schema mapping CSV that defines the renaming for each database.
- A queries CSV in the BIRD format (columns:
question_id,db_id,question,evidence,SQL).
Full workflow
Prepare the schema mapping CSV
Create or obtain a CSV file that maps original table and column names to their obfuscated replacements. Each row represents one column mapping. The required columns are:
The default mapping file used in experiments lives at:
| Column | Description |
|---|---|
db_id | Database identifier (e.g. california_schools) |
table_name | Original table name |
new_table_name | Replacement table name |
column_name | Original column name |
new_column_name | Replacement column name |
Prepare the queries CSV
The input queries file must contain the following columns:
The pipeline internally renames
| Column | Description |
|---|---|
question_id | Unique question identifier |
db_id | Database identifier |
question | Original natural-language question |
evidence | Evidence string accompanying the question |
SQL | Ground-truth SQL query |
question → original_question and evidence → original_evidence before processing.API reference
process_nl_queries()
The primary entry point for batch augmentation.
Path to the input CSV containing the queries to augment.
Path to the schema mapping CSV that defines table and column renames.
Path where the augmented output CSV will be written. If the file already exists, entries already present in it are skipped (incremental processing).
The OpenAI model to use for rewriting. Accepts any member of the
OpenAIModel enum, such as OpenAIModel.GPT_4O.Random seed passed to the OpenAI API for reproducible generation.
Sampling temperature passed to the OpenAI API. Use
0 for fully deterministic output.generate_less_natural_nl()
For single-item generation, call this function directly:
process_nl_queries() calls internally for each row.
print_nl_processing_summary()
Prints a formatted summary of a completed run to stdout, including total, skipped, successful, and failed counts broken down per database.
stats dict is the return value of process_nl_queries().
Output CSV structure
The output CSV written tooutput_csv_path contains the following columns (plus any extra columns carried through from the input):
| Column | Description |
|---|---|
question_id | Original question identifier |
db_id | Database identifier |
original_question | Unmodified natural-language question |
original_evidence | Unmodified evidence string |
new_question | Rewritten question with decreased naturalness |
new_evidence | Rewritten evidence with obfuscated schema names |
SQL | Unchanged ground-truth SQL |
TQA supports incremental processing. If
output_csv_path already exists from a previous run, process_nl_queries() reads it on startup and skips any (db_id, question_id) pair already present. You can safely interrupt and restart the pipeline without reprocessing completed entries.