Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/dais-polymtl/sqlmorph/llms.txt

Use this file to discover all available pages before exploring further.

Textual Query Augmentation addresses a different weakness than structural complexity: linguistic brittleness. Real-world database schemas often use terse, abbreviated, or domain-specific names that don’t match how users phrase their questions. TQA simulates this mismatch by renaming schema elements — tables and columns — using a controlled obfuscation mapping, then asking GPT-4o to rewrite the original natural language question so it reflects the new names. The ground-truth SQL stays the same; only the NL side changes. When applied to CHESS, DIN-SQL, and MAC-SQL, forcing heavy abbreviations caused up to a 17% drop in execution accuracy — a fragility invisible to standard benchmark evaluation.

The decrease-naturalness technique

TQA’s primary technique is decrease naturalness: it systematically degrades the readability of schema identifiers and propagates those changes into the NL queries that reference them. The process has two phases: schema renaming and NL query alteration.

Phase 1: schema renaming

A schema mappings CSV file defines how each table and column in a BIRD database should be renamed. The mapping records the original name and its obfuscated replacement — for example, replacing a readable column name like school_name with a terse abbreviation like schl_nm. The read_schema_mapping function loads this CSV and organises it into a nested dictionary keyed by db_id, then table name, then column name:
# src/textual_query_augmentation/decrease_naturalness/alter_nl_queries.py

def read_schema_mapping(mapping_csv_path):
    df = pd.read_csv(mapping_csv_path)
    db_mappings = defaultdict(lambda: defaultdict(dict))

    for _, row in df.iterrows():
        db_id = row["db_id"]
        table_name = row["table_name"]
        new_table_name = row["new_table_name"]

        if "columns" not in db_mappings[db_id][table_name]:
            db_mappings[db_id][table_name]["new_name"] = new_table_name
            db_mappings[db_id][table_name]["columns"] = {}

    for _, row in df.iterrows():
        db_id = row["db_id"]
        table_name = row["table_name"]
        column_name = row["column_name"]
        new_column_name = row["new_column_name"]
        db_mappings[db_id][table_name]["columns"][column_name] = new_column_name

    return db_mappings

Phase 2: NL query alteration via LLM

Once the schema mapping is loaded, generate_less_natural_nl constructs a prompt that includes the formatted mapping and the original NL question, then sends it to GPT-4o. The prompt template is a few-shot template stored at src/textual_query_augmentation/decrease_naturalness/prompts_templates/fewshot-decrease-nl-naturalness. The model returns a JSON object with an updated_question and updated_evidence field:
def generate_less_natural_nl(
    question,
    evidence,
    schema_mapping,
    model_name=OpenAIModel.GPT_4O,
    seed=42,
    temperature=0,
):
    prompt_renderer = PromptRenderer(
        templates_dir_path="src/textual_query_augmentation/decrease_naturalness/prompts_templates/"
    )
    llm = ModelManager.create_model(
        model_provider=ModelProvider.OPENAI,
        model_type=ModelType.COMPLETION,
        model_name=model_name,
        openai_api_key=os.getenv("OPENAI_API_KEY", None),
    )

    formatted_schema_mapping = format_schema_mapping(schema_mapping)
    prompt_context = {
        "_SCHEMA_MAPPING_": formatted_schema_mapping,
        "_QUESTION_": question,
        "_EVIDENCE_": evidence,
    }
    prompt = prompt_renderer.render("fewshot-decrease-nl-naturalness", prompt_context)
    messages = compose_chat_messages(user_messages=[prompt])
    llm_response = llm.get_chat_completion(messages=messages, seed=seed, temperature=temperature)

    response_json = json.loads(llm_response["completion_content"][0].strip())
    new_question = response_json.get("updated_question", question)
    new_evidence = response_json.get("updated_evidence", evidence)
    return new_question, new_evidence
TQA uses temperature=0 and a fixed seed=42 by default to ensure reproducible outputs. The same NL question will always produce the same altered version when using the same schema mapping and model.

The full pipeline: process_nl_queries

The process_nl_queries function orchestrates the end-to-end TQA pipeline for a batch of queries. It reads a CSV of original queries, loads the schema mappings, checks for already-processed entries (so it can resume interrupted runs), and writes results incrementally to an output CSV:
def process_nl_queries(
    queries_csv_path,
    mapping_csv_path,
    output_csv_path,
    model_name=OpenAIModel.GPT_4O,
    seed=42,
    temperature=0,
):
    db_mappings = read_schema_mapping(mapping_csv_path)
    df = pd.read_csv(queries_csv_path)
    existing_results = load_existing_results(output_csv_path)

    for idx, row in tqdm(df.iterrows(), total=len(df)):
        db_id = row["db_id"]
        schema_mapping = db_mappings.get(db_id, {})

        new_question, new_evidence = generate_less_natural_nl(
            row["original_question"],
            row["original_evidence"],
            schema_mapping,
            model_name,
            seed,
            temperature,
        )

        write_result_to_csv(output_csv_path, row_data, is_first_write)
The output CSV preserves both the original_question and new_question columns, so you can compare the two versions directly and trace how the LLM rewrote each query.

How the pipeline fits together

1

Prepare schema mappings

Create or obtain the schema mappings CSV (databases_naturalness_decreased_fixed.csv). Each row specifies a db_id, table_name, new_table_name, column_name, and new_column_name. The rename strategy determines the degree of naturalness decrease — partial abbreviations produce moderate difficulty; full obfuscation produces maximum difficulty.
2

Load the source queries

Provide a CSV of queries to augment (new_sql_queries.csv). Each row must have question_id, db_id, original_question, and original_evidence columns. These are typically a subset of the BIRD development set.
3

Run the augmentation

Call process_nl_queries with the paths to the query CSV, the mappings CSV, and the desired output path. The function processes queries one at a time, writing results to disk immediately so progress is not lost if the run is interrupted.
from src.textual_query_augmentation.decrease_naturalness.alter_nl_queries import process_nl_queries
from src.core.model_manager import OpenAIModel

stats = process_nl_queries(
    queries_csv_path="data/augmentation/decrease_naturalness/new_sql_queries.csv",
    mapping_csv_path="data/augmentation/decrease_naturalness/databases_naturalness_decreased_fixed.csv",
    output_csv_path="data/augmentation/decrease_naturalness/new_sql_nl_queries.csv",
    model_name=OpenAIModel.GPT_4O,
    seed=42,
    temperature=0,
)
4

Evaluate a Text-to-SQL system

Feed the augmented NL queries (with the unchanged ground-truth SQL) to a Text-to-SQL system, collect its predicted SQL, and evaluate using the SQLMorph metrics module. Compare EX scores on the original vs augmented queries to measure linguistic sensitivity.

Key finding: abbreviations cause large accuracy drops

Experiments with CHESS, DIN-SQL, and MAC-SQL showed that forcing heavy abbreviations in table and column names — propagated into the NL questions via TQA — caused up to a 17% drop in execution accuracy. This finding indicates that SOTA systems have learned to leverage readable schema names as implicit signals during SQL generation. When those signals are removed or distorted, generation quality degrades significantly.
You can control the degree of linguistic degradation by adjusting the schema mapping CSV. A more aggressive renaming strategy (e.g., replacing full words with single-letter codes) will produce harder perturbations and likely larger accuracy drops.
If process_nl_queries encounters a db_id with no entry in the schema mappings CSV, it keeps the original NL question and evidence unchanged and logs a warning. The row is still written to the output CSV with the original values, so the output is always complete.
Yes. process_nl_queries calls load_existing_results at startup to read any already-processed entries from the output CSV. Rows whose (db_id, question_id) pair is already present are skipped automatically. This makes long batch runs safe to interrupt and restart.
TQA defaults to OpenAIModel.GPT_4O. You can pass a different model_name to process_nl_queries or generate_less_natural_nl to use another supported model. Ensure your OPENAI_API_KEY is set in the environment before running.

Build docs developers (and LLMs) love