Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/dais-polymtl/sqlmorph/llms.txt

Use this file to discover all available pages before exploring further.

The Textual Query Augmentation (TQA) module decreases the naturalness of BIRD benchmark questions and evidence strings by replacing human-readable table and column names with obfuscated schema identifiers. It calls an LLM (default: GPT-4o) with a few-shot prompt that instructs the model to rewrite the question and evidence to match a provided schema mapping, then streams results row-by-row into an output CSV so progress is never lost between runs. The module supports resuming interrupted jobs: any (db_id, question_id) pair already present in the output CSV is skipped automatically.

process_nl_queries()

The main entry point. Reads the input queries and schema mapping, calls generate_less_natural_nl() for each unprocessed row, and writes results incrementally to output_csv_path.
from src.textual_query_augmentation.decrease_naturalness.alter_nl_queries import (
    process_nl_queries,
    print_nl_processing_summary,
)
from src.core.model_manager.openai_model import OpenAIModel

stats = process_nl_queries(
    queries_csv_path="data/augmentation/decrease_naturalness/new_sql_queries.csv",
    mapping_csv_path="data/augmentation/decrease_naturalness/databases_naturalness_decreased_fixed.csv",
    output_csv_path="data/augmentation/decrease_naturalness/experiments_dev/new_sql_nl_queries.csv",
    model_name=OpenAIModel.GPT_4O,
    seed=42,
    temperature=0,
)

print_nl_processing_summary(stats)
Parameters
queries_csv_path
string
required
Path to the input CSV file containing the queries to augment. Must have at minimum the columns question_id, db_id, question, and evidence. The function renames questionoriginal_question and evidenceoriginal_evidence internally before processing.
mapping_csv_path
string
required
Path to the schema mapping CSV that describes how table and column names should be rewritten. Must contain the columns db_id, table_name, column_name, new_table_name, and new_column_name.
output_csv_path
string
required
Path where the augmented results are written. The function creates parent directories if they do not exist. If the file already exists, completed rows are loaded and skipped, enabling resumable runs.
model_name
OpenAIModel
default:"OpenAIModel.GPT_4O"
The OpenAI model to use for generation. Pass any member of the OpenAIModel enum (e.g., OpenAIModel.GPT_4O_MINI) or use the default GPT-4o for highest quality output.
seed
integer
default:"42"
Random seed forwarded to the OpenAI completion call. Use a fixed seed to obtain deterministic outputs across retries.
temperature
float
default:"0"
Sampling temperature for the LLM. 0 produces the most deterministic output; higher values introduce more variation.
Return value
stats
dict
Processing statistics for the entire run.

generate_less_natural_nl()

Generates a less natural question and evidence for a single query item. Called internally by process_nl_queries() but also useful for one-off transformations.
from src.textual_query_augmentation.decrease_naturalness.alter_nl_queries import (
    generate_less_natural_nl,
    read_schema_mapping,
)
from src.core.model_manager.openai_model import OpenAIModel

db_mappings = read_schema_mapping("data/.../mapping.csv")
schema_mapping = db_mappings["financial"]

new_question, new_evidence = generate_less_natural_nl(
    question="How many accounts were opened in the Prague district?",
    evidence="Prague district refers to district_name = 'Prague'",
    schema_mapping=schema_mapping,
    model_name=OpenAIModel.GPT_4O,
    seed=42,
    temperature=0,
)
question
string
required
The original natural-language question to be rewritten.
evidence
string
required
The original evidence string accompanying the question.
schema_mapping
dict
required
The schema mapping for the relevant database, as returned by read_schema_mapping() for a single db_id. Structure: {table_name: {"new_name": str, "columns": {old_col: new_col}}}.
model_name
OpenAIModel
default:"OpenAIModel.GPT_4O"
The OpenAI model to use for generation.
seed
integer
default:"42"
Random seed passed to the completion call.
temperature
float
default:"0"
Sampling temperature for the LLM.
Return value Returns a tuple[str, str]: (new_question, new_evidence). If the LLM response cannot be parsed as JSON, or if an exception occurs, the function returns the original (question, evidence) unchanged rather than raising.

read_schema_mapping()

Reads the schema mapping CSV and organises it as a nested dict keyed by db_id.
from src.textual_query_augmentation.decrease_naturalness.alter_nl_queries import (
    read_schema_mapping,
)

db_mappings = read_schema_mapping("data/.../databases_naturalness_decreased_fixed.csv")
# db_mappings["financial"]["account"]["new_name"]  → "acct"
# db_mappings["financial"]["account"]["columns"]["account_id"]  → "acct_id"
mapping_csv_path
string
required
Path to the schema mapping CSV. Required columns: db_id, table_name, column_name, new_table_name, new_column_name.
Returns a dict with the shape:
{
  "<db_id>": {
    "<table_name>": {
      "new_name": "<new_table_name>",
      "columns": {
        "<column_name>": "<new_column_name>",
        ...
      }
    },
    ...
  },
  ...
}

Prints a formatted summary of the statistics dict returned by process_nl_queries().
from src.textual_query_augmentation.decrease_naturalness.alter_nl_queries import (
    print_nl_processing_summary,
)

print_nl_processing_summary(stats)
stats
dict
required
The statistics dict returned by process_nl_queries().
The function prints overall counts (total, skipped, processed, successful, failed) and per-database breakdowns to stdout. It adds a qualitative assessment line (EXCELLENT / GOOD / MODERATE / POOR) based on the success rate.

CSV schemas

Input queries CSV (queries_csv_path)

ColumnTypeDescription
question_idintegerUnique identifier matching the BIRD benchmark
db_idstringDatabase identifier
questionstringNatural-language question (renamed original_question by the function)
evidencestringEvidence string (renamed original_evidence by the function)
SQLstringGold SQL query (passed through unchanged)
difficultystringDifficulty label (passed through unchanged)
Additional columns are preserved as-is in the output.

Schema mapping CSV (mapping_csv_path)

ColumnTypeDescription
db_idstringDatabase identifier
table_namestringOriginal table name
column_namestringOriginal column name
new_table_namestringObfuscated table name
new_column_namestringObfuscated column name

Output CSV (output_csv_path)

ColumnTypeDescription
question_idintegerIdentifier from the input row
db_idstringDatabase identifier
original_questionstringUnmodified question from the input
original_evidencestringUnmodified evidence from the input
new_questionstringLLM-generated less-natural question
new_evidencestringLLM-generated less-natural evidence

Complete usage example

import os
from src.textual_query_augmentation.decrease_naturalness.alter_nl_queries import (
    process_nl_queries,
    print_nl_processing_summary,
)
from src.core.model_manager.openai_model import OpenAIModel

os.environ["OPENAI_API_KEY"] = "sk-..."

stats = process_nl_queries(
    queries_csv_path="data/augmentation/decrease_naturalness/new_sql_queries.csv",
    mapping_csv_path="data/augmentation/decrease_naturalness/databases_naturalness_decreased_fixed.csv",
    output_csv_path="data/augmentation/decrease_naturalness/experiments_dev/new_sql_nl_queries.csv",
    model_name=OpenAIModel.GPT_4O,
    seed=42,
    temperature=0,
)

print_nl_processing_summary(stats)
Set temperature=0 and a fixed seed for reproducible augmentation. When you re-run the pipeline after a partial failure, already-written rows are automatically detected and skipped, so you only pay for the remaining LLM calls.

Build docs developers (and LLMs) love