Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/dais-polymtl/sqlmorph/llms.txt

Use this file to discover all available pages before exploring further.

The SQLMorph evaluation framework goes beyond binary Exact Match by providing a family of relaxed metrics that measure how much of the correct result a predicted SQL query actually recovers. You can evaluate a single query pair from the command line or embed the evaluator in a script to process a full benchmark. Four metrics are available depending on the evaluation technique you choose: Execution Accuracy (EX), Execution Precision (EXP), Execution Recall (EXR), and F1.

CLI usage

The CLI reads its configuration from environment variables set by scripts/metrics_config.sh. Edit that file to select the evaluation technique, database path, and embedding model, then source it before running.

Step 1: Configure settings

Open scripts/metrics_config.sh and edit the exported variables:
# scripts/metrics_config.sh

export EVAL_TECHNIQUE="semantic_column_and_partial_cell"
export DBMS="SQLITE"
export DB_PATH="data/benchmarks/Bird/dev_databases/california_schools/california_schools.sqlite"
export PENALIZE_EXTRA_PRED_COLS="true"
export EMBEDDING_MODEL="TEXT_EMBEDDING_3_SMALL"
export LOGS_DIR_PATH="data/evaluation_logs/"
export ENABLE_LOG="false"
The default scripts/metrics_config.sh exports PENALIZE_EXTRA_COLUMNS, but load_config_from_env() in evaluation.py reads PENALIZE_EXTRA_PRED_COLS. Rename the variable in the script to PENALIZE_EXTRA_PRED_COLS for CLI mode to take effect.
Then source the file:
source scripts/metrics_config.sh

Step 2: Run evaluation

python src/metrics/evaluation.py \
  --predicted-sql "SELECT T1.Phone FROM schools AS T1 INNER JOIN satscores AS T2 ON T1.CDSCode = T2.cds ORDER BY CAST(T2.NumGE1500 AS REAL) / T2.NumTstTakr DESC LIMIT 10;" \
  --ground-truth-sql "SELECT T1.Phone FROM schools AS T1 INNER JOIN satscores AS T2 ON T1.CDSCode = T2.cds ORDER BY CAST(T2.NumGE1500 AS REAL) / T2.NumTstTakr DESC LIMIT 10;"
The script prints the metrics dict and latency to stdout, and optionally writes a timestamped JSON log under LOGS_DIR_PATH when ENABLE_LOG=true.

Configuration options

evaluation_technique
EvaluationTechnique
required
The evaluation strategy to apply. See the table below for available options and which metrics each produces.
db_params
object
required
Database connection parameters. Must contain:
  • dbms — a DBMS enum value (DBMS.SQLITE or DBMS.DUCKDB)
  • db_path — path to the database file
penalize_extra_pred_cols
boolean
default:"true"
When true, columns present in the predicted result but absent from the ground truth reduce the precision score. Has no effect for EXECUTION_ACCURACY or NO_COLUMN_AND_PARTIAL_CELL.
embedding_model
OpenAIModel
default:"OpenAIModel.TEXT_EMBEDDING_3_SMALL"
The OpenAI embedding model used to compute column-name similarity in semantic techniques. Required when using any SEMANTIC_* or UNIFIED_* evaluation technique.
logs_dir_path
string
required
Directory where evaluation logs are written when log=True is passed to run_evaluation(). A subdirectory named after the technique is created automatically.

Evaluation techniques

TechniqueEnum valueEXEXPEXRF1
Execution AccuracyEXECUTION_ACCURACY
Exact Column & Exact CellEXACT_COLUMN_AND_EXACT_CELL
Exact Column & Partial CellEXACT_COLUMN_AND_PARTIAL_CELL
Semantic Column & Exact CellSEMANTIC_COLUMN_AND_EXACT_CELL
Semantic Column & Partial CellSEMANTIC_COLUMN_AND_PARTIAL_CELL
No Column & Partial CellNO_COLUMN_AND_PARTIAL_CELL
Unified Column & Semantic RowUNIFIED_COLUMN_AND_SEMANTIC_ROW

Interpreting results

run_evaluation() returns a dict with at least a metrics key and a latency key:
{
    "metrics": {
        "EX": 0,        # 1 if result sets match exactly, else 0
        "EXP": 0.75,    # fraction of predicted cells that are correct
        "EXR": 1.0,     # fraction of ground-truth cells recovered
        "F1": 0.857,    # harmonic mean of EXP and EXR
    },
    "latency": 0.42,    # seconds taken to run the evaluation
    ...
}
  • EX = 1 means the predicted result set is identical to the ground truth. All other metrics are redundant in this case.
  • High EXR, low EXP indicates the predicted query returns the right data but includes extra columns or rows.
  • High EXP, low EXR indicates the predicted query is precise but misses part of the required output.
  • F1 balances the two; use it when neither over-prediction nor under-prediction is preferable.
Techniques with SEMANTIC_COLUMN or UNIFIED_COLUMN in their name call the OpenAI embeddings API to measure column-name similarity. These require OPENAI_API_KEY to be set and will incur API costs.

Log file structure

When log=True, a JSON file is written to <logs_dir_path>/<technique>/evaluation-<timestamp>.json. The file contains the full evaluation context including the metrics dict, SQL inputs, executed result sets, and the evaluation timestamp.

Build docs developers (and LLMs) love