Get started with SQLMorph in minutes

This guide walks you through setting up SQLMorph from scratch and running your first evaluation. You will need Python 3.12 or later, Git, and an OpenAI API key for semantic metrics. The entire setup takes under five minutes once you have those prerequisites in place.

Clone the repository

Clone the SQLMorph repository from GitHub and move into the project directory.

git clone https://github.com/dais-polymtl/sqlmorph.git && cd sqlmorph

Install uv

SQLMorph uses uv to manage dependencies and Python versions. Follow the official installation instructions for your operating system, then verify it is available.

uv --version

uv automatically selects and fetches the required Python version, so you do not need to manage Python installations separately.

Install dependencies

Install all project dependencies, including optional extras for notebooks and model backends.

uv sync --all-extras

This command creates a virtual environment at ./venv and installs everything declared in pyproject.toml, including duckdb, openai, sqlglot, transformers, and the full suite of model provider clients.

Install pre-commit hooks (optional)

If you plan to contribute or modify source code, install the pre-commit hooks. These run ruff for linting and black for formatting on every commit.

uv run pre-commit install

Configure your environment

Copy the example environment file and add your OpenAI API key. The key is required for any evaluation technique that uses semantic column matching or embedding-based row comparison.

cp .env.example .env

Open .env and replace the placeholder value:

# .env
OPENAI_API_KEY=<your_openai_api_key>

Only the semantic evaluation techniques (SEMANTIC_COLUMN_AND_EXACT_CELL, SEMANTIC_COLUMN_AND_PARTIAL_CELL, UNIFIED_COLUMN_AND_SEMANTIC_ROW) require an OpenAI API key. You can run EXECUTION_ACCURACY and EXACT_COLUMN_AND_PARTIAL_CELL without one.

Source the metrics configuration

Before running evaluations from the CLI, source the metrics configuration script. This sets the environment variables that evaluation.py reads at runtime.

source scripts/metrics_config.sh

The script sets EVAL_TECHNIQUE, DBMS, DB_PATH, EMBEDDING_MODEL, LOGS_DIR_PATH, and PENALIZE_EXTRA_PRED_COLS. Edit the script directly to change these defaults.

Run your first evaluation

You can evaluate a predicted SQL query against a ground-truth query using either the Python API or the CLI.Python APIThe example below reproduces the california_schools evaluation included at the bottom of src/metrics/evaluation.py. It uses SEMANTIC_COLUMN_AND_PARTIAL_CELL, which matches columns by embedding similarity and compares result cells with partial matching — the most informative technique for production use.

from src.metrics.evaluation import Evaluation, EvaluationTechnique
from src.core.database.database_handler import DBMS
from src.core.model_manager import OpenAIModel

predicted_sql = """
SELECT T3.Phone, T3.City, T3.State, T3.MailStreet
FROM satscores T1
JOIN schools T3 ON T1.cds = T3.CDSCode
WHERE T1.NumTstTakr IS NOT NULL AND T1.NumGE1500 IS NOT NULL
ORDER BY (T1.NumGE1500 * 1.0 / T1.NumTstTakr) DESC
LIMIT 10;
"""

ground_truth_sql = """
SELECT T1.Phone
FROM schools AS T1
INNER JOIN satscores AS T2 ON T1.CDSCode = T2.cds
ORDER BY CAST(T2.NumGE1500 AS REAL) / T2.NumTstTakr DESC
LIMIT 10;
"""

db_name = "california_schools"

config = {
    "evaluation_technique": EvaluationTechnique.SEMANTIC_COLUMN_AND_PARTIAL_CELL,
    "db_params": {
        "dbms": DBMS.SQLITE,
        "db_path": f"data/benchmarks/Bird/dev_databases/{db_name}/{db_name}.sqlite",
    },
    "penalize_extra_pred_cols": True,
    "embedding_model": OpenAIModel.TEXT_EMBEDDING_3_SMALL,
    "logs_dir_path": "data/evaluation_outputs/",
}

evaluator = Evaluation(config)
res = evaluator.run_evaluation(predicted_sql, ground_truth_sql, log=True)

print("Semantic Evaluation Results:")
print(f"Metrics: {res['metrics']}, Latency: {res['latency']}")

The predicted_sql above returns four columns while ground_truth_sql returns only Phone. With penalize_extra_pred_cols=True, this will score low on Execution Precision — exactly the kind of over-prediction that binary EX would pass as correct.CLIAfter sourcing scripts/metrics_config.sh, pass queries directly on the command line.

python src/metrics/evaluation.py \
  --predicted-sql "SELECT phone FROM schools LIMIT 10;" \
  --ground-truth-sql "SELECT T1.Phone FROM schools AS T1 INNER JOIN satscores AS T2 ON T1.CDSCode = T2.cds ORDER BY CAST(T2.NumGE1500 AS REAL) / T2.NumTstTakr DESC LIMIT 10;"

Results are printed to stdout and, when ENABLE_LOG=true, written as a timestamped JSON file under LOGS_DIR_PATH.

Next steps

Core concepts overview

Understand how JQE, TQA, and the fine-grained metrics fit together.

JQE usage guide

Generate structurally diverse evaluation sets by expanding queries with valid joins.

TQA usage guide

Create linguistically varied question perturbations to test robustness.

Metrics usage guide

Run all evaluation techniques and interpret EXP, EXR, and F1 results.

Get Started

Core Concepts

Guides

Configuration

Get started with SQLMorph in minutes

Next steps

Core concepts overview

JQE usage guide

TQA usage guide

Metrics usage guide

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Configuration

Documentation Index

​Next steps

Core concepts overview

JQE usage guide

TQA usage guide

Metrics usage guide

Build docs developers (and LLMs) love

Next steps