The SQLMorph evaluation framework goes beyond binary Exact Match by providing a family of relaxed metrics that measure how much of the correct result a predicted SQL query actually recovers. You can evaluate a single query pair from the command line or embed the evaluator in a script to process a full benchmark. Four metrics are available depending on the evaluation technique you choose: Execution Accuracy (EX), Execution Precision (EXP), Execution Recall (EXR), and F1.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/dais-polymtl/sqlmorph/llms.txt
Use this file to discover all available pages before exploring further.
- CLI
- Python API
CLI usage
The CLI reads its configuration from environment variables set byscripts/metrics_config.sh. Edit that file to select the evaluation technique, database path, and embedding model, then source it before running.Step 1: Configure settings
Openscripts/metrics_config.sh and edit the exported variables:The default
scripts/metrics_config.sh exports PENALIZE_EXTRA_COLUMNS, but load_config_from_env() in evaluation.py reads PENALIZE_EXTRA_PRED_COLS. Rename the variable in the script to PENALIZE_EXTRA_PRED_COLS for CLI mode to take effect.Step 2: Run evaluation
LOGS_DIR_PATH when ENABLE_LOG=true.Configuration options
The evaluation strategy to apply. See the table below for available options and which metrics each produces.
Database connection parameters. Must contain:
dbms— aDBMSenum value (DBMS.SQLITEorDBMS.DUCKDB)db_path— path to the database file
When
true, columns present in the predicted result but absent from the ground truth reduce the precision score. Has no effect for EXECUTION_ACCURACY or NO_COLUMN_AND_PARTIAL_CELL.The OpenAI embedding model used to compute column-name similarity in semantic techniques. Required when using any
SEMANTIC_* or UNIFIED_* evaluation technique.Directory where evaluation logs are written when
log=True is passed to run_evaluation(). A subdirectory named after the technique is created automatically.Evaluation techniques
| Technique | Enum value | EX | EXP | EXR | F1 |
|---|---|---|---|---|---|
| Execution Accuracy | EXECUTION_ACCURACY | ✓ | |||
| Exact Column & Exact Cell | EXACT_COLUMN_AND_EXACT_CELL | ✓ | ✓ | ✓ | ✓ |
| Exact Column & Partial Cell | EXACT_COLUMN_AND_PARTIAL_CELL | ✓ | ✓ | ✓ | ✓ |
| Semantic Column & Exact Cell | SEMANTIC_COLUMN_AND_EXACT_CELL | ✓ | ✓ | ✓ | ✓ |
| Semantic Column & Partial Cell | SEMANTIC_COLUMN_AND_PARTIAL_CELL | ✓ | ✓ | ✓ | ✓ |
| No Column & Partial Cell | NO_COLUMN_AND_PARTIAL_CELL | ✓ | ✓ | ✓ | ✓ |
| Unified Column & Semantic Row | UNIFIED_COLUMN_AND_SEMANTIC_ROW | ✓ | ✓ | ✓ | ✓ |
Interpreting results
run_evaluation() returns a dict with at least a metrics key and a latency key:
- EX = 1 means the predicted result set is identical to the ground truth. All other metrics are redundant in this case.
- High EXR, low EXP indicates the predicted query returns the right data but includes extra columns or rows.
- High EXP, low EXR indicates the predicted query is precise but misses part of the required output.
- F1 balances the two; use it when neither over-prediction nor under-prediction is preferable.
Log file structure
Whenlog=True, a JSON file is written to <logs_dir_path>/<technique>/evaluation-<timestamp>.json. The file contains the full evaluation context including the metrics dict, SQL inputs, executed result sets, and the evaluation timestamp.