TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/dais-polymtl/sqlmorph/llms.txt
Use this file to discover all available pages before exploring further.
Evaluation class is the main entry point for computing metrics between a predicted SQL query and a ground-truth SQL query. You instantiate it once with a configuration dictionary and then call run_evaluation() for each query pair you want to score. The class routes to one of seven backend pipelines based on the evaluation_technique key, handles database execution, and optionally writes structured JSON logs to disk.
Constructor
Configuration dictionary controlling which pipeline runs and how results are logged.
run_evaluation()
Executes the configured evaluation pipeline against a single query pair and returns a context dictionary containing metrics and intermediate pipeline state.
The SQL query generated by the Text-to-SQL system under evaluation.
The reference SQL query from the benchmark dataset.
When
True, writes a JSON log file to the configured logs_dir_path. Set to False for batch runs where you don’t need per-query artefacts.Return value
run_evaluation() returns a dictionary (the pipeline context) that always contains metrics and latency. The other keys depend on which technique ran.
The computed evaluation scores for this query pair.
Wall-clock time in seconds from when the pipeline started to when
assign_metrics completed. Includes query execution and, for semantic techniques, embedding API calls.True if either SQL query failed to execute. When True, all metric values are 0.Human-readable error message. Only present when
has_error is True.Echo of the
predicted_sql argument, stored in the context for log traceability.Echo of the
ground_truth_sql argument.ISO-style timestamp string added when
log=True. Format: YYYY-MM-DD_HH-MM-SS.The context dictionary also carries intermediate pipeline state such as
pred_cols, gt_cols, pred_rows, gt_rows, matched_cells, and (for semantic techniques) matched_cols and similarity_matrix. These are useful for debugging but are not part of the stable public API.Usage examples
Python: basic usage
The example below is taken directly from the__main__ block in src/metrics/evaluation.py.
Python: minimal binary check
UseEXECUTION_ACCURACY when you only need a pass/fail score and don’t require column-level detail. This technique does not call an embedding API and runs with the fewest dependencies.
CLI usage via load_config_from_env()
For scripted evaluation runs, you can configure the evaluator entirely through environment variables — typically sourced from scripts/metrics_config.sh — and then invoke evaluation.py as a script.
load_config_from_env() reads the following environment variables and raises ValueError if a required variable is missing or invalid.
| Variable | Required | Description |
|---|---|---|
EVAL_TECHNIQUE | Yes | String value of an EvaluationTechnique enum member |
DBMS | Yes | DBMS name, e.g. SQLITE |
DB_PATH | Yes | Path to the database file |
EMBEDDING_MODEL | Yes | OpenAIModel attribute name, e.g. TEXT_EMBEDDING_3_SMALL |
LOGS_DIR_PATH | Yes | Root directory for log files |
PENALIZE_EXTRA_PRED_COLS | No | "true" or "false". Defaults to "true" |
ENABLE_LOG | No | "true" or "false". Defaults to "false" |