DSPy-Opt integrates DeepEval as both the optimization objective and the final evaluation harness. Every candidate program explored by an optimizer is scored using the same five DeepEval metrics that are used to measure the finished pipeline — ensuring the optimization target and the reporting metric are identical. TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/avnlp/dspy-opt/llms.txt
Use this file to discover all available pages before exploring further.
metrics.py module provides two factory functions that wrap DeepEval’s evaluation loop into DSPy-compatible callables: one that returns a plain float (for most optimizers) and one that returns a dspy.Prediction with score and feedback fields (for GEPA’s reflection loop).
Supported Metrics
All five metrics are instantiated fromdeepeval.metrics and accept a configurable threshold and an evaluator LLM. Each metric is run against a LLMTestCase built from the gold label and the pipeline’s prediction:
| Metric | Class | What it measures |
|---|---|---|
| Answer Relevancy | AnswerRelevancyMetric | How relevant the generated answer is to the input question |
| Faithfulness | FaithfulnessMetric | Whether the answer is grounded in the retrieved context and avoids hallucinations |
| Contextual Precision | ContextualPrecisionMetric | Precision of the retrieved passages — how many are actually relevant |
| Contextual Recall | ContextualRecallMetric | Recall of the retrieved passages — what fraction of the relevant content was retrieved |
| Contextual Relevancy | ContextualRelevancyMetric | Overall relevance of retrieved passages to the question |
Metric Instantiation
Metrics are configured in the YAML file and instantiated in the optimizer script. Each metric accepts a threshold (the minimum passing score) and anasync_mode flag. A LocalModel wrapping your evaluator LLM is passed to every metric:
freshqa_rag_mipro_config.yml):
Metric Function Variants
Themetrics.py module exposes two factory functions. Choose the correct one based on the optimizer you are using.
create_metrics_function() — Returns float
Used by MIPROv2, COPRO, BootstrapFewShotWithRandomSearch, and SIMBA. Wraps the DeepEval evaluation loop into a callable with the signature (gold, pred, trace) -> float. Scores from all five metrics are averaged into a single rounded float.
The function extracts the following attributes from its arguments:
gold.question— the input questiongold.answer— the expected (gold) answerpred.answer— the pipeline’s predicted answerpred.retrieved_context— the list of retrieved passages
LLMTestCase, calls deepeval.evaluate() with run_async=False and throttle_value=60, then aggregates:
create_gepa_metrics_function() — Returns dspy.Prediction
Used exclusively by GEPA. Returns a dspy.Prediction(score=..., feedback=...) where score is the same averaged float and feedback is a comma-separated string of "<MetricName>: <score>" pairs. GEPA’s reflection LLM reads the feedback string to identify which metrics are underperforming and proposes targeted instruction improvements.
Even when using GEPA during optimization, you should use the standard
create_metrics_function() for final evaluation with dspy.Evaluate. The GEPA-specific variant’s dspy.Prediction return type is not compatible with dspy.Evaluate’s aggregation logic.Score Aggregation
Both functions aggregate metric scores identically: each metric contributes one score in[0, 1], all scores are summed and divided by the total count, and the result is rounded to two decimal places. A pipeline with no retrieved context or an evaluation error returns 0.0.
Confident AI Tracing
During optimization runs, metrics and traces can be logged to Confident AI for centralized tracking and visualization. To enable tracing, add your Confident AI API key to a.env.local file in the project root:
Evaluation Script Pattern
After optimization, load the saved pipeline state and evaluate it on the held-out test set usingdspy.Evaluate:
Expected Input/Output Schema
The metric functions derive all needed information from thegold and pred objects produced during a dspy.Evaluate run:
| Attribute | Source | Description |
|---|---|---|
gold.question | Training/test example | The original question text |
gold.answer | Training/test example | The expected reference answer |
pred.answer | Pipeline prediction | The pipeline’s generated answer |
pred.retrieved_context | Pipeline prediction | List of retrieved passage strings |
retrieved_context field on pred is populated by FreshQARAG.forward() as the deduplicated list of passages returned from the WeaviateRetriever calls in stage 4 of the pipeline.