TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/avnlp/dspy-opt/llms.txt
Use this file to discover all available pages before exploring further.
metrics module bridges the DeepEval evaluation library and the DSPy optimizer loop. DSPy optimizers require a callable metric function with a specific signature; DeepEval provides rich, LLM-graded RAG metrics. This module provides two factory functions that wrap any list of BaseMetric instances into the exact shape each optimizer expects — one returning a plain float for most optimizers and one returning a dspy.Prediction with score and feedback text for GEPA.
Supported DeepEval metrics
The following DeepEval metrics work directly with both factory functions:AnswerRelevancyMetric
Measures whether the generated answer is relevant to the input question.
FaithfulnessMetric
Checks whether the answer is grounded in the retrieved context without hallucination.
ContextualPrecisionMetric
Evaluates whether retrieved passages that are relevant are ranked above irrelevant ones.
ContextualRecallMetric
Measures how much of the ground-truth answer is covered by the retrieved context.
ContextualRelevancyMetric
Assesses whether the retrieved context is relevant to the question overall.
Expected data shape
Both inner metric functions read attributes fromgold and pred using getattr with safe defaults:
| Attribute | Source object | Description |
|---|---|---|
question | gold | The original input question |
answer | gold | The ground-truth reference answer |
answer | pred | The LLM’s generated answer |
retrieved_context | pred | List[str] of retrieved passage strings |
deepeval.test_case.LLMTestCase and evaluated synchronously.
create_metrics_function
deepeval_metrics(gold, pred, trace=None) -> float function. Use this with MIPROv2, COPRO, BootstrapFewShotWithRandomSearch, and SIMBA — all optimizers that expect a metric returning a single numeric score.
A list of instantiated DeepEval metric objects (e.g.
AnswerRelevancyMetric, FaithfulnessMetric). All metrics in the list are evaluated on every call, and their scores are averaged into a single float rounded to two decimal places.Callable[[Any, Any, Optional[bool]], float] with the inner signature:
evaluate(), the function collects each metric’s score by name and computes round(sum(scores.values()) / len(scores), 2). If no scores were collected (e.g. evaluation error), it returns 0.0.
create_gepa_metrics_function
dspy.Prediction containing both a numeric score and a human-readable feedback string.
A list of instantiated DeepEval metric objects. All metrics are evaluated and their individual scores are included in the feedback string.
dspy.Prediction has two fields:
| Field | Type | Description |
|---|---|---|
score | float | Averaged score across all metrics, rounded to two decimal places |
feedback | str | Comma-separated "MetricName: score" pairs, e.g. "AnswerRelevancy: 0.8, Faithfulness: 0.6" |
feedback string to diagnose which metrics underperformed and proposes updated prompt instructions for the next optimization round.
Both factory functions call
evaluate() with AsyncConfig(run_async=False, throttle_value=60, max_concurrent=1). Async execution is intentionally disabled to avoid rate-limit errors when many test cases are evaluated in rapid succession during an optimization run. The throttle_value=60 introduces a 60-second inter-request pause when rate limits are approached.