DeepEval Metrics for Agentic Graph RAG Medical Evaluation

Contextual Recall

What it measures: The fraction of relevant context that was actually retrieved from the knowledge sources. A low Contextual Recall score means the pipeline failed to surface evidence that was present in the knowledge graph or document store.Why it matters for medical QA: Clinical questions often hinge on a specific piece of evidence — a diagnostic criterion, a contraindication, a dosage threshold. Missing that evidence, even if the rest of the retrieved context is correct, can lead to an incomplete or dangerous answer. Contextual Recall ensures that the agentic multi-hop loop is not prematurely terminating before all relevant evidence has been gathered.

Contextual Precision

What it measures: The fraction of retrieved context that is actually relevant to the question. A low Contextual Precision score means the pipeline is returning a large amount of noisy or tangentially related material alongside the genuinely useful evidence.Why it matters for medical QA: Excess irrelevant context can distract the answer-generation step, leading to unfocused or diluted responses. In clinical settings, precision is especially important because mixing in unrelated medical concepts can cause the model to conflate conditions, treatments, or patient populations.

Contextual Relevancy

What it measures: The overall relevance of the retrieved context to the question as a whole. While Contextual Precision evaluates individual retrieved items, Contextual Relevancy captures whether the retrieved set is collectively on-topic and appropriate for answering the query.Why it matters for medical QA: A pipeline may retrieve many individually relevant snippets while still missing the overall clinical framing of the question. Contextual Relevancy ensures the pipeline stays oriented toward the actual question being asked, rather than drifting toward related-but-distinct medical topics during iterative retrieval.

Answer Relevancy

What it measures: How relevant the final generated answer is to the original question. A low Answer Relevancy score indicates that the model has produced a response that does not directly address what was asked — even if that response is internally coherent or factually accurate in isolation.Why it matters for medical QA: Medical questions are often precise and require equally precise answers. A response that discusses a related condition, treatment, or concept without directly answering the query fails the clinician or patient regardless of its factual content. Answer Relevancy catches hallucinated tangents and off-topic responses before they reach end users.

Faithfulness

What it measures: Whether the final answer is grounded in the retrieved context and does not introduce claims that are unsupported by the retrieved evidence. A low Faithfulness score means the model has generated statements that go beyond — or contradict — what was actually retrieved.Why it matters for medical QA: Hallucination is the most dangerous failure mode in medical AI. An answer that invents drug interactions, misattributes symptoms, or fabricates study results could directly harm a patient. Faithfulness is therefore the most critical single metric in this evaluation suite, and it is the primary target of the agentic pipeline’s evidence-verification sub-step.

Get Started

Concepts

Backends

Storage & Infrastructure

Evaluation

Metrics

Baseline Comparison

DeepEval

Build docs developers (and LLMs) love

Get Started

Concepts

Backends

Storage & Infrastructure

Evaluation

Documentation Index

​Metrics

​Baseline Comparison

​DeepEval

Build docs developers (and LLMs) love

Metrics

Baseline Comparison

DeepEval