Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/avnlp/agentic-med-diag/llms.txt

Use this file to discover all available pages before exploring further.

All four agentic pipelines — LightRAG-backed, MiniRAG-backed, PathRAG-backed, and HyperGraphRAG-backed — are evaluated against their corresponding non-agentic baselines using DeepEval. This design enables a direct measurement of the value added by the agentic multi-hop reasoning layer on top of each Graph RAG backend.

Evaluation Strategy

Each agentic pipeline wraps a Graph RAG backend with an iterative multi-hop reasoning loop: sub-query decomposition, parallel semantic and relational retrieval, evidence verification, and conditional expansion. Evaluation isolates the contribution of this loop by pairing every agentic variant with its raw backend as a baseline.
1

Select a Graph RAG backend

Choose one of four supported backends — LightRAG, MiniRAG, PathRAG, or HyperGraphRAG — to act as the retrieval engine.
2

Run the agentic pipeline

Wrap the backend with the full agentic multi-hop reasoning loop and generate answers across all four medical benchmarks.
3

Run the baseline

Query the same backend directly, without the agentic loop, to produce baseline answers for the same benchmark questions.
4

Score with DeepEval

Apply all five DeepEval metrics to both sets of answers, then compute the delta between the agentic pipeline and its baseline.
The goal is to quantify precisely how much the agentic multi-hop layer improves retrieval quality and answer faithfulness relative to each raw Graph RAG backend — without conflating the effect of the backend architecture with the effect of the reasoning loop.

What Is Evaluated

Each agentic pipeline is paired with its direct non-agentic counterpart as a baseline. This four-way comparison keeps the retrieval backend constant while varying only the presence of the agentic reasoning loop.
PipelineBaseline
Agentic LightRAG pipelineLightRAG baseline
Agentic MiniRAG pipelineMiniRAG baseline
Agentic PathRAG pipelinePathRAG baseline
Agentic HyperGraphRAG pipelineHyperGraphRAG baseline

Datasets

Evaluation runs across four medical QA benchmarks that collectively cover multi-turn clinical dialogue, step-by-step case reasoning, standardised USMLE-style questions, and biomedical literature retrieval.
DatasetDescription
HealthBenchMulti-turn medical AI benchmark with expert rubric evaluations
MedCaseReasoningMedical case studies with detailed reasoning processes
MetaMedQAMedical QA based on USMLE textbook content
PubMedQABiomedical QA based on PubMed articles
See the Benchmarks page for a detailed description of each dataset and why it was chosen.

Metrics

Five metrics from the DeepEval framework are applied to every pipeline–baseline pair: Contextual Recall, Contextual Precision, Contextual Relevancy, Answer Relevancy, and Faithfulness. Together they measure whether the system retrieves the right context, uses it correctly, and produces faithful answers — all of which are critical in medical settings. See the Metrics page for a detailed explanation of each metric and why it matters for medical QA.

Benchmarks

Detailed profiles of HealthBench, MedCaseReasoning, MetaMedQA, and PubMedQA — including what each benchmark tests and why it complements the others.

Metrics

Full definitions of all five DeepEval metrics, plus an explanation of how baseline comparison enables direct delta measurement.
Evaluation results will be published alongside the code release. No benchmark scores are available yet.

Build docs developers (and LLMs) love