All four agentic pipelines — LightRAG-backed, MiniRAG-backed, PathRAG-backed, and HyperGraphRAG-backed — are evaluated against their corresponding non-agentic baselines using DeepEval. This design enables a direct measurement of the value added by the agentic multi-hop reasoning layer on top of each Graph RAG backend.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/avnlp/agentic-med-diag/llms.txt
Use this file to discover all available pages before exploring further.
Evaluation Strategy
Each agentic pipeline wraps a Graph RAG backend with an iterative multi-hop reasoning loop: sub-query decomposition, parallel semantic and relational retrieval, evidence verification, and conditional expansion. Evaluation isolates the contribution of this loop by pairing every agentic variant with its raw backend as a baseline.Select a Graph RAG backend
Choose one of four supported backends — LightRAG, MiniRAG, PathRAG, or HyperGraphRAG — to act as the retrieval engine.
Run the agentic pipeline
Wrap the backend with the full agentic multi-hop reasoning loop and generate answers across all four medical benchmarks.
Run the baseline
Query the same backend directly, without the agentic loop, to produce baseline answers for the same benchmark questions.
What Is Evaluated
Each agentic pipeline is paired with its direct non-agentic counterpart as a baseline. This four-way comparison keeps the retrieval backend constant while varying only the presence of the agentic reasoning loop.| Pipeline | Baseline |
|---|---|
| Agentic LightRAG pipeline | LightRAG baseline |
| Agentic MiniRAG pipeline | MiniRAG baseline |
| Agentic PathRAG pipeline | PathRAG baseline |
| Agentic HyperGraphRAG pipeline | HyperGraphRAG baseline |
Datasets
Evaluation runs across four medical QA benchmarks that collectively cover multi-turn clinical dialogue, step-by-step case reasoning, standardised USMLE-style questions, and biomedical literature retrieval.| Dataset | Description |
|---|---|
| HealthBench | Multi-turn medical AI benchmark with expert rubric evaluations |
| MedCaseReasoning | Medical case studies with detailed reasoning processes |
| MetaMedQA | Medical QA based on USMLE textbook content |
| PubMedQA | Biomedical QA based on PubMed articles |
Metrics
Five metrics from the DeepEval framework are applied to every pipeline–baseline pair: Contextual Recall, Contextual Precision, Contextual Relevancy, Answer Relevancy, and Faithfulness. Together they measure whether the system retrieves the right context, uses it correctly, and produces faithful answers — all of which are critical in medical settings. See the Metrics page for a detailed explanation of each metric and why it matters for medical QA.Benchmarks
Detailed profiles of HealthBench, MedCaseReasoning, MetaMedQA, and PubMedQA — including what each benchmark tests and why it complements the others.
Metrics
Full definitions of all five DeepEval metrics, plus an explanation of how baseline comparison enables direct delta measurement.
Evaluation results will be published alongside the code release. No benchmark scores are available yet.