The agentic Graph RAG system is evaluated on four medical QA benchmarks that collectively cover a broad range of clinical question types — from multi-turn medical AI interactions to textbook-style USMLE questions and biomedical research QA. Together they stress-test the pipeline across conversational, reasoning-intensive, standardised, and literature-grounded scenarios, giving a comprehensive picture of real-world clinical performance.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/avnlp/agentic-med-diag/llms.txt
Use this file to discover all available pages before exploring further.
HealthBench
HealthBench is a multi-turn medical AI benchmark with expert rubric evaluations. Rather than simple single-turn question–answer pairs, it presents realistic conversational exchanges with a medical AI, and scores responses against rubrics written by medical experts. Why it’s relevant: The agentic pipeline’s iterative sub-query decomposition and evidence-verification loop are designed specifically for complex, multi-hop questions. HealthBench stress-tests whether those capabilities translate to multi-turn, conversational clinical questions evaluated by expert-defined rubrics — a setting that closely mirrors real clinical decision-support use cases.MedCaseReasoning
MedCaseReasoning consists of medical case studies with detailed reasoning processes. Each example requires not just a correct final answer, but the ability to follow and reproduce the stepwise clinical reasoning a physician would use when working through a case. Why it’s relevant: The agentic pipeline decomposes questions into sub-queries and tracks back-references across hops, mirroring the kind of chained reasoning that clinical case studies demand. MedCaseReasoning directly tests whether this multi-hop structure leads to better-aligned reasoning chains, rather than just improved final answers.MetaMedQA
MetaMedQA is a medical QA benchmark based on USMLE textbook content. It presents standardised multiple-choice questions drawn from the same material physicians use when preparing for licensing examinations. Why it’s relevant: USMLE-style questions require precise, factually grounded answers with no room for vague or hallucinated content. MetaMedQA provides a controlled, standardised setting in which the Faithfulness and Contextual Precision metrics are particularly meaningful — any unsupported claim or irrelevant retrieval is immediately penalised.PubMedQA
PubMedQA is a biomedical QA benchmark built on top of PubMed articles. Questions are grounded in peer-reviewed biomedical literature, and answers must be supported by the corresponding research context. Why it’s relevant: The Graph RAG backends index knowledge graphs built from medical documents, and PubMedQA specifically tests retrieval over biomedical literature. This makes it a natural complement to the knowledge-graph-based retrieval channels — the Contextual Recall metric, in particular, measures how much of the relevant PubMed evidence the system actually surfaces.Benchmark Summary
| Benchmark | Type | Source | Focus Area |
|---|---|---|---|
| HealthBench | Multi-turn dialogue | Expert-written rubrics | Conversational clinical AI evaluation |
| MedCaseReasoning | Case study reasoning | Stanford clinical cases | Stepwise clinical reasoning chains |
| MetaMedQA | Multiple choice | USMLE textbook content | Standardised factual medical QA |
| PubMedQA | Biomedical QA | PubMed research articles | Literature-grounded retrieval |
Evaluation results will be published alongside the code release. No benchmark scores are available yet.