Skip to main content

Obstetrics RAG Benchmark

A systematic evaluation of Retrieval-Augmented Generation (RAG) architectures applied to medical question-answering in the obstetrics domain. This research project benchmarks multiple RAG strategies across various Large Language Models using the RAGAS evaluation framework.

Overview

This project investigates the effectiveness of different RAG retrieval strategies for medical Q&A, specifically focusing on pregnancy and childbirth guidance. We implement and evaluate four distinct RAG architectures, comparing their performance across multiple state-of-the-art language models.

Quick Start

Get up and running with your first evaluation in minutes

RAG Architectures

Learn about the different RAG strategies we benchmark

Evaluation Framework

Understand RAGAS metrics and how we measure performance

API Reference

Explore the complete API documentation

Key Features

Multiple RAG Architectures

6 RAG StrategiesSimple Semantic, Hybrid (BM25+Semantic), Hybrid-RRF, HyDE, Query Rewriter, and PageIndex

RAGAS Evaluation

4 Core MetricsFaithfulness, Answer Relevancy, Context Precision, and Context Recall

Multi-Model Support

Multiple LLMsDefault models (GPT-4o, GPT-3.5-turbo) plus extensible registry supporting GPT-5, GPT-5.2, MediPhi, and MedGemma

Vector Search

ChromaDB + OpenAIPersistent vector store with OpenAI text-embedding-3-small

LangChain Pipeline

Production-ReadyBuilt on LangChain for reliable retrieval and generation

Comprehensive Results

Detailed AnalyticsJSON output with timestamps and question-by-question breakdown

Research Focus

This project addresses several key questions in the RAG domain:
  • How do different retrieval strategies (semantic, hybrid, hypothetical embeddings, query reformulation) compare in medical Q&A scenarios?
  • What is the impact of model selection on RAG performance in specialized domains?
  • How do we quantitatively assess retrieval quality and generation faithfulness without manual annotation?
  • Which RAG configuration produces the highest quality responses for obstetrics-related questions?

Use Cases

Benchmark RAG techniques for healthcare applications, compare retrieval strategies for domain-specific knowledge bases, and establish baseline performance metrics for medical Q&A systems.
Evaluate different RAG strategies side-by-side, identify optimal configurations for your use case, and understand trade-offs between retrieval approaches.
Compare multiple language models on the same task, assess model-specific performance variations, and identify the best model for your requirements.
Learn RAG implementation patterns, understand evaluation methodologies, and explore best practices for knowledge-augmented generation.

Getting Started

1

Install Dependencies

Clone the repository and install Python dependencies including LangChain, ChromaDB, and RAGAS.
git clone https://github.com/JhonHander/obstetrics-rag-benchmark.git
cd obstetrics-rag-benchmark
pip install -r requirements.txt
2

Configure API Keys

Set up your OpenAI API key for embeddings and LLM access.
echo "OPENAI_API_KEY=your_key_here" > .env
3

Create Embeddings

Generate vector embeddings from the medical text corpus.
python scripts/create_embeddings.py
4

Run Evaluation

Execute your first RAG evaluation and view results.
python scripts/run_evaluation.py hybrid

Research Contributions

  • Systematic Evaluation: RAGAS-based assessment of RAG architectures in the medical domain
  • Multiple Architectures: Comparison of 6 distinct RAG retrieval strategies
  • Model Diversity: Evaluation across general-purpose and specialized medical language models
  • Reproducible Benchmark: Complete pipeline from data processing to evaluation with detailed results

Next Steps

Installation

Complete installation guide

Core Concepts

Understand the fundamentals

Evaluation Guide

Run your first benchmark

Build docs developers (and LLMs) love