Obstetrics RAG Benchmark

A systematic evaluation of Retrieval-Augmented Generation (RAG) architectures applied to medical question-answering in the obstetrics domain. This research project benchmarks multiple RAG strategies across various Large Language Models using the RAGAS evaluation framework.

Overview

This project investigates the effectiveness of different RAG retrieval strategies for medical Q&A, specifically focusing on pregnancy and childbirth guidance. We implement and evaluate four distinct RAG architectures, comparing their performance across multiple state-of-the-art language models.

Quick Start

Get up and running with your first evaluation in minutes

RAG Architectures

Learn about the different RAG strategies we benchmark

Evaluation Framework

Understand RAGAS metrics and how we measure performance

API Reference

Explore the complete API documentation

Key Features

Multiple RAG Architectures

6 RAG StrategiesSimple Semantic, Hybrid (BM25+Semantic), Hybrid-RRF, HyDE, Query Rewriter, and PageIndex

RAGAS Evaluation

4 Core MetricsFaithfulness, Answer Relevancy, Context Precision, and Context Recall

Multi-Model Support

Multiple LLMsDefault models (GPT-4o, GPT-3.5-turbo) plus extensible registry supporting GPT-5, GPT-5.2, MediPhi, and MedGemma

Vector Search

ChromaDB + OpenAIPersistent vector store with OpenAI text-embedding-3-small

LangChain Pipeline

Production-ReadyBuilt on LangChain for reliable retrieval and generation

Comprehensive Results

Detailed AnalyticsJSON output with timestamps and question-by-question breakdown

Research Focus

This project addresses several key questions in the RAG domain:

How do different retrieval strategies (semantic, hybrid, hypothetical embeddings, query reformulation) compare in medical Q&A scenarios?
What is the impact of model selection on RAG performance in specialized domains?
How do we quantitatively assess retrieval quality and generation faithfulness without manual annotation?
Which RAG configuration produces the highest quality responses for obstetrics-related questions?

Use Cases

Medical AI Research

Benchmark RAG techniques for healthcare applications, compare retrieval strategies for domain-specific knowledge bases, and establish baseline performance metrics for medical Q&A systems.

RAG Architecture Comparison

Evaluate different RAG strategies side-by-side, identify optimal configurations for your use case, and understand trade-offs between retrieval approaches.

LLM Evaluation

Compare multiple language models on the same task, assess model-specific performance variations, and identify the best model for your requirements.

Educational Resource

Learn RAG implementation patterns, understand evaluation methodologies, and explore best practices for knowledge-augmented generation.

Getting Started

Install Dependencies

Clone the repository and install Python dependencies including LangChain, ChromaDB, and RAGAS.

git clone https://github.com/JhonHander/obstetrics-rag-benchmark.git
cd obstetrics-rag-benchmark
pip install -r requirements.txt

Configure API Keys

Set up your OpenAI API key for embeddings and LLM access.

echo "OPENAI_API_KEY=your_key_here" > .env

Create Embeddings

Generate vector embeddings from the medical text corpus.

python scripts/create_embeddings.py

Run Evaluation

Execute your first RAG evaluation and view results.

python scripts/run_evaluation.py hybrid

Research Contributions

Systematic Evaluation: RAGAS-based assessment of RAG architectures in the medical domain
Multiple Architectures: Comparison of 6 distinct RAG retrieval strategies
Model Diversity: Evaluation across general-purpose and specialized medical language models
Reproducible Benchmark: Complete pipeline from data processing to evaluation with detailed results

Next Steps

Installation

Complete installation guide

Core Concepts

Understand the fundamentals

Evaluation Guide

Run your first benchmark

Get Started

Core Concepts

RAG Architectures

Evaluation

Guides

Introduction

Obstetrics RAG Benchmark

Overview

Quick Start

RAG Architectures

Evaluation Framework

API Reference

Key Features

Multiple RAG Architectures

RAGAS Evaluation

Multi-Model Support

Vector Search

LangChain Pipeline

Comprehensive Results

Research Focus

Use Cases

Getting Started

Research Contributions

Next Steps

Installation

Core Concepts

Evaluation Guide

Build docs developers (and LLMs) love

Get Started

Core Concepts

RAG Architectures

Evaluation

Guides

​Obstetrics RAG Benchmark

​Overview

Quick Start

RAG Architectures

Evaluation Framework

API Reference

​Key Features

Multiple RAG Architectures

RAGAS Evaluation

Multi-Model Support

Vector Search

LangChain Pipeline

Comprehensive Results

​Research Focus

​Use Cases

​Getting Started

​Research Contributions

​Next Steps

Installation

Core Concepts

Evaluation Guide

Build docs developers (and LLMs) love

Obstetrics RAG Benchmark

Overview

Key Features

Research Focus

Use Cases

Getting Started

Research Contributions

Next Steps