Quickstart

Get your first RAG evaluation running in minutes. This guide walks you through installation, setup, and running your first benchmark.

Prerequisites

Before you begin, ensure you have:

Python 3.8 or higher installed on your system
An OpenAI API key for embeddings and LLM access
Git for cloning the repository
10-15 minutes for the initial setup

The initial embedding generation takes 2-3 minutes. Subsequent evaluations run much faster (30 seconds - 2 minutes depending on the RAG architecture).

Quick Setup

Clone the Repository

Clone the project and navigate to the directory:

git clone https://github.com/JhonHander/obstetrics-rag-benchmark.git
cd obstetrics-rag-benchmark

Install Dependencies

Install all required Python packages:

pip install -r requirements.txt

This installs LangChain, ChromaDB, RAGAS, and all necessary dependencies.

Configure API Key

Create a .env file in the project root with your OpenAI API key:

echo "OPENAI_API_KEY=your_api_key_here" > .env

Replace your_api_key_here with your actual OpenAI API key from platform.openai.com.

Generate Embeddings

Create vector embeddings from the medical text corpus:

python scripts/create_embeddings.py

This step loads text chunks from data/chunks/chunks_final.json, generates embeddings using OpenAI’s text-embedding-3-small model, and stores them in ChromaDB at data/embeddings/chroma_db/.

Expected output:

Loading chunks from: data/chunks/chunks_final.json
Loaded 156 chunks
Creating embeddings and storing in ChromaDB...
Successfully stored 156 chunks in ChromaDB

Run Your First Evaluation

Execute a RAG evaluation using the Hybrid architecture:

python scripts/run_evaluation.py hybrid

This will:

Load 10 medical questions from the evaluation dataset
Run the Hybrid RAG (BM25 + Semantic) on each question
Generate answers using GPT-4o
Evaluate with RAGAS metrics
Display results and save to results/

View Results

Check the console output for metric scores:

==================== RAGAS Evaluation Results ====================
RAG Type: hybrid
Model: gpt-4o

Faithfulness:       0.8542
Answer Relevancy:   0.7891
Context Precision:  0.9123
Context Recall:     0.7654

Results are also saved as JSON in the results/ directory with timestamp.

Try Different RAG Architectures

Now that you have the system running, try evaluating different RAG strategies:

python scripts/run_evaluation.py simple

Understanding the Output

Each evaluation provides four key metrics:

Metric	Range	What It Measures
Faithfulness	0.0 - 1.0	How well the answer is grounded in retrieved context (lower = more hallucination)
Answer Relevancy	0.0 - 1.0	How directly the answer addresses the question
Context Precision	0.0 - 1.0	Proportion of retrieved context that is relevant
Context Recall	0.0 - 1.0	Completeness of retrieval (did we get all relevant info?)

Higher scores are better for all metrics. Scores above 0.8 indicate excellent performance.

Next Steps

Compare Architectures

Learn about the 6 different RAG strategies and when to use each

Run Benchmarks

Compare all RAG architectures across multiple models

Extend the Research

Add your own RAG architectures or models

Available Commands

Here’s a quick reference of evaluation commands:

Command	Description
`python scripts/run_evaluation.py simple`	Evaluate Simple Semantic RAG
`python scripts/run_evaluation.py hybrid`	Evaluate Hybrid RAG (BM25 + Semantic)
`python scripts/run_evaluation.py hyde`	Evaluate HyDE RAG
`python scripts/run_evaluation.py rewriter`	Evaluate Query Rewriter RAG
`python scripts/run_evaluation.py multi-model hybrid`	Run Hybrid RAG across all available models
`python scripts/run_evaluation.py all-models-all-rags`	Comprehensive benchmark (all RAGs × all models)

The comprehensive benchmark (all-models-all-rags) evaluates all RAG architectures across multiple models and can take 15-30 minutes to complete. It also incurs higher API costs.

Troubleshooting

OpenAI API Key Error

If you see OPENAI_API_KEY not found, ensure:

Your .env file exists in the project root
The key is formatted as OPENAI_API_KEY=sk-...
There are no quotes around the key value

ChromaDB Not Found

If embeddings aren’t found, run the embedding creation step:

python scripts/create_embeddings.py

Import Errors

If you get import errors, reinstall dependencies:

pip install --upgrade -r requirements.txt

What You’ve Accomplished

You’ve successfully:

✅ Installed the Obstetrics RAG Benchmark
✅ Generated vector embeddings for medical text
✅ Run your first RAG evaluation
✅ Viewed RAGAS metrics for retrieval and generation quality

Ready to dive deeper? Explore the Core Concepts to understand how RAG architectures work, or jump into Running Evaluations for advanced benchmarking options.

Get Started

Core Concepts

RAG Architectures

Evaluation

Guides

Prerequisites

Quick Setup

Try Different RAG Architectures

Understanding the Output

Next Steps

Compare Architectures

Run Benchmarks

Extend the Research

Available Commands

Troubleshooting

What You’ve Accomplished

Build docs developers (and LLMs) love

Get Started

Core Concepts

RAG Architectures

Evaluation

Guides

​Prerequisites

​Quick Setup

​Try Different RAG Architectures

​Understanding the Output

​Next Steps

Compare Architectures

Run Benchmarks

Extend the Research

​Available Commands

​Troubleshooting

​What You’ve Accomplished

Build docs developers (and LLMs) love

Prerequisites

Quick Setup

Try Different RAG Architectures

Understanding the Output

Next Steps

Available Commands

Troubleshooting

What You’ve Accomplished