Skip to main content
Get your first RAG evaluation running in minutes. This guide walks you through installation, setup, and running your first benchmark.

Prerequisites

Before you begin, ensure you have:
  • Python 3.8 or higher installed on your system
  • An OpenAI API key for embeddings and LLM access
  • Git for cloning the repository
  • 10-15 minutes for the initial setup
The initial embedding generation takes 2-3 minutes. Subsequent evaluations run much faster (30 seconds - 2 minutes depending on the RAG architecture).

Quick Setup

1

Clone the Repository

Clone the project and navigate to the directory:
git clone https://github.com/JhonHander/obstetrics-rag-benchmark.git
cd obstetrics-rag-benchmark
2

Install Dependencies

Install all required Python packages:
pip install -r requirements.txt
This installs LangChain, ChromaDB, RAGAS, and all necessary dependencies.
3

Configure API Key

Create a .env file in the project root with your OpenAI API key:
echo "OPENAI_API_KEY=your_api_key_here" > .env
Replace your_api_key_here with your actual OpenAI API key from platform.openai.com.
4

Generate Embeddings

Create vector embeddings from the medical text corpus:
python scripts/create_embeddings.py
This step loads text chunks from data/chunks/chunks_final.json, generates embeddings using OpenAI’s text-embedding-3-small model, and stores them in ChromaDB at data/embeddings/chroma_db/.
Expected output:
Loading chunks from: data/chunks/chunks_final.json
Loaded 156 chunks
Creating embeddings and storing in ChromaDB...
Successfully stored 156 chunks in ChromaDB
5

Run Your First Evaluation

Execute a RAG evaluation using the Hybrid architecture:
python scripts/run_evaluation.py hybrid
This will:
  • Load 10 medical questions from the evaluation dataset
  • Run the Hybrid RAG (BM25 + Semantic) on each question
  • Generate answers using GPT-4o
  • Evaluate with RAGAS metrics
  • Display results and save to results/
6

View Results

Check the console output for metric scores:
==================== RAGAS Evaluation Results ====================
RAG Type: hybrid
Model: gpt-4o

Faithfulness:       0.8542
Answer Relevancy:   0.7891
Context Precision:  0.9123
Context Recall:     0.7654
Results are also saved as JSON in the results/ directory with timestamp.

Try Different RAG Architectures

Now that you have the system running, try evaluating different RAG strategies:
python scripts/run_evaluation.py simple

Understanding the Output

Each evaluation provides four key metrics:
MetricRangeWhat It Measures
Faithfulness0.0 - 1.0How well the answer is grounded in retrieved context (lower = more hallucination)
Answer Relevancy0.0 - 1.0How directly the answer addresses the question
Context Precision0.0 - 1.0Proportion of retrieved context that is relevant
Context Recall0.0 - 1.0Completeness of retrieval (did we get all relevant info?)
Higher scores are better for all metrics. Scores above 0.8 indicate excellent performance.

Next Steps

Compare Architectures

Learn about the 6 different RAG strategies and when to use each

Run Benchmarks

Compare all RAG architectures across multiple models

Extend the Research

Add your own RAG architectures or models

Available Commands

Here’s a quick reference of evaluation commands:
CommandDescription
python scripts/run_evaluation.py simpleEvaluate Simple Semantic RAG
python scripts/run_evaluation.py hybridEvaluate Hybrid RAG (BM25 + Semantic)
python scripts/run_evaluation.py hydeEvaluate HyDE RAG
python scripts/run_evaluation.py rewriterEvaluate Query Rewriter RAG
python scripts/run_evaluation.py multi-model hybridRun Hybrid RAG across all available models
python scripts/run_evaluation.py all-models-all-ragsComprehensive benchmark (all RAGs × all models)
The comprehensive benchmark (all-models-all-rags) evaluates all RAG architectures across multiple models and can take 15-30 minutes to complete. It also incurs higher API costs.

Troubleshooting

If you see OPENAI_API_KEY not found, ensure:
  • Your .env file exists in the project root
  • The key is formatted as OPENAI_API_KEY=sk-...
  • There are no quotes around the key value
If embeddings aren’t found, run the embedding creation step:
python scripts/create_embeddings.py
If you get import errors, reinstall dependencies:
pip install --upgrade -r requirements.txt

What You’ve Accomplished

You’ve successfully:
  • ✅ Installed the Obstetrics RAG Benchmark
  • ✅ Generated vector embeddings for medical text
  • ✅ Run your first RAG evaluation
  • ✅ Viewed RAGAS metrics for retrieval and generation quality
Ready to dive deeper? Explore the Core Concepts to understand how RAG architectures work, or jump into Running Evaluations for advanced benchmarking options.

Build docs developers (and LLMs) love