Overview
This project implements and compares 6 distinct RAG architectures, each representing a different approach to retrieving and using medical knowledge. Understanding these architectures is crucial for selecting the right strategy for your use case.Performance vs. Complexity Trade-offMore sophisticated architectures often provide better results but at the cost of increased latency, token usage, and complexity. The benchmark helps identify which trade-offs are worthwhile for medical Q&A.
Architecture Comparison
Simple Semantic
Baseline vector similarity search
Hybrid Search
BM25 lexical + semantic fusion
Hybrid + RRF
Reciprocal Rank Fusion with MMR
HyDE
Hypothetical document generation
Query Rewriter
Multi-query reformulation
PageIndex
External retrieval API
1. Simple Semantic RAG
Architecture Details
Architecture Details
Strategy
Direct vector similarity matching using dense embeddings. This is the baseline approach that all other architectures are compared against.How It Works
- Convert the user’s question into a dense embedding vector
- Search the ChromaDB vector store for documents with similar embeddings
- Return the top-k most similar documents (k=5 by default)
- Use these documents as context for the LLM to generate an answer
Implementation
Characteristics
Strengths:- Fast and efficient (single retrieval operation)
- Simple to implement and maintain
- Low token cost (no additional LLM calls)
- Works well when queries are semantically similar to document content
- May miss documents with different wording but same meaning
- Cannot handle vocabulary mismatch between query and documents
- No keyword-based fallback for technical terms
- Limited by embedding model’s understanding
When to Use
- Starting point for any RAG project (baseline)
- When latency is critical
- When token budget is limited
- When queries naturally match document language
Performance Metrics
- Retrieval Time: ~100-200ms
- Token Usage: Only answer generation tokens
- Best For: General medical questions with standard terminology
2. Hybrid RAG (BM25 + Semantic)
Architecture Details
Architecture Details
Strategy
Combines lexical search (BM25) with semantic search using ensemble retrieval. This addresses vocabulary mismatch by using both keyword matching and semantic understanding.How It Works
- BM25 Retrieval: Score documents based on keyword frequency and rarity (TF-IDF-like)
- Semantic Retrieval: Score documents based on embedding similarity
- Ensemble Fusion: Combine both ranked lists with configurable weights
- Final Selection: Return top-k documents from the fused ranking
Implementation
Characteristics
Strengths:- Handles both semantic similarity and exact keyword matches
- Better coverage for technical medical terms
- More robust to query variations
- Balances precision and recall
- Slightly slower than simple semantic (two retrievals)
- Requires loading full document corpus for BM25
- Weight tuning may be needed for optimal results
- No diversity control (may retrieve similar documents)
When to Use
- Medical domains with technical terminology
- When users ask questions with specific terms or acronyms
- When semantic search alone misses relevant documents
- General improvement over baseline at minimal cost
Configuration
3. Hybrid RAG + RRF
Architecture Details
Architecture Details
Strategy
Enhanced hybrid search using Reciprocal Rank Fusion (RRF) for better rank aggregation, plus Maximal Marginal Relevance (MMR) for diversity.How It Works
- Retrieve More Candidates: Get top-15 from both BM25 and semantic retrievers
- RRF Fusion: Combine rankings using reciprocal rank scores
- MMR Selection: Apply diversity filter to avoid redundant context
- Final Selection: Return top-5 diverse, relevant documents
Reciprocal Rank Fusion Formula
kis a constant (typically 60)rank_i(doc)is the rank of the document in retrieveri- Documents appearing in multiple rankings get boosted scores
Implementation
Characteristics
Strengths:- Better rank fusion than simple score averaging
- Boosts documents that appear in multiple retrievers
- MMR ensures diverse context (reduces redundancy)
- More sophisticated than basic hybrid
- Additional embedding calls for MMR computation
- More complex to implement and tune
- Slightly higher latency
- Benefits may be marginal for small document sets
When to Use
- When document redundancy is a problem
- Large knowledge bases with overlapping content
- When you need both relevance and diversity
- Research/evaluation scenarios requiring best-possible retrieval
Tuning Parameters
4. HyDE RAG
Architecture Details
Architecture Details
Strategy
Hypothetical Document Embeddings - Generate a hypothetical answer to the question, then search for documents similar to that answer rather than the question itself.How It Works
- Generate Hypothetical Answer: Use an LLM to write what a perfect answer would look like
- Embed the Answer: Convert this hypothetical document to a vector
- Search with Answer: Find real documents similar to the hypothetical answer
- Generate Final Answer: Use retrieved documents to produce the actual answer
The Key Insight
Problem: Questions and answers often use different language. “What is the ideal number of prenatal visits?” vs. “A primigravida should have 10 prenatal appointments…”Solution: Search for answer-like text, not question-like text.Implementation
Characteristics
Strengths:- Bridges vocabulary gap between questions and answers
- Better retrieval when queries don’t match document language
- Particularly effective for “how-to” and explanatory questions
- Can retrieve more relevant passages
- Requires TWO LLM calls (costly in tokens and time)
- Hypothetical document may contain hallucinations
- More complex error handling
- Higher latency (2x LLM calls + retrieval)
When to Use
- Complex medical questions requiring detailed explanations
- When simple semantic search retrieves poor results
- Questions phrased differently than source documents
- When token cost is not the primary concern
Model Configuration
Cost Analysis
- Token Overhead: ~2-3x compared to simple semantic
- Latency Overhead: ~2x due to sequential LLM calls
- Quality Improvement: Typically 5-15% on retrieval metrics
5. Query Rewriter RAG
Architecture Details
Architecture Details
Strategy
Multi-Query Reformulation - Generate multiple variations of the user’s question, retrieve documents for each variation, then fuse the results with relevance-based ranking.How It Works
- Generate Query Variations: Create 3 different phrasings of the question
- Standalone reformulation
- Synonym replacement
- Expanded version with related aspects
- Parallel Retrieval: Retrieve documents for each query variation
- Deduplication & Ranking: Combine results with weighted relevance scoring
- Generate Answer: Use the diverse, high-quality context
Three Rewriting Strategies
Implementation
Characteristics
Strengths:- Comprehensive coverage of different query interpretations
- Discovers documents missed by single query
- Handles ambiguous questions better
- More robust to query phrasing
- Multiple LLM calls for rewriting (3 rewrites + 1 answer)
- Multiple retrieval operations (3-4 retrievals)
- Higher token and latency costs
- Complexity in managing multiple retrievals
When to Use
- Complex, ambiguous medical questions
- When single-query retrieval is insufficient
- Research scenarios requiring maximum recall
- When you need diverse perspectives on a topic
Query Weighting Strategy
Earlier query reformulations are weighted higher:Performance Metrics
- LLM Calls: 4 total (3 rewrites + 1 answer)
- Retrieval Operations: 3 parallel retrievals
- Token Overhead: ~3-4x vs baseline
- Latency: ~2-3x vs baseline
6. PageIndex RAG
Architecture Details
Architecture Details
Strategy
Use an external retrieval API (PageIndex) that provides pre-indexed, optimized document retrieval with relevance highlighting.How It Works
- Submit Query: Send question to PageIndex API
- Wait for Processing: Poll for completion (async retrieval)
- Extract Contexts: Parse retrieved nodes and relevant content snippets
- Generate Answer: Use PageIndex contexts with local LLM
Implementation
Characteristics
Strengths:- Offloads retrieval complexity to external service
- Potentially optimized retrieval algorithms
- Relevant content highlighting
- No local vector store management
- Depends on external service availability
- API latency and rate limits
- Requires document pre-indexing with PageIndex
- Additional cost for API usage
- Less control over retrieval process
When to Use
- When you want managed retrieval infrastructure
- Large document sets that are hard to manage locally
- When PageIndex’s retrieval is demonstrably better
- Production systems needing reliability and scale
Configuration
Architecture Comparison Matrix
| Architecture | LLM Calls | Retrievals | Complexity | Token Cost | Latency | Best Use Case |
|---|---|---|---|---|---|---|
| Simple Semantic | 1 | 1 | Low | Low | Fast | Baseline, simple queries |
| Hybrid | 1 | 2 | Medium | Low | Fast | Technical terminology |
| Hybrid + RRF | 1 | 2 | High | Medium | Medium | Diverse context needed |
| HyDE | 2 | 1 | Medium | High | Slow | Question-answer gap |
| Query Rewriter | 4 | 3 | High | High | Slow | Complex/ambiguous queries |
| PageIndex | 1 | 1 (API) | Low | Low + API | Variable | Managed infrastructure |
Which Architecture Should You Choose?
Start Simple
Begin with Simple Semantic RAG to establish baseline performance. It works well for most cases.
Add Lexical Search
Upgrade to Hybrid RAG if you need better handling of technical terms and keywords.
Optimize Quality
Try HyDE or Query Rewriter if retrieval quality is insufficient and token cost is acceptable.
Maximize Performance
Use Hybrid + RRF for the best balance of quality and diversity in research scenarios.
Next Steps
Evaluation Framework
Learn how these architectures are evaluated with RAGAS
Run Benchmarks
Compare architectures on your own dataset
