RAG Architectures

Overview

This project implements and compares 6 distinct RAG architectures, each representing a different approach to retrieving and using medical knowledge. Understanding these architectures is crucial for selecting the right strategy for your use case.

Performance vs. Complexity Trade-offMore sophisticated architectures often provide better results but at the cost of increased latency, token usage, and complexity. The benchmark helps identify which trade-offs are worthwhile for medical Q&A.

Architecture Comparison

Simple Semantic

Baseline vector similarity search

Hybrid Search

BM25 lexical + semantic fusion

Hybrid + RRF

Reciprocal Rank Fusion with MMR

HyDE

Hypothetical document generation

Query Rewriter

Multi-query reformulation

PageIndex

External retrieval API

1. Simple Semantic RAG

Architecture Details

Strategy

Direct vector similarity matching using dense embeddings. This is the baseline approach that all other architectures are compared against.

How It Works

Convert the user’s question into a dense embedding vector
Search the ChromaDB vector store for documents with similar embeddings
Return the top-k most similar documents (k=5 by default)
Use these documents as context for the LLM to generate an answer

Implementation

# Simple semantic retrieval
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(
    persist_directory="data/embeddings/chroma_db",
    embedding_function=embeddings,
    collection_name="guia_embarazo_parto"
)

retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
retrieved_docs = retriever.invoke(query)

Characteristics

Strengths:

Fast and efficient (single retrieval operation)
Simple to implement and maintain
Low token cost (no additional LLM calls)
Works well when queries are semantically similar to document content

Weaknesses:

May miss documents with different wording but same meaning
Cannot handle vocabulary mismatch between query and documents
No keyword-based fallback for technical terms
Limited by embedding model’s understanding

When to Use

Starting point for any RAG project (baseline)
When latency is critical
When token budget is limited
When queries naturally match document language

Performance Metrics

Retrieval Time: ~100-200ms
Token Usage: Only answer generation tokens
Best For: General medical questions with standard terminology

2. Hybrid RAG (BM25 + Semantic)

Architecture Details

Strategy

Combines lexical search (BM25) with semantic search using ensemble retrieval. This addresses vocabulary mismatch by using both keyword matching and semantic understanding.

How It Works

BM25 Retrieval: Score documents based on keyword frequency and rarity (TF-IDF-like)
Semantic Retrieval: Score documents based on embedding similarity
Ensemble Fusion: Combine both ranked lists with configurable weights
Final Selection: Return top-k documents from the fused ranking

Implementation

# Lexical retriever (BM25)
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retriever.k = 5

# Semantic retriever
semantic_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Ensemble combination
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, semantic_retriever],
    weights=[0.5, 0.5]  # Equal weighting
)

retrieved_docs = ensemble_retriever.invoke(query)

Characteristics

Strengths:

Handles both semantic similarity and exact keyword matches
Better coverage for technical medical terms
More robust to query variations
Balances precision and recall

Weaknesses:

Slightly slower than simple semantic (two retrievals)
Requires loading full document corpus for BM25
Weight tuning may be needed for optimal results
No diversity control (may retrieve similar documents)

When to Use

Medical domains with technical terminology
When users ask questions with specific terms or acronyms
When semantic search alone misses relevant documents
General improvement over baseline at minimal cost

Configuration

# Adjust weights to favor lexical or semantic
weights=[0.3, 0.7]  # Favor semantic
weights=[0.7, 0.3]  # Favor lexical
weights=[0.5, 0.5]  # Equal balance (default)

3. Hybrid RAG + RRF

Architecture Details

Strategy

Enhanced hybrid search using Reciprocal Rank Fusion (RRF) for better rank aggregation, plus Maximal Marginal Relevance (MMR) for diversity.

How It Works

Retrieve More Candidates: Get top-15 from both BM25 and semantic retrievers
RRF Fusion: Combine rankings using reciprocal rank scores
MMR Selection: Apply diversity filter to avoid redundant context
Final Selection: Return top-5 diverse, relevant documents

Reciprocal Rank Fusion Formula

RRF_score(doc) = Σ [ 1 / (k + rank_i(doc)) ]

Where:

k is a constant (typically 60)
rank_i(doc) is the rank of the document in retriever i
Documents appearing in multiple rankings get boosted scores

Implementation

def reciprocal_rank_fusion(rankings, k_constant=60, top_k=5):
    scores = {}
    for ranked_docs in rankings:
        for rank, doc in enumerate(ranked_docs, start=1):
            doc_id = get_document_id(doc)
            scores[doc_id] = scores.get(doc_id, 0) + (1 / (k_constant + rank))
    
    sorted_docs = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    return [doc for doc_id, score in sorted_docs[:top_k]]

# Then apply MMR for diversity
final_docs = mmr_select(query, rrf_docs, top_k=5, lambda_mult=0.7)

Characteristics

Strengths:

Better rank fusion than simple score averaging
Boosts documents that appear in multiple retrievers
MMR ensures diverse context (reduces redundancy)
More sophisticated than basic hybrid

Weaknesses:

Additional embedding calls for MMR computation
More complex to implement and tune
Slightly higher latency
Benefits may be marginal for small document sets

When to Use

When document redundancy is a problem
Large knowledge bases with overlapping content
When you need both relevance and diversity
Research/evaluation scenarios requiring best-possible retrieval

Tuning Parameters

k_bm25_candidates = 15      # Initial BM25 retrieval count
k_semantic_candidates = 15  # Initial semantic retrieval count
k_rrf_pool = 10            # Documents after RRF fusion
k_final = 5                # Final documents after MMR
rrf_k = 60                 # RRF constant
mmr_lambda = 0.7           # MMR diversity (0=diverse, 1=relevant)

4. HyDE RAG

Architecture Details

Strategy

Hypothetical Document Embeddings - Generate a hypothetical answer to the question, then search for documents similar to that answer rather than the question itself.

How It Works

Generate Hypothetical Answer: Use an LLM to write what a perfect answer would look like
Embed the Answer: Convert this hypothetical document to a vector
Search with Answer: Find real documents similar to the hypothetical answer
Generate Final Answer: Use retrieved documents to produce the actual answer

The Key Insight

Problem: Questions and answers often use different language. “What is the ideal number of prenatal visits?” vs. “A primigravida should have 10 prenatal appointments…”Solution: Search for answer-like text, not question-like text.

Implementation

# Step 1: Generate hypothetical document
hyde_prompt = """
You are a medical expert writing a detailed section for a medical guide.

Based on this question: {question}

Write a detailed medical document that would perfectly answer this question.
Include accurate medical information, clinical details, and recommendations.
"""

hypothetical_doc = llm_hyde.invoke(hyde_prompt.format(question=query))

# Step 2: Retrieve using the hypothetical document
retrieved_docs = retriever.invoke(hypothetical_doc)

# Step 3: Generate final answer with retrieved context
final_answer = llm_answer.invoke(qa_prompt.format(
    context=format_docs(retrieved_docs),
    question=query
))

Characteristics

Strengths:

Bridges vocabulary gap between questions and answers
Better retrieval when queries don’t match document language
Particularly effective for “how-to” and explanatory questions
Can retrieve more relevant passages

Weaknesses:

Requires TWO LLM calls (costly in tokens and time)
Hypothetical document may contain hallucinations
More complex error handling
Higher latency (2x LLM calls + retrieval)

When to Use

Complex medical questions requiring detailed explanations
When simple semantic search retrieves poor results
Questions phrased differently than source documents
When token cost is not the primary concern

Model Configuration

# Use creative model for hypothesis, strong model for answer
llm_hyde = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.7)
llm_answer = ChatOpenAI(model_name="gpt-4o", temperature=0)

Cost Analysis

Token Overhead: ~2-3x compared to simple semantic
Latency Overhead: ~2x due to sequential LLM calls
Quality Improvement: Typically 5-15% on retrieval metrics

5. Query Rewriter RAG

Architecture Details

Strategy

Multi-Query Reformulation - Generate multiple variations of the user’s question, retrieve documents for each variation, then fuse the results with relevance-based ranking.

How It Works

Generate Query Variations: Create 3 different phrasings of the question
- Standalone reformulation
- Synonym replacement
- Expanded version with related aspects
Parallel Retrieval: Retrieve documents for each query variation
Deduplication & Ranking: Combine results with weighted relevance scoring
Generate Answer: Use the diverse, high-quality context

Three Rewriting Strategies

REPHRASE_TEMPLATE_1 = """
Rewrite this question to be a standalone, specific query about 
pregnancy and childbirth.

Original: {question}
Standalone question:
"""

REPHRASE_TEMPLATE_2 = """
Rephrase this question using synonyms and alternative medical terms.

Original: {question}
Rephrased question:
"""

REPHRASE_TEMPLATE_3 = """
Expand this question to include related aspects and additional context.

Base question: {question}
Expanded question:
"""

Implementation

# Generate variations
rewritten_queries = []
for prompt in REPHRASE_PROMPTS:
    variant = llm_rewriter.invoke(prompt.format(question=query))
    rewritten_queries.append(variant)

# Retrieve for each variant
all_docs = []
for i, variant_query in enumerate(rewritten_queries):
    docs = vectorstore.similarity_search_with_score(variant_query, k=5)
    
    # Weight by query position (earlier queries more important)
    query_weight = 1.0 - (i * 0.05)
    for doc, score in docs:
        all_docs.append((doc, score * query_weight))

# Deduplicate and rank
all_docs.sort(key=lambda x: x[1], reverse=True)
final_docs = deduplicate(all_docs[:8])

Characteristics

Strengths:

Comprehensive coverage of different query interpretations
Discovers documents missed by single query
Handles ambiguous questions better
More robust to query phrasing

Weaknesses:

Multiple LLM calls for rewriting (3 rewrites + 1 answer)
Multiple retrieval operations (3-4 retrievals)
Higher token and latency costs
Complexity in managing multiple retrievals

When to Use

Complex, ambiguous medical questions
When single-query retrieval is insufficient
Research scenarios requiring maximum recall
When you need diverse perspectives on a topic

Query Weighting Strategy

Earlier query reformulations are weighted higher:

# Query 1 (standalone): weight = 1.0
# Query 2 (synonyms):   weight = 0.95
# Query 3 (expanded):   weight = 0.90

This prevents overly broad reformulations from dominating results.

Performance Metrics

LLM Calls: 4 total (3 rewrites + 1 answer)
Retrieval Operations: 3 parallel retrievals
Token Overhead: ~3-4x vs baseline
Latency: ~2-3x vs baseline

6. PageIndex RAG

Architecture Details

Strategy

Use an external retrieval API (PageIndex) that provides pre-indexed, optimized document retrieval with relevance highlighting.

How It Works

Submit Query: Send question to PageIndex API
Wait for Processing: Poll for completion (async retrieval)
Extract Contexts: Parse retrieved nodes and relevant content snippets
Generate Answer: Use PageIndex contexts with local LLM

Implementation

from pageindex import PageIndexClient

# Initialize client
pageindex_client = PageIndexClient(api_key=PAGEINDEX_API_KEY)

# Submit retrieval request
retrieval_id = pageindex_client.create_retrieval(
    doc_id=PAGEINDEX_DOC_ID,
    query=query,
    thinking=False
)

# Wait for completion
retrieval_result = wait_for_completion(retrieval_id)

# Extract contexts
contexts = extract_contexts_from_retrieval(retrieval_result)

# Generate answer
answer = llm.invoke(qa_prompt.format(
    context=format_contexts(contexts),
    question=query
))

Characteristics

Strengths:

Offloads retrieval complexity to external service
Potentially optimized retrieval algorithms
Relevant content highlighting
No local vector store management

Weaknesses:

Depends on external service availability
API latency and rate limits
Requires document pre-indexing with PageIndex
Additional cost for API usage
Less control over retrieval process

When to Use

When you want managed retrieval infrastructure
Large document sets that are hard to manage locally
When PageIndex’s retrieval is demonstrably better
Production systems needing reliability and scale

Configuration

# .env file
PAGEINDEX_API_KEY=your_api_key
PAGEINDEX_DOC_ID=pi-xxxx
OPENAI_API_KEY=your_openai_key

Architecture Comparison Matrix

Architecture	LLM Calls	Retrievals	Complexity	Token Cost	Latency	Best Use Case
Simple Semantic	1	1	Low	Low	Fast	Baseline, simple queries
Hybrid	1	2	Medium	Low	Fast	Technical terminology
Hybrid + RRF	1	2	High	Medium	Medium	Diverse context needed
HyDE	2	1	Medium	High	Slow	Question-answer gap
Query Rewriter	4	3	High	High	Slow	Complex/ambiguous queries
PageIndex	1	1 (API)	Low	Low + API	Variable	Managed infrastructure

Which Architecture Should You Choose?

Start Simple

Begin with Simple Semantic RAG to establish baseline performance. It works well for most cases.

Add Lexical Search

Upgrade to Hybrid RAG if you need better handling of technical terms and keywords.

Optimize Quality

Try HyDE or Query Rewriter if retrieval quality is insufficient and token cost is acceptable.

Maximize Performance

Use Hybrid + RRF for the best balance of quality and diversity in research scenarios.

Get Started

Core Concepts

Evaluation

Guides

​Overview

​Architecture Comparison

Simple Semantic

Hybrid Search

Hybrid + RRF

HyDE

Query Rewriter

PageIndex

​1. Simple Semantic RAG

​Strategy

​How It Works

​Implementation

​Characteristics

​When to Use

​Performance Metrics

​2. Hybrid RAG (BM25 + Semantic)

​Strategy

​How It Works

​Implementation

​Characteristics

​When to Use

​Configuration

​3. Hybrid RAG + RRF

​Strategy

​How It Works

​Reciprocal Rank Fusion Formula

​Implementation

​Characteristics

​When to Use

​Tuning Parameters

​4. HyDE RAG

​Strategy

​How It Works

​The Key Insight

​Implementation

​Characteristics

​When to Use

​Model Configuration

​Cost Analysis

​5. Query Rewriter RAG

​Strategy

​How It Works

​Three Rewriting Strategies

​Implementation

​Characteristics

​When to Use

​Query Weighting Strategy

​Performance Metrics

​6. PageIndex RAG

​Strategy

​How It Works

​Implementation

​Characteristics

​When to Use

​Configuration

​Architecture Comparison Matrix

​Which Architecture Should You Choose?

Start Simple

Add Lexical Search

Optimize Quality

Maximize Performance

​Next Steps

Evaluation Framework

Run Benchmarks

Build docs developers (and LLMs) love

Overview

Architecture Comparison

1. Simple Semantic RAG

Strategy

How It Works

Implementation

Characteristics

When to Use

Performance Metrics

2. Hybrid RAG (BM25 + Semantic)

Strategy

How It Works

Implementation

Characteristics

When to Use

Configuration

3. Hybrid RAG + RRF

Strategy

How It Works

Reciprocal Rank Fusion Formula

Implementation

Characteristics

When to Use

Tuning Parameters

4. HyDE RAG

Strategy

How It Works

The Key Insight

Implementation

Characteristics

When to Use

Model Configuration

Cost Analysis

5. Query Rewriter RAG

Strategy

How It Works

Three Rewriting Strategies

Implementation

Characteristics

When to Use

Query Weighting Strategy

Performance Metrics

6. PageIndex RAG

Strategy

How It Works

Implementation

Characteristics

When to Use

Configuration

Architecture Comparison Matrix

Which Architecture Should You Choose?

Next Steps