Benchmarking

This guide covers best practices for running comprehensive benchmarks to compare RAG architectures and identify optimal configurations.

Benchmarking Goals

A comprehensive benchmark should answer:

Quality

Which RAG architecture produces the highest quality answers?

Performance

Which architecture is fastest and most cost-effective?

Robustness

Which architecture handles diverse questions best?

Scalability

Which architecture scales best to production?

Benchmark Types

1. Single RAG Benchmark

Evaluate one RAG architecture in depth.

python scripts/run_evaluation.py hybrid

Use when:

Testing a new RAG implementation
Debugging a specific architecture
Quick quality check

Output: ragas_evaluation_[rag_type]_[timestamp].json

2. Multi-Model Benchmark

Compare how different LLMs perform with the same RAG architecture.

python scripts/run_evaluation.py multi-model hybrid

Use when:

Selecting the best LLM for your use case
Understanding model-specific strengths
Cost-benefit analysis across models

Output: ragas_multimodel_[rag_type]_[timestamp].json

3. Comprehensive Benchmark

Test all RAG architectures with all available models.

python scripts/run_evaluation.py all-models-all-rags

Use when:

Conducting research
Selecting production configuration
Publishing results

Output: ragas_comprehensive_all_rags_all_models_[timestamp].json

Running a Comprehensive Benchmark

Prepare Environment

Ensure stable conditions for fair comparison:

# Clean environment
rm -rf data/embeddings/chroma_db/

# Recreate embeddings
python scripts/create_embeddings.py

# Verify API keys
cat .env | grep OPENAI_API_KEY

Run Comprehensive Evaluation

Start the full benchmark:

python scripts/run_evaluation.py all-models-all-rags > benchmark_log.txt 2>&1 &

This runs in the background and logs all output.

Monitor Progress

Watch the log file:

tail -f benchmark_log.txt

You’ll see progress through RAG types:

========================= SIMPLE SEMANTIC RAG =========================
Starting RAGAS evaluation
...
============================= HYDE RAG =============================
Starting RAGAS evaluation
...

Wait for Completion

Typical duration:

6 RAG types × 4 models × 10 questions = 240 evaluations
~5-10 seconds per question
Total: 2-4 hours

Do not interrupt the benchmark. Results are only saved at the end.

Understanding Benchmark Results

The comprehensive benchmark produces a detailed JSON file:

{
  "metadata": {
    "timestamp": "2026-03-11T11:15:57.123456",
    "evaluation_type": "comprehensive_rag_comparison",
    "dataset_size": 10,
    "rags_evaluated": [
      "simple", "hybrid", "hybrid-rrf", "hyde", "rewriter", "pageindex"
    ],
    "models_evaluated": [
      "gpt-4o", "gpt-5", "gpt-5.2", "google/medgemma-1.5-4b-it"
    ]
  },
  "summary": { /* Metrics for each RAG × Model combination */ },
  "best_performers": { /* Top performers for each metric */ },
  "question_by_question": [ /* Detailed results */ ]
}

Summary Section

Compare all RAG architectures:

{
  "summary": {
    "simple": {
      "rag_name": "Simple Semantic RAG",
      "metrics": {
        "faithfulness": 0.850,
        "answer_relevancy": 0.265,
        "context_precision": 0.779,
        "context_recall": 0.600
      },
      "performance": {
        "average_execution_time": 8.234,
        "total_cost": 0.021433
      }
    },
    "hybrid": {
      "rag_name": "Hybrid RAG (BM25 + Semantic)",
      "metrics": {
        "faithfulness": 0.912,
        "answer_relevancy": 0.783,
        "context_precision": 0.891,
        "context_recall": 0.845
      },
      "performance": {
        "average_execution_time": 12.567,
        "total_cost": 0.038921
      }
    }
    // ... more RAGs
  }
}

Best Performers

Identify winners for each metric:

{
  "best_performers": {
    "faithfulness": {
      "rag_type": "hybrid-rrf",
      "rag_name": "Hybrid RAG + RRF (BM25 + Semantic)",
      "score": 0.924
    },
    "answer_relevancy": {
      "rag_type": "rewriter",
      "rag_name": "Rewriter RAG (Multi-Query)",
      "score": 0.887
    },
    "context_precision": {
      "rag_type": "hybrid-rrf",
      "rag_name": "Hybrid RAG + RRF (BM25 + Semantic)",
      "score": 0.901
    },
    "context_recall": {
      "rag_type": "rewriter",
      "rag_name": "Rewriter RAG (Multi-Query)",
      "score": 0.894
    },
    "best_avg_execution_time": {
      "rag_type": "simple",
      "rag_name": "Simple Semantic RAG",
      "score": 8.234
    },
    "best_total_cost": {
      "rag_type": "simple",
      "rag_name": "Simple Semantic RAG",
      "score": 0.021
    }
  }
}

Comparing RAG Architectures

Quality Comparison

Rank by overall average score:

Rank	RAG Architecture	Avg Score	Faithfulness	Answer Rel.	Ctx Prec.	Ctx Recall
1	Hybrid-RRF	0.891	0.924	0.875	0.901	0.864
2	Rewriter	0.872	0.901	0.887	0.834	0.894
3	Hybrid	0.858	0.912	0.783	0.891	0.845
4	HyDE	0.812	0.867	0.745	0.823	0.812
5	PageIndex	0.781	0.834	0.698	0.801	0.791
6	Simple	0.623	0.850	0.265	0.779	0.600

Key insights:

Hybrid-RRF offers the best overall quality
Rewriter excels at recall (finding all relevant info)
Simple is fast but lacks answer relevancy

Performance Comparison

Rank by speed and cost:

Rank	RAG Architecture	Avg Time (s)	Total Cost	Cost per Q
1	Simple	8.2	$0.021	$0.0021
2	Hybrid	12.6	$0.039	$0.0039
3	PageIndex	13.1	$0.041	$0.0041
4	Hybrid-RRF	14.8	$0.048	$0.0048
5	HyDE	18.3	$0.067	$0.0067
6	Rewriter	21.7	$0.089	$0.0089

Key insights:

Simple is 2.6× faster than Rewriter
Simple costs 4.2× less than Rewriter
Advanced RAGs trade cost/speed for quality

Trade-off Analysis

Best Overall Quality

Hybrid-RRF

Average score: 0.891
Excels in all metrics
Cost: $0.048 per 10 questions

Use for: Production systems where quality matters most

Best Balance

Hybrid RAG

Average score: 0.858 (only 3.7% lower)
15% faster than Hybrid-RRF
19% cheaper than Hybrid-RRF

Use for: Most production use cases

Best Performance

Simple Semantic

Fastest: 8.2s average
Cheapest: $0.021 total
Score: 0.623 (acceptable)

Use for: High-volume, cost-sensitive applications

Best Recall

Rewriter RAG

Context recall: 0.894
Answer relevancy: 0.887
Most thorough retrieval

Use for: Critical applications requiring completeness

Cross-Model Analysis

For multi-model benchmarks, analyze how models perform across RAGs:

{
  "question_by_question": [
    {
      "question_id": 1,
      "question": "¿En qué momento y quien debe reevaluar...",
      "rag_results": {
        "gpt-4o": { /* metrics */ },
        "gpt-5": { /* metrics */ },
        "gpt-5.2": { /* metrics */ },
        "google/medgemma-1.5-4b-it": { /* metrics */ }
      }
    }
  ]
}

Model Performance Matrix

Model	Avg Score	Faithfulness	Answer Rel.	Cost
gpt-5.2	0.892	0.934	0.901	$0.045
gpt-5	0.876	0.912	0.887	$0.038
gpt-4o	0.854	0.889	0.834	$0.041
medgemma-1.5-4b	0.623	0.850	0.265	$0.021

Insights:

GPT-5.2 offers best quality but at higher cost
GPT-5 provides best value (quality/cost ratio)
Medical-specialized models (medgemma) need more tuning

Best Practices

Fair Comparison

Use identical test data

All RAGs should be evaluated on the exact same questions:

# From ragas_evaluator.py:49-90
DATA_GT = [  # Same 10 questions for all evaluations
    {"question": "...", "ground_truth": "..."},
    # ...
]

Same embedding model

Don’t change embeddings between RAG evaluations:

# All RAGs use: OpenAI text-embedding-3-small
# Stored in: data/embeddings/chroma_db/

Consistent retrieval parameters

Keep k (number of chunks) consistent:

# Default k=5 for all RAGs
results = vectorstore.similarity_search(query, k=5)

Same evaluation conditions

Run evaluations on the same hardware
Use the same API tier (avoid rate limits)
Don’t run in parallel (can affect timing)

Result Storage Organization

Organize results for easy comparison:

results/
├── benchmarks/
│   ├── 2026-03-11_comprehensive/
│   │   ├── ragas_comprehensive_all_rags_all_models_20260311_111557.json
│   │   ├── analysis.ipynb
│   │   └── visualizations/
│   │       ├── metrics_comparison.png
│   │       ├── cost_analysis.png
│   │       └── performance_heatmap.png
│   └── 2026-03-10_hybrid_only/
│       └── ragas_multimodel_hybrid_20260310_153022.json
└── single_runs/
    ├── ragas_evaluation_simple_20260311_093843.json
    └── ragas_evaluation_hybrid_20260311_095023.json

Documentation

Document your benchmark methodology:

# Benchmark Report: 2026-03-11

## Setup
- Date: March 11, 2026
- Environment: Python 3.11, MacBook Pro M2
- Models tested: gpt-4o, gpt-5, gpt-5.2, medgemma-1.5-4b
- RAGs tested: simple, hybrid, hybrid-rrf, hyde, rewriter, pageindex
- Dataset: 10 obstetric questions (Spanish)
- Embeddings: OpenAI text-embedding-3-small

## Key Findings
1. Hybrid-RRF achieved highest quality (0.891 avg)
2. Simple RAG was fastest (8.2s avg) and cheapest ($0.021)
3. GPT-5 offered best quality/cost ratio
4. Medical models need further tuning

## Recommendations
- Use Hybrid RAG for production (good balance)
- Consider Hybrid-RRF for critical applications
- Use Simple RAG for high-volume, cost-sensitive use cases

Analyzing Results Programmatically

Load and Compare

import json
import pandas as pd

# Load comprehensive results
with open('results/ragas_comprehensive_all_rags_all_models_20260311_111557.json') as f:
    data = json.load(f)

# Extract summary data
rows = []
for rag_type, rag_data in data['summary'].items():
    row = {
        'rag_type': rag_type,
        'rag_name': rag_data['rag_name'],
        **rag_data['metrics'],
        **rag_data['performance']
    }
    rows.append(row)

df = pd.DataFrame(rows)

# Calculate overall average
df['avg_score'] = df[['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall']].mean(axis=1)

# Sort by quality
df_by_quality = df.sort_values('avg_score', ascending=False)
print("\nRAGs ranked by quality:")
print(df_by_quality[['rag_name', 'avg_score', 'total_cost', 'average_execution_time']])

# Sort by cost
df_by_cost = df.sort_values('total_cost')
print("\nRAGs ranked by cost:")
print(df_by_cost[['rag_name', 'total_cost', 'avg_score']])

Visualize Results

import matplotlib.pyplot as plt
import seaborn as sns

# Metrics comparison
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
metrics = ['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall']

for idx, metric in enumerate(metrics):
    ax = axes[idx // 2, idx % 2]
    df_sorted = df.sort_values(metric, ascending=False)
    ax.barh(df_sorted['rag_type'], df_sorted[metric])
    ax.set_xlabel('Score')
    ax.set_title(metric.replace('_', ' ').title())
    ax.set_xlim(0, 1)

plt.tight_layout()
plt.savefig('metrics_comparison.png', dpi=300)

# Cost vs Quality scatter
plt.figure(figsize=(10, 6))
plt.scatter(df['total_cost'], df['avg_score'], s=100)
for idx, row in df.iterrows():
    plt.annotate(row['rag_type'], (row['total_cost'], row['avg_score']), 
                xytext=(5, 5), textcoords='offset points')
plt.xlabel('Total Cost ($)')
plt.ylabel('Average Quality Score')
plt.title('Cost vs Quality Trade-off')
plt.grid(True, alpha=0.3)
plt.savefig('cost_vs_quality.png', dpi=300)

Publishing Results

For research or internal documentation:

LaTeX Table

\begin{table}[h]
\centering
\caption{RAGAS Evaluation Results Across RAG Architectures}
\begin{tabular}{lcccccc}
\hline
RAG Type & Faith. & Ans.Rel. & Ctx.Prec. & Ctx.Rec. & Avg & Cost \\
\hline
Hybrid-RRF & 0.924 & 0.875 & 0.901 & 0.864 & 0.891 & \$0.048 \\
Rewriter & 0.901 & 0.887 & 0.834 & 0.894 & 0.872 & \$0.089 \\
Hybrid & 0.912 & 0.783 & 0.891 & 0.845 & 0.858 & \$0.039 \\
HyDE & 0.867 & 0.745 & 0.823 & 0.812 & 0.812 & \$0.067 \\
PageIndex & 0.834 & 0.698 & 0.801 & 0.791 & 0.781 & \$0.041 \\
Simple & 0.850 & 0.265 & 0.779 & 0.600 & 0.623 & \$0.021 \\
\hline
\end{tabular}
\end{table}

Get Started

Core Concepts

RAG Architectures

Evaluation

Guides

​Benchmarking

​Benchmarking Goals

Quality

Performance

Robustness

Scalability

​Benchmark Types

​1. Single RAG Benchmark

​2. Multi-Model Benchmark

​3. Comprehensive Benchmark

​Running a Comprehensive Benchmark

​Understanding Benchmark Results

​Summary Section

​Best Performers

​Comparing RAG Architectures

​Quality Comparison

​Performance Comparison

​Trade-off Analysis

Best Overall Quality

Best Balance

Best Performance

Best Recall

​Cross-Model Analysis

​Model Performance Matrix

​Best Practices

​Fair Comparison

​Result Storage Organization

​Documentation

​Analyzing Results Programmatically

​Load and Compare

​Visualize Results

​Publishing Results

​LaTeX Table

​Next Steps

RAGAS Metrics

Interpreting Results

Build docs developers (and LLMs) love

Benchmarking

Benchmarking Goals

Benchmark Types

1. Single RAG Benchmark

2. Multi-Model Benchmark

3. Comprehensive Benchmark

Running a Comprehensive Benchmark

Understanding Benchmark Results

Summary Section

Best Performers

Comparing RAG Architectures

Quality Comparison

Performance Comparison

Trade-off Analysis

Cross-Model Analysis

Model Performance Matrix

Best Practices

Fair Comparison

Result Storage Organization

Documentation

Analyzing Results Programmatically

Load and Compare

Visualize Results

Publishing Results

LaTeX Table

Next Steps