Skip to main content

Benchmarking

This guide covers best practices for running comprehensive benchmarks to compare RAG architectures and identify optimal configurations.

Benchmarking Goals

A comprehensive benchmark should answer:

Quality

Which RAG architecture produces the highest quality answers?

Performance

Which architecture is fastest and most cost-effective?

Robustness

Which architecture handles diverse questions best?

Scalability

Which architecture scales best to production?

Benchmark Types

1. Single RAG Benchmark

Evaluate one RAG architecture in depth.
python scripts/run_evaluation.py hybrid
Use when:
  • Testing a new RAG implementation
  • Debugging a specific architecture
  • Quick quality check
Output: ragas_evaluation_[rag_type]_[timestamp].json

2. Multi-Model Benchmark

Compare how different LLMs perform with the same RAG architecture.
python scripts/run_evaluation.py multi-model hybrid
Use when:
  • Selecting the best LLM for your use case
  • Understanding model-specific strengths
  • Cost-benefit analysis across models
Output: ragas_multimodel_[rag_type]_[timestamp].json

3. Comprehensive Benchmark

Test all RAG architectures with all available models.
python scripts/run_evaluation.py all-models-all-rags
Use when:
  • Conducting research
  • Selecting production configuration
  • Publishing results
Output: ragas_comprehensive_all_rags_all_models_[timestamp].json

Running a Comprehensive Benchmark

1

Prepare Environment

Ensure stable conditions for fair comparison:
# Clean environment
rm -rf data/embeddings/chroma_db/

# Recreate embeddings
python scripts/create_embeddings.py

# Verify API keys
cat .env | grep OPENAI_API_KEY
2

Run Comprehensive Evaluation

Start the full benchmark:
python scripts/run_evaluation.py all-models-all-rags > benchmark_log.txt 2>&1 &
This runs in the background and logs all output.
3

Monitor Progress

Watch the log file:
tail -f benchmark_log.txt
You’ll see progress through RAG types:
========================= SIMPLE SEMANTIC RAG =========================
Starting RAGAS evaluation
...
============================= HYDE RAG =============================
Starting RAGAS evaluation
...
4

Wait for Completion

Typical duration:
  • 6 RAG types × 4 models × 10 questions = 240 evaluations
  • ~5-10 seconds per question
  • Total: 2-4 hours
Do not interrupt the benchmark. Results are only saved at the end.

Understanding Benchmark Results

The comprehensive benchmark produces a detailed JSON file:
{
  "metadata": {
    "timestamp": "2026-03-11T11:15:57.123456",
    "evaluation_type": "comprehensive_rag_comparison",
    "dataset_size": 10,
    "rags_evaluated": [
      "simple", "hybrid", "hybrid-rrf", "hyde", "rewriter", "pageindex"
    ],
    "models_evaluated": [
      "gpt-4o", "gpt-5", "gpt-5.2", "google/medgemma-1.5-4b-it"
    ]
  },
  "summary": { /* Metrics for each RAG × Model combination */ },
  "best_performers": { /* Top performers for each metric */ },
  "question_by_question": [ /* Detailed results */ ]
}

Summary Section

Compare all RAG architectures:
{
  "summary": {
    "simple": {
      "rag_name": "Simple Semantic RAG",
      "metrics": {
        "faithfulness": 0.850,
        "answer_relevancy": 0.265,
        "context_precision": 0.779,
        "context_recall": 0.600
      },
      "performance": {
        "average_execution_time": 8.234,
        "total_cost": 0.021433
      }
    },
    "hybrid": {
      "rag_name": "Hybrid RAG (BM25 + Semantic)",
      "metrics": {
        "faithfulness": 0.912,
        "answer_relevancy": 0.783,
        "context_precision": 0.891,
        "context_recall": 0.845
      },
      "performance": {
        "average_execution_time": 12.567,
        "total_cost": 0.038921
      }
    }
    // ... more RAGs
  }
}

Best Performers

Identify winners for each metric:
{
  "best_performers": {
    "faithfulness": {
      "rag_type": "hybrid-rrf",
      "rag_name": "Hybrid RAG + RRF (BM25 + Semantic)",
      "score": 0.924
    },
    "answer_relevancy": {
      "rag_type": "rewriter",
      "rag_name": "Rewriter RAG (Multi-Query)",
      "score": 0.887
    },
    "context_precision": {
      "rag_type": "hybrid-rrf",
      "rag_name": "Hybrid RAG + RRF (BM25 + Semantic)",
      "score": 0.901
    },
    "context_recall": {
      "rag_type": "rewriter",
      "rag_name": "Rewriter RAG (Multi-Query)",
      "score": 0.894
    },
    "best_avg_execution_time": {
      "rag_type": "simple",
      "rag_name": "Simple Semantic RAG",
      "score": 8.234
    },
    "best_total_cost": {
      "rag_type": "simple",
      "rag_name": "Simple Semantic RAG",
      "score": 0.021
    }
  }
}

Comparing RAG Architectures

Quality Comparison

Rank by overall average score:
RankRAG ArchitectureAvg ScoreFaithfulnessAnswer Rel.Ctx Prec.Ctx Recall
1Hybrid-RRF0.8910.9240.8750.9010.864
2Rewriter0.8720.9010.8870.8340.894
3Hybrid0.8580.9120.7830.8910.845
4HyDE0.8120.8670.7450.8230.812
5PageIndex0.7810.8340.6980.8010.791
6Simple0.6230.8500.2650.7790.600
Key insights:
  • Hybrid-RRF offers the best overall quality
  • Rewriter excels at recall (finding all relevant info)
  • Simple is fast but lacks answer relevancy

Performance Comparison

Rank by speed and cost:
RankRAG ArchitectureAvg Time (s)Total CostCost per Q
1Simple8.2$0.021$0.0021
2Hybrid12.6$0.039$0.0039
3PageIndex13.1$0.041$0.0041
4Hybrid-RRF14.8$0.048$0.0048
5HyDE18.3$0.067$0.0067
6Rewriter21.7$0.089$0.0089
Key insights:
  • Simple is 2.6× faster than Rewriter
  • Simple costs 4.2× less than Rewriter
  • Advanced RAGs trade cost/speed for quality

Trade-off Analysis

Best Overall Quality

Hybrid-RRF
  • Average score: 0.891
  • Excels in all metrics
  • Cost: $0.048 per 10 questions
Use for: Production systems where quality matters most

Best Balance

Hybrid RAG
  • Average score: 0.858 (only 3.7% lower)
  • 15% faster than Hybrid-RRF
  • 19% cheaper than Hybrid-RRF
Use for: Most production use cases

Best Performance

Simple Semantic
  • Fastest: 8.2s average
  • Cheapest: $0.021 total
  • Score: 0.623 (acceptable)
Use for: High-volume, cost-sensitive applications

Best Recall

Rewriter RAG
  • Context recall: 0.894
  • Answer relevancy: 0.887
  • Most thorough retrieval
Use for: Critical applications requiring completeness

Cross-Model Analysis

For multi-model benchmarks, analyze how models perform across RAGs:
{
  "question_by_question": [
    {
      "question_id": 1,
      "question": "¿En qué momento y quien debe reevaluar...",
      "rag_results": {
        "gpt-4o": { /* metrics */ },
        "gpt-5": { /* metrics */ },
        "gpt-5.2": { /* metrics */ },
        "google/medgemma-1.5-4b-it": { /* metrics */ }
      }
    }
  ]
}

Model Performance Matrix

ModelAvg ScoreFaithfulnessAnswer Rel.Cost
gpt-5.20.8920.9340.901$0.045
gpt-50.8760.9120.887$0.038
gpt-4o0.8540.8890.834$0.041
medgemma-1.5-4b0.6230.8500.265$0.021
Insights:
  • GPT-5.2 offers best quality but at higher cost
  • GPT-5 provides best value (quality/cost ratio)
  • Medical-specialized models (medgemma) need more tuning

Best Practices

Fair Comparison

All RAGs should be evaluated on the exact same questions:
# From ragas_evaluator.py:49-90
DATA_GT = [  # Same 10 questions for all evaluations
    {"question": "...", "ground_truth": "..."},
    # ...
]
Don’t change embeddings between RAG evaluations:
# All RAGs use: OpenAI text-embedding-3-small
# Stored in: data/embeddings/chroma_db/
Keep k (number of chunks) consistent:
# Default k=5 for all RAGs
results = vectorstore.similarity_search(query, k=5)
  • Run evaluations on the same hardware
  • Use the same API tier (avoid rate limits)
  • Don’t run in parallel (can affect timing)

Result Storage Organization

Organize results for easy comparison:
results/
├── benchmarks/
│   ├── 2026-03-11_comprehensive/
│   │   ├── ragas_comprehensive_all_rags_all_models_20260311_111557.json
│   │   ├── analysis.ipynb
│   │   └── visualizations/
│   │       ├── metrics_comparison.png
│   │       ├── cost_analysis.png
│   │       └── performance_heatmap.png
│   └── 2026-03-10_hybrid_only/
│       └── ragas_multimodel_hybrid_20260310_153022.json
└── single_runs/
    ├── ragas_evaluation_simple_20260311_093843.json
    └── ragas_evaluation_hybrid_20260311_095023.json

Documentation

Document your benchmark methodology:
# Benchmark Report: 2026-03-11

## Setup
- Date: March 11, 2026
- Environment: Python 3.11, MacBook Pro M2
- Models tested: gpt-4o, gpt-5, gpt-5.2, medgemma-1.5-4b
- RAGs tested: simple, hybrid, hybrid-rrf, hyde, rewriter, pageindex
- Dataset: 10 obstetric questions (Spanish)
- Embeddings: OpenAI text-embedding-3-small

## Key Findings
1. Hybrid-RRF achieved highest quality (0.891 avg)
2. Simple RAG was fastest (8.2s avg) and cheapest ($0.021)
3. GPT-5 offered best quality/cost ratio
4. Medical models need further tuning

## Recommendations
- Use Hybrid RAG for production (good balance)
- Consider Hybrid-RRF for critical applications
- Use Simple RAG for high-volume, cost-sensitive use cases

Analyzing Results Programmatically

Load and Compare

import json
import pandas as pd

# Load comprehensive results
with open('results/ragas_comprehensive_all_rags_all_models_20260311_111557.json') as f:
    data = json.load(f)

# Extract summary data
rows = []
for rag_type, rag_data in data['summary'].items():
    row = {
        'rag_type': rag_type,
        'rag_name': rag_data['rag_name'],
        **rag_data['metrics'],
        **rag_data['performance']
    }
    rows.append(row)

df = pd.DataFrame(rows)

# Calculate overall average
df['avg_score'] = df[['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall']].mean(axis=1)

# Sort by quality
df_by_quality = df.sort_values('avg_score', ascending=False)
print("\nRAGs ranked by quality:")
print(df_by_quality[['rag_name', 'avg_score', 'total_cost', 'average_execution_time']])

# Sort by cost
df_by_cost = df.sort_values('total_cost')
print("\nRAGs ranked by cost:")
print(df_by_cost[['rag_name', 'total_cost', 'avg_score']])

Visualize Results

import matplotlib.pyplot as plt
import seaborn as sns

# Metrics comparison
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
metrics = ['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall']

for idx, metric in enumerate(metrics):
    ax = axes[idx // 2, idx % 2]
    df_sorted = df.sort_values(metric, ascending=False)
    ax.barh(df_sorted['rag_type'], df_sorted[metric])
    ax.set_xlabel('Score')
    ax.set_title(metric.replace('_', ' ').title())
    ax.set_xlim(0, 1)

plt.tight_layout()
plt.savefig('metrics_comparison.png', dpi=300)

# Cost vs Quality scatter
plt.figure(figsize=(10, 6))
plt.scatter(df['total_cost'], df['avg_score'], s=100)
for idx, row in df.iterrows():
    plt.annotate(row['rag_type'], (row['total_cost'], row['avg_score']), 
                xytext=(5, 5), textcoords='offset points')
plt.xlabel('Total Cost ($)')
plt.ylabel('Average Quality Score')
plt.title('Cost vs Quality Trade-off')
plt.grid(True, alpha=0.3)
plt.savefig('cost_vs_quality.png', dpi=300)

Publishing Results

For research or internal documentation:

LaTeX Table

\begin{table}[h]
\centering
\caption{RAGAS Evaluation Results Across RAG Architectures}
\begin{tabular}{lcccccc}
\hline
RAG Type & Faith. & Ans.Rel. & Ctx.Prec. & Ctx.Rec. & Avg & Cost \\
\hline
Hybrid-RRF & 0.924 & 0.875 & 0.901 & 0.864 & 0.891 & \$0.048 \\
Rewriter & 0.901 & 0.887 & 0.834 & 0.894 & 0.872 & \$0.089 \\
Hybrid & 0.912 & 0.783 & 0.891 & 0.845 & 0.858 & \$0.039 \\
HyDE & 0.867 & 0.745 & 0.823 & 0.812 & 0.812 & \$0.067 \\
PageIndex & 0.834 & 0.698 & 0.801 & 0.791 & 0.781 & \$0.041 \\
Simple & 0.850 & 0.265 & 0.779 & 0.600 & 0.623 & \$0.021 \\
\hline
\end{tabular}
\end{table}

Next Steps

RAGAS Metrics

Understand what each metric measures

Interpreting Results

Detailed guide to analyzing evaluation results

Build docs developers (and LLMs) love