Benchmarking
This guide covers best practices for running comprehensive benchmarks to compare RAG architectures and identify optimal configurations.Benchmarking Goals
A comprehensive benchmark should answer:Quality
Which RAG architecture produces the highest quality answers?
Performance
Which architecture is fastest and most cost-effective?
Robustness
Which architecture handles diverse questions best?
Scalability
Which architecture scales best to production?
Benchmark Types
1. Single RAG Benchmark
Evaluate one RAG architecture in depth.- Testing a new RAG implementation
- Debugging a specific architecture
- Quick quality check
ragas_evaluation_[rag_type]_[timestamp].json
2. Multi-Model Benchmark
Compare how different LLMs perform with the same RAG architecture.- Selecting the best LLM for your use case
- Understanding model-specific strengths
- Cost-benefit analysis across models
ragas_multimodel_[rag_type]_[timestamp].json
3. Comprehensive Benchmark
Test all RAG architectures with all available models.- Conducting research
- Selecting production configuration
- Publishing results
ragas_comprehensive_all_rags_all_models_[timestamp].json
Running a Comprehensive Benchmark
Run Comprehensive Evaluation
Start the full benchmark:This runs in the background and logs all output.
Understanding Benchmark Results
The comprehensive benchmark produces a detailed JSON file:Summary Section
Compare all RAG architectures:Best Performers
Identify winners for each metric:Comparing RAG Architectures
Quality Comparison
Rank by overall average score:| Rank | RAG Architecture | Avg Score | Faithfulness | Answer Rel. | Ctx Prec. | Ctx Recall |
|---|---|---|---|---|---|---|
| 1 | Hybrid-RRF | 0.891 | 0.924 | 0.875 | 0.901 | 0.864 |
| 2 | Rewriter | 0.872 | 0.901 | 0.887 | 0.834 | 0.894 |
| 3 | Hybrid | 0.858 | 0.912 | 0.783 | 0.891 | 0.845 |
| 4 | HyDE | 0.812 | 0.867 | 0.745 | 0.823 | 0.812 |
| 5 | PageIndex | 0.781 | 0.834 | 0.698 | 0.801 | 0.791 |
| 6 | Simple | 0.623 | 0.850 | 0.265 | 0.779 | 0.600 |
Performance Comparison
Rank by speed and cost:| Rank | RAG Architecture | Avg Time (s) | Total Cost | Cost per Q |
|---|---|---|---|---|
| 1 | Simple | 8.2 | $0.021 | $0.0021 |
| 2 | Hybrid | 12.6 | $0.039 | $0.0039 |
| 3 | PageIndex | 13.1 | $0.041 | $0.0041 |
| 4 | Hybrid-RRF | 14.8 | $0.048 | $0.0048 |
| 5 | HyDE | 18.3 | $0.067 | $0.0067 |
| 6 | Rewriter | 21.7 | $0.089 | $0.0089 |
Trade-off Analysis
Best Overall Quality
Hybrid-RRF
- Average score: 0.891
- Excels in all metrics
- Cost: $0.048 per 10 questions
Best Balance
Hybrid RAG
- Average score: 0.858 (only 3.7% lower)
- 15% faster than Hybrid-RRF
- 19% cheaper than Hybrid-RRF
Best Performance
Simple Semantic
- Fastest: 8.2s average
- Cheapest: $0.021 total
- Score: 0.623 (acceptable)
Best Recall
Rewriter RAG
- Context recall: 0.894
- Answer relevancy: 0.887
- Most thorough retrieval
Cross-Model Analysis
For multi-model benchmarks, analyze how models perform across RAGs:Model Performance Matrix
| Model | Avg Score | Faithfulness | Answer Rel. | Cost |
|---|---|---|---|---|
| gpt-5.2 | 0.892 | 0.934 | 0.901 | $0.045 |
| gpt-5 | 0.876 | 0.912 | 0.887 | $0.038 |
| gpt-4o | 0.854 | 0.889 | 0.834 | $0.041 |
| medgemma-1.5-4b | 0.623 | 0.850 | 0.265 | $0.021 |
Insights:
- GPT-5.2 offers best quality but at higher cost
- GPT-5 provides best value (quality/cost ratio)
- Medical-specialized models (medgemma) need more tuning
Best Practices
Fair Comparison
Use identical test data
Use identical test data
All RAGs should be evaluated on the exact same questions:
Same embedding model
Same embedding model
Don’t change embeddings between RAG evaluations:
Consistent retrieval parameters
Consistent retrieval parameters
Keep k (number of chunks) consistent:
Same evaluation conditions
Same evaluation conditions
- Run evaluations on the same hardware
- Use the same API tier (avoid rate limits)
- Don’t run in parallel (can affect timing)
Result Storage Organization
Organize results for easy comparison:Documentation
Document your benchmark methodology:Analyzing Results Programmatically
Load and Compare
Visualize Results
Publishing Results
For research or internal documentation:LaTeX Table
Next Steps
RAGAS Metrics
Understand what each metric measures
Interpreting Results
Detailed guide to analyzing evaluation results
