Overview
The RAGAS Evaluator module provides comprehensive evaluation capabilities for RAG systems using the RAGAS (Retrieval-Augmented Generation Assessment) framework. It supports multiple RAG architectures, various LLM models, and generates detailed performance metrics and cost analysis.
Module Path: src/evaluation/ragas_evaluator.py
Key Features
- Multiple RAG Support: Evaluate Simple, HyDE, Rewriter, Hybrid, Hybrid-RRF, and PageIndex RAG systems
- RAGAS Metrics: Faithfulness, Answer Relevancy, Context Precision, and Context Recall
- Multi-Model Evaluation: Compare performance across different LLM models
- Performance Tracking: Token usage, execution time, and cost analysis
- Obstetric Dataset: Built-in specialized medical dataset with 10 ground truth questions
- Comprehensive Reporting: JSON export with detailed question-by-question analysis
RAGASEvaluator Class
The main orchestrator class for RAG evaluation workflows.
Constructor
RAGASEvaluator(rag_type: str = "rewriter", debug: bool = False)
Initialize a RAGAS evaluator for a specific RAG architecture.
RAG architecture to evaluate. Supported values:
"simple" - Simple Semantic RAG
"hybrid" - Hybrid RAG (BM25 + Semantic)
"hybrid-rrf" - Hybrid RAG with Reciprocal Rank Fusion
"hyde" - HyDE RAG (Hypothetical Documents)
"rewriter" - Multi-Query Rewriter RAG
"pageindex" - PageIndex RAG
Enable debug output for detailed logging and error traces
Instance Attributes:
metrics - List of RAGAS metric objects (faithfulness, answer_relevancy, context_precision, context_recall)
results_dir - Path to results directory (project_root/results)
query_function - RAG-specific query function for evaluation
rag_name - Descriptive name of the RAG system
rag_type - RAG type identifier
llm_model - Default LLM model name (initially “gpt-4o”)
performance_metadata - List storing execution metrics per query
Example:
from src.evaluation.ragas_evaluator import RAGASEvaluator
# Initialize evaluator for HyDE RAG
evaluator = RAGASEvaluator(rag_type="hyde", debug=True)
# Initialize with simple RAG
simple_evaluator = RAGASEvaluator(rag_type="simple")
set_models
set_models(llm_model: str = None, embeddings_model: str = None)
Update the LLM model used by the evaluator.
New LLM model name (e.g., “gpt-4o”, “claude-3-5-sonnet”)
New embeddings model name (currently not used in implementation)
Example:
evaluator = RAGASEvaluator(rag_type="hybrid")
evaluator.set_models(llm_model="claude-3-5-sonnet")
load_test_queries
load_test_queries(use_obstetric_dataset: bool = True) -> List[Dict]
Load test queries from the obstetric dataset.
Whether to use the built-in obstetric dataset (currently the only option)
List of query dictionaries, each containing:
question (str) - The test question
ground_truth (str) - The expected answer
Example:
evaluator = RAGASEvaluator(rag_type="simple")
queries = evaluator.load_test_queries()
print(f"Loaded {len(queries)} test queries")
# Output: Loaded 10 test queries
prepare_dataset
prepare_dataset(test_queries: List[Dict]) -> Dataset
Prepare RAGAS dataset format by executing RAG queries and collecting results.
List of test query dictionaries with “question” and “ground_truth” keys
RAGAS-compatible HuggingFace Dataset with columns:
question - User queries
answer - RAG-generated answers
contexts - Retrieved context chunks
ground_truth - Expected answers
Side Effects:
- Populates
self.performance_metadata with execution metrics
- Prints progress information for each query
- Handles errors gracefully and continues processing
Example:
evaluator = RAGASEvaluator(rag_type="rewriter")
test_queries = evaluator.load_test_queries()
dataset = evaluator.prepare_dataset(test_queries)
print(f"Dataset size: {len(dataset)}")
# Output: Dataset size: 10
evaluate_rag
evaluate_rag(dataset: Dataset) -> Dict[str, Any]
Evaluate RAG using RAGAS metrics with automatic fallback to synchronous mode.
Prepared RAGAS dataset with questions, answers, contexts, and ground truth
Evaluation results object with metric scores. Can be accessed as:
- Object attributes (e.g.,
results.faithfulness)
- Via
to_pandas() method for DataFrame conversion
Behavior:
- Executes RAGAS evaluation with 8 parallel workers
- Disables per-call timeouts for Python 3.14+ compatibility
- Automatically detects all-NaN results and falls back to synchronous evaluation
- Handles async context issues gracefully
Example:
evaluator = RAGASEvaluator(rag_type="hybrid-rrf")
queries = evaluator.load_test_queries()
dataset = evaluator.prepare_dataset(queries)
results = evaluator.evaluate_rag(dataset)
# Access metrics
print(f"Faithfulness: {results.faithfulness:.3f}")
print(f"Answer Relevancy: {results.answer_relevancy:.3f}")
# Convert to DataFrame
df = results.to_pandas()
print(df[['faithfulness', 'answer_relevancy']].describe())
display_results
Display evaluation results in a formatted console output.
Evaluation results from evaluate_rag()
Output Format:
- Individual metric scores (0-1 scale)
- Average score across all metrics
- Performance assessment (Excellent/Good/Needs improvement/Significant improvements needed)
Example:
evaluator = RAGASEvaluator(rag_type="simple")
evaluator.run_evaluation() # Automatically calls display_results
# Manual display
results = evaluator.evaluate_rag(dataset)
evaluator.display_results(results)
Sample Output:
============================================================
RAGAS EVALUATION RESULTS
============================================================
Faithfulness: 0.872
Answer Relevancy: 0.845
Context Precision: 0.790
Context Recall: 0.823
Average Score: 0.833
Performance: Excellent
save_results
save_results(
results,
filename: str = None,
return_data_only: bool = False,
model_name: str = None
) -> Union[Path, Dict, None]
Save evaluation results to JSON file or return as dictionary.
Evaluation results from evaluate_rag()
Output filename. If not provided, generates timestamped filename
If True, returns dictionary instead of saving to file
LLM model name to include in metadata
- File path if saved to disk
- Dictionary if
return_data_only=True
- None if no results or error occurred
Output Structure:
{
"metadata": {
"timestamp": "2026-03-11T10:30:45.123456",
"evaluation_type": "single_rag_evaluation_hybrid",
"dataset_size": 10,
"rags_evaluated": ["hybrid"],
"model_used": "gpt-4o"
},
"pricing_config": { ... },
"summary": {
"hybrid": {
"rag_name": "Hybrid RAG (BM25 + Semantic)",
"metrics": {
"faithfulness": 0.872,
"answer_relevancy": 0.845,
"context_precision": 0.790,
"context_recall": 0.823
},
"performance": {
"average_execution_time": 2.456,
"total_input_tokens": 12450,
"total_output_tokens": 3200,
"total_cost": 0.045678,
"average_cost_per_question": 0.004568,
"overall_average_score": 0.833
}
}
},
"question_by_question": [ ... ]
}
Example:
# Save to file
evaluator = RAGASEvaluator(rag_type="hyde")
results = evaluator.evaluate_rag(dataset)
filepath = evaluator.save_results(results, model_name="gpt-4o")
print(f"Results saved to: {filepath}")
# Return as dictionary
data = evaluator.save_results(results, return_data_only=True)
print(f"Overall average: {data['summary']['hyde']['performance']['overall_average_score']}")
run_evaluation
Execute complete evaluation pipeline with the obstetric dataset.
Returns: Evaluation results object
Pipeline Steps:
- Load test queries from obstetric dataset
- Prepare RAGAS dataset by executing RAG queries
- Run RAGAS evaluation with configured metrics
- Display formatted results
- Save results to JSON file
Example:
# Complete evaluation workflow
evaluator = RAGASEvaluator(rag_type="pageindex", debug=False)
results = evaluator.run_evaluation()
# Results are automatically displayed and saved
Console Output:
Starting RAGAS evaluation
System: PageIndex RAG
Dataset: Obstetric queries (10 questions)
============================================================
Processing 10 queries with PageIndex RAG
Dataset prepared: 10 queries processed
Starting RAGAS evaluation...
Evaluation completed
============================================================
RAGAS EVALUATION RESULTS
============================================================
...
Results saved to: results/ragas_evaluation_pageindex_20260311_103045.json
Evaluation completed - 10 queries processed
run_multi_model_evaluation
run_multi_model_evaluation(models_to_test: list = None)
Run evaluation for the current RAG type against multiple LLM models.
List of model keys from MODELS_REGISTRY. If None, tests all registered models.
Behavior:
- Iterates through each model in the list
- Creates custom LLM instances for each model
- Wraps RAG query functions to use the custom model
- Collects results from all models
- Generates consolidated multi-model report
Output File: ragas_multimodel_{rag_type}_{timestamp}.json
Example:
# Test specific models
evaluator = RAGASEvaluator(rag_type="simple")
models = ["gpt-4o", "claude-3-5-sonnet", "gemini-1.5-pro"]
evaluator.run_multi_model_evaluation(models_to_test=models)
# Test all registered models
evaluator.run_multi_model_evaluation()
Output Structure:
{
"metadata": {
"timestamp": "2026-03-11T10:30:45.123456",
"evaluation_type": "multi_model_rag_comparison",
"rag_type_evaluated": "simple",
"dataset_size": 10,
"models_evaluated": ["gpt-4o", "claude-3-5-sonnet", "gemini-1.5-pro"]
},
"pricing_config": { ... },
"summary": {
"gpt-4o": { ... },
"claude-3-5-sonnet": { ... },
"gemini-1.5-pro": { ... }
},
"question_by_question": [ ... ]
}
SyncEvaluationResult Class
Lightweight wrapper for synchronous evaluation results, compatible with RAGAS result format.
Constructor
SyncEvaluationResult(dataframe: pd.DataFrame, metrics: List[Any])
DataFrame containing metric scores for each evaluation sample
List of RAGAS metric objects used in evaluation
Behavior:
- Calculates mean value for each metric and stores as instance attribute
- Provides
to_pandas() method for DataFrame access
- Compatible with downstream result processing pipelines
Example:
from src.evaluation.ragas_evaluator import SyncEvaluationResult
import pandas as pd
# Create result wrapper
df = pd.DataFrame({
'faithfulness': [0.9, 0.8, 0.85],
'answer_relevancy': [0.85, 0.9, 0.88]
})
result = SyncEvaluationResult(df, metrics)
# Access mean scores
print(f"Faithfulness: {result.faithfulness}")
print(f"Answer Relevancy: {result.answer_relevancy}")
# Get DataFrame
df = result.to_pandas()
Helper Functions
evaluate_rewriter_rag
evaluate_rewriter_rag(export_analysis: bool = False, debug: bool = False)
Evaluate Rewriter RAG specifically with optional analysis export.
Export detailed analysis files (CSV, charts) using export_ragas_analysis utility
Example:
from src.evaluation.ragas_evaluator import evaluate_rewriter_rag
# Basic evaluation
results = evaluate_rewriter_rag()
# With detailed analysis export
results = evaluate_rewriter_rag(export_analysis=True, debug=True)
evaluate_hybrid_rag
evaluate_hybrid_rag(export_analysis: bool = False, debug: bool = False)
Evaluate Hybrid RAG (BM25 + Semantic) specifically.
Export detailed analysis files
Example:
from src.evaluation.ragas_evaluator import evaluate_hybrid_rag
results = evaluate_hybrid_rag(export_analysis=True)
evaluate_hybrid_rrf_rag
evaluate_hybrid_rrf_rag(export_analysis: bool = False, debug: bool = False)
Evaluate Hybrid RAG with Reciprocal Rank Fusion.
Export detailed analysis files
Example:
from src.evaluation.ragas_evaluator import evaluate_hybrid_rrf_rag
results = evaluate_hybrid_rrf_rag()
evaluate_hyde_rag
evaluate_hyde_rag(export_analysis: bool = False, debug: bool = False)
Evaluate HyDE RAG (Hypothetical Documents) specifically.
Export detailed analysis files
Example:
from src.evaluation.ragas_evaluator import evaluate_hyde_rag
results = evaluate_hyde_rag(export_analysis=True)
evaluate_simple_rag
evaluate_simple_rag(export_analysis: bool = False, debug: bool = False)
Evaluate Simple Semantic RAG specifically.
Export detailed analysis files
Example:
from src.evaluation.ragas_evaluator import evaluate_simple_rag
results = evaluate_simple_rag()
evaluate_pageindex_rag
evaluate_pageindex_rag(export_analysis: bool = False, debug: bool = False)
Evaluate PageIndex RAG specifically.
Export detailed analysis files
Example:
from src.evaluation.ragas_evaluator import evaluate_pageindex_rag
results = evaluate_pageindex_rag()
evaluate_both_rags
evaluate_both_rags(export_analysis: bool = False, debug: bool = False)
Evaluate both original RAG systems sequentially (Rewriter and Hybrid).
Export detailed analysis files for both systems
Dictionary with keys “rewriter” and “hybrid” containing respective results
Example:
from src.evaluation.ragas_evaluator import evaluate_both_rags
results = evaluate_both_rags(export_analysis=True)
print(f"Rewriter score: {results['rewriter'].faithfulness}")
print(f"Hybrid score: {results['hybrid'].faithfulness}")
evaluate_all_rags
evaluate_all_rags(export_analysis: bool = False, debug: bool = False)
Evaluate all 6 RAG systems sequentially with comprehensive comparison report.
Export detailed analysis files for all systems
Dictionary with keys for each RAG type (“simple”, “hyde”, “rewriter”, “hybrid”, “hybrid-rrf”, “pageindex”)
Behavior:
- Evaluates all 6 RAG systems in sequence
- Generates individual result files for each RAG
- Creates consolidated comparison report with best performer analysis
- Includes 2-second pause between evaluations
Output Files:
- Individual:
ragas_evaluation_{rag_type}_{timestamp}.json
- Comparison:
ragas_comparison_all_rags_{timestamp}.json
Example:
from src.evaluation.ragas_evaluator import evaluate_all_rags
results = evaluate_all_rags(export_analysis=True, debug=False)
# Access individual results
for rag_type, result in results.items():
print(f"{rag_type}: {result.faithfulness:.3f}")
run_all_models_all_rags_evaluation
run_all_models_all_rags_evaluation(export_analysis: bool = False, debug: bool = False)
Comprehensive evaluation: ALL RAG types against ALL LLM models.
Export detailed analysis files (currently not used in this function)
Behavior:
- Tests all 6 RAG types with all models in MODELS_REGISTRY
- Total evaluations: 6 RAGs × N models
- Generates comprehensive consolidated report
- Provides progress indicators and success statistics
Output File: ragas_comprehensive_all_rags_all_models_{timestamp}.json
Example:
from src.evaluation.ragas_evaluator import run_all_models_all_rags_evaluation
# Run comprehensive evaluation
results = run_all_models_all_rags_evaluation(debug=True)
Console Output:
🚀 Starting comprehensive evaluation: ALL RAGs vs ALL Models
RAG types to evaluate: ['simple', 'hybrid', 'hybrid-rrf', 'hyde', 'rewriter', 'pageindex']
Models to test: ['gpt-4o', 'claude-3-5-sonnet', 'gemini-1.5-pro', ...]
Total evaluations: 6 × 8 = 48
============================================================
...
================================================================================
🎉 COMPREHENSIVE EVALUATION COMPLETED!
📄 Report saved to: results/ragas_comprehensive_all_rags_all_models_20260311_103045.json
📊 Total evaluations: 6 RAGs × 8 models = 48
✅ Successful evaluations: 46/48
================================================================================
DATA_GT Dataset
Built-in obstetric and pregnancy-specific evaluation dataset with 10 ground truth questions.
Structure:
DATA_GT = [
{
"question": "¿En qué momento y quien debe reevaluar el riesgo clínico...",
"ground_truth": "El Ginecobstetra en la semana 28 - 30 y semana 34 – 36."
},
# ... 9 more questions
]
Topics Covered:
- Prenatal care scheduling and timing
- Risk assessment protocols
- Clinical evaluation tools (Herrera & Hurtado scale)
- Postpartum depression screening
- Weight gain recommendations by BMI
- VBAC (Vaginal Birth After Cesarean) probabilities
- Nausea and vomiting treatment options
Language: Spanish (Colombian clinical guidelines)
Usage:
from src.evaluation.ragas_evaluator import DATA_GT
print(f"Dataset size: {len(DATA_GT)}")
for i, item in enumerate(DATA_GT, 1):
print(f"Q{i}: {item['question'][:50]}...")
RAGAS Metrics
The evaluator uses four fundamental RAGAS metrics:
Faithfulness
Measures whether the answer is factually consistent with the retrieved contexts. Score range: 0-1 (higher is better).
Calculation: Checks if claims in the answer can be inferred from the contexts without hallucination.
Answer Relevancy
Measures how relevant the answer is to the original question. Score range: 0-1 (higher is better).
Calculation: Uses embeddings to compute semantic similarity between question and answer.
Context Precision
Measures the proportion of relevant contexts in the retrieved set. Score range: 0-1 (higher is better).
Calculation: Evaluates whether retrieved contexts are actually useful for answering the question.
Context Recall
Measures whether all necessary information from ground truth is present in retrieved contexts. Score range: 0-1 (higher is better).
Calculation: Checks if ground truth answer can be derived from the retrieved contexts.
Single RAG Evaluation
{
"metadata": {
"timestamp": "2026-03-11T10:30:45.123456",
"evaluation_type": "single_rag_evaluation_hybrid",
"dataset_size": 10,
"rags_evaluated": ["hybrid"],
"model_used": "gpt-4o"
},
"pricing_config": {
"gpt-4o": {
"input": 0.0025,
"output": 0.01
}
},
"summary": {
"hybrid": {
"rag_name": "Hybrid RAG (BM25 + Semantic)",
"metrics": {
"faithfulness": 0.872,
"answer_relevancy": 0.845,
"context_precision": 0.790,
"context_recall": 0.823
},
"performance": {
"average_execution_time": 2.456,
"total_input_tokens": 12450,
"total_output_tokens": 3200,
"total_cost": 0.045678,
"average_cost_per_question": 0.004568,
"overall_average_score": 0.833
}
}
},
"question_by_question": [
{
"question_id": 1,
"question": "¿En qué momento y quien debe reevaluar...",
"ground_truth": "El Ginecobstetra en la semana 28 - 30...",
"rag_results": {
"hybrid": {
"answer": "Según las guías clínicas...",
"contexts_count": 5,
"metrics": {
"faithfulness": 0.900,
"answer_relevancy": 0.875,
"context_precision": 0.800,
"context_recall": 0.850
},
"performance": {
"question": "¿En qué momento y quien debe reevaluar...",
"execution_time": 2.345,
"input_tokens": 1245,
"output_tokens": 320,
"total_cost": 0.004567,
"cost_source": "precise"
}
}
}
}
]
}
Multi-Model Comparison
{
"metadata": {
"timestamp": "2026-03-11T10:30:45.123456",
"evaluation_type": "multi_model_rag_comparison",
"rag_type_evaluated": "simple",
"dataset_size": 10,
"models_evaluated": ["gpt-4o", "claude-3-5-sonnet", "gemini-1.5-pro"]
},
"summary": {
"gpt-4o": {
"model_name": "gpt-4o",
"metrics": { ... },
"performance": { ... }
},
"claude-3-5-sonnet": {
"model_name": "claude-3-5-sonnet",
"metrics": { ... },
"performance": { ... }
}
},
"question_by_question": [
{
"question_id": 1,
"question": "...",
"ground_truth": "...",
"rag_results": {
"gpt-4o": { ... },
"claude-3-5-sonnet": { ... }
}
}
]
}
Comprehensive All RAGs All Models
{
"metadata": {
"timestamp": "2026-03-11T10:30:45.123456",
"evaluation_type": "comprehensive_all_rags_all_models",
"dataset_size": 10,
"rags_evaluated": ["simple", "hybrid", "hybrid-rrf", "hyde", "rewriter", "pageindex"],
"models_evaluated": ["gpt-4o", "claude-3-5-sonnet", ...],
"total_evaluations": 48
},
"summary": {
"simple": {
"gpt-4o": { ... },
"claude-3-5-sonnet": { ... }
},
"hybrid": { ... }
},
"question_by_question": [
{
"question_id": 1,
"question": "...",
"ground_truth": "...",
"rag_results": {
"simple": {
"gpt-4o": { ... },
"claude-3-5-sonnet": { ... }
},
"hybrid": { ... }
}
}
]
}
Command-Line Usage
The module can be executed directly from the command line:
# Evaluate specific RAG type
python src/evaluation/ragas_evaluator.py simple
python src/evaluation/ragas_evaluator.py hybrid-rrf
# Evaluate with detailed analysis export
python src/evaluation/ragas_evaluator.py hyde --export
# Evaluate with debug output
python src/evaluation/ragas_evaluator.py rewriter --debug
# Evaluate multiple RAGs
python src/evaluation/ragas_evaluator.py both # Rewriter + Hybrid
python src/evaluation/ragas_evaluator.py all # All 6 RAGs
# Multi-model evaluation
python src/evaluation/ragas_evaluator.py multi-model simple
# Comprehensive evaluation (all RAGs × all models)
python src/evaluation/ragas_evaluator.py all-models-all-rags
Available Flags:
--export or -e - Export detailed analysis files
--debug or -d - Enable debug output
Complete Usage Example
from src.evaluation.ragas_evaluator import RAGASEvaluator
# 1. Basic Evaluation
evaluator = RAGASEvaluator(rag_type="hybrid", debug=False)
results = evaluator.run_evaluation()
# 2. Custom Workflow
evaluator = RAGASEvaluator(rag_type="simple")
queries = evaluator.load_test_queries()
dataset = evaluator.prepare_dataset(queries)
results = evaluator.evaluate_rag(dataset)
evaluator.display_results(results)
filepath = evaluator.save_results(results, model_name="gpt-4o")
# 3. Multi-Model Comparison
evaluator = RAGASEvaluator(rag_type="hyde")
models = ["gpt-4o", "claude-3-5-sonnet", "gemini-1.5-pro"]
evaluator.run_multi_model_evaluation(models_to_test=models)
# 4. Change Model During Evaluation
evaluator = RAGASEvaluator(rag_type="pageindex")
evaluator.set_models(llm_model="claude-3-5-sonnet")
results = evaluator.run_evaluation()
# 5. Access Detailed Metrics
df = results.to_pandas()
print(df[['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall']])
print(f"Mean Faithfulness: {df['faithfulness'].mean():.3f}")
print(f"Std Faithfulness: {df['faithfulness'].std():.3f}")
# 6. Programmatic Comparison
from src.evaluation.ragas_evaluator import evaluate_all_rags
all_results = evaluate_all_rags(export_analysis=True)
for rag_type, result in all_results.items():
print(f"\n{rag_type.upper()} Performance:")
print(f" Faithfulness: {result.faithfulness:.3f}")
print(f" Answer Relevancy: {result.answer_relevancy:.3f}")
Execution Time
- Single RAG evaluation: ~3-5 minutes for 10 questions
- Multi-model evaluation: ~5-10 minutes per model
- All RAGs evaluation: ~20-30 minutes total
- Comprehensive (all RAGs × all models): 1-2 hours depending on number of models
Cost Implications
Evaluation costs depend on:
- LLM model pricing (input/output tokens)
- Number of RAG queries
- Retrieved context length
- Answer generation length
Example costs per 10-question evaluation:
- GPT-4o: ~$0.04-0.06
- Claude-3.5-Sonnet: ~$0.05-0.08
- GPT-3.5-turbo: ~$0.01-0.02
Async vs Sync Mode
The evaluator automatically handles async context issues:
- Default: Async evaluation with 8 workers (faster)
- Fallback: Synchronous evaluation with 1 worker (more stable)
- Trigger: All-NaN metrics or async exceptions
Error Handling
Common Issues
1. All-NaN Metrics
Detected all-NaN async metric output. Switching to synchronous fallback...
Solution: Automatically handled by fallback mechanism.
2. Model Not Found
Model 'invalid-model' not found in registry. Skipping.
Solution: Use valid model keys from MODELS_REGISTRY.
3. Query Execution Failure
Error processing query 5: Connection timeout
Solution: Individual query failures are logged but don’t stop evaluation.
4. Unsupported RAG Type
ValueError: Unsupported RAG type: custom. Use 'rewriter', 'hybrid', ...
Solution: Use one of the 6 supported RAG types.
Dependencies
Required Packages:
ragas - RAGAS evaluation framework
datasets - HuggingFace datasets library
pandas - Data manipulation
numpy - Numerical operations
Internal Dependencies:
src.rag.* - RAG system implementations
src.common.model_provider - LLM model management
src.common.pricing - Cost tracking utilities
src.common.utils - Analysis export utilities
See Also