Skip to main content

Overview

The RAGAS Evaluator module provides comprehensive evaluation capabilities for RAG systems using the RAGAS (Retrieval-Augmented Generation Assessment) framework. It supports multiple RAG architectures, various LLM models, and generates detailed performance metrics and cost analysis. Module Path: src/evaluation/ragas_evaluator.py

Key Features

  • Multiple RAG Support: Evaluate Simple, HyDE, Rewriter, Hybrid, Hybrid-RRF, and PageIndex RAG systems
  • RAGAS Metrics: Faithfulness, Answer Relevancy, Context Precision, and Context Recall
  • Multi-Model Evaluation: Compare performance across different LLM models
  • Performance Tracking: Token usage, execution time, and cost analysis
  • Obstetric Dataset: Built-in specialized medical dataset with 10 ground truth questions
  • Comprehensive Reporting: JSON export with detailed question-by-question analysis

RAGASEvaluator Class

The main orchestrator class for RAG evaluation workflows.

Constructor

RAGASEvaluator(rag_type: str = "rewriter", debug: bool = False)
Initialize a RAGAS evaluator for a specific RAG architecture.
rag_type
str
default:"rewriter"
RAG architecture to evaluate. Supported values:
  • "simple" - Simple Semantic RAG
  • "hybrid" - Hybrid RAG (BM25 + Semantic)
  • "hybrid-rrf" - Hybrid RAG with Reciprocal Rank Fusion
  • "hyde" - HyDE RAG (Hypothetical Documents)
  • "rewriter" - Multi-Query Rewriter RAG
  • "pageindex" - PageIndex RAG
debug
bool
default:"False"
Enable debug output for detailed logging and error traces
Instance Attributes:
  • metrics - List of RAGAS metric objects (faithfulness, answer_relevancy, context_precision, context_recall)
  • results_dir - Path to results directory (project_root/results)
  • query_function - RAG-specific query function for evaluation
  • rag_name - Descriptive name of the RAG system
  • rag_type - RAG type identifier
  • llm_model - Default LLM model name (initially “gpt-4o”)
  • performance_metadata - List storing execution metrics per query
Example:
from src.evaluation.ragas_evaluator import RAGASEvaluator

# Initialize evaluator for HyDE RAG
evaluator = RAGASEvaluator(rag_type="hyde", debug=True)

# Initialize with simple RAG
simple_evaluator = RAGASEvaluator(rag_type="simple")

set_models

set_models(llm_model: str = None, embeddings_model: str = None)
Update the LLM model used by the evaluator.
llm_model
str
New LLM model name (e.g., “gpt-4o”, “claude-3-5-sonnet”)
embeddings_model
str
New embeddings model name (currently not used in implementation)
Example:
evaluator = RAGASEvaluator(rag_type="hybrid")
evaluator.set_models(llm_model="claude-3-5-sonnet")

load_test_queries

load_test_queries(use_obstetric_dataset: bool = True) -> List[Dict]
Load test queries from the obstetric dataset.
use_obstetric_dataset
bool
default:"True"
Whether to use the built-in obstetric dataset (currently the only option)
return
List[Dict]
List of query dictionaries, each containing:
  • question (str) - The test question
  • ground_truth (str) - The expected answer
Example:
evaluator = RAGASEvaluator(rag_type="simple")
queries = evaluator.load_test_queries()
print(f"Loaded {len(queries)} test queries")
# Output: Loaded 10 test queries

prepare_dataset

prepare_dataset(test_queries: List[Dict]) -> Dataset
Prepare RAGAS dataset format by executing RAG queries and collecting results.
test_queries
List[Dict]
required
List of test query dictionaries with “question” and “ground_truth” keys
return
Dataset
RAGAS-compatible HuggingFace Dataset with columns:
  • question - User queries
  • answer - RAG-generated answers
  • contexts - Retrieved context chunks
  • ground_truth - Expected answers
Side Effects:
  • Populates self.performance_metadata with execution metrics
  • Prints progress information for each query
  • Handles errors gracefully and continues processing
Example:
evaluator = RAGASEvaluator(rag_type="rewriter")
test_queries = evaluator.load_test_queries()
dataset = evaluator.prepare_dataset(test_queries)
print(f"Dataset size: {len(dataset)}")
# Output: Dataset size: 10

evaluate_rag

evaluate_rag(dataset: Dataset) -> Dict[str, Any]
Evaluate RAG using RAGAS metrics with automatic fallback to synchronous mode.
dataset
Dataset
required
Prepared RAGAS dataset with questions, answers, contexts, and ground truth
return
Dict[str, Any]
Evaluation results object with metric scores. Can be accessed as:
  • Object attributes (e.g., results.faithfulness)
  • Via to_pandas() method for DataFrame conversion
Behavior:
  • Executes RAGAS evaluation with 8 parallel workers
  • Disables per-call timeouts for Python 3.14+ compatibility
  • Automatically detects all-NaN results and falls back to synchronous evaluation
  • Handles async context issues gracefully
Example:
evaluator = RAGASEvaluator(rag_type="hybrid-rrf")
queries = evaluator.load_test_queries()
dataset = evaluator.prepare_dataset(queries)
results = evaluator.evaluate_rag(dataset)

# Access metrics
print(f"Faithfulness: {results.faithfulness:.3f}")
print(f"Answer Relevancy: {results.answer_relevancy:.3f}")

# Convert to DataFrame
df = results.to_pandas()
print(df[['faithfulness', 'answer_relevancy']].describe())

display_results

display_results(results)
Display evaluation results in a formatted console output.
results
required
Evaluation results from evaluate_rag()
Output Format:
  • Individual metric scores (0-1 scale)
  • Average score across all metrics
  • Performance assessment (Excellent/Good/Needs improvement/Significant improvements needed)
Example:
evaluator = RAGASEvaluator(rag_type="simple")
evaluator.run_evaluation()  # Automatically calls display_results

# Manual display
results = evaluator.evaluate_rag(dataset)
evaluator.display_results(results)
Sample Output:
============================================================
RAGAS EVALUATION RESULTS
============================================================
Faithfulness: 0.872
Answer Relevancy: 0.845
Context Precision: 0.790
Context Recall: 0.823

Average Score: 0.833
Performance: Excellent

save_results

save_results(
    results,
    filename: str = None,
    return_data_only: bool = False,
    model_name: str = None
) -> Union[Path, Dict, None]
Save evaluation results to JSON file or return as dictionary.
results
required
Evaluation results from evaluate_rag()
filename
str
Output filename. If not provided, generates timestamped filename
return_data_only
bool
default:"False"
If True, returns dictionary instead of saving to file
model_name
str
LLM model name to include in metadata
return
Union[Path, Dict, None]
  • File path if saved to disk
  • Dictionary if return_data_only=True
  • None if no results or error occurred
Output Structure:
{
  "metadata": {
    "timestamp": "2026-03-11T10:30:45.123456",
    "evaluation_type": "single_rag_evaluation_hybrid",
    "dataset_size": 10,
    "rags_evaluated": ["hybrid"],
    "model_used": "gpt-4o"
  },
  "pricing_config": { ... },
  "summary": {
    "hybrid": {
      "rag_name": "Hybrid RAG (BM25 + Semantic)",
      "metrics": {
        "faithfulness": 0.872,
        "answer_relevancy": 0.845,
        "context_precision": 0.790,
        "context_recall": 0.823
      },
      "performance": {
        "average_execution_time": 2.456,
        "total_input_tokens": 12450,
        "total_output_tokens": 3200,
        "total_cost": 0.045678,
        "average_cost_per_question": 0.004568,
        "overall_average_score": 0.833
      }
    }
  },
  "question_by_question": [ ... ]
}
Example:
# Save to file
evaluator = RAGASEvaluator(rag_type="hyde")
results = evaluator.evaluate_rag(dataset)
filepath = evaluator.save_results(results, model_name="gpt-4o")
print(f"Results saved to: {filepath}")

# Return as dictionary
data = evaluator.save_results(results, return_data_only=True)
print(f"Overall average: {data['summary']['hyde']['performance']['overall_average_score']}")

run_evaluation

run_evaluation() -> Any
Execute complete evaluation pipeline with the obstetric dataset. Returns: Evaluation results object Pipeline Steps:
  1. Load test queries from obstetric dataset
  2. Prepare RAGAS dataset by executing RAG queries
  3. Run RAGAS evaluation with configured metrics
  4. Display formatted results
  5. Save results to JSON file
Example:
# Complete evaluation workflow
evaluator = RAGASEvaluator(rag_type="pageindex", debug=False)
results = evaluator.run_evaluation()

# Results are automatically displayed and saved
Console Output:
Starting RAGAS evaluation
System: PageIndex RAG
Dataset: Obstetric queries (10 questions)
============================================================
Processing 10 queries with PageIndex RAG
Dataset prepared: 10 queries processed
Starting RAGAS evaluation...
Evaluation completed

============================================================
RAGAS EVALUATION RESULTS
============================================================
...
Results saved to: results/ragas_evaluation_pageindex_20260311_103045.json

Evaluation completed - 10 queries processed

run_multi_model_evaluation

run_multi_model_evaluation(models_to_test: list = None)
Run evaluation for the current RAG type against multiple LLM models.
models_to_test
list
List of model keys from MODELS_REGISTRY. If None, tests all registered models.
Behavior:
  • Iterates through each model in the list
  • Creates custom LLM instances for each model
  • Wraps RAG query functions to use the custom model
  • Collects results from all models
  • Generates consolidated multi-model report
Output File: ragas_multimodel_{rag_type}_{timestamp}.json Example:
# Test specific models
evaluator = RAGASEvaluator(rag_type="simple")
models = ["gpt-4o", "claude-3-5-sonnet", "gemini-1.5-pro"]
evaluator.run_multi_model_evaluation(models_to_test=models)

# Test all registered models
evaluator.run_multi_model_evaluation()
Output Structure:
{
  "metadata": {
    "timestamp": "2026-03-11T10:30:45.123456",
    "evaluation_type": "multi_model_rag_comparison",
    "rag_type_evaluated": "simple",
    "dataset_size": 10,
    "models_evaluated": ["gpt-4o", "claude-3-5-sonnet", "gemini-1.5-pro"]
  },
  "pricing_config": { ... },
  "summary": {
    "gpt-4o": { ... },
    "claude-3-5-sonnet": { ... },
    "gemini-1.5-pro": { ... }
  },
  "question_by_question": [ ... ]
}

SyncEvaluationResult Class

Lightweight wrapper for synchronous evaluation results, compatible with RAGAS result format.

Constructor

SyncEvaluationResult(dataframe: pd.DataFrame, metrics: List[Any])
dataframe
pd.DataFrame
required
DataFrame containing metric scores for each evaluation sample
metrics
List[Any]
required
List of RAGAS metric objects used in evaluation
Behavior:
  • Calculates mean value for each metric and stores as instance attribute
  • Provides to_pandas() method for DataFrame access
  • Compatible with downstream result processing pipelines
Example:
from src.evaluation.ragas_evaluator import SyncEvaluationResult
import pandas as pd

# Create result wrapper
df = pd.DataFrame({
    'faithfulness': [0.9, 0.8, 0.85],
    'answer_relevancy': [0.85, 0.9, 0.88]
})
result = SyncEvaluationResult(df, metrics)

# Access mean scores
print(f"Faithfulness: {result.faithfulness}")
print(f"Answer Relevancy: {result.answer_relevancy}")

# Get DataFrame
df = result.to_pandas()

Helper Functions

evaluate_rewriter_rag

evaluate_rewriter_rag(export_analysis: bool = False, debug: bool = False)
Evaluate Rewriter RAG specifically with optional analysis export.
export_analysis
bool
default:"False"
Export detailed analysis files (CSV, charts) using export_ragas_analysis utility
debug
bool
default:"False"
Enable debug output
Example:
from src.evaluation.ragas_evaluator import evaluate_rewriter_rag

# Basic evaluation
results = evaluate_rewriter_rag()

# With detailed analysis export
results = evaluate_rewriter_rag(export_analysis=True, debug=True)

evaluate_hybrid_rag

evaluate_hybrid_rag(export_analysis: bool = False, debug: bool = False)
Evaluate Hybrid RAG (BM25 + Semantic) specifically.
export_analysis
bool
default:"False"
Export detailed analysis files
debug
bool
default:"False"
Enable debug output
Example:
from src.evaluation.ragas_evaluator import evaluate_hybrid_rag

results = evaluate_hybrid_rag(export_analysis=True)

evaluate_hybrid_rrf_rag

evaluate_hybrid_rrf_rag(export_analysis: bool = False, debug: bool = False)
Evaluate Hybrid RAG with Reciprocal Rank Fusion.
export_analysis
bool
default:"False"
Export detailed analysis files
debug
bool
default:"False"
Enable debug output
Example:
from src.evaluation.ragas_evaluator import evaluate_hybrid_rrf_rag

results = evaluate_hybrid_rrf_rag()

evaluate_hyde_rag

evaluate_hyde_rag(export_analysis: bool = False, debug: bool = False)
Evaluate HyDE RAG (Hypothetical Documents) specifically.
export_analysis
bool
default:"False"
Export detailed analysis files
debug
bool
default:"False"
Enable debug output
Example:
from src.evaluation.ragas_evaluator import evaluate_hyde_rag

results = evaluate_hyde_rag(export_analysis=True)

evaluate_simple_rag

evaluate_simple_rag(export_analysis: bool = False, debug: bool = False)
Evaluate Simple Semantic RAG specifically.
export_analysis
bool
default:"False"
Export detailed analysis files
debug
bool
default:"False"
Enable debug output
Example:
from src.evaluation.ragas_evaluator import evaluate_simple_rag

results = evaluate_simple_rag()

evaluate_pageindex_rag

evaluate_pageindex_rag(export_analysis: bool = False, debug: bool = False)
Evaluate PageIndex RAG specifically.
export_analysis
bool
default:"False"
Export detailed analysis files
debug
bool
default:"False"
Enable debug output
Example:
from src.evaluation.ragas_evaluator import evaluate_pageindex_rag

results = evaluate_pageindex_rag()

evaluate_both_rags

evaluate_both_rags(export_analysis: bool = False, debug: bool = False)
Evaluate both original RAG systems sequentially (Rewriter and Hybrid).
export_analysis
bool
default:"False"
Export detailed analysis files for both systems
debug
bool
default:"False"
Enable debug output
return
Dict
Dictionary with keys “rewriter” and “hybrid” containing respective results
Example:
from src.evaluation.ragas_evaluator import evaluate_both_rags

results = evaluate_both_rags(export_analysis=True)
print(f"Rewriter score: {results['rewriter'].faithfulness}")
print(f"Hybrid score: {results['hybrid'].faithfulness}")

evaluate_all_rags

evaluate_all_rags(export_analysis: bool = False, debug: bool = False)
Evaluate all 6 RAG systems sequentially with comprehensive comparison report.
export_analysis
bool
default:"False"
Export detailed analysis files for all systems
debug
bool
default:"False"
Enable debug output
return
Dict
Dictionary with keys for each RAG type (“simple”, “hyde”, “rewriter”, “hybrid”, “hybrid-rrf”, “pageindex”)
Behavior:
  • Evaluates all 6 RAG systems in sequence
  • Generates individual result files for each RAG
  • Creates consolidated comparison report with best performer analysis
  • Includes 2-second pause between evaluations
Output Files:
  • Individual: ragas_evaluation_{rag_type}_{timestamp}.json
  • Comparison: ragas_comparison_all_rags_{timestamp}.json
Example:
from src.evaluation.ragas_evaluator import evaluate_all_rags

results = evaluate_all_rags(export_analysis=True, debug=False)

# Access individual results
for rag_type, result in results.items():
    print(f"{rag_type}: {result.faithfulness:.3f}")

run_all_models_all_rags_evaluation

run_all_models_all_rags_evaluation(export_analysis: bool = False, debug: bool = False)
Comprehensive evaluation: ALL RAG types against ALL LLM models.
export_analysis
bool
default:"False"
Export detailed analysis files (currently not used in this function)
debug
bool
default:"False"
Enable debug output
Behavior:
  • Tests all 6 RAG types with all models in MODELS_REGISTRY
  • Total evaluations: 6 RAGs × N models
  • Generates comprehensive consolidated report
  • Provides progress indicators and success statistics
Output File: ragas_comprehensive_all_rags_all_models_{timestamp}.json Example:
from src.evaluation.ragas_evaluator import run_all_models_all_rags_evaluation

# Run comprehensive evaluation
results = run_all_models_all_rags_evaluation(debug=True)
Console Output:
🚀 Starting comprehensive evaluation: ALL RAGs vs ALL Models
RAG types to evaluate: ['simple', 'hybrid', 'hybrid-rrf', 'hyde', 'rewriter', 'pageindex']
Models to test: ['gpt-4o', 'claude-3-5-sonnet', 'gemini-1.5-pro', ...]
Total evaluations: 6 × 8 = 48
============================================================
...
================================================================================
🎉 COMPREHENSIVE EVALUATION COMPLETED!
📄 Report saved to: results/ragas_comprehensive_all_rags_all_models_20260311_103045.json
📊 Total evaluations: 6 RAGs × 8 models = 48
✅ Successful evaluations: 46/48
================================================================================

DATA_GT Dataset

Built-in obstetric and pregnancy-specific evaluation dataset with 10 ground truth questions. Structure:
DATA_GT = [
    {
        "question": "¿En qué momento y quien debe reevaluar el riesgo clínico...",
        "ground_truth": "El Ginecobstetra en la semana 28 - 30 y semana 34 – 36."
    },
    # ... 9 more questions
]
Topics Covered:
  • Prenatal care scheduling and timing
  • Risk assessment protocols
  • Clinical evaluation tools (Herrera & Hurtado scale)
  • Postpartum depression screening
  • Weight gain recommendations by BMI
  • VBAC (Vaginal Birth After Cesarean) probabilities
  • Nausea and vomiting treatment options
Language: Spanish (Colombian clinical guidelines) Usage:
from src.evaluation.ragas_evaluator import DATA_GT

print(f"Dataset size: {len(DATA_GT)}")
for i, item in enumerate(DATA_GT, 1):
    print(f"Q{i}: {item['question'][:50]}...")

RAGAS Metrics

The evaluator uses four fundamental RAGAS metrics:

Faithfulness

Measures whether the answer is factually consistent with the retrieved contexts. Score range: 0-1 (higher is better). Calculation: Checks if claims in the answer can be inferred from the contexts without hallucination.

Answer Relevancy

Measures how relevant the answer is to the original question. Score range: 0-1 (higher is better). Calculation: Uses embeddings to compute semantic similarity between question and answer.

Context Precision

Measures the proportion of relevant contexts in the retrieved set. Score range: 0-1 (higher is better). Calculation: Evaluates whether retrieved contexts are actually useful for answering the question.

Context Recall

Measures whether all necessary information from ground truth is present in retrieved contexts. Score range: 0-1 (higher is better). Calculation: Checks if ground truth answer can be derived from the retrieved contexts.

Result Format

Single RAG Evaluation

{
  "metadata": {
    "timestamp": "2026-03-11T10:30:45.123456",
    "evaluation_type": "single_rag_evaluation_hybrid",
    "dataset_size": 10,
    "rags_evaluated": ["hybrid"],
    "model_used": "gpt-4o"
  },
  "pricing_config": {
    "gpt-4o": {
      "input": 0.0025,
      "output": 0.01
    }
  },
  "summary": {
    "hybrid": {
      "rag_name": "Hybrid RAG (BM25 + Semantic)",
      "metrics": {
        "faithfulness": 0.872,
        "answer_relevancy": 0.845,
        "context_precision": 0.790,
        "context_recall": 0.823
      },
      "performance": {
        "average_execution_time": 2.456,
        "total_input_tokens": 12450,
        "total_output_tokens": 3200,
        "total_cost": 0.045678,
        "average_cost_per_question": 0.004568,
        "overall_average_score": 0.833
      }
    }
  },
  "question_by_question": [
    {
      "question_id": 1,
      "question": "¿En qué momento y quien debe reevaluar...",
      "ground_truth": "El Ginecobstetra en la semana 28 - 30...",
      "rag_results": {
        "hybrid": {
          "answer": "Según las guías clínicas...",
          "contexts_count": 5,
          "metrics": {
            "faithfulness": 0.900,
            "answer_relevancy": 0.875,
            "context_precision": 0.800,
            "context_recall": 0.850
          },
          "performance": {
            "question": "¿En qué momento y quien debe reevaluar...",
            "execution_time": 2.345,
            "input_tokens": 1245,
            "output_tokens": 320,
            "total_cost": 0.004567,
            "cost_source": "precise"
          }
        }
      }
    }
  ]
}

Multi-Model Comparison

{
  "metadata": {
    "timestamp": "2026-03-11T10:30:45.123456",
    "evaluation_type": "multi_model_rag_comparison",
    "rag_type_evaluated": "simple",
    "dataset_size": 10,
    "models_evaluated": ["gpt-4o", "claude-3-5-sonnet", "gemini-1.5-pro"]
  },
  "summary": {
    "gpt-4o": {
      "model_name": "gpt-4o",
      "metrics": { ... },
      "performance": { ... }
    },
    "claude-3-5-sonnet": {
      "model_name": "claude-3-5-sonnet",
      "metrics": { ... },
      "performance": { ... }
    }
  },
  "question_by_question": [
    {
      "question_id": 1,
      "question": "...",
      "ground_truth": "...",
      "rag_results": {
        "gpt-4o": { ... },
        "claude-3-5-sonnet": { ... }
      }
    }
  ]
}

Comprehensive All RAGs All Models

{
  "metadata": {
    "timestamp": "2026-03-11T10:30:45.123456",
    "evaluation_type": "comprehensive_all_rags_all_models",
    "dataset_size": 10,
    "rags_evaluated": ["simple", "hybrid", "hybrid-rrf", "hyde", "rewriter", "pageindex"],
    "models_evaluated": ["gpt-4o", "claude-3-5-sonnet", ...],
    "total_evaluations": 48
  },
  "summary": {
    "simple": {
      "gpt-4o": { ... },
      "claude-3-5-sonnet": { ... }
    },
    "hybrid": { ... }
  },
  "question_by_question": [
    {
      "question_id": 1,
      "question": "...",
      "ground_truth": "...",
      "rag_results": {
        "simple": {
          "gpt-4o": { ... },
          "claude-3-5-sonnet": { ... }
        },
        "hybrid": { ... }
      }
    }
  ]
}

Command-Line Usage

The module can be executed directly from the command line:
# Evaluate specific RAG type
python src/evaluation/ragas_evaluator.py simple
python src/evaluation/ragas_evaluator.py hybrid-rrf

# Evaluate with detailed analysis export
python src/evaluation/ragas_evaluator.py hyde --export

# Evaluate with debug output
python src/evaluation/ragas_evaluator.py rewriter --debug

# Evaluate multiple RAGs
python src/evaluation/ragas_evaluator.py both        # Rewriter + Hybrid
python src/evaluation/ragas_evaluator.py all         # All 6 RAGs

# Multi-model evaluation
python src/evaluation/ragas_evaluator.py multi-model simple

# Comprehensive evaluation (all RAGs × all models)
python src/evaluation/ragas_evaluator.py all-models-all-rags
Available Flags:
  • --export or -e - Export detailed analysis files
  • --debug or -d - Enable debug output

Complete Usage Example

from src.evaluation.ragas_evaluator import RAGASEvaluator

# 1. Basic Evaluation
evaluator = RAGASEvaluator(rag_type="hybrid", debug=False)
results = evaluator.run_evaluation()

# 2. Custom Workflow
evaluator = RAGASEvaluator(rag_type="simple")
queries = evaluator.load_test_queries()
dataset = evaluator.prepare_dataset(queries)
results = evaluator.evaluate_rag(dataset)
evaluator.display_results(results)
filepath = evaluator.save_results(results, model_name="gpt-4o")

# 3. Multi-Model Comparison
evaluator = RAGASEvaluator(rag_type="hyde")
models = ["gpt-4o", "claude-3-5-sonnet", "gemini-1.5-pro"]
evaluator.run_multi_model_evaluation(models_to_test=models)

# 4. Change Model During Evaluation
evaluator = RAGASEvaluator(rag_type="pageindex")
evaluator.set_models(llm_model="claude-3-5-sonnet")
results = evaluator.run_evaluation()

# 5. Access Detailed Metrics
df = results.to_pandas()
print(df[['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall']])
print(f"Mean Faithfulness: {df['faithfulness'].mean():.3f}")
print(f"Std Faithfulness: {df['faithfulness'].std():.3f}")

# 6. Programmatic Comparison
from src.evaluation.ragas_evaluator import evaluate_all_rags
all_results = evaluate_all_rags(export_analysis=True)

for rag_type, result in all_results.items():
    print(f"\n{rag_type.upper()} Performance:")
    print(f"  Faithfulness: {result.faithfulness:.3f}")
    print(f"  Answer Relevancy: {result.answer_relevancy:.3f}")

Performance Considerations

Execution Time

  • Single RAG evaluation: ~3-5 minutes for 10 questions
  • Multi-model evaluation: ~5-10 minutes per model
  • All RAGs evaluation: ~20-30 minutes total
  • Comprehensive (all RAGs × all models): 1-2 hours depending on number of models

Cost Implications

Evaluation costs depend on:
  • LLM model pricing (input/output tokens)
  • Number of RAG queries
  • Retrieved context length
  • Answer generation length
Example costs per 10-question evaluation:
  • GPT-4o: ~$0.04-0.06
  • Claude-3.5-Sonnet: ~$0.05-0.08
  • GPT-3.5-turbo: ~$0.01-0.02

Async vs Sync Mode

The evaluator automatically handles async context issues:
  • Default: Async evaluation with 8 workers (faster)
  • Fallback: Synchronous evaluation with 1 worker (more stable)
  • Trigger: All-NaN metrics or async exceptions

Error Handling

Common Issues

1. All-NaN Metrics
Detected all-NaN async metric output. Switching to synchronous fallback...
Solution: Automatically handled by fallback mechanism. 2. Model Not Found
Model 'invalid-model' not found in registry. Skipping.
Solution: Use valid model keys from MODELS_REGISTRY. 3. Query Execution Failure
Error processing query 5: Connection timeout
Solution: Individual query failures are logged but don’t stop evaluation. 4. Unsupported RAG Type
ValueError: Unsupported RAG type: custom. Use 'rewriter', 'hybrid', ...
Solution: Use one of the 6 supported RAG types.

Dependencies

Required Packages:
  • ragas - RAGAS evaluation framework
  • datasets - HuggingFace datasets library
  • pandas - Data manipulation
  • numpy - Numerical operations
Internal Dependencies:
  • src.rag.* - RAG system implementations
  • src.common.model_provider - LLM model management
  • src.common.pricing - Cost tracking utilities
  • src.common.utils - Analysis export utilities

See Also

Build docs developers (and LLMs) love