RAGAS Evaluator - Obstetrics RAG Benchmark

Overview

The RAGAS Evaluator module provides comprehensive evaluation capabilities for RAG systems using the RAGAS (Retrieval-Augmented Generation Assessment) framework. It supports multiple RAG architectures, various LLM models, and generates detailed performance metrics and cost analysis. Module Path: src/evaluation/ragas_evaluator.py

Key Features

Multiple RAG Support: Evaluate Simple, HyDE, Rewriter, Hybrid, Hybrid-RRF, and PageIndex RAG systems
RAGAS Metrics: Faithfulness, Answer Relevancy, Context Precision, and Context Recall
Multi-Model Evaluation: Compare performance across different LLM models
Performance Tracking: Token usage, execution time, and cost analysis
Obstetric Dataset: Built-in specialized medical dataset with 10 ground truth questions
Comprehensive Reporting: JSON export with detailed question-by-question analysis

RAGASEvaluator Class

The main orchestrator class for RAG evaluation workflows.

Constructor

RAGASEvaluator(rag_type: str = "rewriter", debug: bool = False)

Initialize a RAGAS evaluator for a specific RAG architecture.

rag_type

str

default:"rewriter"

RAG architecture to evaluate. Supported values:

"simple" - Simple Semantic RAG
"hybrid" - Hybrid RAG (BM25 + Semantic)
"hybrid-rrf" - Hybrid RAG with Reciprocal Rank Fusion
"hyde" - HyDE RAG (Hypothetical Documents)
"rewriter" - Multi-Query Rewriter RAG
"pageindex" - PageIndex RAG

debug

bool

default:"False"

Enable debug output for detailed logging and error traces

Instance Attributes:

metrics - List of RAGAS metric objects (faithfulness, answer_relevancy, context_precision, context_recall)
results_dir - Path to results directory (project_root/results)
query_function - RAG-specific query function for evaluation
rag_name - Descriptive name of the RAG system
rag_type - RAG type identifier
llm_model - Default LLM model name (initially “gpt-4o”)
performance_metadata - List storing execution metrics per query

Example:

from src.evaluation.ragas_evaluator import RAGASEvaluator

# Initialize evaluator for HyDE RAG
evaluator = RAGASEvaluator(rag_type="hyde", debug=True)

# Initialize with simple RAG
simple_evaluator = RAGASEvaluator(rag_type="simple")

set_models

set_models(llm_model: str = None, embeddings_model: str = None)

Update the LLM model used by the evaluator.

llm_model

str

New LLM model name (e.g., “gpt-4o”, “claude-3-5-sonnet”)

embeddings_model

str

New embeddings model name (currently not used in implementation)

Example:

evaluator = RAGASEvaluator(rag_type="hybrid")
evaluator.set_models(llm_model="claude-3-5-sonnet")

load_test_queries

load_test_queries(use_obstetric_dataset: bool = True) -> List[Dict]

Load test queries from the obstetric dataset.

use_obstetric_dataset

bool

default:"True"

Whether to use the built-in obstetric dataset (currently the only option)

return

List[Dict]

List of query dictionaries, each containing:

question (str) - The test question
ground_truth (str) - The expected answer

Example:

evaluator = RAGASEvaluator(rag_type="simple")
queries = evaluator.load_test_queries()
print(f"Loaded {len(queries)} test queries")
# Output: Loaded 10 test queries

prepare_dataset

prepare_dataset(test_queries: List[Dict]) -> Dataset

Prepare RAGAS dataset format by executing RAG queries and collecting results.

test_queries

List[Dict]

required

List of test query dictionaries with “question” and “ground_truth” keys

return

Dataset

RAGAS-compatible HuggingFace Dataset with columns:

question - User queries
answer - RAG-generated answers
contexts - Retrieved context chunks
ground_truth - Expected answers

Side Effects:

Populates self.performance_metadata with execution metrics
Prints progress information for each query
Handles errors gracefully and continues processing

Example:

evaluator = RAGASEvaluator(rag_type="rewriter")
test_queries = evaluator.load_test_queries()
dataset = evaluator.prepare_dataset(test_queries)
print(f"Dataset size: {len(dataset)}")
# Output: Dataset size: 10

evaluate_rag

evaluate_rag(dataset: Dataset) -> Dict[str, Any]

Evaluate RAG using RAGAS metrics with automatic fallback to synchronous mode.

dataset

Dataset

required

Prepared RAGAS dataset with questions, answers, contexts, and ground truth

return

Dict[str, Any]

Evaluation results object with metric scores. Can be accessed as:

Object attributes (e.g., results.faithfulness)
Via to_pandas() method for DataFrame conversion

Behavior:

Executes RAGAS evaluation with 8 parallel workers
Disables per-call timeouts for Python 3.14+ compatibility
Automatically detects all-NaN results and falls back to synchronous evaluation
Handles async context issues gracefully

Example:

evaluator = RAGASEvaluator(rag_type="hybrid-rrf")
queries = evaluator.load_test_queries()
dataset = evaluator.prepare_dataset(queries)
results = evaluator.evaluate_rag(dataset)

# Access metrics
print(f"Faithfulness: {results.faithfulness:.3f}")
print(f"Answer Relevancy: {results.answer_relevancy:.3f}")

# Convert to DataFrame
df = results.to_pandas()
print(df[['faithfulness', 'answer_relevancy']].describe())

display_results

display_results(results)

Display evaluation results in a formatted console output.

results

required

Evaluation results from evaluate_rag()

Output Format:

Individual metric scores (0-1 scale)
Average score across all metrics
Performance assessment (Excellent/Good/Needs improvement/Significant improvements needed)

Example:

evaluator = RAGASEvaluator(rag_type="simple")
evaluator.run_evaluation()  # Automatically calls display_results

# Manual display
results = evaluator.evaluate_rag(dataset)
evaluator.display_results(results)

Sample Output:

============================================================
RAGAS EVALUATION RESULTS
============================================================
Faithfulness: 0.872
Answer Relevancy: 0.845
Context Precision: 0.790
Context Recall: 0.823

Average Score: 0.833
Performance: Excellent

save_results

save_results(
    results,
    filename: str = None,
    return_data_only: bool = False,
    model_name: str = None
) -> Union[Path, Dict, None]

Save evaluation results to JSON file or return as dictionary.

results

required

Evaluation results from evaluate_rag()

filename

str

Output filename. If not provided, generates timestamped filename

return_data_only

bool

default:"False"

If True, returns dictionary instead of saving to file

model_name

str

LLM model name to include in metadata

return

Union[Path, Dict, None]

File path if saved to disk
Dictionary if return_data_only=True
None if no results or error occurred

Output Structure:

{
  "metadata": {
    "timestamp": "2026-03-11T10:30:45.123456",
    "evaluation_type": "single_rag_evaluation_hybrid",
    "dataset_size": 10,
    "rags_evaluated": ["hybrid"],
    "model_used": "gpt-4o"
  },
  "pricing_config": { ... },
  "summary": {
    "hybrid": {
      "rag_name": "Hybrid RAG (BM25 + Semantic)",
      "metrics": {
        "faithfulness": 0.872,
        "answer_relevancy": 0.845,
        "context_precision": 0.790,
        "context_recall": 0.823
      },
      "performance": {
        "average_execution_time": 2.456,
        "total_input_tokens": 12450,
        "total_output_tokens": 3200,
        "total_cost": 0.045678,
        "average_cost_per_question": 0.004568,
        "overall_average_score": 0.833
      }
    }
  },
  "question_by_question": [ ... ]
}

Example:

# Save to file
evaluator = RAGASEvaluator(rag_type="hyde")
results = evaluator.evaluate_rag(dataset)
filepath = evaluator.save_results(results, model_name="gpt-4o")
print(f"Results saved to: {filepath}")

# Return as dictionary
data = evaluator.save_results(results, return_data_only=True)
print(f"Overall average: {data['summary']['hyde']['performance']['overall_average_score']}")

run_evaluation

run_evaluation() -> Any

Execute complete evaluation pipeline with the obstetric dataset. Returns: Evaluation results object Pipeline Steps:

Load test queries from obstetric dataset
Prepare RAGAS dataset by executing RAG queries
Run RAGAS evaluation with configured metrics
Display formatted results
Save results to JSON file

Example:

# Complete evaluation workflow
evaluator = RAGASEvaluator(rag_type="pageindex", debug=False)
results = evaluator.run_evaluation()

# Results are automatically displayed and saved

Console Output:

Starting RAGAS evaluation
System: PageIndex RAG
Dataset: Obstetric queries (10 questions)
============================================================
Processing 10 queries with PageIndex RAG
Dataset prepared: 10 queries processed
Starting RAGAS evaluation...
Evaluation completed

============================================================
RAGAS EVALUATION RESULTS
============================================================
...
Results saved to: results/ragas_evaluation_pageindex_20260311_103045.json

Evaluation completed - 10 queries processed

run_multi_model_evaluation

run_multi_model_evaluation(models_to_test: list = None)

Run evaluation for the current RAG type against multiple LLM models.

models_to_test

list

List of model keys from MODELS_REGISTRY. If None, tests all registered models.

Behavior:

Iterates through each model in the list
Creates custom LLM instances for each model
Wraps RAG query functions to use the custom model
Collects results from all models
Generates consolidated multi-model report

Output File: ragas_multimodel_{rag_type}_{timestamp}.json Example:

# Test specific models
evaluator = RAGASEvaluator(rag_type="simple")
models = ["gpt-4o", "claude-3-5-sonnet", "gemini-1.5-pro"]
evaluator.run_multi_model_evaluation(models_to_test=models)

# Test all registered models
evaluator.run_multi_model_evaluation()

Output Structure:

{
  "metadata": {
    "timestamp": "2026-03-11T10:30:45.123456",
    "evaluation_type": "multi_model_rag_comparison",
    "rag_type_evaluated": "simple",
    "dataset_size": 10,
    "models_evaluated": ["gpt-4o", "claude-3-5-sonnet", "gemini-1.5-pro"]
  },
  "pricing_config": { ... },
  "summary": {
    "gpt-4o": { ... },
    "claude-3-5-sonnet": { ... },
    "gemini-1.5-pro": { ... }
  },
  "question_by_question": [ ... ]
}

SyncEvaluationResult Class

Lightweight wrapper for synchronous evaluation results, compatible with RAGAS result format.

Constructor

SyncEvaluationResult(dataframe: pd.DataFrame, metrics: List[Any])

dataframe

pd.DataFrame

required

DataFrame containing metric scores for each evaluation sample

metrics

List[Any]

required

List of RAGAS metric objects used in evaluation

Behavior:

Calculates mean value for each metric and stores as instance attribute
Provides to_pandas() method for DataFrame access
Compatible with downstream result processing pipelines

Example:

from src.evaluation.ragas_evaluator import SyncEvaluationResult
import pandas as pd

# Create result wrapper
df = pd.DataFrame({
    'faithfulness': [0.9, 0.8, 0.85],
    'answer_relevancy': [0.85, 0.9, 0.88]
})
result = SyncEvaluationResult(df, metrics)

# Access mean scores
print(f"Faithfulness: {result.faithfulness}")
print(f"Answer Relevancy: {result.answer_relevancy}")

# Get DataFrame
df = result.to_pandas()

Helper Functions

evaluate_rewriter_rag

evaluate_rewriter_rag(export_analysis: bool = False, debug: bool = False)

Evaluate Rewriter RAG specifically with optional analysis export.

export_analysis

bool

default:"False"

Export detailed analysis files (CSV, charts) using export_ragas_analysis utility

debug

bool

default:"False"

Enable debug output

Example:

from src.evaluation.ragas_evaluator import evaluate_rewriter_rag

# Basic evaluation
results = evaluate_rewriter_rag()

# With detailed analysis export
results = evaluate_rewriter_rag(export_analysis=True, debug=True)

evaluate_hybrid_rag

evaluate_hybrid_rag(export_analysis: bool = False, debug: bool = False)

Evaluate Hybrid RAG (BM25 + Semantic) specifically.

export_analysis

bool

default:"False"

Export detailed analysis files

debug

bool

default:"False"

Enable debug output

Example:

from src.evaluation.ragas_evaluator import evaluate_hybrid_rag

results = evaluate_hybrid_rag(export_analysis=True)

evaluate_hybrid_rrf_rag

evaluate_hybrid_rrf_rag(export_analysis: bool = False, debug: bool = False)

Evaluate Hybrid RAG with Reciprocal Rank Fusion.

export_analysis

bool

default:"False"

Export detailed analysis files

debug

bool

default:"False"

Enable debug output

Example:

from src.evaluation.ragas_evaluator import evaluate_hybrid_rrf_rag

results = evaluate_hybrid_rrf_rag()

evaluate_hyde_rag

evaluate_hyde_rag(export_analysis: bool = False, debug: bool = False)

Evaluate HyDE RAG (Hypothetical Documents) specifically.

export_analysis

bool

default:"False"

Export detailed analysis files

debug

bool

default:"False"

Enable debug output

Example:

from src.evaluation.ragas_evaluator import evaluate_hyde_rag

results = evaluate_hyde_rag(export_analysis=True)

evaluate_simple_rag

evaluate_simple_rag(export_analysis: bool = False, debug: bool = False)

Evaluate Simple Semantic RAG specifically.

export_analysis

bool

default:"False"

Export detailed analysis files

debug

bool

default:"False"

Enable debug output

Example:

from src.evaluation.ragas_evaluator import evaluate_simple_rag

results = evaluate_simple_rag()

evaluate_pageindex_rag

evaluate_pageindex_rag(export_analysis: bool = False, debug: bool = False)

Evaluate PageIndex RAG specifically.

export_analysis

bool

default:"False"

Export detailed analysis files

debug

bool

default:"False"

Enable debug output

Example:

from src.evaluation.ragas_evaluator import evaluate_pageindex_rag

results = evaluate_pageindex_rag()

evaluate_both_rags

evaluate_both_rags(export_analysis: bool = False, debug: bool = False)

Evaluate both original RAG systems sequentially (Rewriter and Hybrid).

export_analysis

bool

default:"False"

Export detailed analysis files for both systems

debug

bool

default:"False"

Enable debug output

return

Dict

Dictionary with keys “rewriter” and “hybrid” containing respective results

Example:

from src.evaluation.ragas_evaluator import evaluate_both_rags

results = evaluate_both_rags(export_analysis=True)
print(f"Rewriter score: {results['rewriter'].faithfulness}")
print(f"Hybrid score: {results['hybrid'].faithfulness}")

evaluate_all_rags

evaluate_all_rags(export_analysis: bool = False, debug: bool = False)

Evaluate all 6 RAG systems sequentially with comprehensive comparison report.

export_analysis

bool

default:"False"

Export detailed analysis files for all systems

debug

bool

default:"False"

Enable debug output

return

Dict

Dictionary with keys for each RAG type (“simple”, “hyde”, “rewriter”, “hybrid”, “hybrid-rrf”, “pageindex”)

Behavior:

Evaluates all 6 RAG systems in sequence
Generates individual result files for each RAG
Creates consolidated comparison report with best performer analysis
Includes 2-second pause between evaluations

Output Files:

Individual: ragas_evaluation_{rag_type}_{timestamp}.json
Comparison: ragas_comparison_all_rags_{timestamp}.json

Example:

from src.evaluation.ragas_evaluator import evaluate_all_rags

results = evaluate_all_rags(export_analysis=True, debug=False)

# Access individual results
for rag_type, result in results.items():
    print(f"{rag_type}: {result.faithfulness:.3f}")

run_all_models_all_rags_evaluation

run_all_models_all_rags_evaluation(export_analysis: bool = False, debug: bool = False)

Comprehensive evaluation: ALL RAG types against ALL LLM models.

export_analysis

bool

default:"False"

Export detailed analysis files (currently not used in this function)

debug

bool

default:"False"

Enable debug output

Behavior:

Tests all 6 RAG types with all models in MODELS_REGISTRY
Total evaluations: 6 RAGs × N models
Generates comprehensive consolidated report
Provides progress indicators and success statistics

Output File: ragas_comprehensive_all_rags_all_models_{timestamp}.json Example:

from src.evaluation.ragas_evaluator import run_all_models_all_rags_evaluation

# Run comprehensive evaluation
results = run_all_models_all_rags_evaluation(debug=True)

Console Output:

🚀 Starting comprehensive evaluation: ALL RAGs vs ALL Models
RAG types to evaluate: ['simple', 'hybrid', 'hybrid-rrf', 'hyde', 'rewriter', 'pageindex']
Models to test: ['gpt-4o', 'claude-3-5-sonnet', 'gemini-1.5-pro', ...]
Total evaluations: 6 × 8 = 48
============================================================
...
================================================================================
🎉 COMPREHENSIVE EVALUATION COMPLETED!
📄 Report saved to: results/ragas_comprehensive_all_rags_all_models_20260311_103045.json
📊 Total evaluations: 6 RAGs × 8 models = 48
✅ Successful evaluations: 46/48
================================================================================

DATA_GT Dataset

Built-in obstetric and pregnancy-specific evaluation dataset with 10 ground truth questions. Structure:

DATA_GT = [
    {
        "question": "¿En qué momento y quien debe reevaluar el riesgo clínico...",
        "ground_truth": "El Ginecobstetra en la semana 28 - 30 y semana 34 – 36."
    },
    # ... 9 more questions
]

Topics Covered:

Prenatal care scheduling and timing
Risk assessment protocols
Clinical evaluation tools (Herrera & Hurtado scale)
Postpartum depression screening
Weight gain recommendations by BMI
VBAC (Vaginal Birth After Cesarean) probabilities
Nausea and vomiting treatment options

Language: Spanish (Colombian clinical guidelines) Usage:

from src.evaluation.ragas_evaluator import DATA_GT

print(f"Dataset size: {len(DATA_GT)}")
for i, item in enumerate(DATA_GT, 1):
    print(f"Q{i}: {item['question'][:50]}...")

RAGAS Metrics

The evaluator uses four fundamental RAGAS metrics:

Faithfulness

Measures whether the answer is factually consistent with the retrieved contexts. Score range: 0-1 (higher is better). Calculation: Checks if claims in the answer can be inferred from the contexts without hallucination.

Answer Relevancy

Measures how relevant the answer is to the original question. Score range: 0-1 (higher is better). Calculation: Uses embeddings to compute semantic similarity between question and answer.

Context Precision

Measures the proportion of relevant contexts in the retrieved set. Score range: 0-1 (higher is better). Calculation: Evaluates whether retrieved contexts are actually useful for answering the question.

Context Recall

Measures whether all necessary information from ground truth is present in retrieved contexts. Score range: 0-1 (higher is better). Calculation: Checks if ground truth answer can be derived from the retrieved contexts.

Result Format

Single RAG Evaluation

{
  "metadata": {
    "timestamp": "2026-03-11T10:30:45.123456",
    "evaluation_type": "single_rag_evaluation_hybrid",
    "dataset_size": 10,
    "rags_evaluated": ["hybrid"],
    "model_used": "gpt-4o"
  },
  "pricing_config": {
    "gpt-4o": {
      "input": 0.0025,
      "output": 0.01
    }
  },
  "summary": {
    "hybrid": {
      "rag_name": "Hybrid RAG (BM25 + Semantic)",
      "metrics": {
        "faithfulness": 0.872,
        "answer_relevancy": 0.845,
        "context_precision": 0.790,
        "context_recall": 0.823
      },
      "performance": {
        "average_execution_time": 2.456,
        "total_input_tokens": 12450,
        "total_output_tokens": 3200,
        "total_cost": 0.045678,
        "average_cost_per_question": 0.004568,
        "overall_average_score": 0.833
      }
    }
  },
  "question_by_question": [
    {
      "question_id": 1,
      "question": "¿En qué momento y quien debe reevaluar...",
      "ground_truth": "El Ginecobstetra en la semana 28 - 30...",
      "rag_results": {
        "hybrid": {
          "answer": "Según las guías clínicas...",
          "contexts_count": 5,
          "metrics": {
            "faithfulness": 0.900,
            "answer_relevancy": 0.875,
            "context_precision": 0.800,
            "context_recall": 0.850
          },
          "performance": {
            "question": "¿En qué momento y quien debe reevaluar...",
            "execution_time": 2.345,
            "input_tokens": 1245,
            "output_tokens": 320,
            "total_cost": 0.004567,
            "cost_source": "precise"
          }
        }
      }
    }
  ]
}

Multi-Model Comparison

{
  "metadata": {
    "timestamp": "2026-03-11T10:30:45.123456",
    "evaluation_type": "multi_model_rag_comparison",
    "rag_type_evaluated": "simple",
    "dataset_size": 10,
    "models_evaluated": ["gpt-4o", "claude-3-5-sonnet", "gemini-1.5-pro"]
  },
  "summary": {
    "gpt-4o": {
      "model_name": "gpt-4o",
      "metrics": { ... },
      "performance": { ... }
    },
    "claude-3-5-sonnet": {
      "model_name": "claude-3-5-sonnet",
      "metrics": { ... },
      "performance": { ... }
    }
  },
  "question_by_question": [
    {
      "question_id": 1,
      "question": "...",
      "ground_truth": "...",
      "rag_results": {
        "gpt-4o": { ... },
        "claude-3-5-sonnet": { ... }
      }
    }
  ]
}

Comprehensive All RAGs All Models

{
  "metadata": {
    "timestamp": "2026-03-11T10:30:45.123456",
    "evaluation_type": "comprehensive_all_rags_all_models",
    "dataset_size": 10,
    "rags_evaluated": ["simple", "hybrid", "hybrid-rrf", "hyde", "rewriter", "pageindex"],
    "models_evaluated": ["gpt-4o", "claude-3-5-sonnet", ...],
    "total_evaluations": 48
  },
  "summary": {
    "simple": {
      "gpt-4o": { ... },
      "claude-3-5-sonnet": { ... }
    },
    "hybrid": { ... }
  },
  "question_by_question": [
    {
      "question_id": 1,
      "question": "...",
      "ground_truth": "...",
      "rag_results": {
        "simple": {
          "gpt-4o": { ... },
          "claude-3-5-sonnet": { ... }
        },
        "hybrid": { ... }
      }
    }
  ]
}

Command-Line Usage

The module can be executed directly from the command line:

# Evaluate specific RAG type
python src/evaluation/ragas_evaluator.py simple
python src/evaluation/ragas_evaluator.py hybrid-rrf

# Evaluate with detailed analysis export
python src/evaluation/ragas_evaluator.py hyde --export

# Evaluate with debug output
python src/evaluation/ragas_evaluator.py rewriter --debug

# Evaluate multiple RAGs
python src/evaluation/ragas_evaluator.py both        # Rewriter + Hybrid
python src/evaluation/ragas_evaluator.py all         # All 6 RAGs

# Multi-model evaluation
python src/evaluation/ragas_evaluator.py multi-model simple

# Comprehensive evaluation (all RAGs × all models)
python src/evaluation/ragas_evaluator.py all-models-all-rags

Available Flags:

--export or -e - Export detailed analysis files
--debug or -d - Enable debug output

Complete Usage Example

from src.evaluation.ragas_evaluator import RAGASEvaluator

# 1. Basic Evaluation
evaluator = RAGASEvaluator(rag_type="hybrid", debug=False)
results = evaluator.run_evaluation()

# 2. Custom Workflow
evaluator = RAGASEvaluator(rag_type="simple")
queries = evaluator.load_test_queries()
dataset = evaluator.prepare_dataset(queries)
results = evaluator.evaluate_rag(dataset)
evaluator.display_results(results)
filepath = evaluator.save_results(results, model_name="gpt-4o")

# 3. Multi-Model Comparison
evaluator = RAGASEvaluator(rag_type="hyde")
models = ["gpt-4o", "claude-3-5-sonnet", "gemini-1.5-pro"]
evaluator.run_multi_model_evaluation(models_to_test=models)

# 4. Change Model During Evaluation
evaluator = RAGASEvaluator(rag_type="pageindex")
evaluator.set_models(llm_model="claude-3-5-sonnet")
results = evaluator.run_evaluation()

# 5. Access Detailed Metrics
df = results.to_pandas()
print(df[['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall']])
print(f"Mean Faithfulness: {df['faithfulness'].mean():.3f}")
print(f"Std Faithfulness: {df['faithfulness'].std():.3f}")

# 6. Programmatic Comparison
from src.evaluation.ragas_evaluator import evaluate_all_rags
all_results = evaluate_all_rags(export_analysis=True)

for rag_type, result in all_results.items():
    print(f"\n{rag_type.upper()} Performance:")
    print(f"  Faithfulness: {result.faithfulness:.3f}")
    print(f"  Answer Relevancy: {result.answer_relevancy:.3f}")

Performance Considerations

Execution Time

Single RAG evaluation: ~3-5 minutes for 10 questions
Multi-model evaluation: ~5-10 minutes per model
All RAGs evaluation: ~20-30 minutes total
Comprehensive (all RAGs × all models): 1-2 hours depending on number of models

Cost Implications

Evaluation costs depend on:

LLM model pricing (input/output tokens)
Number of RAG queries
Retrieved context length
Answer generation length

Example costs per 10-question evaluation:

GPT-4o: ~$0.04-0.06
Claude-3.5-Sonnet: ~$0.05-0.08
GPT-3.5-turbo: ~$0.01-0.02

Async vs Sync Mode

The evaluator automatically handles async context issues:

Default: Async evaluation with 8 workers (faster)
Fallback: Synchronous evaluation with 1 worker (more stable)
Trigger: All-NaN metrics or async exceptions

Error Handling

Common Issues

1. All-NaN Metrics

Detected all-NaN async metric output. Switching to synchronous fallback...

Solution: Automatically handled by fallback mechanism. 2. Model Not Found

Model 'invalid-model' not found in registry. Skipping.

Solution: Use valid model keys from MODELS_REGISTRY. 3. Query Execution Failure

Error processing query 5: Connection timeout

Solution: Individual query failures are logged but don’t stop evaluation. 4. Unsupported RAG Type

ValueError: Unsupported RAG type: custom. Use 'rewriter', 'hybrid', ...

Solution: Use one of the 6 supported RAG types.

Dependencies

Required Packages:

ragas - RAGAS evaluation framework
datasets - HuggingFace datasets library
pandas - Data manipulation
numpy - Numerical operations

Internal Dependencies:

src.rag.* - RAG system implementations
src.common.model_provider - LLM model management
src.common.pricing - Cost tracking utilities
src.common.utils - Analysis export utilities

RAG Modules

Evaluation

Common Utilities

Scripts

​Overview

​Key Features

​RAGASEvaluator Class

​Constructor

​set_models

​load_test_queries

​prepare_dataset

​evaluate_rag

​display_results

​save_results

​run_evaluation

​run_multi_model_evaluation

​SyncEvaluationResult Class

​Constructor

​Helper Functions

​evaluate_rewriter_rag

​evaluate_hybrid_rag

​evaluate_hybrid_rrf_rag

​evaluate_hyde_rag

​evaluate_simple_rag

​evaluate_pageindex_rag

​evaluate_both_rags

​evaluate_all_rags

​run_all_models_all_rags_evaluation

​DATA_GT Dataset

​RAGAS Metrics

​Faithfulness

​Answer Relevancy

​Context Precision

​Context Recall

​Result Format

​Single RAG Evaluation

​Multi-Model Comparison

​Comprehensive All RAGs All Models

​Command-Line Usage

​Complete Usage Example

​Performance Considerations

​Execution Time

​Cost Implications

​Async vs Sync Mode

​Error Handling

​Common Issues

​Dependencies

​See Also

Build docs developers (and LLMs) love

Overview

Key Features

RAGASEvaluator Class

Constructor

set_models

load_test_queries

prepare_dataset

evaluate_rag

display_results

save_results

run_evaluation

run_multi_model_evaluation

SyncEvaluationResult Class

Constructor

Helper Functions

evaluate_rewriter_rag

evaluate_hybrid_rag

evaluate_hybrid_rrf_rag

evaluate_hyde_rag

evaluate_simple_rag

evaluate_pageindex_rag

evaluate_both_rags

evaluate_all_rags

run_all_models_all_rags_evaluation

DATA_GT Dataset

RAGAS Metrics

Faithfulness

Answer Relevancy

Context Precision

Context Recall

Result Format

Single RAG Evaluation

Multi-Model Comparison

Comprehensive All RAGs All Models

Command-Line Usage

Complete Usage Example

Performance Considerations

Execution Time

Cost Implications

Async vs Sync Mode

Error Handling

Common Issues

Dependencies

See Also