RAGAS Evaluation Framework
The benchmark uses RAGAS (Retrieval-Augmented Generation Assessment) for automated, LLM-based evaluation. RAGAS provides four core metrics:- Faithfulness: Measures factual consistency between answer and context
- Answer Relevancy: Assesses how well the answer addresses the question
- Context Precision: Evaluates relevance of retrieved context
- Context Recall: Measures completeness of retrieved information
Current Evaluation Configuration
The evaluator is configured insrc/evaluation/ragas_evaluator.py:118-123:
Adding Custom RAGAS Metrics
RAGAS supports additional metrics that can be added to the evaluation:Update Result Processing
Ensure new metrics are included in result extraction. The framework automatically handles this in
save_results(), but verify in display_results():Creating Custom Metrics
Medical-Specific Metric Example
Let’s create a custom metric for medical citation verification:Aspect-Based Evaluation
RAGAS supports aspect-based evaluation usingAspectCritic for domain-specific assessment:
Modifying Evaluation Logic
Custom Dataset Preparation
Modify how queries are processed before evaluation inragas_evaluator.py:181-248:
Domain-Specific Evaluation Considerations
Medical Q&A Specific Metrics
For medical applications, consider these additional evaluation dimensions:Clinical Accuracy
Verify medical facts against clinical guidelines
Safety
Ensure answers include appropriate warnings and contraindications
Clarity
Assess if medical terminology is appropriately explained
Completeness
Check if all relevant factors are addressed
Example: Medical Safety Metric
Output Format Customization
Custom Result Formatting
Modify how results are saved inragas_evaluator.py:434-573:
Evaluation Configuration
Run Configuration
Customize RAGAS evaluation behavior inragas_evaluator.py:269:
Testing Custom Metrics
Unit Tests
Integration Tests
Test metrics with full evaluation:Best Practices
Metric Design
- Clear Definition: Define exactly what the metric measures
- Bounded Scores: Use 0-1 range for consistency with RAGAS
- Reproducibility: Ensure deterministic behavior when possible
- Performance: Optimize for evaluation speed
Evaluation Strategy
- Baseline First: Establish baseline with standard metrics
- Incremental Addition: Add custom metrics one at a time
- Validation: Validate custom metrics against human judgment
- Documentation: Document metric definitions and interpretation
Next Steps
Interpreting Results
Understand evaluation outputs
Extending Research
Contribute new evaluation methods
API Reference
Complete API documentation
Adding RAG Architectures
Implement new retrieval strategies
