Interpreting Results
This guide helps you understand the JSON output from RAGAS evaluations and extract actionable insights.Result File Structure
All evaluation results are saved as JSON files in theresults/ directory with this structure:
Metadata Section
Provides context about the evaluation run:timestamp- When the evaluation was run (ISO 8601 format)evaluation_type- Type of evaluation (single RAG, multi-model, comprehensive)dataset_size- Number of questions evaluatedrags_evaluated- List of RAG architectures testedmodel_used- LLM model used for generation (single RAG only)
Pricing Configuration
Tracks the cost configuration for different models:Pricing data is automatically captured from
src/common/pricing.py and helps you understand the cost implications of different model choices.Summary Section
Contains aggregated metrics and performance statistics:Metrics Analysis
Faithfulness: 0.850
Excellent - Answers are well-grounded in retrieved context with minimal hallucination
Answer Relevancy: 0.265
Poor - Answers may not directly address the questions asked
Context Precision: 0.779
Good - Retrieval is mostly pulling relevant information
Context Recall: 0.600
Fair - Some relevant information may be missing from retrieval
Performance Analysis
Question-by-Question Breakdown
The most detailed section showing individual question performance:Analyzing Individual Questions
Examine the Answer
Read the generated answer to understand what went wrong:
- Is it hallucinating?
- Is it off-topic?
- Is it incomplete?
Check Retrieved Context
The
contexts_count shows how many chunks were retrieved:- Too few contexts (1-2): May be missing information → Low context recall
- Too many contexts (5+): May include noise → Low context precision
Example Analysis: Question #1
- The system retrieved relevant context (precision 0.75)
- But it didn’t retrieve all necessary context (recall 0.0)
- The answer is faithful to what was retrieved (faithfulness 0.9)
- However, the answer doesn’t actually answer the question (relevancy 0.0)
- Only 1 context chunk was retrieved (
contexts_count: 1) - The retrieved chunk didn’t contain the specific answer
- The LLM couldn’t answer based on incomplete information
- Increase retrieval count (k) from default to higher value
- Improve embedding quality for this type of question
- Consider using a more advanced RAG architecture (hybrid, rewriter)
Comparing Across RAG Architectures
For multi-RAG evaluations, compare performance side-by-side:Analysis
Hybrid RAG significantly outperforms Simple RAG:
- Better retrieval (precision 0.88 vs 0.75, recall 0.90 vs 0.0)
- More relevant answers (relevancy 0.85 vs 0.0)
- Maintains faithfulness (0.95 vs 0.9)
Best Performers Analysis
For comprehensive evaluations, the results include abest_performers section:
Trade-offs Analysis
Quality
Best: Hybrid-RRF & Rewriter
- Highest accuracy metrics
- Best retrieval quality
- Most relevant answers
Performance
Best: Simple Semantic
- Fastest execution (8.2s)
- Lowest cost ($0.021)
- Simplest architecture
Actionable Insights
Low Faithfulness
Symptoms
Symptoms
- Generated answers contain information not in the retrieved context
- LLM is hallucinating or using prior knowledge
Solutions
Solutions
- Improve prompts: Add explicit instructions to only use retrieved context
- Use better models: Some models are better at staying grounded
- Add citations: Require the LLM to cite which chunks support each claim
- Post-processing: Filter out unsupported claims
Low Answer Relevancy
Symptoms
Symptoms
- Answers don’t directly address the question
- Responses are too general or off-topic
Solutions
Solutions
- Refine prompts: Make answer format more specific
- Better retrieval: Ensure retrieved context is relevant to the question
- Query preprocessing: Rephrase questions for better retrieval
- Use query-focused RAG: Try Rewriter or HyDE architectures
Low Context Precision
Symptoms
Symptoms
- Retrieved chunks contain irrelevant information
- Too much noise in the context
Solutions
Solutions
- Increase similarity threshold: Only retrieve highly relevant chunks
- Add reranking: Use a reranker to filter retrieved chunks
- Better embeddings: Use domain-specific embedding models
- Try hybrid search: Combine BM25 + semantic search
Low Context Recall
Symptoms
Symptoms
- Important information is missing from retrieved context
- Answers are incomplete
Solutions
Solutions
- Retrieve more chunks: Increase k (e.g., from 5 to 10)
- Better chunking: Ensure chunks contain complete information
- Multi-query retrieval: Try Rewriter RAG for diverse retrieval
- Check embeddings: Ensure embedding model captures domain semantics
Visualization Examples
While the project saves results as JSON, you can create visualizations:Next Steps
RAGAS Metrics
Learn more about what each metric measures
Benchmarking
Best practices for comprehensive RAG benchmarking
