Interpreting Results

This guide helps you understand the JSON output from RAGAS evaluations and extract actionable insights.

Result File Structure

All evaluation results are saved as JSON files in the results/ directory with this structure:

{
  "metadata": { /* Evaluation context */ },
  "pricing_config": { /* Cost tracking configuration */ },
  "summary": { /* Aggregated metrics */ },
  "question_by_question": [ /* Detailed per-question results */ ]
}

Metadata Section

Provides context about the evaluation run:

{
  "metadata": {
    "timestamp": "2026-03-11T09:38:43.085121",
    "evaluation_type": "single_rag_evaluation_simple",
    "dataset_size": 10,
    "rags_evaluated": ["simple"],
    "model_used": "google/medgemma-1.5-4b-it"
  }
}

Fields:

timestamp - When the evaluation was run (ISO 8601 format)
evaluation_type - Type of evaluation (single RAG, multi-model, comprehensive)
dataset_size - Number of questions evaluated
rags_evaluated - List of RAG architectures tested
model_used - LLM model used for generation (single RAG only)

Pricing Configuration

Tracks the cost configuration for different models:

{
  "pricing_config": {
    "openai_models": {
      "gpt-4o": {
        "input_rate_per_1m": 2.5,
        "output_rate_per_1m": 10.0,
        "pricing_source_url": "https://developers.openai.com/api/pricing/",
        "pricing_updated_at": "2026-03-11"
      }
    },
    "huggingface_endpoints": {
      "medgemma": {
        "cloud_provider": "aws",
        "instance_family": "nvidia-A10G",
        "hourly_rate_usd": 1.8,
        "allocation_mode": "runtime_proportional"
      }
    }
  }
}

Pricing data is automatically captured from src/common/pricing.py and helps you understand the cost implications of different model choices.

Summary Section

Contains aggregated metrics and performance statistics:

{
  "summary": {
    "simple": {
      "rag_name": "Simple Semantic RAG",
      "metrics": {
        "faithfulness": 0.8503661560317907,
        "answer_relevancy": 0.26467160516971017,
        "context_precision": 0.7788888888660509,
        "context_recall": 0.6
      },
      "performance": {
        "average_execution_time": 11.316,
        "total_input_tokens": 9348,
        "total_output_tokens": 5120,
        "total_cost": 0.031433,
        "average_cost_per_question": 0.003143,
        "overall_average_score": 0.623
      }
    }
  }
}

Metrics Analysis

Faithfulness: 0.850

Excellent - Answers are well-grounded in retrieved context with minimal hallucination

Answer Relevancy: 0.265

Poor - Answers may not directly address the questions asked

Context Precision: 0.779

Good - Retrieval is mostly pulling relevant information

Context Recall: 0.600

Fair - Some relevant information may be missing from retrieval

Performance Analysis

"performance": {
  "average_execution_time": 11.316,      // seconds per question
  "total_input_tokens": 9348,            // total tokens sent to LLM
  "total_output_tokens": 5120,           // total tokens generated
  "total_cost": 0.031433,                // total cost in USD
  "average_cost_per_question": 0.003143, // cost per question
  "overall_average_score": 0.623         // average of all 4 metrics
}

Cost optimization insights:

This evaluation cost $0.03 for 10 questions
Average $0.003 per question
At scale (1000 questions): ~$3.14
Consider cheaper models for high-volume use cases

Question-by-Question Breakdown

The most detailed section showing individual question performance:

{
  "question_by_question": [
    {
      "question_id": 1,
      "question": "¿En qué momento y quien debe reevaluar el riesgo clínico...",
      "ground_truth": "El Ginecobstetra en la semana 28 - 30 y semana 34 – 36.",
      "rag_results": {
        "simple": {
          "answer": "<unused94>thought\n1. **Identify the core question:**...",
          "contexts_count": 1,
          "metrics": {
            "faithfulness": 0.9,
            "answer_relevancy": 0.0,
            "context_precision": 0.7499999999625,
            "context_recall": 0.0
          },
          "performance": {
            "execution_time": 16.701202154159546,
            "input_tokens": 920,
            "output_tokens": 512,
            "total_cost": 0.0046392228
          }
        }
      }
    }
  ]
}

Analyzing Individual Questions

Identify Problem Questions

Look for questions with low overall scores:

# Questions with avg score < 0.5 need investigation
problem_questions = [
    q for q in results["question_by_question"]
    if avg(q["rag_results"][rag_type]["metrics"].values()) < 0.5
]

Examine the Answer

Read the generated answer to understand what went wrong:

Is it hallucinating?
Is it off-topic?
Is it incomplete?

Check Retrieved Context

The contexts_count shows how many chunks were retrieved:

Too few contexts (1-2): May be missing information → Low context recall
Too many contexts (5+): May include noise → Low context precision

Compare to Ground Truth

Compare the answer to the ground truth to understand the gap:

"ground_truth": "El Ginecobstetra en la semana 28 - 30 y semana 34 – 36."

Example Analysis: Question #1

"metrics": {
  "faithfulness": 0.9,           // ✅ Good - not hallucinating
  "answer_relevancy": 0.0,       // ❌ Problem - answer doesn't address question
  "context_precision": 0.75,     // ✅ Good - retrieved relevant context
  "context_recall": 0.0          // ❌ Problem - missing necessary information
}

Diagnosis:

The system retrieved relevant context (precision 0.75)
But it didn’t retrieve all necessary context (recall 0.0)
The answer is faithful to what was retrieved (faithfulness 0.9)
However, the answer doesn’t actually answer the question (relevancy 0.0)

Root Cause:

Only 1 context chunk was retrieved (contexts_count: 1)
The retrieved chunk didn’t contain the specific answer
The LLM couldn’t answer based on incomplete information

Fix:

Increase retrieval count (k) from default to higher value
Improve embedding quality for this type of question
Consider using a more advanced RAG architecture (hybrid, rewriter)

Comparing Across RAG Architectures

For multi-RAG evaluations, compare performance side-by-side:

{
  "question_id": 1,
  "question": "¿En qué momento y quien debe reevaluar...",
  "rag_results": {
    "simple": {
      "metrics": {
        "faithfulness": 0.9,
        "answer_relevancy": 0.0,
        "context_precision": 0.75,
        "context_recall": 0.0
      }
    },
    "hybrid": {
      "metrics": {
        "faithfulness": 0.95,
        "answer_relevancy": 0.85,
        "context_precision": 0.88,
        "context_recall": 0.90
      }
    }
  }
}

Analysis

Hybrid RAG significantly outperforms Simple RAG:

Better retrieval (precision 0.88 vs 0.75, recall 0.90 vs 0.0)
More relevant answers (relevancy 0.85 vs 0.0)
Maintains faithfulness (0.95 vs 0.9)

Recommendation: Use Hybrid RAG for production

Best Performers Analysis

For comprehensive evaluations, the results include a best_performers section:

{
  "best_performers": {
    "faithfulness": {
      "rag_type": "hybrid-rrf",
      "rag_name": "Hybrid RAG + RRF (BM25 + Semantic)",
      "score": 0.924
    },
    "answer_relevancy": {
      "rag_type": "rewriter",
      "rag_name": "Rewriter RAG (Multi-Query)",
      "score": 0.887
    },
    "best_avg_execution_time": {
      "rag_type": "simple",
      "rag_name": "Simple Semantic RAG",
      "score": 8.234
    },
    "best_total_cost": {
      "rag_type": "simple",
      "rag_name": "Simple Semantic RAG",
      "score": 0.021
    }
  }
}

Trade-offs Analysis

Quality

Best: Hybrid-RRF & Rewriter

Highest accuracy metrics
Best retrieval quality
Most relevant answers

Trade-off: Higher cost and latency

Performance

Best: Simple Semantic

Fastest execution (8.2s)
Lowest cost ($0.021)
Simplest architecture

Trade-off: Lower quality metrics

Actionable Insights

Low Faithfulness

Symptoms

Generated answers contain information not in the retrieved context
LLM is hallucinating or using prior knowledge

Solutions

Improve prompts: Add explicit instructions to only use retrieved context
Use better models: Some models are better at staying grounded
Add citations: Require the LLM to cite which chunks support each claim
Post-processing: Filter out unsupported claims

Low Answer Relevancy

Symptoms

Answers don’t directly address the question
Responses are too general or off-topic

Solutions

Refine prompts: Make answer format more specific
Better retrieval: Ensure retrieved context is relevant to the question
Query preprocessing: Rephrase questions for better retrieval
Use query-focused RAG: Try Rewriter or HyDE architectures

Low Context Precision

Symptoms

Retrieved chunks contain irrelevant information
Too much noise in the context

Solutions

Increase similarity threshold: Only retrieve highly relevant chunks
Add reranking: Use a reranker to filter retrieved chunks
Better embeddings: Use domain-specific embedding models
Try hybrid search: Combine BM25 + semantic search

Low Context Recall

Symptoms

Important information is missing from retrieved context
Answers are incomplete

Solutions

Retrieve more chunks: Increase k (e.g., from 5 to 10)
Better chunking: Ensure chunks contain complete information
Multi-query retrieval: Try Rewriter RAG for diverse retrieval
Check embeddings: Ensure embedding model captures domain semantics

Visualization Examples

While the project saves results as JSON, you can create visualizations:

import json
import matplotlib.pyplot as plt

# Load results
with open('results/ragas_evaluation_hybrid_20260311_095023.json') as f:
    results = json.load(f)

# Extract metrics
metrics = results['summary']['hybrid']['metrics']

# Create bar chart
plt.bar(metrics.keys(), metrics.values())
plt.xlabel('Metrics')
plt.ylabel('Score')
plt.title('Hybrid RAG Performance')
plt.ylim(0, 1)
plt.axhline(y=0.8, color='g', linestyle='--', label='Excellent')
plt.axhline(y=0.6, color='orange', linestyle='--', label='Good')
plt.legend()
plt.tight_layout()
plt.savefig('hybrid_rag_metrics.png')

Get Started

Core Concepts

RAG Architectures

Evaluation

Guides

Interpreting Results

Interpreting Results

Result File Structure

Metadata Section

Pricing Configuration

Summary Section

Metrics Analysis

Faithfulness: 0.850

Answer Relevancy: 0.265

Context Precision: 0.779

Context Recall: 0.600

Performance Analysis

Question-by-Question Breakdown

Analyzing Individual Questions

Example Analysis: Question #1

Comparing Across RAG Architectures

Analysis

Best Performers Analysis

Trade-offs Analysis

Quality

Performance

Actionable Insights

Low Faithfulness

Low Answer Relevancy

Low Context Precision

Low Context Recall

Visualization Examples

Next Steps

RAGAS Metrics

Benchmarking

Build docs developers (and LLMs) love

Get Started

Core Concepts

RAG Architectures

Evaluation

Guides

​Interpreting Results

​Result File Structure

​Metadata Section

​Pricing Configuration

​Summary Section

​Metrics Analysis

Faithfulness: 0.850

Answer Relevancy: 0.265

Context Precision: 0.779

Context Recall: 0.600

​Performance Analysis

​Question-by-Question Breakdown

​Analyzing Individual Questions

​Example Analysis: Question #1

​Comparing Across RAG Architectures

Analysis

​Best Performers Analysis

​Trade-offs Analysis

Quality

Performance

​Actionable Insights

​Low Faithfulness

​Low Answer Relevancy

​Low Context Precision

​Low Context Recall

​Visualization Examples

​Next Steps

RAGAS Metrics

Benchmarking

Build docs developers (and LLMs) love

Interpreting Results

Result File Structure

Metadata Section

Pricing Configuration

Summary Section

Metrics Analysis

Performance Analysis

Question-by-Question Breakdown

Analyzing Individual Questions

Example Analysis: Question #1

Comparing Across RAG Architectures

Best Performers Analysis

Trade-offs Analysis

Actionable Insights

Low Faithfulness

Low Answer Relevancy

Low Context Precision

Low Context Recall

Visualization Examples

Next Steps