Skip to main content

Overview

The LLMJudge class evaluates generated responses for hallucinations by extracting claims, verifying them against source documents, detecting contradictions, and computing confidence scores. It implements a structured rubric-based evaluation system.

Class Definition

class LLMJudge:
    async def evaluate(self, response: str, context: List[Dict]) -> Dict

Methods

evaluate

Performs hallucination detection with structured rubric evaluation.
async def evaluate(self, response: str, context: List[Dict]) -> Dict
response
str
required
The generated response text to evaluate
context
List[Dict]
required
Retrieved document sections used to generate the response. Each section should contain:
title
str
Section title
content
str
Section content text
claims
List[Dict]
List of extracted and evaluated claims
text
str
The claim statement text
type
str
Claim type: quantitative, temporal, obligation, or general
found_in_source
bool
Whether a supporting quote was found
source_quote
str | None
The supporting quote from source documents (if found)
status
str
Claim status: supported, unsupported, or contradicted
confidence_score
float
Confidence score between 0.0 and 1.0 (rounded to 2 decimals)
is_hallucinated
bool
True if response contains contradictions or confidence < 0.5
should_return
bool
Whether the response should be returned to the user (opposite of is_hallucinated)
summary
Dict
Aggregate statistics
total_claims
int
Total number of claims extracted
supported
int
Number of supported claims
unsupported
int
Number of unsupported claims
contradicted
int
Number of contradicted claims
reasoning
str
Human-readable explanation of the evaluation

Example

judge = LLMJudge()

response = "The late payment penalty is 2% of the outstanding balance. Payment is due within 30 days of invoice receipt. (See Late Payment Penalties, page 5)"

context = [
    {
        "title": "Late Payment Penalties",
        "content": "A late fee of 1.5% per month (18% annually) will apply to outstanding balances. Payment is due within 30 days of invoice receipt.",
        "page_num": 5
    }
]

verdict = await judge.evaluate(response, context)

print(f"Confidence: {verdict['confidence_score']}")
print(f"Hallucinated: {verdict['is_hallucinated']}")
print(f"Should return: {verdict['should_return']}")

# Output:
# Confidence: 0.2  (low due to contradicted claim)
# Hallucinated: True (2% contradicts 1.5% in source)
# Should return: False

Evaluation Process

The judge uses a 4-phase evaluation process:

Phase 1: Extract Claims

Claims are extracted using pattern matching for different claim types.

Phase 2: Ground Claims

Each claim is verified against source documents to find supporting evidence.

Phase 3: Detect Contradictions

Claims are checked for contradictions with source content (especially quantitative values).

Phase 4: Calculate Confidence

A weighted scoring system computes overall confidence:
confidence_score = 1.0
confidence_score -= (contradicted_count / total_claims) * 0.8  # Heavy penalty
confidence_score -= (unsupported_count / total_claims) * 0.3   # Moderate penalty
confidence_score = max(0.0, min(1.0, confidence_score))
Scoring Weights:
  • Contradicted claim: -0.8 penalty (severe)
  • Unsupported claim: -0.3 penalty (moderate)
  • Supported claim: No penalty
Hallucination Threshold: contradicted_count > 0 OR confidence_score < 0.5

Private Helper Methods

_extract_claims

Extracts claims from the response using pattern matching.
def _extract_claims(self, response: str) -> List[Dict]
Claim Types and Patterns:
  1. Quantitative Claims (numbers, percentages)
    • Pattern: r'([^.]*\d+(?:\.\d+)?%?[^.]*.)'
    • Example: “The fee is 2% of the balance.”
  2. Temporal Claims (timeframes, deadlines)
    • Pattern: r'([^.]*(?:within|after|before|\d+\s*days?|\d+\s*months?|\d+\s*years?)[^.]*.)'
    • Example: “Payment is due within 30 days.”
  3. Obligation Claims (requirements, mandates)
    • Pattern: r'([^.]*(?:shall|must|will|is required)[^.]*.)'
    • Example: “The vendor shall provide weekly updates.”
  4. General Claims (fallback)
    • Sentences longer than 20 characters when no structured claims found
Example:
response = "The late fee is 2% per month. Payment must be made within 30 days."
claims = judge._extract_claims(response)
# [
#   {"text": "The late fee is 2% per month.", "type": "quantitative"},
#   {"text": "Payment must be made within 30 days.", "type": "temporal"}
# ]

_find_supporting_quote

Searches for supporting evidence in source documents.
def _find_supporting_quote(self, claim: str, context: List[Dict]) -> Optional[str]
Matching Strategy:
  1. Number Matching (Strict) - For quantitative claims
    • Extracts numbers from claim and content
    • Finds sentences containing matching numbers
    • Example: “2%” in claim must match “2%” in source
  2. Key Phrase Matching - For all claims
    • Extracts significant words (4+ characters)
    • Requires 3+ overlapping words between claim and content
    • Requires 2+ overlapping words in a specific sentence
Example:
claim = "Payment is due within 30 days"
context = [{"content": "Invoice payment is due within 30 days of receipt."}]
quote = judge._find_supporting_quote(claim, context)
# "Invoice payment is due within 30 days of receipt."

_check_contradiction

Detects contradictions between claims and source documents.
def _check_contradiction(self, claim: str, context: List[Dict]) -> bool
Contradiction Detection:
  1. Percentage Contradictions
    • Extracts percentages from claim and content
    • Checks for mismatches in payment/fee contexts
    • Example: Claim says “2%” but source says “1.5%”
  2. Timeframe Contradictions
    • Extracts day counts from claim and content
    • Checks for mismatches in payment contexts
    • Example: Claim says “30 days” but source says “15 days”
Context-Aware Matching:
# Only flags contradiction if both claim and source discuss same topic
payment_keywords = ["payment", "pay", "due", "within", "invoice", "receipt"]
Example:
claim = "The late fee is 2% per month"
context = [{"content": "A late fee of 1.5% per month will apply"}]
is_contradicted = judge._check_contradiction(claim, context)
# True (2% != 1.5%)

Confidence Scoring Examples

Example 1: Perfect Response

# 3 claims, all supported
confidence = 1.0 - (0/3)*0.8 - (0/3)*0.3 = 1.0
# Result: 1.0, not hallucinated

Example 2: Unsupported Claims

# 5 claims: 3 supported, 2 unsupported
confidence = 1.0 - (0/5)*0.8 - (2/5)*0.3 = 0.88
# Result: 0.88, not hallucinated (>0.5)

Example 3: Contradiction

# 4 claims: 2 supported, 1 unsupported, 1 contradicted
confidence = 1.0 - (1/4)*0.8 - (1/4)*0.3 = 0.625
# Result: 0.62, HALLUCINATED (contradicted_count > 0)

Example 4: Low Confidence

# 3 claims: 1 supported, 2 unsupported
confidence = 1.0 - (0/3)*0.8 - (2/3)*0.3 = 0.8
# But if: 3 claims: 0 supported, 3 unsupported
confidence = 1.0 - (0/3)*0.8 - (3/3)*0.3 = 0.7
# And if: 2 claims: 0 supported, 2 unsupported
confidence = 1.0 - (0/2)*0.8 - (2/2)*0.3 = 0.7

Usage Example

from components import LLMJudge, ResponseGenerator, AgenticRetriever

# Generate response
generator = ResponseGenerator()
response = generator.generate(retrieved_sections)

# Evaluate for hallucinations
judge = LLMJudge()
verdict = await judge.evaluate(response, retrieved_sections)

if verdict["should_return"]:
    print("Response is grounded and safe to return")
    print(f"Confidence: {verdict['confidence_score']}")
    print(response)
else:
    print("Response contains hallucinations")
    print(f"Confidence: {verdict['confidence_score']}")
    print(f"Contradicted: {verdict['summary']['contradicted']}")
    print(f"Unsupported: {verdict['summary']['unsupported']}")
    
    # Show problematic claims
    for claim in verdict["claims"]:
        if claim["status"] != "supported":
            print(f"\n{claim['status'].upper()}: {claim['text']}")

Integration

The judge is the final validation step before returning responses to users:
# Full RAG pipeline with hallucination detection
decomposition = await decomposer.decompose(query)
sections = await retriever.retrieve(query, decomposition)
response = generator.generate(sections)
verdict = await judge.evaluate(response, sections)  # <- LLMJudge

if verdict["should_return"]:
    return response
else:
    return "I don't have enough information to answer confidently."

Build docs developers (and LLMs) love