LLM Judge

The Problem

LLMs generate fluent text even when making up facts. A response might say “late fee is 5%” when the document says “1.5%” - both sound plausible, but only one is correct. From README.md:29-30:

“The problem: LLMs generate fluent text even when making up facts. We need to validate each claim against source documents.”

The LLMJudge validates every factual claim in the generated response against retrieved document sections.

Implementation

Location: components.py:147-308

Four-Phase Validation

From README.md:31-36:

# Instead of asking "is this response correct?", decompose the problem:

Extract each factual claim from the response
Find evidence in documents
Detect contradictions
Calculate confidence

Phase 1: Claim Extraction

Method: _extract_claims(response: str) -> List[Dict] The judge categorizes claims because different types need different validation:

Quantitative Claims

Numbers, percentages, amounts:

# Pattern: Any sentence containing numbers or percentages
number_patterns = re.findall(r'([^.]*\d+(?:\.\d+)?%?[^.]*\.)', response)

for match in number_patterns:
    claims.append({"text": match.strip(), "type": "quantitative"})

Location: components.py:152-155 Examples:

“Client shall pay a late fee of 1.5% per month.”
“Payment is due within 30 days.”

From README.md:38:

“Quantitative claims (numbers, percentages) are easy to verify and dangerous if wrong. ‘1.5% late fee’ is verifiable; ‘5% late fee’ is a detectable hallucination.”

Temporal Claims

Timeframes, deadlines, durations:

# Pattern: Sentences with time-related keywords
time_patterns = re.findall(
    r'([^.]*(?:within|after|before|\d+\s*days?|\d+\s*months?|\d+\s*years?)[^.]*\.)',
    response,
    re.IGNORECASE
)

for match in time_patterns:
    if match.strip() not in [c["text"] for c in claims]:
        claims.append({"text": match.strip(), "type": "temporal"})

Location: components.py:158-161 Examples:

“Either party may terminate upon 30 days’ written notice.”
“Confidentiality obligations survive for 3 years.”

Obligation Claims

Contract terms with “shall”, “must”, “will”:

# Pattern: Sentences with obligation keywords
obligation_patterns = re.findall(
    r'([^.]*(?:shall|must|will|is required)[^.]*\.)',
    response,
    re.IGNORECASE
)

for match in obligation_patterns:
    if match.strip() not in [c["text"] for c in claims]:
        claims.append({"text": match.strip(), "type": "obligation"})

Location: components.py:164-167 From README.md:38:

“Obligations (shall/must) indicate contract terms that should exist in the document.”

Examples:

“ABC Corporation shall indemnify Client against third-party claims.”
“Client must maintain confidentiality of proprietary information.”

General Claims

Fallback for other factual statements:

# If no structured claims found, split by sentences
if not claims:
    sentences = re.split(r'(?<=[.!?])\s+', response)
    for sent in sentences:
        if len(sent) > 20:  # Skip very short sentences
            claims.append({"text": sent.strip(), "type": "general"})

Location: components.py:170-174

Phase 2: Evidence Grounding

Method: _find_supporting_quote(claim: str, context: List[Dict]) -> Optional[str] For each claim, search retrieved sections for supporting evidence.

Number Matching (Strict)

For quantitative claims, exact number matches are required:

numbers_in_claim = re.findall(r'\d+(?:\.\d+)?%?', claim)

for section in context:
    content = section.get("content", "")
    numbers_in_content = re.findall(r'\d+(?:\.\d+)?%?', content)
    
    for num in numbers_in_claim:
        if num in numbers_in_content:
            # Find the sentence containing this number
            sentences = re.split(r'(?<=[.!?])\s+', content)
            for sent in sentences:
                if num in sent:
                    return sent.strip()  # Supporting evidence found

Location: components.py:188-195

Strict number matching prevents subtle hallucinations like “1.5%” becoming “5%” or “30 days” becoming “60 days”.

Key Phrase Matching (Flexible)

For non-quantitative claims, use word overlap:

# Extract significant words (4+ characters)
claim_words = set(re.findall(r'\b\w{4,}\b', claim_lower))
content_words = set(re.findall(r'\b\w{4,}\b', content_lower))
overlap = claim_words & content_words

if len(overlap) >= 3:  # Require at least 3 matching words
    sentences = re.split(r'(?<=[.!?])\s+', content)
    for sent in sentences:
        sent_words = set(re.findall(r'\b\w{4,}\b', sent.lower()))
        if len(claim_words & sent_words) >= 2:
            return sent.strip()

Location: components.py:198-207

Phase 3: Contradiction Detection

Method: _check_contradiction(claim: str, context: List[Dict]) -> bool From README.md:40-41:

“Contradiction detection compares numbers in claims against numbers in context. If the response says ‘late fee is 5%’ but the document says ‘late fee of 1.5%’, that’s a contradiction.”

Percentage Contradictions

claim_percentages = re.findall(r'(\d+(?:\.\d+)?)\s*%', claim)

for section in context:
    content = section.get("content", "")
    content_percentages = re.findall(r'(\d+(?:\.\d+)?)\s*%', content)
    
    if content_percentages:
        # Check if discussing the same topic
        if ("late" in claim_lower or "fee" in claim_lower) and \
           ("late" in content_lower or "fee" in content_lower):
            for claim_pct in claim_percentages:
                if claim_pct not in content_percentages:
                    return True  # CONTRADICTION!

Location: components.py:221-228 From README.md:41:

“I check that both discuss the same topic (late/fee keywords) before flagging.”

Day/Timeline Contradictions

claim_days = re.findall(r'(\d+)\s*days?', claim_lower)

for section in context:
    content = section.get("content", "")
    content_days_list = re.findall(r'(\d+)\s*\)?\s*days?', content_lower)
    
    if content_days_list:
        payment_keywords = ["payment", "pay", "due", "within", "invoice", "receipt"]
        claim_has_payment = any(kw in claim_lower for kw in payment_keywords)
        content_has_payment = any(kw in content_lower for kw in payment_keywords)
        
        if claim_has_payment and content_has_payment:
            for claim_d in claim_days:
                if claim_d not in content_days_list:
                    return True  # CONTRADICTION!

Location: components.py:231-240

Context-aware contradiction detection prevents false positives. “30 days” in payment terms vs “3 years” in confidentiality is NOT a contradiction because they discuss different topics.

Phase 4: Confidence Scoring

Method: evaluate(response: str, context: List[Dict]) -> Dict From README.md:43-44:

# Weighted scoring system:

confidence_score = 1.0
confidence_score -= (contradicted_count / total_claims) * 0.8  # Heavy penalty
confidence_score -= (unsupported_count / total_claims) * 0.3   # Moderate penalty
confidence_score = max(0.0, min(1.0, confidence_score))

Location: components.py:285-288

Penalty Weights

Status	Penalty	Rationale
Contradicted	-0.8 per claim	Stating something provably wrong is severe
Unsupported	-0.3 per claim	Might be valid inference, less severe
Supported	0.0	No penalty for correct claims

From README.md:43-44:

“Confidence scoring: contradictions get heavy penalty (0.8 per claim) because stating something wrong is serious. Unsupported claims get moderate penalty (0.3) because they might be valid inferences.”

Decision Threshold

is_hallucinated = contradicted_count > 0 or confidence_score < 0.5

Location: components.py:291 From README.md:44:

“Threshold at 0.5 means if more than half the claims are problematic, the response is rejected.”

Verdict Structure

The judge returns structured JSON:

{
    "claims": [
        {
            "text": "Client shall pay invoices within 30 days.",
            "type": "temporal",
            "found_in_source": True,
            "source_quote": "Client shall pay invoices within thirty (30) days of receipt.",
            "status": "supported"
        },
        {
            "text": "Late fee is 5% per month.",
            "type": "quantitative",
            "found_in_source": False,
            "source_quote": None,
            "status": "contradicted"
        }
    ],
    "confidence_score": 0.15,
    "is_hallucinated": True,
    "should_return": False,
    "summary": {
        "total_claims": 2,
        "supported": 1,
        "unsupported": 0,
        "contradicted": 1
    },
    "reasoning": "Found 1 supported, 0 unsupported, 1 contradicted claims."
}

Location: components.py:293-308

Example: Detecting Hallucination

Generated Response

"The late payment fee is 5% per month. Payment is due within 30 days."

Retrieved Context

Section: Late Payment Penalties (page 8)
"If payment is not received within thirty (30) days, Client shall be assessed
a late fee of 1.5% per month (18% annually) on the outstanding balance."

Judge Evaluation

Claim 1: “The late payment fee is 5% per month.”

Type: quantitative
Numbers in claim: [“5”]
Numbers in context: [“30”, “1.5”, “18”]
Topic match: Both mention “late” and “fee”
Status: CONTRADICTED (5% ≠ 1.5%)

Claim 2: “Payment is due within 30 days.”

Type: temporal
Numbers in claim: [“30”]
Numbers in context: [“30”, “1.5”, “18”]
Supporting quote: “payment is not received within thirty (30) days”
Status: SUPPORTED

Confidence Calculation

total_claims = 2
contradicted_count = 1
supported_count = 1
unsupported_count = 0

confidence_score = 1.0
confidence_score -= (1 / 2) * 0.8  # -0.4 for contradicted claim
confidence_score -= (0 / 2) * 0.3  # -0.0 for unsupported
confidence_score = 0.6

is_hallucinated = True  # contradicted_count > 0
should_return = False

Even though confidence is 0.6 (above 0.5), ANY contradicted claim triggers is_hallucinated = True.

Integration with Workflow

# nodes.py:29-37
async def judge_node(state: DocMindState) -> DocMindState:
    judge = LLMJudge()
    verdict = await judge.evaluate(state["generated_response"], state["retrieved_sections"])
    state["judge_verdict"] = verdict
    state["node_history"] = state.get("node_history", []) + ["judge"]
    
    # Increment retry count if hallucinated
    if verdict.get("is_hallucinated", False):
        state["retry_count"] = state.get("retry_count", 0) + 1
    
    return state

Location: nodes.py:29-37

Retry Logic

# nodes.py:47-55
def should_retry(state: DocMindState) -> str:
    verdict = state.get("judge_verdict", {})
    retry_count = state.get("retry_count", 0)
    
    # Retry if hallucinated and haven't exceeded max retries (2 attempts max)
    if verdict.get("is_hallucinated", False) and retry_count < 2:
        log_retry_attempt(retry_count + 1, 2)
        return "retry"
    return "output"

Location: nodes.py:47-55 From README.md:47-48:

“If the judge detects hallucination, the system retries retrieval. Maximum 2 retries to avoid infinite loops. If retrieval fails twice, the information probably doesn’t exist in the documents.”

Design Limitations

From README.md:61:

“Judge makes single-pass decisions without revision.”

In production, you could:

Allow the judge to request additional context
Support multi-hop reasoning across sections
Implement chain-of-thought validation
Use an LLM for more nuanced claim extraction

Testing

from components import LLMJudge

judge = LLMJudge()

# Test claim extraction
response = "The late fee is 1.5% per month. Payment is due within 30 days."
claims = judge._extract_claims(response)
assert len(claims) == 2
assert claims[0]["type"] == "quantitative"
assert claims[1]["type"] == "temporal"

# Test contradiction detection
claim = "Late fee is 5% per month"
context = [{"content": "late fee of 1.5% per month"}]
assert judge._check_contradiction(claim, context) == True

# Test evaluation
verdict = await judge.evaluate(response, context)
assert verdict["is_hallucinated"] == True
assert verdict["confidence_score"] < 0.5

Get Started

Core Concepts

Guides

The Problem

Implementation

Four-Phase Validation

Phase 1: Claim Extraction

Quantitative Claims

Temporal Claims

Obligation Claims

General Claims

Phase 2: Evidence Grounding

Number Matching (Strict)

Key Phrase Matching (Flexible)

Phase 3: Contradiction Detection

Percentage Contradictions

Day/Timeline Contradictions

Phase 4: Confidence Scoring

Penalty Weights

Decision Threshold

Verdict Structure

Example: Detecting Hallucination

Generated Response

Retrieved Context

Judge Evaluation

Confidence Calculation

Integration with Workflow

Retry Logic

Design Limitations

Testing

Next Steps

Agentic Retrieval

LangGraph Workflow

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

​The Problem

​Implementation

​Four-Phase Validation

​Phase 1: Claim Extraction

​Quantitative Claims

​Temporal Claims

​Obligation Claims

​General Claims

​Phase 2: Evidence Grounding

​Number Matching (Strict)

​Key Phrase Matching (Flexible)

​Phase 3: Contradiction Detection

​Percentage Contradictions

​Day/Timeline Contradictions

​Phase 4: Confidence Scoring

​Penalty Weights

​Decision Threshold

​Verdict Structure

​Example: Detecting Hallucination

​Generated Response

​Retrieved Context

​Judge Evaluation

​Confidence Calculation

​Integration with Workflow

​Retry Logic

​Design Limitations

​Testing

​Next Steps

Agentic Retrieval

LangGraph Workflow

Build docs developers (and LLMs) love

The Problem

Implementation

Four-Phase Validation

Phase 1: Claim Extraction

Quantitative Claims

Temporal Claims

Obligation Claims

General Claims

Phase 2: Evidence Grounding

Number Matching (Strict)

Key Phrase Matching (Flexible)

Phase 3: Contradiction Detection

Percentage Contradictions

Day/Timeline Contradictions

Phase 4: Confidence Scoring

Penalty Weights

Decision Threshold

Verdict Structure

Example: Detecting Hallucination

Generated Response

Retrieved Context

Judge Evaluation

Confidence Calculation

Integration with Workflow

Retry Logic

Design Limitations

Testing

Next Steps