Overview
The model-as-judge approach uses a separate, independent language model to evaluate and compare responses from the system under test. This provides an automated, scalable way to detect factual inconsistencies without requiring human annotators.Why use a judge model?
The judge model provides several advantages:Advantages of model-as-judge
Advantages of model-as-judge
- Scalability: Can process thousands of queries without human intervention
- Consistency: Applies the same evaluation criteria to all responses
- Detail: Provides structured reasoning and identifies specific conflicting facts
- Speed: Analyzes multiple responses in seconds
- Cost-effectiveness: More affordable than human evaluation at scale
PAS2 uses OpenAI’s o3-mini model as the judge because of its strong reasoning capabilities and reliability in structured output generation.
Judge method implementation
Thejudge_hallucination() method orchestrates the judgment process:
Method signature
Parameters:original_query(str): The first query that was askedoriginal_response(str): Response to the original queryparaphrased_queries(List[str]): All semantic paraphrases of the originalparaphrased_responses(List[str]): Responses to each paraphrase
HallucinationJudgment: Structured judgment with detection results and reasoning
Context preparation
The judge receives a formatted context containing all queries and responses (pas2.py:314-324):- Compare the original response against all paraphrased responses
- Identify patterns of inconsistency across multiple variations
- Trace conflicting facts back to specific query-response pairs
System prompt engineering
The judge uses a carefully designed system prompt to ensure accurate evaluation (pas2.py:326-338):Key prompt elements
Role definition
Establishes the judge’s purpose: detecting hallucinations through cross-response comparison.
Evaluation criteria
Focuses on factual inconsistencies while ignoring stylistic variations that don’t affect meaning.
Hallucination definition
Clearly defines what constitutes a hallucination: stating different facts for the same question.
The prompt explicitly instructs the judge to focus on factual discrepancies, not stylistic differences. This prevents false positives from variations in tone, length, or phrasing.
API call and response parsing
The judge model is called with structured output mode:Response format enforcement
Using
response_format={"type": "json_object"} ensures the model always returns valid JSON, making parsing reliable and reducing error handling complexity.Judgment object creation
The JSON response is parsed and converted to a typed Pydantic model (pas2.py:354-361):- Type safety through Pydantic validation
- Default values for missing fields
- Consistent structure across all judgments
Judgment components explained
Hallucination detection flag
Type:boolDescription: Primary binary indicator of whether hallucinations were found
Confidence score
Type:float (0.0 to 1.0)Description: Judge’s confidence in its determination
- 0.0-0.3: Low confidence, borderline cases
- 0.4-0.6: Moderate confidence, some evidence
- 0.7-1.0: High confidence, clear evidence
Conflicting facts
Type:List[Dict[str, Any]]Description: Structured list of specific factual contradictions found Example structure:
Reasoning
Type:strDescription: Detailed explanation of the judge’s analysis process and findings Typically includes:
- Comparison methodology
- Specific examples of inconsistencies
- Explanation of why differences are or aren’t hallucinations
Summary
Type:strDescription: Concise summary suitable for display to end users
Example judgment
Example judgment
Fallback mechanism
If the judge model fails or returns an error, a safe fallback judgment is provided (pas2.py:368-377):The fallback assumes no hallucination (
hallucination_detected=False) with zero confidence, following the principle of “innocent until proven guilty” when evidence is unavailable.Judge model selection
PAS2 uses OpenAI’s o3-mini model (pas2.py:58):Why o3-mini?
- Reasoning capability: Strong analytical and comparison abilities
- Structured output: Reliable JSON generation
- Cost-efficiency: More affordable than larger models
- Speed: Fast response times for real-time applications
Performance characteristics
Judgment typically completes in 2-5 seconds, depending on:- Number of responses to compare (more responses = longer analysis)
- Response length (longer text requires more processing)
- API response time (network and model availability)
All judgment timing is logged with millisecond precision for performance monitoring (pas2.py:364).