Running Tests
Test Output
The test runner provides formatted output:Test Set A: Strategic Retrieval
These tests verify that the AgenticRetriever uses intent-based scoring to retrieve the most relevant sections.test_retrieval_specificity
test_retrieval_specificity
Verifies the retriever returns specific sections, not the entire document:What it tests:
test_starter.py:13-24
- Retriever must find page 8 (Late Payment Penalties)
- Should return fewer than 4 sections (strategic, not exhaustive)
test_retrieval_payment_deadlines
test_retrieval_payment_deadlines
Tests retrieval for payment deadline queries:Validates:
test_starter.py:28-38
- Payment-related sections are retrieved
- Result set is limited to 5 sections maximum
test_retrieval_indemnification
test_retrieval_indemnification
Tests retrieval quality with relevance scoring:Checks:
test_starter.py:42-53
- Indemnification section is retrieved
- All results meet relevance threshold (score >= 2.0)
test_retrieval_ip_infringement
test_retrieval_ip_infringement
Tests IP-related query retrieval:
test_starter.py:57-66
Test Set B: LLM-as-Judge
These tests verify the judge correctly identifies hallucinations, contradictions, and unsupported claims.test_judge_hallucination_catch
test_judge_hallucination_catch
Verifies the judge catches obvious hallucinations:Tests:
test_starter.py:72-81
- Source says 1.5%, response claims 5%
- Judge must detect the contradiction
- Confidence score must be below 0.5
is_hallucinatedmust be True
test_judge_catches_unsupported_claims
test_judge_catches_unsupported_claims
Tests detection of claims not found in source:Validates:
test_starter.py:85-94
- The “grace period” claim is not in the source
- Judge flags it as “unsupported”
test_judge_catches_contradictions
test_judge_catches_contradictions
Tests contradiction detection with multiple incorrect claims:Checks:
test_starter.py:98-107
- 60 days contradicts 30 days
- 5% contradicts 1.5%
- Judge flags as contradicted or hallucinated
test_judge_catches_made_up_numbers
test_judge_catches_made_up_numbers
Critical test for numeric hallucinations:Critical for:
test_starter.py:111-119
- Contract compliance (wrong numbers can have legal implications)
- Financial accuracy (fees, dates, percentages)
test_judge_passes_valid_inferences
test_judge_passes_valid_inferences
Ensures the judge doesn’t over-penalize valid responses:Validates:
test_starter.py:123-132
- Correct numbers (1.5%, 30 days) are recognized
- Reasonable paraphrasing (“must be made” vs “is due”) is accepted
- High confidence score for accurate responses
test_judge_structured_output
test_judge_structured_output
Tests the judge returns properly structured verdicts:Required fields:
test_starter.py:136-149
claims: List of evaluated claimsconfidence_score: Float between 0 and 1is_hallucinated: Boolean flagshould_return: Whether to return the response
Test Set C: End-to-End
These tests validate the complete workflow from query to final output.test_e2e_payment_terms_query
test_e2e_payment_terms_query
Full workflow test with latency check:Validates:
test_starter.py:155-164
- Response is generated
- Judge evaluates the response
- Total latency is under 10 seconds
test_e2e_termination_query
test_e2e_termination_query
Tests termination-related queries:
test_starter.py:168-174
test_e2e_confidentiality_query
test_e2e_confidentiality_query
Validates confidentiality section retrieval:Checks:
test_starter.py:178-185
- Confidentiality section is retrieved
- Response is generated from correct sections
test_e2e_intellectual_property_query
test_e2e_intellectual_property_query
Tests intent detection for IP queries:Validates:
test_starter.py:189-195
- Query is correctly decomposed with intent “intellectual_property”
- Appropriate sections are retrieved
test_e2e_indemnification_query
test_e2e_indemnification_query
Tests indemnification workflow:
test_starter.py:199-205
test_e2e_late_payment_penalties
test_e2e_late_payment_penalties
Critical test for penalty queries with latency check:
test_starter.py:209-218
test_e2e_services_scope_query
test_e2e_services_scope_query
Tests scope of services retrieval limits:Ensures:
test_starter.py:222-228
- Retrieval is limited to 5 sections
- Prevents information overload
Running Specific Test Sets
Writing Custom Tests
Add your own tests to validate custom behavior:Next Steps
Customization
Customize components and extend the system
Running Queries
Learn how to run queries and interpret results