Testing Guide

DocMind includes a comprehensive test suite with 17 tests covering strategic retrieval, LLM-as-Judge evaluation, and end-to-end workflows.

Running Tests

pytest test_starter.py -v

Test Output

The test runner provides formatted output:

============================================================
TEST SET A: Strategic Retrieval
============================================================
✅ Retrieval Specificity Passed
✅ Payment Deadlines Passed
✅ Indemnification Passed
✅ IP Infringement Passed

============================================================
TEST SET B: LLM-as-Judge
============================================================
✅ Hallucination Catch Passed
✅ Unsupported Claims Passed
✅ Contradictions Passed
✅ Made-up Numbers Passed
✅ Valid Inferences Passed
✅ Structured Output Passed

============================================================
TEST SET C: End-to-End
============================================================
✅ Payment Terms Query Passed
✅ Termination Query Passed
✅ Confidentiality Query Passed
✅ IP Query Passed
✅ Indemnification Query Passed
✅ Late Payment Penalties Passed
✅ Services Scope Query Passed

Test Set A: Strategic Retrieval

These tests verify that the AgenticRetriever uses intent-based scoring to retrieve the most relevant sections.

test_retrieval_specificity

Verifies the retriever returns specific sections, not the entire document:

test_starter.py:13-24

@pytest.mark.asyncio
async def test_retrieval_specificity():
    store = MockDocumentStore()
    retriever = AgenticRetriever(store)
    
    results = await retriever.retrieve(
        "What are the penalties for late payment?", 
        {"intent": "penalty"}
    )
    
    page_nums = [r["page_num"] for r in results]
    assert 8 in page_nums, "Must retrieve page 8 (Penalties)"
    assert len(results) < 4, "Should not retrieve entire document"

What it tests:

Retriever must find page 8 (Late Payment Penalties)
Should return fewer than 4 sections (strategic, not exhaustive)

Why it matters: Prevents over-retrieval which can introduce noise and hallucinations.

test_retrieval_payment_deadlines

Tests retrieval for payment deadline queries:

test_starter.py:28-38

@pytest.mark.asyncio
async def test_retrieval_payment_deadlines():
    store = MockDocumentStore()
    retriever = AgenticRetriever(store)
    decomposer = QueryDecomposer()
    
    decomposition = await decomposer.decompose("Payment deadlines")
    results = await retriever.retrieve("Payment deadlines", decomposition)
    
    titles = [r["title"] for r in results]
    assert any("Payment" in t for t in titles), "Must retrieve payment-related sections"
    assert len(results) <= 5, "Should return at most 5 sections"

Validates:

Payment-related sections are retrieved
Result set is limited to 5 sections maximum

test_retrieval_indemnification

Tests retrieval quality with relevance scoring:

test_starter.py:42-53

@pytest.mark.asyncio
async def test_retrieval_indemnification():
    store = MockDocumentStore()
    retriever = AgenticRetriever(store)
    decomposer = QueryDecomposer()
    
    decomposition = await decomposer.decompose("Indemnification obligations")
    results = await retriever.retrieve("Indemnification obligations", decomposition)
    
    titles = [r["title"] for r in results]
    assert any("Indemnification" in t for t in titles), "Must retrieve indemnification section"
    for result in results:
        assert result["_relevance_score"] >= 2.0, "All results should be highly relevant"

Checks:

Indemnification section is retrieved
All results meet relevance threshold (score >= 2.0)

test_retrieval_ip_infringement

Tests IP-related query retrieval:

test_starter.py:57-66

@pytest.mark.asyncio
async def test_retrieval_ip_infringement():
    store = MockDocumentStore()
    retriever = AgenticRetriever(store)
    decomposer = QueryDecomposer()
    
    decomposition = await decomposer.decompose("What happens if IP is infringed?")
    results = await retriever.retrieve("What happens if IP is infringed?", decomposition)
    
    titles = [r["title"] for r in results]
    assert any("Intellectual Property" in t or "IP" in t for t in titles), "Must retrieve IP section"

Test Set B: LLM-as-Judge

These tests verify the judge correctly identifies hallucinations, contradictions, and unsupported claims.

test_judge_hallucination_catch

Verifies the judge catches obvious hallucinations:

test_starter.py:72-81

@pytest.mark.asyncio
async def test_judge_hallucination_catch():
    judge = LLMJudge()
    
    source_context = [{"content": "Late fee is 1.5% per month."}]
    hallucinated_response = "The late fee is 5% per month."
    
    verdict = await judge.evaluate(hallucinated_response, source_context)
    
    assert verdict["confidence_score"] < 0.5
    assert verdict["is_hallucinated"] == True

Tests:

Source says 1.5%, response claims 5%
Judge must detect the contradiction
Confidence score must be below 0.5
is_hallucinated must be True

test_judge_catches_unsupported_claims

Tests detection of claims not found in source:

test_starter.py:85-94

@pytest.mark.asyncio
async def test_judge_catches_unsupported_claims():
    judge = LLMJudge()
    
    source_context = [{"content": "Payment is due within 30 days of invoice receipt."}]
    response_with_unsupported = "Payment is due within 30 days. Additionally, a grace period of 15 days is provided."
    
    verdict = await judge.evaluate(response_with_unsupported, source_context)
    
    unsupported = [c for c in verdict["claims"] if c["status"] == "unsupported"]
    assert len(unsupported) > 0, "Judge should identify unsupported claims"

Validates:

The “grace period” claim is not in the source
Judge flags it as “unsupported”

test_judge_catches_contradictions

Tests contradiction detection with multiple incorrect claims:

test_starter.py:98-107

@pytest.mark.asyncio
async def test_judge_catches_contradictions():
    judge = LLMJudge()
    
    source_context = [{"content": "Payment is due within 30 days of invoice receipt. Late fee is 1.5% per month."}]
    contradicted_response = "Payment is due within 60 days. Late fee is 5% per month."
    
    verdict = await judge.evaluate(contradicted_response, source_context)
    
    contradicted = [c for c in verdict["claims"] if c["status"] == "contradicted"]
    assert len(contradicted) > 0 or verdict["is_hallucinated"], "Judge should catch contradictions"

Checks:

60 days contradicts 30 days
5% contradicts 1.5%
Judge flags as contradicted or hallucinated

test_judge_catches_made_up_numbers

Critical test for numeric hallucinations:

test_starter.py:111-119

@pytest.mark.asyncio
async def test_judge_catches_made_up_numbers():
    judge = LLMJudge()
    
    source_context = [{"content": "Late fee is 1.5% per month. Payment is due within 30 days."}]
    response_with_fake_numbers = "Late fee is 10% per month. Payment is due within 15 days."
    
    verdict = await judge.evaluate(response_with_fake_numbers, source_context)
    
    assert verdict["is_hallucinated"] == True, "Judge should flag made-up numbers as hallucination"

Critical for:

Contract compliance (wrong numbers can have legal implications)
Financial accuracy (fees, dates, percentages)

test_judge_passes_valid_inferences

Ensures the judge doesn’t over-penalize valid responses:

test_starter.py:123-132

@pytest.mark.asyncio
async def test_judge_passes_valid_inferences():
    judge = LLMJudge()
    
    source_context = [{"content": "Late fee is 1.5% per month. Payment is due within 30 days of invoice receipt."}]
    valid_response = "Late fee is 1.5% per month. Payment must be made within 30 days."
    
    verdict = await judge.evaluate(valid_response, source_context)
    
    assert verdict["confidence_score"] >= 0.5, "Valid inferences should have high confidence"
    assert verdict["should_return"] == True, "Valid responses should be returned"

Validates:

Correct numbers (1.5%, 30 days) are recognized
Reasonable paraphrasing (“must be made” vs “is due”) is accepted
High confidence score for accurate responses

test_judge_structured_output

Tests the judge returns properly structured verdicts:

test_starter.py:136-149

@pytest.mark.asyncio
async def test_judge_structured_output():
    judge = LLMJudge()
    
    source_context = [{"content": "The contract is governed by California law."}]
    response = "This contract follows California law."
    
    verdict = await judge.evaluate(response, source_context)
    
    assert "claims" in verdict, "Verdict must contain claims"
    assert "confidence_score" in verdict, "Verdict must contain confidence_score"
    assert "is_hallucinated" in verdict, "Verdict must contain is_hallucinated"
    assert "should_return" in verdict, "Verdict must contain should_return"
    assert isinstance(verdict["claims"], list), "Claims must be a list"
    assert 0 <= verdict["confidence_score"] <= 1, "Confidence score must be between 0 and 1"

Required fields:

claims: List of evaluated claims
confidence_score: Float between 0 and 1
is_hallucinated: Boolean flag
should_return: Whether to return the response

Test Set C: End-to-End

These tests validate the complete workflow from query to final output.

test_e2e_payment_terms_query

Full workflow test with latency check:

test_starter.py:155-164

@pytest.mark.asyncio
async def test_e2e_payment_terms_query():
    graph = build_graph_workflow()
    
    start_time = time.time()
    result = await graph.ainvoke({"query": "What are the payment terms?"})
    latency = time.time() - start_time
    
    assert result["generated_response"], "Should generate a response"
    assert result["judge_verdict"], "Should have judge verdict"
    assert latency < 10, "Should complete within reasonable time"

Validates:

Response is generated
Judge evaluates the response
Total latency is under 10 seconds

test_e2e_termination_query

Tests termination-related queries:

test_starter.py:168-174

@pytest.mark.asyncio
async def test_e2e_termination_query():
    graph = build_graph_workflow()
    
    result = await graph.ainvoke({"query": "How can the contract be terminated?"})
    
    assert result["generated_response"], "Should generate a response"
    assert len(result["retrieved_sections"]) > 0, "Should retrieve sections"

test_e2e_confidentiality_query

Validates confidentiality section retrieval:

test_starter.py:178-185

@pytest.mark.asyncio
async def test_e2e_confidentiality_query():
    graph = build_graph_workflow()
    
    result = await graph.ainvoke({"query": "What are the confidentiality obligations?"})
    
    assert result["generated_response"], "Should generate a response"
    titles = [s["title"] for s in result["retrieved_sections"]]
    assert any("Confidentiality" in t for t in titles), "Should retrieve confidentiality section"

Checks:

Confidentiality section is retrieved
Response is generated from correct sections

test_e2e_intellectual_property_query

Tests intent detection for IP queries:

test_starter.py:189-195

@pytest.mark.asyncio
async def test_e2e_intellectual_property_query():
    graph = build_graph_workflow()
    
    result = await graph.ainvoke({"query": "Who owns the intellectual property?"})
    
    assert result["generated_response"], "Should generate a response"
    assert result["decomposition"]["intent"] == "intellectual_property"

Validates:

Query is correctly decomposed with intent “intellectual_property”
Appropriate sections are retrieved

test_e2e_indemnification_query

Tests indemnification workflow:

test_starter.py:199-205

@pytest.mark.asyncio
async def test_e2e_indemnification_query():
    graph = build_graph_workflow()
    
    result = await graph.ainvoke({"query": "What are the indemnification requirements?"})
    
    assert result["generated_response"], "Should generate a response"
    assert result["judge_verdict"]["should_return"] in [True, False]

test_e2e_late_payment_penalties

Critical test for penalty queries with latency check:

test_starter.py:209-218

@pytest.mark.asyncio
async def test_e2e_late_payment_penalties():
    graph = build_graph_workflow()
    
    start_time = time.time()
    result = await graph.ainvoke({"query": "What happens if I pay late?"})
    latency = time.time() - start_time
    
    assert result["generated_response"], "Should generate a response"
    assert result["judge_verdict"], "Should have judge evaluation"
    assert latency < 10, "Latency should be reasonable"

test_e2e_services_scope_query

Tests scope of services retrieval limits:

test_starter.py:222-228

@pytest.mark.asyncio
async def test_e2e_services_scope_query():
    graph = build_graph_workflow()
    
    result = await graph.ainvoke({"query": "What services are provided under this agreement?"})
    
    assert result["generated_response"], "Should generate a response"
    assert len(result["retrieved_sections"]) <= 5, "Should not retrieve too many sections"

Ensures:

Retrieval is limited to 5 sections
Prevents information overload

Running Specific Test Sets

pytest test_starter.py -k "test_retrieval" -v

Writing Custom Tests

Add your own tests to validate custom behavior:

import pytest
from components import AgenticRetriever, QueryDecomposer
from mock_data import MockDocumentStore

@pytest.mark.asyncio
async def test_custom_intent():
    store = MockDocumentStore()
    retriever = AgenticRetriever(store)
    decomposer = QueryDecomposer()
    
    # Test your custom query
    decomposition = await decomposer.decompose("Your custom query")
    results = await retriever.retrieve("Your custom query", decomposition)
    
    # Add your assertions
    assert len(results) > 0, "Should retrieve sections"
    assert results[0]["_relevance_score"] >= 2.0, "Should be relevant"

Get Started

Core Concepts

Guides

Running Tests

Test Output

Test Set A: Strategic Retrieval

Test Set B: LLM-as-Judge

Test Set C: End-to-End

Running Specific Test Sets

Writing Custom Tests

Next Steps

Customization

Running Queries

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

​Running Tests

​Test Output

​Test Set A: Strategic Retrieval

​Test Set B: LLM-as-Judge

​Test Set C: End-to-End

​Running Specific Test Sets

​Writing Custom Tests

​Next Steps

Customization

Running Queries

Build docs developers (and LLMs) love

Running Tests

Test Output

Test Set A: Strategic Retrieval

Test Set B: LLM-as-Judge

Test Set C: End-to-End

Running Specific Test Sets

Writing Custom Tests

Next Steps