Skip to main content
DocMind includes a comprehensive test suite with 17 tests covering strategic retrieval, LLM-as-Judge evaluation, and end-to-end workflows.

Running Tests

pytest test_starter.py -v

Test Output

The test runner provides formatted output:
============================================================
TEST SET A: Strategic Retrieval
============================================================
 Retrieval Specificity Passed
 Payment Deadlines Passed
 Indemnification Passed
 IP Infringement Passed

============================================================
TEST SET B: LLM-as-Judge
============================================================
 Hallucination Catch Passed
 Unsupported Claims Passed
 Contradictions Passed
 Made-up Numbers Passed
 Valid Inferences Passed
 Structured Output Passed

============================================================
TEST SET C: End-to-End
============================================================
 Payment Terms Query Passed
 Termination Query Passed
 Confidentiality Query Passed
 IP Query Passed
 Indemnification Query Passed
 Late Payment Penalties Passed
 Services Scope Query Passed

Test Set A: Strategic Retrieval

These tests verify that the AgenticRetriever uses intent-based scoring to retrieve the most relevant sections.
Verifies the retriever returns specific sections, not the entire document:
test_starter.py:13-24
@pytest.mark.asyncio
async def test_retrieval_specificity():
    store = MockDocumentStore()
    retriever = AgenticRetriever(store)
    
    results = await retriever.retrieve(
        "What are the penalties for late payment?", 
        {"intent": "penalty"}
    )
    
    page_nums = [r["page_num"] for r in results]
    assert 8 in page_nums, "Must retrieve page 8 (Penalties)"
    assert len(results) < 4, "Should not retrieve entire document"
What it tests:
  • Retriever must find page 8 (Late Payment Penalties)
  • Should return fewer than 4 sections (strategic, not exhaustive)
Why it matters: Prevents over-retrieval which can introduce noise and hallucinations.
Tests retrieval for payment deadline queries:
test_starter.py:28-38
@pytest.mark.asyncio
async def test_retrieval_payment_deadlines():
    store = MockDocumentStore()
    retriever = AgenticRetriever(store)
    decomposer = QueryDecomposer()
    
    decomposition = await decomposer.decompose("Payment deadlines")
    results = await retriever.retrieve("Payment deadlines", decomposition)
    
    titles = [r["title"] for r in results]
    assert any("Payment" in t for t in titles), "Must retrieve payment-related sections"
    assert len(results) <= 5, "Should return at most 5 sections"
Validates:
  • Payment-related sections are retrieved
  • Result set is limited to 5 sections maximum
Tests retrieval quality with relevance scoring:
test_starter.py:42-53
@pytest.mark.asyncio
async def test_retrieval_indemnification():
    store = MockDocumentStore()
    retriever = AgenticRetriever(store)
    decomposer = QueryDecomposer()
    
    decomposition = await decomposer.decompose("Indemnification obligations")
    results = await retriever.retrieve("Indemnification obligations", decomposition)
    
    titles = [r["title"] for r in results]
    assert any("Indemnification" in t for t in titles), "Must retrieve indemnification section"
    for result in results:
        assert result["_relevance_score"] >= 2.0, "All results should be highly relevant"
Checks:
  • Indemnification section is retrieved
  • All results meet relevance threshold (score >= 2.0)
Tests IP-related query retrieval:
test_starter.py:57-66
@pytest.mark.asyncio
async def test_retrieval_ip_infringement():
    store = MockDocumentStore()
    retriever = AgenticRetriever(store)
    decomposer = QueryDecomposer()
    
    decomposition = await decomposer.decompose("What happens if IP is infringed?")
    results = await retriever.retrieve("What happens if IP is infringed?", decomposition)
    
    titles = [r["title"] for r in results]
    assert any("Intellectual Property" in t or "IP" in t for t in titles), "Must retrieve IP section"

Test Set B: LLM-as-Judge

These tests verify the judge correctly identifies hallucinations, contradictions, and unsupported claims.
Verifies the judge catches obvious hallucinations:
test_starter.py:72-81
@pytest.mark.asyncio
async def test_judge_hallucination_catch():
    judge = LLMJudge()
    
    source_context = [{"content": "Late fee is 1.5% per month."}]
    hallucinated_response = "The late fee is 5% per month."
    
    verdict = await judge.evaluate(hallucinated_response, source_context)
    
    assert verdict["confidence_score"] < 0.5
    assert verdict["is_hallucinated"] == True
Tests:
  • Source says 1.5%, response claims 5%
  • Judge must detect the contradiction
  • Confidence score must be below 0.5
  • is_hallucinated must be True
Tests detection of claims not found in source:
test_starter.py:85-94
@pytest.mark.asyncio
async def test_judge_catches_unsupported_claims():
    judge = LLMJudge()
    
    source_context = [{"content": "Payment is due within 30 days of invoice receipt."}]
    response_with_unsupported = "Payment is due within 30 days. Additionally, a grace period of 15 days is provided."
    
    verdict = await judge.evaluate(response_with_unsupported, source_context)
    
    unsupported = [c for c in verdict["claims"] if c["status"] == "unsupported"]
    assert len(unsupported) > 0, "Judge should identify unsupported claims"
Validates:
  • The “grace period” claim is not in the source
  • Judge flags it as “unsupported”
Tests contradiction detection with multiple incorrect claims:
test_starter.py:98-107
@pytest.mark.asyncio
async def test_judge_catches_contradictions():
    judge = LLMJudge()
    
    source_context = [{"content": "Payment is due within 30 days of invoice receipt. Late fee is 1.5% per month."}]
    contradicted_response = "Payment is due within 60 days. Late fee is 5% per month."
    
    verdict = await judge.evaluate(contradicted_response, source_context)
    
    contradicted = [c for c in verdict["claims"] if c["status"] == "contradicted"]
    assert len(contradicted) > 0 or verdict["is_hallucinated"], "Judge should catch contradictions"
Checks:
  • 60 days contradicts 30 days
  • 5% contradicts 1.5%
  • Judge flags as contradicted or hallucinated
Critical test for numeric hallucinations:
test_starter.py:111-119
@pytest.mark.asyncio
async def test_judge_catches_made_up_numbers():
    judge = LLMJudge()
    
    source_context = [{"content": "Late fee is 1.5% per month. Payment is due within 30 days."}]
    response_with_fake_numbers = "Late fee is 10% per month. Payment is due within 15 days."
    
    verdict = await judge.evaluate(response_with_fake_numbers, source_context)
    
    assert verdict["is_hallucinated"] == True, "Judge should flag made-up numbers as hallucination"
Critical for:
  • Contract compliance (wrong numbers can have legal implications)
  • Financial accuracy (fees, dates, percentages)
Ensures the judge doesn’t over-penalize valid responses:
test_starter.py:123-132
@pytest.mark.asyncio
async def test_judge_passes_valid_inferences():
    judge = LLMJudge()
    
    source_context = [{"content": "Late fee is 1.5% per month. Payment is due within 30 days of invoice receipt."}]
    valid_response = "Late fee is 1.5% per month. Payment must be made within 30 days."
    
    verdict = await judge.evaluate(valid_response, source_context)
    
    assert verdict["confidence_score"] >= 0.5, "Valid inferences should have high confidence"
    assert verdict["should_return"] == True, "Valid responses should be returned"
Validates:
  • Correct numbers (1.5%, 30 days) are recognized
  • Reasonable paraphrasing (“must be made” vs “is due”) is accepted
  • High confidence score for accurate responses
Tests the judge returns properly structured verdicts:
test_starter.py:136-149
@pytest.mark.asyncio
async def test_judge_structured_output():
    judge = LLMJudge()
    
    source_context = [{"content": "The contract is governed by California law."}]
    response = "This contract follows California law."
    
    verdict = await judge.evaluate(response, source_context)
    
    assert "claims" in verdict, "Verdict must contain claims"
    assert "confidence_score" in verdict, "Verdict must contain confidence_score"
    assert "is_hallucinated" in verdict, "Verdict must contain is_hallucinated"
    assert "should_return" in verdict, "Verdict must contain should_return"
    assert isinstance(verdict["claims"], list), "Claims must be a list"
    assert 0 <= verdict["confidence_score"] <= 1, "Confidence score must be between 0 and 1"
Required fields:
  • claims: List of evaluated claims
  • confidence_score: Float between 0 and 1
  • is_hallucinated: Boolean flag
  • should_return: Whether to return the response

Test Set C: End-to-End

These tests validate the complete workflow from query to final output.
Full workflow test with latency check:
test_starter.py:155-164
@pytest.mark.asyncio
async def test_e2e_payment_terms_query():
    graph = build_graph_workflow()
    
    start_time = time.time()
    result = await graph.ainvoke({"query": "What are the payment terms?"})
    latency = time.time() - start_time
    
    assert result["generated_response"], "Should generate a response"
    assert result["judge_verdict"], "Should have judge verdict"
    assert latency < 10, "Should complete within reasonable time"
Validates:
  • Response is generated
  • Judge evaluates the response
  • Total latency is under 10 seconds
Tests termination-related queries:
test_starter.py:168-174
@pytest.mark.asyncio
async def test_e2e_termination_query():
    graph = build_graph_workflow()
    
    result = await graph.ainvoke({"query": "How can the contract be terminated?"})
    
    assert result["generated_response"], "Should generate a response"
    assert len(result["retrieved_sections"]) > 0, "Should retrieve sections"
Validates confidentiality section retrieval:
test_starter.py:178-185
@pytest.mark.asyncio
async def test_e2e_confidentiality_query():
    graph = build_graph_workflow()
    
    result = await graph.ainvoke({"query": "What are the confidentiality obligations?"})
    
    assert result["generated_response"], "Should generate a response"
    titles = [s["title"] for s in result["retrieved_sections"]]
    assert any("Confidentiality" in t for t in titles), "Should retrieve confidentiality section"
Checks:
  • Confidentiality section is retrieved
  • Response is generated from correct sections
Tests intent detection for IP queries:
test_starter.py:189-195
@pytest.mark.asyncio
async def test_e2e_intellectual_property_query():
    graph = build_graph_workflow()
    
    result = await graph.ainvoke({"query": "Who owns the intellectual property?"})
    
    assert result["generated_response"], "Should generate a response"
    assert result["decomposition"]["intent"] == "intellectual_property"
Validates:
  • Query is correctly decomposed with intent “intellectual_property”
  • Appropriate sections are retrieved
Tests indemnification workflow:
test_starter.py:199-205
@pytest.mark.asyncio
async def test_e2e_indemnification_query():
    graph = build_graph_workflow()
    
    result = await graph.ainvoke({"query": "What are the indemnification requirements?"})
    
    assert result["generated_response"], "Should generate a response"
    assert result["judge_verdict"]["should_return"] in [True, False]
Critical test for penalty queries with latency check:
test_starter.py:209-218
@pytest.mark.asyncio
async def test_e2e_late_payment_penalties():
    graph = build_graph_workflow()
    
    start_time = time.time()
    result = await graph.ainvoke({"query": "What happens if I pay late?"})
    latency = time.time() - start_time
    
    assert result["generated_response"], "Should generate a response"
    assert result["judge_verdict"], "Should have judge evaluation"
    assert latency < 10, "Latency should be reasonable"
Tests scope of services retrieval limits:
test_starter.py:222-228
@pytest.mark.asyncio
async def test_e2e_services_scope_query():
    graph = build_graph_workflow()
    
    result = await graph.ainvoke({"query": "What services are provided under this agreement?"})
    
    assert result["generated_response"], "Should generate a response"
    assert len(result["retrieved_sections"]) <= 5, "Should not retrieve too many sections"
Ensures:
  • Retrieval is limited to 5 sections
  • Prevents information overload

Running Specific Test Sets

pytest test_starter.py -k "test_retrieval" -v

Writing Custom Tests

Add your own tests to validate custom behavior:
import pytest
from components import AgenticRetriever, QueryDecomposer
from mock_data import MockDocumentStore

@pytest.mark.asyncio
async def test_custom_intent():
    store = MockDocumentStore()
    retriever = AgenticRetriever(store)
    decomposer = QueryDecomposer()
    
    # Test your custom query
    decomposition = await decomposer.decompose("Your custom query")
    results = await retriever.retrieve("Your custom query", decomposition)
    
    # Add your assertions
    assert len(results) > 0, "Should retrieve sections"
    assert results[0]["_relevance_score"] >= 2.0, "Should be relevant"

Next Steps

Customization

Customize components and extend the system

Running Queries

Learn how to run queries and interpret results

Build docs developers (and LLMs) love