Skip to main content

The Problem

Semantic search returns “similar” documents but not always the right ones. Searching for “late payment” might return generic “payment terms” sections that don’t mention penalties. From README.md:14-16:
“The problem: semantic search returns ‘similar’ documents but not always the right ones. Searching ‘late payment’ might return generic ‘payment terms’ sections.”
AgenticRetriever solves this with strategy-based retrieval that combines intent mapping, entity matching, and relevance scoring.

Implementation

Location: components.py:66-144

Search Strategies

From README.md:17-21:
# Three search strategies based on intent:

1. full_text - For IP and indemnification
   → Exact term matches required
"Indemnification" should NOT return "liability"

2. hybrid - For payment, penalties, termination
   → Combines exact matches with variations
"Late payment" = "overdue payment"

3. vector - For unknown intents
   → Fallback when intent is unclear
Currently, the system uses full_text search as the primary mechanism:
# components.py:119
all_sections = await self.doc_store.full_text_search(query)

Intent-to-Section Mapping

The retriever maintains explicit mappings from query intents to document sections:
INTENT_SECTION_MAP = {
    "penalty": ["Late Payment Penalties", "Payment Terms"],
    "payment_terms": ["Payment Terms", "Late Payment Penalties"],
    "intellectual_property": ["Intellectual Property Rights"],
    "indemnification": ["Indemnification"],
    "termination": ["Termination for Convenience"],
    "confidentiality": ["Confidentiality"],
    "scope_of_services": ["Scope of Services"],
}
Location: components.py:67-75
The first section in each list is the primary match and receives a +2.0 scoring bonus. Secondary sections get +5.0 but no primary bonus.

Multi-Signal Scoring

Method: _score_section(section, query, decomposition) -> float From README.md:23-25:
“The scoring system combines multiple signals. Intent mapping has the highest weight (5-7 points) because if we know the user wants ‘penalty’, the ‘Late Payment Penalties’ section is almost certainly relevant.”

Scoring Formula

def _score_section(self, section: Dict, query: str, decomposition: Dict) -> float:
    score = 0.0
    
    # 1. Intent-based scoring (highest priority)
    intent = decomposition.get("intent", "unknown")
    target_sections = self.INTENT_SECTION_MAP.get(intent, [])
    
    if section["title"] in target_sections:
        score += 5.0  # Base intent match
        if section["title"] == target_sections[0]:  # Primary match
            score += 2.0
    
    # 2. Entity matching
    entities = decomposition.get("entities", [])
    for entity in entities:
        if entity in content_lower:
            score += 1.0  # Entity in content
        if entity in title_lower:
            score += 1.5  # Entity in title (more descriptive)
    
    # 3. Query term matching (tiebreaker)
    query_terms = query.lower().split()
    for term in query_terms:
        if len(term) > 3 and term in content_lower:
            score += 0.5
    
    return score
Location: components.py:80-107

Scoring Breakdown

SignalWeightRationale
Primary intent match+7.0(5.0 base + 2.0 bonus) Section is exactly what user wants
Secondary intent match+5.0Section is relevant to intent
Entity in title+1.5Titles are more descriptive than content
Entity in content+1.0Entity appears in section text
Query term match+0.5For tiebreaking similar sections
From README.md:24:
“Entity matches in titles get 1.5 points (titles are more descriptive), in content 1.0. Query terms get 0.5 for tiebreaking.”

Relevance Filtering

Method: _filter_irrelevant(sections, threshold=1.0) -> List[Dict]
def _filter_irrelevant(self, sections: List[Dict], threshold: float = 1.0) -> List[Dict]:
    return [s for s in sections if s.get("_relevance_score", 0) >= threshold]
Location: components.py:109-111 The retriever applies a 2.0 threshold to ensure quality:
relevant = self._filter_irrelevant(scored_sections, threshold=2.0)
Location: components.py:135 From README.md:25-26:
“I set the relevance threshold at 2.0. This ensures at least one entity appears in the title or there’s an intent match. Determined empirically from the test cases.”

What 2.0 Means

A section must have at least one of:
  • One entity in title (1.5) + one entity in content (1.0) = 2.5 ✅
  • Two entities in title (1.5 × 2) = 3.0 ✅
  • Intent match (5.0+) = 5.0+ ✅
  • One entity in title (1.5) + one query term (0.5) = 2.0 ✅
Sections with only content matches (score < 2.0) are filtered as likely noise.

Retrieval Algorithm

Method: retrieve(query, decomposition) -> List[Dict]
async def retrieve(self, query: str, decomposition: Dict) -> List[Dict]:
    # Step 1: Get candidate sections from document store
    all_sections = await self.doc_store.full_text_search(query)
    
    if not all_sections:
        all_sections = await self.doc_store.get_document_sections(SAMPLE_CONTRACT["doc_id"])
    
    # Step 2: Score each section
    scored_sections = []
    for section in all_sections:
        score = self._score_section(section, query, decomposition)
        score += section.get("_text_boost", 0.0)  # Apply text search boost
        section_copy = section.copy()
        section_copy["_relevance_score"] = score
        scored_sections.append(section_copy)
    
    # Step 3: Sort by score (highest first)
    scored_sections.sort(key=lambda x: x["_relevance_score"], reverse=True)
    
    # Step 4: Filter irrelevant (score < 2.0)
    relevant = self._filter_irrelevant(scored_sections, threshold=2.0)
    
    # Step 5: Return top 3-5 sections
    if not relevant:
        relevant = scored_sections[:1]  # At least return something
    else:
        relevant = relevant[:5]  # Cap at 5 sections
    
    return relevant
Location: components.py:114-144

Example: Late Payment Query

query = "What are the late payment penalties?"
decomposition = {
    "intent": "penalty",
    "entities": ["late", "payment", "penalties"],
    "constraints": {},
    "temporals": []
}

Scoring Process

Section: “Late Payment Penalties” (page 8)
  • Intent match (primary): +7.0
  • “late” in title: +1.5
  • “payment” in title: +1.5
  • “penalties” in title: +1.5
  • “late” in content: +1.0
  • “payment” in content: +1.0
  • Total: 13.5
Section: “Payment Terms” (page 3)
  • Intent match (secondary): +5.0
  • “payment” in title: +1.5
  • “payment” in content: +1.0
  • Total: 7.5
Section: “Confidentiality” (page 20)
  • No intent match: 0.0
  • No entity matches: 0.0
  • Total: 0.0 ❌ (filtered)

Final Result

[
  {"title": "Late Payment Penalties", "page_num": 8, "_relevance_score": 13.5},
  {"title": "Payment Terms", "page_num": 3, "_relevance_score": 7.5}
]

Integration with Workflow

# nodes.py:14-20
async def retrieve_node(state: DocMindState) -> DocMindState:
    store = MockDocumentStore()
    retriever = AgenticRetriever(store)
    sections = await retriever.retrieve(state["query"], state["decomposition"])
    state["retrieved_sections"] = sections
    state["node_history"] = state.get("node_history", []) + ["retrieve"]
    return state
Location: nodes.py:14-20

Retry Mechanism

If the LLMJudge detects hallucinations, the workflow retries retrieval:
# workflow.py:23-29
workflow.add_conditional_edges(
    "judge",
    should_retry,
    {
        "retry": "retrieve",  # Try again with same query
        "output": "output"
    }
)
Location: workflow.py:23-29 From README.md:47-48:
“If the judge detects hallucination, the system retries retrieval. Maximum 2 retries to avoid infinite loops.”

Document Store Interface

The retriever depends on a document store providing:
class MockDocumentStore:
    async def full_text_search(self, query: str, top_k: int = 5) -> List[Dict]:
        # Returns sections matching query terms
        pass
    
    async def get_document_sections(self, doc_id: str) -> List[Dict]:
        # Returns all sections in document
        pass
Location: mock_data.py:19-31 Each section includes:
  • section_id - Unique identifier
  • title - Section heading
  • page_num - Page number for citation
  • content - Section text

Design Trade-offs

Fixed Thresholds vs Adaptive

From README.md:58:
“Why fixed threshold instead of adaptive: simplicity for this scope. In production it would be calibrated with a validation dataset.”

Manual Scoring vs Learning

From README.md:61:
“Manual scoring doesn’t learn from feedback.”
In production, you could:
  • Track user feedback (helpful/not helpful)
  • A/B test different scoring weights
  • Train a ranking model on historical queries

Testing

from components import AgenticRetriever
from mock_data import MockDocumentStore

# Test scoring
retriever = AgenticRetriever(MockDocumentStore())
section = {"title": "Late Payment Penalties", "content": "late fee of 1.5%", "page_num": 8}
decomp = {"intent": "penalty", "entities": ["late", "payment"]}

score = retriever._score_section(section, "late payment", decomp)
assert score >= 7.0  # Should have intent match + entity matches

# Test retrieval
sections = await retriever.retrieve("What are the penalties?", decomp)
assert len(sections) > 0
assert sections[0]["title"] == "Late Payment Penalties"

Next Steps

Query Decomposition

Understand how queries are parsed

LLM Judge

See how responses are validated

Build docs developers (and LLMs) love