Query Decomposition

The Problem

Natural language queries are inherently ambiguous. “What are the penalties?” could refer to:

Late payment fees
Contractual penalties
Legal sanctions
Performance penalties

Without structure, search becomes unreliable. The QueryDecomposer solves this by extracting explicit intent, entities, constraints, and temporal references before retrieval begins.

Implementation

Location: components.py:8-63 The decomposer returns a structured dictionary:

{
  "intent": "penalty",
  "entities": ["late", "payment", "penalties"],
  "constraints": {"percentage": "1.5"},
  "temporals": []
}

Intent Extraction

Method: _extract_intent(query: str) -> str Uses a priority-ordered intent map with regex patterns:

intent_map = [
    ("penalty", r"\blate fee\b|\boverdue\b|\bpenalt(y|ies)\b"),
    ("payment_terms", r"\bpay\b|\binvoice\b|\bpayment\b"),
    ("intellectual_property", r"\bowned by\b|\blicense\b|\bintellectual property\b|\bIP\b|\binfring"),
    ("indemnification", r"\bindemnif(y|ication)\b|\bthird-party claims\b"),
    ("termination", r"\bterminate\b|\btermination\b|\bwritten notice\b"),
    ("confidentiality", r"\bconfidential\b|\bproprietary information\b"),
    ("scope_of_services", r"\bshall provide\b|\bservices\b"),
]

Location: components.py:10-18

Pattern Design

Each pattern uses:

Word boundaries (\b) - Ensures “penalties” matches but “penalties-related” doesn’t match incorrectly
Alternation (|) - Captures variations (“penalty” or “penalties”)
Case-insensitive matching (re.IGNORECASE) - Handles “Penalty”, “PENALTY”, “penalty”

Patterns are checked in order. More specific patterns (like “indemnification”) should appear before generic ones (like “payment_terms”) to ensure accurate classification.

Entity Extraction

Method: _extract_entities(query: str) -> List[str]

stopwords = {"the", "is", "in", "at", "of", "for", "and", "a", "to", "what", "are", "on", "with", "by"}
words = re.findall(r"\b\w+\b", query.lower())
return [w for w in words if w not in stopwords]

Location: components.py:24-27 Extracts meaningful terms for matching:

Splits query into words
Filters common stopwords
Returns lowercase terms for case-insensitive matching

Example

Query: "What are the late payment penalties?"
Entities: ["late", "payment", "penalties"]

These entities are later used by the AgenticRetriever for scoring sections.

Constraint Extraction

Method: _extract_constraints(query: str) -> Dict

Timeframe Patterns

time_match = re.search(r"\bwithin (\d+ (days|weeks|months|years))\b", query, re.IGNORECASE)
if time_match:
    constraints["timeframe"] = time_match.group(1)

Location: components.py:32-33 Examples:

“within 30 days” → {"timeframe": "30 days"}
“within 2 months” → {"timeframe": "2 months"}

Percentage Patterns

percentage_match = re.search(r"\b(\d+(\.\d+)?)%\b", query)
if percentage_match:
    constraints["percentage"] = percentage_match.group(1)

Location: components.py:34-36 Examples:

“1.5% late fee” → {"percentage": "1.5"}
“5% penalty” → {"percentage": "5"}

Temporal Extraction

Method: _extract_temporals(query: str) -> List[str]

date_match = re.search(r"\b(on|by|before|after) (\w+ \d{1,2}(, \d{4})?)\b", query, re.IGNORECASE)
if date_match:
    temporals.append(date_match.group(2))

Location: components.py:41-44 Examples:

“by December 15, 2024” → ["December 15, 2024"]
“before March 1” → ["March 1"]

Design Rationale

Why Regex Instead of Embeddings?

From README.md:9:

“I used regex instead of embeddings because legal documents use consistent terminology. Regex gives deterministic control and avoids false positives from semantic similarity. ‘Penalty’ and ‘penalize’ might be similar vectors but mean different things in context. Also faster than API calls.”

Trade-offs

From README.md:11:

“Trade-off: if vocabulary grows significantly, this would need a trained classifier.”

Advantages:

✅ Deterministic and testable
✅ No API latency (200-500ms saved)
✅ No token costs
✅ Exact control over matching logic

Disadvantages:

❌ Doesn’t generalize beyond defined patterns
❌ Requires manual updates for new vocabulary
❌ Can’t handle paraphrasing or synonyms

Usage Example

from components import QueryDecomposer

decomposer = QueryDecomposer()
result = await decomposer.decompose("What are the late payment penalties?")

print(result)
# Output:
# {
#   "intent": "penalty",
#   "entities": ["late", "payment", "penalties"],
#   "constraints": {},
#   "temporals": []
# }

Integration with Workflow

The decomposition result flows to the AgenticRetriever:

# nodes.py:7-12
async def decompose_node(state: DocMindState) -> DocMindState:
    decomposer = QueryDecomposer()
    decomposition = await decomposer.decompose(state["query"])
    state["decomposition"] = decomposition
    state["node_history"] = state.get("node_history", []) + ["decompose"]
    return state

Location: nodes.py:7-12 The retriever uses:

intent - Maps to relevant document sections
entities - Scores content matches
constraints - Validates numerical claims
temporals - Validates temporal claims

Testing

Each extraction method is independently testable:

# Test intent extraction
assert decomposer._extract_intent("What are the penalties?") == "penalty"
assert decomposer._extract_intent("Who owns the IP?") == "intellectual_property"

# Test entity extraction
assert "payment" in decomposer._extract_entities("late payment terms")
assert "the" not in decomposer._extract_entities("What is the penalty?")

# Test constraint extraction
result = decomposer._extract_constraints("within 30 days")
assert result["timeframe"] == "30 days"

Extending the Decomposer

To add new intents:

Add pattern to intent_map in priority order
Update AgenticRetriever.INTENT_SECTION_MAP
Add test cases

# Add warranty intent
("warranty", r"\bwarrant(y|ies)\b|\bguarantee\b|\brepresentation\b"),

Get Started

Core Concepts

Guides

Query Decomposition

The Problem

Implementation

Intent Extraction

Pattern Design

Entity Extraction

Example

Constraint Extraction

Timeframe Patterns

Percentage Patterns

Temporal Extraction

Design Rationale

Why Regex Instead of Embeddings?

Trade-offs

Usage Example

Integration with Workflow

Testing

Extending the Decomposer

Next Steps

Agentic Retrieval

Architecture

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

​The Problem

​Implementation

​Intent Extraction

​Pattern Design

​Entity Extraction

​Example

​Constraint Extraction

​Timeframe Patterns

​Percentage Patterns

​Temporal Extraction

​Design Rationale

​Why Regex Instead of Embeddings?

​Trade-offs

​Usage Example

​Integration with Workflow

​Testing

​Extending the Decomposer

​Next Steps

Agentic Retrieval

Architecture

Build docs developers (and LLMs) love

The Problem

Implementation

Intent Extraction

Pattern Design

Entity Extraction

Example

Constraint Extraction

Timeframe Patterns

Percentage Patterns

Temporal Extraction

Design Rationale

Why Regex Instead of Embeddings?

Trade-offs

Usage Example

Integration with Workflow

Testing

Extending the Decomposer

Next Steps