Natural language queries are inherently ambiguous. “What are the penalties?” could refer to:
Late payment fees
Contractual penalties
Legal sanctions
Performance penalties
Without structure, search becomes unreliable. The QueryDecomposer solves this by extracting explicit intent, entities, constraints, and temporal references before retrieval begins.
Patterns are checked in order. More specific patterns (like “indemnification”) should appear before generic ones (like “payment_terms”) to ensure accurate classification.
stopwords = {"the", "is", "in", "at", "of", "for", "and", "a", "to", "what", "are", "on", "with", "by"}words = re.findall(r"\b\w+\b", query.lower())return [w for w in words if w not in stopwords]
Location:components.py:24-27Extracts meaningful terms for matching:
Splits query into words
Filters common stopwords
Returns lowercase terms for case-insensitive matching
“I used regex instead of embeddings because legal documents use consistent terminology. Regex gives deterministic control and avoids false positives from semantic similarity. ‘Penalty’ and ‘penalize’ might be similar vectors but mean different things in context. Also faster than API calls.”
# Test intent extractionassert decomposer._extract_intent("What are the penalties?") == "penalty"assert decomposer._extract_intent("Who owns the IP?") == "intellectual_property"# Test entity extractionassert "payment" in decomposer._extract_entities("late payment terms")assert "the" not in decomposer._extract_entities("What is the penalty?")# Test constraint extractionresult = decomposer._extract_constraints("within 30 days")assert result["timeframe"] == "30 days"