Multi-level routing system with TF-IDF, density indexing, and semantic scoring for precise document selection
SIAA uses a sophisticated multi-level document retrieval system to find the most relevant documents for each query. The system combines TF-IDF scoring, density indexing, filename matching, and advanced chunk selection to deliver precise, context-aware responses.
Builds an inverted index mapping terms to documents by density (term frequency / total tokens).
siaa_proxy.py:818-829
# Índice de densidad con tokens alfanuméricosnuevo_indice = defaultdict(list)for nombre_doc, doc in todos_los_docs.items(): total = doc["total_tokens"] if total == 0: continue for termino, freq in doc["token_count"].items(): if len(termino) >= MIN_LEN_KEYWORD: nuevo_indice[termino].append((freq / total, nombre_doc))for t in nuevo_indice: nuevo_indice[t].sort(reverse=True)
Example: If “psaa16” appears 15 times in a 500-token document, density = 15/500 = 0.03
Extracts tokens from filenames and matches against query terms.
siaa_proxy.py:723-728
def _tokens_nombre_archivo(nombre_clave: str) -> set: sin_ext = os.path.splitext(nombre_clave)[0] partes = re.split(r'[_\s\-\.]+', sin_ext.lower()) # [FIX-1] incluir partes que tengan dígitos (psaa16, 10476) return {p for p in partes if len(p) >= 3}
scores_combinados = defaultdict(float)for doc, s in scores_tfidf.items(): scores_combinados[doc] += s * 2.0 # TF-IDF weightfor doc, s in scores_densidad.items(): scores_combinados[doc] += s * 1.0 # Density weightfor doc, s in scores_nombre.items(): scores_combinados[doc] += s * 1.5 # Filename weight
Query expansion helps bridge the gap between natural language questions (“cuándo debo reportar”) and technical document language (“periodicidad quincenal”).
def puntuar_chunk(chunk: dict, palabras: set, pregunta_norm: str, terminos_prio: set, idf_local: dict = None) -> float: texto = chunk["texto"].lower() puntos = 0.0 # TF-IDF scoring per word for w in palabras: count = texto.count(w) if count > 0: tf = 1.0 + math.log(count) # log-normalized base = 3.0 if w in terminos_prio else 1.0 if idf_local and w in idf_local: base *= idf_local[w] puntos += tf * base # Exact query match: +15 points if pregunta_norm in texto: puntos += 15.0 # Article with degree symbol: +10 points if PATRON_ARTICULO_GRADO.search(chunk["texto"]): puntos += 10.0 elif PATRON_ARTICULO_SIMPLE.search(chunk["texto"]): puntos += 5.0 # Numbered list: +4 points if re.search(r'^\s*\d+[\.\)]\s+\S', chunk["texto"], re.MULTILINE): puntos += 4.0 # Proximity bonus ("Sharpshooter" strategy) if len(palabras) >= 2: VENTANA = 150 # chars ≈ 1-2 short sentences max_densidad = 0.0 for i in range(0, max(1, len(texto) - VENTANA), 50): v = texto[i:i+VENTANA] matches = sum(1 for w in palabras if w in v) if matches >= 2: d = matches / len(palabras) if d > max_densidad: max_densidad = d if max_densidad >= 0.90: puntos += 20.0 # 90%+ keywords together elif max_densidad >= 0.70: puntos += 12.0 elif max_densidad >= 0.50: puntos += 6.0 elif max_densidad >= 0.30: puntos += 2.0 return puntos
The proximity scorer uses a sliding window to find regions where query keywords cluster together:
1
Slide 150-char window across chunk
Move in 50-char steps with 67% overlap
2
Count keyword matches in each window
Calculate density = matches / total_keywords
3
Award bonus based on max density
≥90% density: +20 points (exact answer)
≥70% density: +12 points (very likely)
≥50% density: +6 points (probable)
≥30% density: +2 points (weak signal)
Why proximity matters: A chunk with “incumplimiento disciplinario sanción reportar” in 120 chars gets density ≈1.0 and max bonus (+20), while the same words scattered across 800 chars gets density ≈0.25 and minimal bonus.
MAX_DOCS_CONTEXTO = 2 # Maximum documents per queryTOP_KEYWORDS_POR_DOC = 20 # Top TF-IDF keywords per documentMIN_FREQ_KEYWORD = 2 # Minimum term frequencyMIN_LEN_KEYWORD = 3 # Minimum term lengthCHUNK_SIZE = 800 # Maximum chars per chunkCHUNK_OVERLAP = 300 # Shared chars between chunksMAX_CHUNKS_CONTEXTO = 3 # Max chunks per document