Skip to main content
SIAA implements a thread-safe LRU (Least Recently Used) cache that stores complete responses to frequent document queries. The cache can deliver responses in ~5ms compared to 44 seconds for uncached queries—an 8,800x speedup.

Overview

The cache system is designed specifically for the judicial environment where 26 departments ask similar questions repeatedly.

Cache Configuration

siaa_proxy.py:61-63
CACHE_MAX_ENTRADAS = 200    # Maximum stored responses
CACHE_TTL_SEGUNDOS = 3600   # 1 hour — lifetime per entry
CACHE_SOLO_DOC     = True   # Only cache document queries (not greetings)
CACHE_MAX_ENTRADAS
int
default:"200"
Maximum number of cached responses. When full, the least recently used entry is evicted.
CACHE_TTL_SEGUNDOS
int
default:"3600"
Time-to-live in seconds. Entries older than this are considered stale and removed.
CACHE_SOLO_DOC
bool
default:"true"
When true, only document queries are cached. Conversational queries (greetings, small talk) are always processed fresh.

Cache Entry Structure

Each cache entry contains:
siaa_proxy.py:65-66
# Estructura de cada entrada del caché:
# { "respuesta": str, "cita": str, "ts": float, "hits": int }
respuesta
string
Complete response text from the AI model
cita
string
Source citation with document links (e.g., ”📄 Fuente: PSAA16-10476”)
ts
float
Timestamp when entry was created/updated (Unix time)
hits
int
Number of times this entry was retrieved (LRU tracking)

Cache Key Generation

The cache key is generated from a normalized version of the query to ensure variations of the same question hit the cache.

Normalization Algorithm

def _clave_cache(texto: str) -> str:
    """
    Genera clave de caché normalizada — insensible a tildes, puntuación y mayúsculas.
    "¿Cuándo debo reportar?" == "cuando debo reportar" == "CUANDO DEBO REPORTAR"
    """
    import unicodedata
    t = texto.lower()
    t = re.sub(r'[^\w\s]', '', t)  # Remove punctuation
    # Remove accents: "cuándo" → "cuando", "información" → "informacion"
    t = ''.join(c for c in unicodedata.normalize('NFD', t)
                if unicodedata.category(c) != 'Mn')
    t = re.sub(r'\s+', ' ', t).strip()
    return hashlib.sha256(t.encode()).hexdigest()[:16]

Normalization Steps

1

Lowercase

"CUÁNDO""cuándo"
2

Remove Punctuation

"¿Cuándo debo reportar?""Cuándo debo reportar"
3

Remove Accents

"cuándo""cuando", "información""informacion"
4

Normalize Whitespace

"cuando debo reportar""cuando debo reportar"
5

SHA-256 Hash

"cuando debo reportar""a7f3c8e9b2d14f56" (first 16 chars)

Equivalent Queries

All these variations produce the same cache key:
¿Cuándo debo reportar?
CUÁNDO DEBO REPORTAR
cuando debo reportar
¿¿¿Cuándo... debo reportar???
Cache key: a7f3c8e9b2d14f56

Cache Operations

Cache Get (Retrieval)

def cache_get(pregunta: str) -> dict | None:
    """
    Busca la pregunta en el caché.

    Returns:
        dict con {respuesta, cita} si hay hit válido, None si miss o expirado.
    """
    global _cache_hits, _cache_misses
    clave = _clave_cache(pregunta)

    with _cache_lock:
        if clave not in _cache_respuestas:
            _cache_misses += 1
            return None

        entrada = _cache_respuestas[clave]

        # Verificar TTL
        if time.time() - entrada["ts"] > CACHE_TTL_SEGUNDOS:
            del _cache_respuestas[clave]
            _cache_misses += 1
            return None

        # HIT — mover al final (LRU: más recientemente usado)
        _cache_respuestas.move_to_end(clave)
        entrada["hits"] += 1
        _cache_hits += 1
        return {"respuesta": entrada["respuesta"], "cita": entrada["cita"]}
Behavior:
  1. Generate normalized cache key
  2. Check if key exists in cache
  3. If exists, verify TTL hasn’t expired
  4. If valid, move entry to end of OrderedDict (LRU bookkeeping)
  5. Increment hit counter
  6. Return response and citation
The LRU mechanism uses Python’s OrderedDict.move_to_end() to track recency. Most recently accessed items move to the end; when evicting, the first item (least recent) is removed.

Cache Set (Storage)

def cache_set(pregunta: str, respuesta: str, cita: str):
    """
    Guarda una respuesta en el caché.
    Si el caché está lleno, desaloja la entrada menos usada (la del frente).
    """
    if not respuesta.strip():
        return  # No cachear respuestas vacías

    # No cachear respuestas de "no encontré" — son negativas y pueden cambiar
    if "no encontré esa información" in respuesta.lower():
        return

    clave = _clave_cache(pregunta)
    with _cache_lock:
        # Si ya existe, actualizar
        if clave in _cache_respuestas:
            _cache_respuestas.move_to_end(clave)
            _cache_respuestas[clave].update({"respuesta": respuesta, "cita": cita, "ts": time.time()})
            return

        # Si está lleno, desalojar el más antiguo (frente del OrderedDict)
        while len(_cache_respuestas) >= CACHE_MAX_ENTRADAS:
            _cache_respuestas.popitem(last=False)

        _cache_respuestas[clave] = {
            "respuesta": respuesta,
            "cita":      cita,
            "ts":        time.time(),
            "hits":      0,
        }
Storage Rules:
  • ❌ Don’t cache empty responses
  • ❌ Don’t cache “no encontré esa información” (negative results may change)
  • ✅ Update existing entries (refresh timestamp)
  • ✅ Evict oldest entry when cache is full
Negative results (“no encontré”) are not cached because they may become outdated when documents are updated. A query that returns no results today might have a valid answer tomorrow.

Cache Statistics

def cache_stats() -> dict:
    """Estadísticas del caché para el endpoint /siaa/status."""
    with _cache_lock:
        total   = _cache_hits + _cache_misses
        hit_rate = round(_cache_hits / total * 100, 1) if total > 0 else 0
        return {
            "entradas":  len(_cache_respuestas),
            "max":       CACHE_MAX_ENTRADAS,
            "hits":      _cache_hits,
            "misses":    _cache_misses,
            "hit_rate":  f"{hit_rate}%",
            "ttl_seg":   CACHE_TTL_SEGUNDOS,
        }

When Caching Happens

Conversational vs. Document Queries

The system distinguishes between two query types:
def es_conversacion_general(texto: str) -> bool:
    t = texto.lower().strip()
    # If contains technical/judicial term → ALWAYS document query
    if any(term in t for term in TERMINOS_SIEMPRE_DOCUMENTAL):
        return False
    # Ultra-short phrases are greetings
    if len(t) < 8:
        return True
    return any(p in t for p in PATRONES_CONVERSACION)

Document Queries

Cached
  • “¿Cuándo debo reportar al SIERJU?”
  • “¿Qué es el PSAA16?”
  • “Consecuencias por no reportar”
These are deterministic and benefit from caching.

Conversational

Not Cached
  • “Hola”
  • “Buenos días”
  • “Gracias”
  • “¿Quién eres?”
These are context-dependent and change each time.

Cache Check Flow

siaa_proxy.py:1486-1523
# Only for document queries
if not es_conv and CACHE_SOLO_DOC:
    hit = cache_get(ultima_pregunta)
    if hit:
        print(
            f"[CACHÉ HIT] pregunta={ultima_pregunta[:50]!r} "
            f"stats={cache_stats()}",
            flush=True
        )
        registrar_consulta(
            tipo="DOC", pregunta=ultima_pregunta,
            respuesta=hit["respuesta"], docs=[],
            ctx_chars=0, tiempo_seg=0.0, cache_hit=True
        )
        # Return cached response immediately
        return Response(
            stream_with_context(_stream_cached()),
            content_type="text/event-stream",
            headers={"X-Cache": "HIT"}
        )

Cache Response Delivery

Cached responses are delivered via Server-Sent Events (SSE) with simulated streaming:
siaa_proxy.py:1505-1516
def _stream_cached():
    # Send response token by token to simulate streaming
    chunk_size = 40  # chars per "token"
    for i in range(0, len(respuesta_cached), chunk_size):
        trozo = respuesta_cached[i:i+chunk_size]
        safe  = json.dumps(trozo)[1:-1]
        yield f'data: {{"choices":[{{"delta":{{"content":"{safe}"}}}}]}}\n\n'
    if cita_cached:
        safe_cita = json.dumps(cita_cached)[1:-1]
        yield f'data: {{"choices":[{{"delta":{{"content":"{safe_cita}"}}}}]}}\n\n'
    yield "data: [DONE]\n\n"
Although the response is pre-computed, it’s sent in chunks to maintain compatibility with the streaming UI. This creates a smooth typing effect even for instant cache hits.

Performance Impact

Expected Metrics

Cache Hit Time

~5 millisecondsIn-memory lookup + streaming delivery

Cache Miss Time

~25-45 secondsDocument routing + chunk scoring + model inference

Hit Rate (Estimated)

30-40%26 departments asking similar questions

Speedup Factor

8,800x faster5ms vs 44s for identical queries

Real-World Example

Scenario: 26 judicial departments all ask “¿Cuándo debo reportar al SIERJU?”
Query #Cache StatusTimeResource Usage
1stMiss38sFull: routing + chunks + model
2nd-26thHit5ms eachZero: memory lookup only
Total time saved: 25 queries × 38s = 950 seconds (15.8 minutes)

Thread Safety

All cache operations are protected by a lock:
siaa_proxy.py:67-70
_cache_respuestas = OrderedDict()
_cache_lock       = threading.Lock()
_cache_hits       = 0
_cache_misses     = 0
Essential in a multi-threaded environment with up to 16 concurrent requests.

Cache Management Endpoints

View Cache Statistics

curl http://localhost:5000/siaa/cache
Response:
{
  "entradas": 47,
  "max": 200,
  "hits": 152,
  "misses": 89,
  "hit_rate": "63.1%",
  "ttl_seg": 3600
}

Clear Cache

curl -X DELETE http://localhost:5000/siaa/cache
Response:
{
  "vaciado": true,
  "mensaje": "Caché limpiado correctamente"
}
Clear the cache after updating documents to ensure users don’t receive outdated cached responses.

Cache Status in Headers

Cached responses include a header:
HTTP/1.1 200 OK
Content-Type: text/event-stream
X-Cache: HIT

Quality Logging

Cache hits are logged separately for monitoring:
siaa_proxy.py:1496-1500
registrar_consulta(
    tipo="DOC", pregunta=ultima_pregunta,
    respuesta=hit["respuesta"], docs=[],
    ctx_chars=0, tiempo_seg=0.0, cache_hit=True
)
Log entry:
{
  "ts": "2026-03-08T14:23:45",
  "tipo": "CACHE_HIT",
  "alerta": "OK",
  "pregunta": "cuando debo reportar al sierju",
  "respuesta": "Debe reportar antes del quinto día hábil...",
  "docs": [],
  "ctx_chars": 0,
  "tiempo_s": 0.0
}

Best Practices

Default: 200 entries is appropriate for 20-30 departments.Increase CACHE_MAX_ENTRADAS if:
  • You have >50 departments
  • Users ask many unique but frequent questions
  • You have >10 GB RAM available
Decrease if:
  • Memory is constrained (<4 GB)
  • Hit rate is consistently <20%
  • Document updates are very frequent
Memory estimate: ~5 KB per entry → 200 entries ≈ 1 MB
Default: 3600s (1 hour) balances freshness and performance.Increase TTL to 7200-14400s (2-4 hours) if:
  • Documents rarely change
  • You want maximum cache efficiency
  • Questions are highly repetitive
Decrease TTL to 1800s (30 min) if:
  • Documents update frequently
  • Regulatory content changes often
  • You prioritize freshness over speed
Automatic: Clear cache after document updates
# After updating documents
python3 convertidor.py
curl -X DELETE http://localhost:5000/siaa/cache
curl http://localhost:5000/siaa/recargar
Scheduled: Clear cache nightly if documents update daily
# Cron job: daily at 2 AM
0 2 * * * curl -X DELETE http://localhost:5000/siaa/cache
Track these metrics:
  • Hit rate: Should be >30% after warm-up period
  • Entry count: Should stay below max (indicates cache isn’t thrashing)
  • Avg response time: Cache hits should be <10ms
# Monitor continuously
watch -n 5 'curl -s http://localhost:5000/siaa/cache | jq .'
Low hit rate (<20%) may indicate:
  • Questions are too diverse (consider increasing max entries)
  • TTL is too short (responses expire before re-use)
  • Most queries are conversational (expected — not cached)

Implementation Notes

Why OrderedDict? Python’s OrderedDict maintains insertion order and provides move_to_end() for efficient LRU tracking without a separate linked list.
Why not Redis? For a single-server deployment with <1000 entries, an in-memory LRU cache is simpler and faster than a separate Redis instance. The entire cache fits in <5 MB of RAM.
Why normalize accents? Spanish queries like “información” and “informacion” should hit the same cache entry. NFD normalization removes diacritical marks for consistent hashing.

Build docs developers (and LLMs) love