Overview
The cache system is designed specifically for the judicial environment where 26 departments ask similar questions repeatedly.Cache Configuration
siaa_proxy.py:61-63
Maximum number of cached responses. When full, the least recently used entry is evicted.
Time-to-live in seconds. Entries older than this are considered stale and removed.
When true, only document queries are cached. Conversational queries (greetings, small talk) are always processed fresh.
Cache Entry Structure
Each cache entry contains:siaa_proxy.py:65-66
Complete response text from the AI model
Source citation with document links (e.g., ”📄 Fuente: PSAA16-10476”)
Timestamp when entry was created/updated (Unix time)
Number of times this entry was retrieved (LRU tracking)
Cache Key Generation
The cache key is generated from a normalized version of the query to ensure variations of the same question hit the cache.Normalization Algorithm
Normalization Steps
Equivalent Queries
All these variations produce the same cache key:a7f3c8e9b2d14f56
Cache Operations
Cache Get (Retrieval)
- Generate normalized cache key
- Check if key exists in cache
- If exists, verify TTL hasn’t expired
- If valid, move entry to end of OrderedDict (LRU bookkeeping)
- Increment hit counter
- Return response and citation
The LRU mechanism uses Python’s
OrderedDict.move_to_end() to track recency. Most recently accessed items move to the end; when evicting, the first item (least recent) is removed.Cache Set (Storage)
- ❌ Don’t cache empty responses
- ❌ Don’t cache “no encontré esa información” (negative results may change)
- ✅ Update existing entries (refresh timestamp)
- ✅ Evict oldest entry when cache is full
Cache Statistics
When Caching Happens
Conversational vs. Document Queries
The system distinguishes between two query types:Document Queries
Cached ✅
- “¿Cuándo debo reportar al SIERJU?”
- “¿Qué es el PSAA16?”
- “Consecuencias por no reportar”
Conversational
Not Cached ❌
- “Hola”
- “Buenos días”
- “Gracias”
- “¿Quién eres?”
Cache Check Flow
siaa_proxy.py:1486-1523
Cache Response Delivery
Cached responses are delivered via Server-Sent Events (SSE) with simulated streaming:siaa_proxy.py:1505-1516
Although the response is pre-computed, it’s sent in chunks to maintain compatibility with the streaming UI. This creates a smooth typing effect even for instant cache hits.
Performance Impact
Expected Metrics
Cache Hit Time
~5 millisecondsIn-memory lookup + streaming delivery
Cache Miss Time
~25-45 secondsDocument routing + chunk scoring + model inference
Hit Rate (Estimated)
30-40%26 departments asking similar questions
Speedup Factor
8,800x faster5ms vs 44s for identical queries
Real-World Example
Scenario: 26 judicial departments all ask “¿Cuándo debo reportar al SIERJU?”| Query # | Cache Status | Time | Resource Usage |
|---|---|---|---|
| 1st | Miss | 38s | Full: routing + chunks + model |
| 2nd-26th | Hit | 5ms each | Zero: memory lookup only |
Thread Safety
All cache operations are protected by a lock:siaa_proxy.py:67-70
Cache Management Endpoints
View Cache Statistics
Clear Cache
Cache Status in Headers
Cached responses include a header:Quality Logging
Cache hits are logged separately for monitoring:siaa_proxy.py:1496-1500
Best Practices
Cache Size Tuning
Cache Size Tuning
Default: 200 entries is appropriate for 20-30 departments.Increase
CACHE_MAX_ENTRADAS if:- You have >50 departments
- Users ask many unique but frequent questions
- You have >10 GB RAM available
- Memory is constrained (<4 GB)
- Hit rate is consistently <20%
- Document updates are very frequent
TTL Configuration
TTL Configuration
Default: 3600s (1 hour) balances freshness and performance.Increase TTL to 7200-14400s (2-4 hours) if:
- Documents rarely change
- You want maximum cache efficiency
- Questions are highly repetitive
- Documents update frequently
- Regulatory content changes often
- You prioritize freshness over speed
Cache Invalidation Strategy
Cache Invalidation Strategy
Automatic: Clear cache after document updatesScheduled: Clear cache nightly if documents update daily
Monitoring Cache Health
Monitoring Cache Health
Track these metrics:Low hit rate (<20%) may indicate:
- Hit rate: Should be >30% after warm-up period
- Entry count: Should stay below max (indicates cache isn’t thrashing)
- Avg response time: Cache hits should be <10ms
- Questions are too diverse (consider increasing max entries)
- TTL is too short (responses expire before re-use)
- Most queries are conversational (expected — not cached)
Implementation Notes
Why OrderedDict? Python’s
OrderedDict maintains insertion order and provides move_to_end() for efficient LRU tracking without a separate linked list.Why not Redis? For a single-server deployment with <1000 entries, an in-memory LRU cache is simpler and faster than a separate Redis instance. The entire cache fits in <5 MB of RAM.
Why normalize accents? Spanish queries like “información” and “informacion” should hit the same cache entry. NFD normalization removes diacritical marks for consistent hashing.