Overview
SIAA implements a high-performance LRU (Least Recently Used) cache that stores responses to document-based queries. Cache hits return responses in ~5ms compared to ~44s without cache — an 8,800x speedup.Cache Operations
The cache is managed through the/siaa/cache endpoint:
Get Cache Statistics
Retrieve current cache metrics:Current number of cached responses. Maximum is
CACHE_MAX_ENTRADAS (default: 200).Maximum cache capacity (configured via
CACHE_MAX_ENTRADAS).Total number of queries served from cache since server start. Cumulative counter.
Total number of queries that required full AI processing. Cumulative counter.
Percentage of queries served from cache:
hits / (hits + misses) * 100.Time-to-live for cache entries in seconds (configured via
CACHE_TTL_SEGUNDOS).Clear Cache
Delete all cached responses:Cache Key Generation
How Keys Are Created
Cache keys are generated from normalized questions using the_clave_cache() function:
Normalization Process
- Convert to lowercase:
"REPORTAR"→"reportar" - Remove punctuation:
"¿Cuándo?"→"Cuándo" - Remove accents:
"Cuándo información"→"Cuando informacion" - Normalize whitespace: Multiple spaces → single space
- Hash with SHA256: Take first 16 characters
Equivalent Questions
These variations produce the same cache key:Case, accents, punctuation, and extra whitespace are ignored. This maximizes cache hits for semantically identical questions.
Cache Entry Structure
Internal Data Model
Each cache entry contains:Example Entry
Entry Lifecycle
LRU Eviction Policy
How LRU Works
SIAA uses Python’sOrderedDict to implement LRU:
Eviction Behavior
- Cache Not Full
- Cache Full
- Cache Hit
New entries are added without eviction.
entradas < CACHE_MAX_ENTRADAS.Eviction Example
Cache capacity: 3 entriesHit Rate Monitoring
Calculating Hit Rate
Hit rate is calculated as:Interpreting Hit Rates
70%+ Hit Rate (Excellent)
70%+ Hit Rate (Excellent)
Optimal performance. Most queries are repeat questions. Cache is highly effective.Recommendation: No action needed. Monitor stability.
40-70% Hit Rate (Good)
40-70% Hit Rate (Good)
Healthy cache utilization. Expected range for 26 judicial offices with similar workflows.Recommendation: Continue monitoring. Consider increasing
CACHE_MAX_ENTRADAS if hit rate trends downward.20-40% Hit Rate (Fair)
20-40% Hit Rate (Fair)
Moderate cache effectiveness. Queries may be too diverse or cache too small.Recommendation:
- Increase
CACHE_MAX_ENTRADASfrom 200 to 400 - Increase
CACHE_TTL_SEGUNDOSfrom 3600 to 7200 (2 hours)
<20% Hit Rate (Poor)
<20% Hit Rate (Poor)
Low cache utilization. Investigate root cause.Possible causes:
- Highly diverse queries (each user asks unique questions)
- Cache too small for user base
- Frequent cache clearing
- Very short TTL causing premature expiration
CACHE_MAX_ENTRADAS and CACHE_TTL_SEGUNDOS significantly.Monitoring Hit Rate Over Time
When to Clear Cache
Required Cache Clear Scenarios
1. Document Updates
When modifying source documents:The
/siaa/recargar endpoint automatically clears the cache after reloading documents.2. Adding New Documents
When adding documents that may answer previously unanswerable questions:3. Configuration Changes
After changing chunking or extraction parameters:4. Quality Issues
If log analysis reveals systematic hallucinations:Optional Cache Clear Scenarios
Consider clearing cache to gather fresh performance data:- Before performance benchmarking
- After system upgrades
- Monthly maintenance (if desired)
Cache Exclusions
What is NOT Cached
SIAA deliberately excludes certain responses from caching:1. Conversational Queries
2. Empty Responses
3. “No encontré” Responses
- Document additions may later answer the question
- Extraction improvements may find relevant content
- These responses have minimal computational cost
Cache-Only Query Types
✅ Cached:tipo == "DOC" and response contains actual information
❌ Not Cached:
tipo == "CONV"(conversational)- Empty responses
- “No encontré” responses
- Error responses
Performance Impact
Cache Hit Performance
Cache hits deliver dramatic performance improvements:| Metric | Without Cache | With Cache | Improvement |
|---|---|---|---|
| Response time | ~44 seconds | ~5 milliseconds | 8,800x faster |
| AI processing | Full model inference | None | 100% reduction |
| RAM usage | ~2 GB (model active) | Minimal | 99% reduction |
| CPU usage | ~100% on 6 cores | <1% | 99% reduction |
Resource Savings
With 40% hit rate across 100 daily queries:Cache Memory Usage
Estimate cache memory footprint:Cache memory usage is negligible (~300 KB for 200 entries). Feel free to increase
CACHE_MAX_ENTRADAS significantly without RAM concerns.Optimizing Cache Settings
Increasing Cache Size
For larger user bases or higher query diversity:Extending TTL
For more stable document sets:Disabling Cache
For testing or debugging (not recommended for production):Monitoring Best Practices
Daily Cache Health Check
Cache Performance Alerts
Troubleshooting
Cache Not Working
Symptom:hit_rate: "0.0%" despite repeat queries
Check:
Low Hit Rate Despite Repeat Queries
Symptom: Users report asking the same questions but hit rate is low Cause: Question variations prevent key matching Examples:Cache Memory Concerns
Symptom: Concern about cache consuming too much RAM Reality: Cache uses minimal memory (~300 KB for 200 entries) Verification:Next Steps
Configuration
Adjust cache settings in system configuration
Monitoring
Track cache performance in real-time