Skip to main content

System Architecture Overview

SIAA (Sistema Inteligente de Apoyo Administrativo) is an intelligent judicial document management system built for the Seccional Bucaramanga of Colombia’s Judicial Branch. It uses AI-powered document routing and retrieval to answer queries about judicial procedures, regulations, and administrative processes.

System Components

Component Details

Flask Proxy Server

The proxy server (siaa_proxy.py) acts as the central orchestrator, handling:
  • Request routing and validation
  • Cache management
  • Document retrieval coordination
  • Ollama API communication
  • Quality monitoring and logging
The proxy runs on Waitress WSGI server with 16 threads (HILOS_SERVIDOR=16) for production deployment.

Ollama LLM Engine

SIAA uses Qwen2.5:3b model via Ollama’s local API:
siaa_proxy.py
OLLAMA_URL = "http://localhost:11434"
MODEL = "qwen2.5:3b"

# Model parameters
options = {
    "temperature": 0.0,        # Deterministic responses
    "num_predict": 150,        # Max tokens per response (300 for lists)
    "num_ctx": 2048,           # Context window (adaptive: 1024-3072)
    "num_thread": 6,           # Physical cores only (Ryzen 5 2600)
    "num_batch": 512,          # Large batch → lower TTFT
    "repeat_penalty": 1.1,
}

Document Store

Documents are loaded from /opt/siaa/fuentes at startup:
siaa_proxy.py
CARPETA_FUENTES = "/opt/siaa/fuentes"

# Document structure
doc_entry = {
    "ruta": "/opt/siaa/fuentes/acuerdo_psaa16.md",
    "nombre_original": "acuerdo_psaa16.md",
    "contenido": "...",
    "palabras": set(),           # Tokenized vocabulary
    "tamano": 45231,             # Character count
    "coleccion": "general",
    "token_count": Counter(),    # Term frequency index
    "total_tokens": 1542,
    "tokens_nombre": {"acuerdo", "psaa16"},
    "num_chunks": 38,            # Pre-calculated chunks
}

LRU Cache System

High-performance response cache with thread-safe LRU eviction:
siaa_proxy.py
CACHE_MAX_ENTRADAS = 200    # Maximum entries
CACHE_TTL_SEGUNDOS = 3600   # 1 hour TTL
CACHE_SOLO_DOC = True       # Only cache documentary queries

# Cache entry structure
{
    "respuesta": "El SIERJU es un sistema de información...",
    "cita": "📄 Fuente: ACUERDO PSAA16-10476",
    "ts": 1709856234.5,
    "hits": 12,
}
Cache hits provide 8,800x speedup (~5ms vs 44s) and reduce Ollama load by 30-40% across 26 court offices.

Data Flow: Query to Response

With MAX_OLLAMA_SIMULTANEOS=2, a 3rd concurrent request will wait up to 30 seconds. If the queue is full, users receive a “Sistema ocupado” message.

Why Limit to 2 Concurrent Requests?

  1. RAM constraints: Qwen2.5:3b requires ~4GB per instance
  2. CPU bottleneck: Ryzen 5 2600 (6 cores) thrashes with >2 parallel inferences
  3. Response quality: More concurrency = slower per-token generation

Health Monitoring System

Automatic health checks run every 15 seconds:
siaa_proxy.py
ollama_estado = {
    "disponible": False,
    "ultimo_check": 0,
    "fallos": 0,
    "warmup_done": None
}

def verificar_ollama() -> bool:
    try:
        r = requests.get(f"{OLLAMA_URL}/api/tags", timeout=TIMEOUT_HEALTH)
        ok = (r.status_code == 200)
    except Exception:
        ok = False
    
    with ollama_lock:
        ollama_estado["disponible"] = ok
        ollama_estado["ultimo_check"] = time.time()
        ollama_estado["fallos"] = 0 if ok else ollama_estado["fallos"] + 1
        warmup_pendiente = ok and ollama_estado["warmup_done"] is None
    
    # Warm-up: load model into RAM on first success
    if warmup_pendiente:
        print(f"  [Ollama] Precargando {MODEL} en RAM...", flush=True)
        requests.post(
            f"{OLLAMA_URL}/api/chat",
            json={"model": MODEL, "messages": [{"role": "user", "content": "ok"}],
                  "stream": False, "options": {"num_predict": 1, "num_ctx": 64}}
        )
        ollama_estado["warmup_done"] = True
    
    return ok

# Background monitoring thread
def _monitor_loop():
    while True:
        verificar_ollama()
        time.sleep(15)

threading.Thread(target=_monitor_loop, daemon=True).start()

Warm-up Process

On first successful connection, the monitor sends a minimal query ("ok" with 1 token prediction) to:
  1. Load the model into RAM (prevents 30s delay on first real query)
  2. Initialize CUDA/ROCm context
  3. Verify model availability

Check System Status

curl http://localhost:5000/siaa/status
Returns health metrics including warmup_completado, usuarios_activos, cache stats, and Ollama availability.

Quality Monitoring and Logging

Every query is logged to /opt/siaa/logs/calidad.jsonl (JSONL format for easy analysis):
siaa_proxy.py
def registrar_consulta(
    tipo: str,          # "CONV", "DOC", "CACHE_HIT", "ERROR"
    pregunta: str,
    respuesta: str,
    docs: list,
    ctx_chars: int,
    tiempo_seg: float,
    cache_hit: bool = False,
):
    # Detect issues automatically
    no_encontro = "no encontré esa información" in respuesta.lower()
    habia_docs = len(docs) > 0 and ctx_chars > 100
    
    if no_encontro and habia_docs:
        alerta = "POSIBLE_ALUCINACION"   # Had docs but said "not found"
    elif no_encontro and not habia_docs:
        alerta = "SIN_CONTEXTO"           # No docs available (correct)
    elif tipo == "ERROR":
        alerta = "ERROR"
    else:
        alerta = "OK"
    
    entrada = {
        "ts": time.strftime("%Y-%m-%dT%H:%M:%S"),
        "tipo": "CACHE_HIT" if cache_hit else tipo,
        "alerta": alerta,
        "pregunta": pregunta[:200],
        "respuesta": respuesta[:300],
        "docs": docs,
        "ctx_chars": ctx_chars,
        "tiempo_s": round(tiempo_seg, 2),
    }
    
    # Write to JSONL with rotation at 5000 lines
    with _log_lock:
        with open(LOG_ARCHIVO, "a", encoding="utf-8") as f:
            f.write(json.dumps(entrada, ensure_ascii=False) + "\n")

Hallucination Detection

The system automatically flags potential hallucinations:
  • POSIBLE_ALUCINACION: Model said “no encontré” despite receiving relevant documents
  • SIN_CONTEXTO: No documents found (expected “no encontré”)

View Quality Logs

# Last 50 queries
curl http://localhost:5000/siaa/log

# Filter by alert type
curl http://localhost:5000/siaa/log?alerta=POSIBLE_ALUCINACION

# Text format for terminal
curl http://localhost:5000/siaa/log?n=20&formato=txt

Configuration Reference

ParameterValuePurpose
OLLAMA_URLhttp://localhost:11434Ollama API endpoint
MODELqwen2.5:3bLLM model identifier
MAX_OLLAMA_SIMULTANEOS2Concurrent Ollama requests
HILOS_SERVIDOR16Waitress worker threads
TIMEOUT_CONEXION8Connection timeout (seconds)
TIMEOUT_RESPUESTA180Response timeout (seconds)
CARPETA_FUENTES/opt/siaa/fuentesDocument source directory
MAX_DOCS_CONTEXTO2Max documents per query
CHUNK_SIZE800Characters per chunk
CHUNK_OVERLAP300Overlap between chunks
MAX_CHUNKS_CONTEXTO3Max chunks per document
CACHE_MAX_ENTRADAS200Cache capacity
CACHE_TTL_SEGUNDOS3600Cache entry lifetime
LOG_ARCHIVO/opt/siaa/logs/calidad.jsonlQuality log path
LOG_MAX_LINEAS5000Log rotation threshold

Performance Characteristics

  • Cache hit: ~5ms response time
  • Cache miss: 20-45s response time (depending on context size)
  • TTFT (Time To First Token): 3-8s with warm model
  • Token generation: ~15-20 tokens/second
  • Max throughput: 2 concurrent users (semaphore limit)
  • Cache hit rate: 30-40% (across 26 court offices)

API Endpoints

EndpointMethodDescription
/siaa/chatPOSTMain chat interface (SSE streaming)
/siaa/statusGETSystem health and statistics
/siaa/ver/<doc>GETView document as HTML
/siaa/logGETQuality monitoring log
/siaa/cacheGET/DELETECache statistics / clear cache
/siaa/enrutarGETTest document routing
/siaa/fragmentoGETView extracted fragment
/siaa/recargarGETReload documents from disk

Next Steps

Build docs developers (and LLMs) love