System Architecture Overview

SIAA (Sistema Inteligente de Apoyo Administrativo) is an intelligent judicial document management system built for the Seccional Bucaramanga of Colombia’s Judicial Branch. It uses AI-powered document routing and retrieval to answer queries about judicial procedures, regulations, and administrative processes.

System Components

Component Details

Flask Proxy Server

The proxy server (siaa_proxy.py) acts as the central orchestrator, handling:

Request routing and validation
Cache management
Document retrieval coordination
Ollama API communication
Quality monitoring and logging

The proxy runs on Waitress WSGI server with 16 threads (HILOS_SERVIDOR=16) for production deployment.

Ollama LLM Engine

SIAA uses Qwen2.5:3b model via Ollama’s local API:

siaa_proxy.py

OLLAMA_URL = "http://localhost:11434"
MODEL = "qwen2.5:3b"

# Model parameters
options = {
    "temperature": 0.0,        # Deterministic responses
    "num_predict": 150,        # Max tokens per response (300 for lists)
    "num_ctx": 2048,           # Context window (adaptive: 1024-3072)
    "num_thread": 6,           # Physical cores only (Ryzen 5 2600)
    "num_batch": 512,          # Large batch → lower TTFT
    "repeat_penalty": 1.1,
}

Document Store

Documents are loaded from /opt/siaa/fuentes at startup:

siaa_proxy.py

CARPETA_FUENTES = "/opt/siaa/fuentes"

# Document structure
doc_entry = {
    "ruta": "/opt/siaa/fuentes/acuerdo_psaa16.md",
    "nombre_original": "acuerdo_psaa16.md",
    "contenido": "...",
    "palabras": set(),           # Tokenized vocabulary
    "tamano": 45231,             # Character count
    "coleccion": "general",
    "token_count": Counter(),    # Term frequency index
    "total_tokens": 1542,
    "tokens_nombre": {"acuerdo", "psaa16"},
    "num_chunks": 38,            # Pre-calculated chunks
}

LRU Cache System

High-performance response cache with thread-safe LRU eviction:

siaa_proxy.py

CACHE_MAX_ENTRADAS = 200    # Maximum entries
CACHE_TTL_SEGUNDOS = 3600   # 1 hour TTL
CACHE_SOLO_DOC = True       # Only cache documentary queries

# Cache entry structure
{
    "respuesta": "El SIERJU es un sistema de información...",
    "cita": "📄 Fuente: ACUERDO PSAA16-10476",
    "ts": 1709856234.5,
    "hits": 12,
}

Cache hits provide 8,800x speedup (~5ms vs 44s) and reduce Ollama load by 30-40% across 26 court offices.

Data Flow: Query to Response

With MAX_OLLAMA_SIMULTANEOS=2, a 3rd concurrent request will wait up to 30 seconds. If the queue is full, users receive a “Sistema ocupado” message.

Why Limit to 2 Concurrent Requests?

RAM constraints: Qwen2.5:3b requires ~4GB per instance
CPU bottleneck: Ryzen 5 2600 (6 cores) thrashes with >2 parallel inferences
Response quality: More concurrency = slower per-token generation

Health Monitoring System

Automatic health checks run every 15 seconds:

siaa_proxy.py

ollama_estado = {
    "disponible": False,
    "ultimo_check": 0,
    "fallos": 0,
    "warmup_done": None
}

def verificar_ollama() -> bool:
    try:
        r = requests.get(f"{OLLAMA_URL}/api/tags", timeout=TIMEOUT_HEALTH)
        ok = (r.status_code == 200)
    except Exception:
        ok = False
    
    with ollama_lock:
        ollama_estado["disponible"] = ok
        ollama_estado["ultimo_check"] = time.time()
        ollama_estado["fallos"] = 0 if ok else ollama_estado["fallos"] + 1
        warmup_pendiente = ok and ollama_estado["warmup_done"] is None
    
    # Warm-up: load model into RAM on first success
    if warmup_pendiente:
        print(f"  [Ollama] Precargando {MODEL} en RAM...", flush=True)
        requests.post(
            f"{OLLAMA_URL}/api/chat",
            json={"model": MODEL, "messages": [{"role": "user", "content": "ok"}],
                  "stream": False, "options": {"num_predict": 1, "num_ctx": 64}}
        )
        ollama_estado["warmup_done"] = True
    
    return ok

# Background monitoring thread
def _monitor_loop():
    while True:
        verificar_ollama()
        time.sleep(15)

threading.Thread(target=_monitor_loop, daemon=True).start()

Warm-up Process

On first successful connection, the monitor sends a minimal query ("ok" with 1 token prediction) to:

Load the model into RAM (prevents 30s delay on first real query)
Initialize CUDA/ROCm context
Verify model availability

Check System Status

curl http://localhost:5000/siaa/status

Returns health metrics including warmup_completado, usuarios_activos, cache stats, and Ollama availability.

Quality Monitoring and Logging

Every query is logged to /opt/siaa/logs/calidad.jsonl (JSONL format for easy analysis):

siaa_proxy.py

def registrar_consulta(
    tipo: str,          # "CONV", "DOC", "CACHE_HIT", "ERROR"
    pregunta: str,
    respuesta: str,
    docs: list,
    ctx_chars: int,
    tiempo_seg: float,
    cache_hit: bool = False,
):
    # Detect issues automatically
    no_encontro = "no encontré esa información" in respuesta.lower()
    habia_docs = len(docs) > 0 and ctx_chars > 100
    
    if no_encontro and habia_docs:
        alerta = "POSIBLE_ALUCINACION"   # Had docs but said "not found"
    elif no_encontro and not habia_docs:
        alerta = "SIN_CONTEXTO"           # No docs available (correct)
    elif tipo == "ERROR":
        alerta = "ERROR"
    else:
        alerta = "OK"
    
    entrada = {
        "ts": time.strftime("%Y-%m-%dT%H:%M:%S"),
        "tipo": "CACHE_HIT" if cache_hit else tipo,
        "alerta": alerta,
        "pregunta": pregunta[:200],
        "respuesta": respuesta[:300],
        "docs": docs,
        "ctx_chars": ctx_chars,
        "tiempo_s": round(tiempo_seg, 2),
    }
    
    # Write to JSONL with rotation at 5000 lines
    with _log_lock:
        with open(LOG_ARCHIVO, "a", encoding="utf-8") as f:
            f.write(json.dumps(entrada, ensure_ascii=False) + "\n")

Hallucination Detection

The system automatically flags potential hallucinations:

POSIBLE_ALUCINACION: Model said “no encontré” despite receiving relevant documents
SIN_CONTEXTO: No documents found (expected “no encontré”)

View Quality Logs

# Last 50 queries
curl http://localhost:5000/siaa/log

# Filter by alert type
curl http://localhost:5000/siaa/log?alerta=POSIBLE_ALUCINACION

# Text format for terminal
curl http://localhost:5000/siaa/log?n=20&formato=txt

Configuration Reference

Parameter	Value	Purpose
`OLLAMA_URL`	`http://localhost:11434`	Ollama API endpoint
`MODEL`	`qwen2.5:3b`	LLM model identifier
`MAX_OLLAMA_SIMULTANEOS`	`2`	Concurrent Ollama requests
`HILOS_SERVIDOR`	`16`	Waitress worker threads
`TIMEOUT_CONEXION`	`8`	Connection timeout (seconds)
`TIMEOUT_RESPUESTA`	`180`	Response timeout (seconds)
`CARPETA_FUENTES`	`/opt/siaa/fuentes`	Document source directory
`MAX_DOCS_CONTEXTO`	`2`	Max documents per query
`CHUNK_SIZE`	`800`	Characters per chunk
`CHUNK_OVERLAP`	`300`	Overlap between chunks
`MAX_CHUNKS_CONTEXTO`	`3`	Max chunks per document
`CACHE_MAX_ENTRADAS`	`200`	Cache capacity
`CACHE_TTL_SEGUNDOS`	`3600`	Cache entry lifetime
`LOG_ARCHIVO`	`/opt/siaa/logs/calidad.jsonl`	Quality log path
`LOG_MAX_LINEAS`	`5000`	Log rotation threshold

Performance Characteristics

Cache hit: ~5ms response time
Cache miss: 20-45s response time (depending on context size)
TTFT (Time To First Token): 3-8s with warm model
Token generation: ~15-20 tokens/second
Max throughput: 2 concurrent users (semaphore limit)
Cache hit rate: 30-40% (across 26 court offices)

API Endpoints

Endpoint	Method	Description
`/siaa/chat`	POST	Main chat interface (SSE streaming)
`/siaa/status`	GET	System health and statistics
`/siaa/ver/<doc>`	GET	View document as HTML
`/siaa/log`	GET	Quality monitoring log
`/siaa/cache`	GET/DELETE	Cache statistics / clear cache
`/siaa/enrutar`	GET	Test document routing
`/siaa/fragmento`	GET	View extracted fragment
`/siaa/recargar`	GET	Reload documents from disk

Next Steps

Learn about the document routing algorithm
Understand chunking strategies
Explore Ollama integration

Get Started

Core Features

Document Processing

System Architecture

Administration

System Architecture Overview

System Architecture Overview

System Components

Component Details

Flask Proxy Server

Ollama LLM Engine

Document Store

LRU Cache System

Data Flow: Query to Response

Why Limit to 2 Concurrent Requests?

Health Monitoring System

Warm-up Process

Check System Status

Quality Monitoring and Logging

Hallucination Detection

View Quality Logs

Configuration Reference

Performance Characteristics

API Endpoints

Next Steps

Build docs developers (and LLMs) love

Get Started

Core Features

Document Processing

System Architecture

Administration

​System Architecture Overview

​System Components

​Component Details

​Flask Proxy Server

​Ollama LLM Engine

​Document Store

​LRU Cache System

​Data Flow: Query to Response

​Why Limit to 2 Concurrent Requests?

​Health Monitoring System

​Warm-up Process

Check System Status

​Quality Monitoring and Logging

​Hallucination Detection

View Quality Logs

​Configuration Reference

​Performance Characteristics

​API Endpoints

Next Steps

Build docs developers (and LLMs) love

System Architecture Overview

System Components

Component Details

Flask Proxy Server

Ollama LLM Engine

Document Store

LRU Cache System

Data Flow: Query to Response

Why Limit to 2 Concurrent Requests?

Health Monitoring System

Warm-up Process

Quality Monitoring and Logging

Hallucination Detection

Configuration Reference

Performance Characteristics

API Endpoints