Epidemiological Analytics from SIVIGILA Data

Salud IA Bot’s public health module gives every Telegram user access to Colombia’s official SIVIGILA epidemiological dataset through plain-language questions. Instead of requiring SQL or spreadsheet skills, users simply ask what they want to know and the bot’s NLP engine routes the question to the right analytic method, returning structured answers with percentages, emoji context, and actionable conclusions — all in Spanish.

The public health module loads SIVIGILA data from a local XML file at startup using xml2js. No live internet calls are made to SIVIGILA — all data is parsed from the bundled XML file and held in memory for the lifetime of the process.

What SIVIGILA is and why it matters

SIVIGILA (Sistema Nacional de Vigilancia en Salud Pública) is Colombia’s mandatory public health surveillance system managed by the Instituto Nacional de Salud (INS). It aggregates mandatory disease event reports from every health provider across the country, covering hundreds of notifiable conditions — from dengue and tuberculosis to occupational accidents and perinatal mortality. The dataset loaded into Salud IA Bot (Eventos_de_Interés_en_Salud_Pública_20260514.xml) contains aggregate national case counts broken down by:

Urban/rural zone (urbano / rural)
Six life-cycle age groups: primera infancia (0-4), infancia (5-9), adolescencia (10-14), juventud (15-19), adulto joven (20-49), adulto mayor (50+)
Sex: femenino / masculino
Total case count per notifiable event (total_de_eventos)

How `procesarPregunta` routes intents

SaludPublicaService.procesarPregunta(texto) is the primary entry point for free-text queries. It normalizes the input, scans a synonym dictionary of approximately 35 entries, and falls back to ambiguous partial-match search when no synonym matches. From there, BotUpdate delegates to specific analytic methods depending on detected intent:

Text normalization

Input is lowercased, diacritics are stripped via Unicode NFD decomposition, punctuation is removed, and whitespace is collapsed. This makes queries like "DENGUE", "dengüe", and "dengue" resolve to the same key.

Synonym resolution

A static map translates common shorthand to the canonical SIVIGILA event name. For example: dengue -> DENGUE, mordeduras -> AGRESIONES POR ANIMALES POTENCIALMENTE TRANSMISORES DE RABIA, vih -> VIH/SIDA - MORTALIDAD POR SIDA, chikungunya -> CHIKUNGUYA, drogas -> CONSUMO DE SPA.

Ambiguous search fallback

If no synonym matches, buscarEventosAmbigua(nombre) performs partial matching across all event names. The first match is used to retrieve the full event record.

Structured NLG output

_formatearRespuesta(datos, tipo) converts raw query results into a human-readable Telegram message with headings, bullet points, percentages, and emoji — no raw JSON is ever sent to the user.

Example Telegram queries

"¿Cuáles son los eventos de salud pública más frecuentes?"
-> Calls topEventos(5) — returns ranked list of the 5 events with most total cases

"Compara dengue vs zika"
-> Calls compararEventos('DENGUE', 'ZIKA') — shows total cases, which is higher, and the absolute difference

"¿Qué enfermedad afecta más a los adolescentes?"
-> Calls eventoPrincipalPorGrupoEtario('adolescencia') — returns the SIVIGILA event with highest adolescencia count

"Proporción global por sexo"
-> Calls proporcionSexoGlobal() — returns femenino%, masculino%, and total across all events

"¿Qué enfermedades son más rurales?"
-> Calls eventoMasRural() — returns the event with the highest rural/total_de_eventos ratio

"Dame un resumen de salud pública"
-> Calls obtenerResumenGeneral() — returns totalCasos, totalEventos, top 3 events, and breakdown by category (infecciosos, mental, materno, violencia)

Flexible search engine

The search engine avoids naive substring matching. A dedicated normalizeText() function strips accents, punctuation, and casing before any comparison. When an exact synonym match fails, buscarEventosAmbigua(nombre) performs partial matching across all event names. If that still returns nothing, buscarPorSimilitud(query, threshold=0.6) applies a Levenshtein-distance similarity score, allowing queries like "dgngue" or "chikunguña" to still resolve correctly.

NLP-based demographic queries with `CYCLE_KEYWORDS`

Demographic queries map natural-language age descriptors to the corresponding SIVIGILA age-group fields. The CYCLE_KEYWORDS constant (defined in constants/keywords.ts) drives this mapping:

CYCLE_KEYWORDS entry	Life-cycle key	Underlying field(s)
`['ninos', 'nino', 'nena']`	`niños`	`primera_infancia`, `infancia`, `de_5_a_9`
`['adolescente', 'adolescentes']`	`adolescentes`	`adolescencia`
`['jovenes', 'joven']`	`jovenes`	`juventud`, `adulto_j_ven`
`['adultos', 'adulto']`	`adultos`	`adulto_j_ven`
`['mayores', 'mayor']`	`mayores`	`adulto_mayor`

These keywords trigger eventoPrincipalPorGrupoEtario(grupo) with the matching field name, returning the single event with the highest case count for that age group.

RAG bypass — zero hallucination for structured data

When SaludPublicaService returns a valid result, BotUpdate sends that response directly to the user without ever calling the LLaMA 3.1 model. The LLM is invoked only as a fallback for open-ended questions where no structured data path exists. This architecture guarantees that numerical statistics (case counts, percentages, rankings) are always sourced from the actual SIVIGILA dataset and are never generated or embellished by the AI. BYPASS_MARKERS (defined in constants/keywords.ts) is a list of sentinel strings — such as '--- ANÁLISIS', '--- RANKING', and '--- DISTRIBUCIÓN' — that BotUpdate checks against pre-formatted response strings. When a response begins with one of these markers, the bot sends it directly to the user, bypassing the LLM entirely.

Top events ranking

topEventos(n) and bottomEventos(n) return sorted slices of the event list by total case count.

Comparative analysis

compararEventos(A, B) returns both events side by side with the difference in total cases and which event is higher.

Gender distribution

proporcionSexoGlobal() and eventosMayorBrechaSexo(n) surface gender imbalances across all or individual events.

Urban / rural split

eventoMasRural() and eventoMasUrbano() identify events with the highest proportion of cases in each zone type.

Get Started

Core Features

Architecture

Operations

What SIVIGILA is and why it matters

How `procesarPregunta` routes intents

Example Telegram queries

Flexible search engine

NLP-based demographic queries with `CYCLE_KEYWORDS`

RAG bypass — zero hallucination for structured data

Top events ranking

Comparative analysis

Gender distribution

Urban / rural split

Build docs developers (and LLMs) love

Get Started

Core Features

Architecture

Operations

Documentation Index

​What SIVIGILA is and why it matters

​How procesarPregunta routes intents

​Example Telegram queries

​Flexible search engine

​NLP-based demographic queries with CYCLE_KEYWORDS

​RAG bypass — zero hallucination for structured data

Top events ranking

Comparative analysis

Gender distribution

Urban / rural split

Build docs developers (and LLMs) love

What SIVIGILA is and why it matters

How `procesarPregunta` routes intents

Example Telegram queries

Flexible search engine

NLP-based demographic queries with `CYCLE_KEYWORDS`

RAG bypass — zero hallucination for structured data