The AI layer of Salud IA Bot is intentionally thin. Rather than building a dedicated LLM framework, the project uses the official OpenAI Node SDK pointed at OpenRouter’s base URL. This gives the application a standard, well-maintained client, retains full access to every model available on OpenRouter, and removes any vendor lock-in. The heavy intelligence work happens in the domain services that construct the prompts — not in the model itself.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/RubenDarioGuerreroNeira/Ecosistema-IA-Colombia/llms.txt
Use this file to discover all available pages before exploring further.
GenkitService — The Central AI Client
GenkitService is a NestJS @Injectable() that wraps the OpenAI SDK. It is the only place in the codebase where AI calls are made. All domain services that need generative text inject GenkitService and call a single method: generateResponse(prompt: string). The full implementation includes exponential-backoff retry logic built directly into generateResponse:
You can swap the model by setting
OPENROUTER_MODEL to any OpenRouter-supported identifier, such as anthropic/claude-3.5-sonnet or google/gemini-flash-1.5. The OpenAI-compatible API ensures no code changes are needed.Retry Logic with Exponential Backoff
generateResponse wraps the API call in a retry loop that handles transient errors gracefully. HTTP 429 (rate limit) and 503 (service unavailable) are treated as retryable; all other errors are surfaced immediately. The delay doubles on each attempt: 1 s -> 2 s -> 4 s.
The loop runs for attempt = 0 through attempt <= MAX_RETRIES (i.e., up to 4 total attempts). On each retryable failure the backoff delay is computed as Math.pow(2, attempt) * 1000 milliseconds. A non-retryable error or exhausted retries throws the last captured error to the caller.
This makes the bot resilient to short bursts of API throttling without surfacing confusing errors to end users.
RAG Pattern — Grounding Responses in Real Data
Salud IA Bot implements Retrieval-Augmented Generation (RAG) entirely in application code. Before callingGenkitService, domain services query SQLite and external APIs to build a context block, then inject that block into the prompt. The LLM’s job is to narrate data it is handed, not to recall facts from pre-training weights.
Context retrieval
SaludAnaliticaService calls SaludPublicaService, AirQualityService, and VaccinationService in parallel to collect SIVIGILA statistics, air quality indicators, and vaccination coverage for the detected region and disease.Prompt augmentation
The retrieved context is prepended to the user’s message inside a clearly delimited block:
Response generation
The augmented prompt is sent to
GenkitService.generateResponse(). The model receives grounded statistics and is explicitly instructed not to generate any information not present in the context block.Fallback Instruction (No Context Available)
WhencontextData is absent or contains the [INFO] marker, handleGeneralQuery in bot.update.ts switches to a scope-limiting fallback instruction instead of the RAG context block. This instruction is embedded inline in the prompt string — there is no separate system prompt variable in GenkitService:
- Role: expert in Colombian public health (not a general-purpose assistant)
- Scope: health services, SIVIGILA statistics, mental health, sexual health
- Constraint: return the exact
RESPONSE_NO_INFORMATIONstring for out-of-scope queries
GenkitService itself contains no system prompt. All prompt engineering — both the RAG context block and the fallback instruction — is constructed by BotUpdate.handleGeneralQuery() in bot.update.ts before the call to GenkitService.generateResponse().Bypass Pattern — Zero Hallucination for Structured Queries
For queries where the answer is a lookup rather than a synthesis, services return data directly without ever calling the LLM. This is called the bypass pattern and is the primary hallucination-prevention mechanism for high-precision use cases.Cali urgency detection
CaliHealthService.processCaliQuery() detects urgency intent, complexity level, specific sede names, and service types (odontologia, ginecologia, farmacia) entirely through normalised-text matching against the SQLite provider table. The formatted response is returned directly to BotUpdate — GenkitService is never called.SIVIGILA statistics
SaludPublicaService.procesarPregunta() runs its full NLP pipeline (normalisation -> synonym search -> statistical analysis -> NLG formatter) in pure TypeScript. For ranking, comparative, and demographic queries it generates a complete Markdown response from real data without invoking the AI.BYPASS_MARKERS sentinel
When
StatsService.getSummary() returns a string that contains one of the BYPASS_MARKERS constants, BotUpdate delivers the string directly instead of forwarding it to GenkitService. This ensures that fully-resolved data responses bypass the LLM even when they arrive through the general stats path.Provider listings
AntioquiaQuestionsService, YopalQuestionsService, and BoyacaHealthService all return structured provider lists directly from TypeORM queries. No generative step is involved for “find hospitals near me” or “list clinics in Tunja” queries.Why OpenRouter
Using OpenRouter rather than a direct model API offers three concrete advantages for this project:| Benefit | Detail |
|---|---|
| No vendor lock-in | The OPENROUTER_MODEL environment variable is all that needs to change to switch from LLaMA to Claude to Gemini. The openai SDK client code is identical for every model. |
| OpenAI-compatible API | The same chat.completions.create call works across all supported models. No SDK swap, no breaking changes. |
| Cost flexibility | Free-tier models (e.g. nvidia/nemotron-3-super-120b-a12b:free) can be used during development; production deployments can switch to a premium model by changing one env var. |
Model Choice: Meta LLaMA 3.1 70B Instruct
The default model ismeta-llama/Meta-Llama-3.1-70B-Instruct. According to the project’s CRISP-ML documentation, this model was selected for two reasons:
Response latency
The 70B Instruct variant hits the project’s target of under 3 seconds for typical RAG prompts. Larger frontier models introduce additional latency that degrades the conversational UX on mobile Telegram clients.
Spanish health domain quality
LLaMA 3.1 70B Instruct performs well on Spanish-language text and handles medical terminology, epidemiological phrasing, and Colombian regional context without requiring fine-tuning. Its instruction-following reliability is critical for the “respond only from the context block” RAG constraint.
Composite Risk Scoring (ML Layer)
For predictive queries,PredictiveQuestionsService orchestrates three services that run entirely without the LLM:
MlPredictionService computes a weighted score across four dimensions sourced from the SQLite database:
| Dimension | Weight | Data source |
|---|---|---|
| Case volume (SIVIGILA events) | 40% | HealthEvent table |
| Rurality index | 20% | HealthEvent urban/rural columns |
| Vaccination coverage gap | 25% | Vaccination table |
| Vulnerable population share | 15% | HealthEvent age-group columns |
PredictiveQuestionsService.clasificarRiesgo() and returned directly to BotUpdate — no LLM call is needed for risk classification. The LLM is only invoked if the user asks a follow-up open-ended question about the result.