AI Engine: OpenRouter, LLaMA 3.1, and RAG Architecture

The AI layer of Salud IA Bot is intentionally thin. Rather than building a dedicated LLM framework, the project uses the official OpenAI Node SDK pointed at OpenRouter’s base URL. This gives the application a standard, well-maintained client, retains full access to every model available on OpenRouter, and removes any vendor lock-in. The heavy intelligence work happens in the domain services that construct the prompts — not in the model itself.

GenkitService — The Central AI Client

GenkitService is a NestJS @Injectable() that wraps the OpenAI SDK. It is the only place in the codebase where AI calls are made. All domain services that need generative text inject GenkitService and call a single method: generateResponse(prompt: string). The full implementation includes exponential-backoff retry logic built directly into generateResponse:

// src/bot/genkit.service.ts
@Injectable()
export class GenkitService {
  private readonly logger = new Logger(GenkitService.name);
  private readonly openai = new OpenAI({
    apiKey: process.env.OPENROUTER_API_KEY ?? 'test',
    baseURL: 'https://openrouter.ai/api/v1',
  });

  private async sleep(ms: number) {
    return new Promise((resolve) => setTimeout(resolve, ms));
  }

  async generateResponse(prompt: string): Promise<string> {
    const MAX_RETRIES = 3;
    let lastError: any;
    const model =
      process.env.OPENROUTER_MODEL || 'meta-llama/Meta-Llama-3.1-70B-Instruct';

    for (let attempt = 0; attempt <= MAX_RETRIES; attempt++) {
      try {
        const response = await this.openai.chat.completions.create({
          model,
          messages: [{ role: 'user', content: prompt }],
        });
        return response.choices[0].message.content ?? '';
      } catch (error: any) {
        lastError = error;
        const isTransient = error?.status === 429 || error?.status === 503;
        if (isTransient && attempt < MAX_RETRIES) {
          const delay = Math.pow(2, attempt) * 1000;
          this.logger.warn(
            `OpenRouter transient error (${error.status || error.code}). Retrying in ${delay}ms... (Attempt ${attempt + 1}/${MAX_RETRIES})`,
          );
          await this.sleep(delay);
          continue;
        }
        this.logger.error(`OpenRouter API failed after ${attempt} retries: ${error.message}`);
        throw error;
      }
    }
    throw lastError;
  }
}

You can swap the model by setting OPENROUTER_MODEL to any OpenRouter-supported identifier, such as anthropic/claude-3.5-sonnet or google/gemini-flash-1.5. The OpenAI-compatible API ensures no code changes are needed.

Retry Logic with Exponential Backoff

generateResponse wraps the API call in a retry loop that handles transient errors gracefully. HTTP 429 (rate limit) and 503 (service unavailable) are treated as retryable; all other errors are surfaced immediately. The delay doubles on each attempt: 1 s -> 2 s -> 4 s. The loop runs for attempt = 0 through attempt <= MAX_RETRIES (i.e., up to 4 total attempts). On each retryable failure the backoff delay is computed as Math.pow(2, attempt) * 1000 milliseconds. A non-retryable error or exhausted retries throws the last captured error to the caller. This makes the bot resilient to short bursts of API throttling without surfacing confusing errors to end users.

RAG Pattern — Grounding Responses in Real Data

Salud IA Bot implements Retrieval-Augmented Generation (RAG) entirely in application code. Before calling GenkitService, domain services query SQLite and external APIs to build a context block, then inject that block into the prompt. The LLM’s job is to narrate data it is handed, not to recall facts from pre-training weights.

Context retrieval

SaludAnaliticaService calls SaludPublicaService, AirQualityService, and VaccinationService in parallel to collect SIVIGILA statistics, air quality indicators, and vaccination coverage for the detected region and disease.

Prompt augmentation

The retrieved context is prepended to the user’s message inside a clearly delimited block:

// src/bot/bot.update.ts — handleGeneralQuery (context-available branch)
augmentedPrompt = `
### CONTEXTO DE DATOS REALES (COLOMBIA) ###
${contextData}
### FIN DEL CONTEXTO ###

INSTRUCCIÓN: Responde a la consulta del usuario utilizando EXCLUSIVAMENTE los datos del contexto anterior.
Si el contexto no contiene información relevante para responder la consulta, responde EXACTAMENTE con este mensaje: "${RESPONSE_NO_INFORMATION}"
Si el contexto contiene estadísticas, limítate a analizarlas y presentarlas. NO generes información que no esté presente en el contexto.

Consulta: ${text}
`;

Response generation

The augmented prompt is sent to GenkitService.generateResponse(). The model receives grounded statistics and is explicitly instructed not to generate any information not present in the context block.

Delivery

The response is passed through sendLongMessage() and delivered to the user as one or more Telegram messages.

Fallback Instruction (No Context Available)

When contextData is absent or contains the [INFO] marker, handleGeneralQuery in bot.update.ts switches to a scope-limiting fallback instruction instead of the RAG context block. This instruction is embedded inline in the prompt string — there is no separate system prompt variable in GenkitService:

// src/bot/bot.update.ts — handleGeneralQuery (no-context branch)
augmentedPrompt = `Consulta: ${text}

INSTRUCCIÓN: Como asistente experto en salud pública colombiana, si la consulta no está relacionada
con tus capacidades (servicios de salud, estadísticas de salud pública, salud mental o sexual),
responde EXACTAMENTE con este mensaje: "${RESPONSE_NO_INFORMATION}"`;

This fallback establishes:

Role: expert in Colombian public health (not a general-purpose assistant)
Scope: health services, SIVIGILA statistics, mental health, sexual health
Constraint: return the exact RESPONSE_NO_INFORMATION string for out-of-scope queries

GenkitService itself contains no system prompt. All prompt engineering — both the RAG context block and the fallback instruction — is constructed by BotUpdate.handleGeneralQuery() in bot.update.ts before the call to GenkitService.generateResponse().

Bypass Pattern — Zero Hallucination for Structured Queries

For queries where the answer is a lookup rather than a synthesis, services return data directly without ever calling the LLM. This is called the bypass pattern and is the primary hallucination-prevention mechanism for high-precision use cases.

Cali urgency detection

CaliHealthService.processCaliQuery() detects urgency intent, complexity level, specific sede names, and service types (odontologia, ginecologia, farmacia) entirely through normalised-text matching against the SQLite provider table. The formatted response is returned directly to BotUpdate — GenkitService is never called.

SIVIGILA statistics

SaludPublicaService.procesarPregunta() runs its full NLP pipeline (normalisation -> synonym search -> statistical analysis -> NLG formatter) in pure TypeScript. For ranking, comparative, and demographic queries it generates a complete Markdown response from real data without invoking the AI.

BYPASS_MARKERS sentinel

When StatsService.getSummary() returns a string that contains one of the BYPASS_MARKERS constants, BotUpdate delivers the string directly instead of forwarding it to GenkitService. This ensures that fully-resolved data responses bypass the LLM even when they arrive through the general stats path.

Provider listings

AntioquiaQuestionsService, YopalQuestionsService, and BoyacaHealthService all return structured provider lists directly from TypeORM queries. No generative step is involved for “find hospitals near me” or “list clinics in Tunja” queries.

Why OpenRouter

Using OpenRouter rather than a direct model API offers three concrete advantages for this project:

Benefit	Detail
No vendor lock-in	The `OPENROUTER_MODEL` environment variable is all that needs to change to switch from LLaMA to Claude to Gemini. The `openai` SDK client code is identical for every model.
OpenAI-compatible API	The same `chat.completions.create` call works across all supported models. No SDK swap, no breaking changes.
Cost flexibility	Free-tier models (e.g. `nvidia/nemotron-3-super-120b-a12b:free`) can be used during development; production deployments can switch to a premium model by changing one env var.

Model Choice: Meta LLaMA 3.1 70B Instruct

The default model is meta-llama/Meta-Llama-3.1-70B-Instruct. According to the project’s CRISP-ML documentation, this model was selected for two reasons:

Response latency

The 70B Instruct variant hits the project’s target of under 3 seconds for typical RAG prompts. Larger frontier models introduce additional latency that degrades the conversational UX on mobile Telegram clients.

Spanish health domain quality

LLaMA 3.1 70B Instruct performs well on Spanish-language text and handles medical terminology, epidemiological phrasing, and Colombian regional context without requiring fine-tuning. Its instruction-following reliability is critical for the “respond only from the context block” RAG constraint.

If you switch to a model that does not reliably follow system instructions (e.g. a base model rather than an instruct variant), the RAG constraint may be ignored and the bot could generate hallucinated statistics. Always use an instruct or chat-tuned model variant.

Composite Risk Scoring (ML Layer)

For predictive queries, PredictiveQuestionsService orchestrates three services that run entirely without the LLM:

MlPredictionService      -> Composite risk score (BAJO / MEDIO / ALTO / CRITICO)
AdvancedPredictionService -> Holt-Winters-inspired time-series projection
EarlyWarningService       -> Threshold-based outbreak alert generation

MlPredictionService computes a weighted score across four dimensions sourced from the SQLite database:

Dimension	Weight	Data source
Case volume (SIVIGILA events)	40%	`HealthEvent` table
Rurality index	20%	`HealthEvent` urban/rural columns
Vaccination coverage gap	25%	`Vaccination` table
Vulnerable population share	15%	`HealthEvent` age-group columns

The structured score output (numerical breakdown + risk level + recommendations) is formatted by PredictiveQuestionsService.clasificarRiesgo() and returned directly to BotUpdate — no LLM call is needed for risk classification. The LLM is only invoked if the user asks a follow-up open-ended question about the result.

Get Started

Core Features

Architecture

Operations

GenkitService — The Central AI Client

Retry Logic with Exponential Backoff

RAG Pattern — Grounding Responses in Real Data

Fallback Instruction (No Context Available)

Bypass Pattern — Zero Hallucination for Structured Queries

Cali urgency detection

SIVIGILA statistics

BYPASS_MARKERS sentinel

Provider listings

Why OpenRouter

Model Choice: Meta LLaMA 3.1 70B Instruct

Response latency

Spanish health domain quality

Composite Risk Scoring (ML Layer)

Build docs developers (and LLMs) love

Get Started

Core Features

Architecture

Operations

Documentation Index

​GenkitService — The Central AI Client

​Retry Logic with Exponential Backoff

​RAG Pattern — Grounding Responses in Real Data

​Fallback Instruction (No Context Available)

​Bypass Pattern — Zero Hallucination for Structured Queries

Cali urgency detection

SIVIGILA statistics

BYPASS_MARKERS sentinel

Provider listings

​Why OpenRouter

​Model Choice: Meta LLaMA 3.1 70B Instruct

Response latency

Spanish health domain quality

​Composite Risk Scoring (ML Layer)

Build docs developers (and LLMs) love

GenkitService — The Central AI Client

Retry Logic with Exponential Backoff

RAG Pattern — Grounding Responses in Real Data

Fallback Instruction (No Context Available)

Bypass Pattern — Zero Hallucination for Structured Queries

Why OpenRouter

Model Choice: Meta LLaMA 3.1 70B Instruct

Composite Risk Scoring (ML Layer)