XML-to-SQLite Data Pipeline and Persistence Strategy

Salud IA Bot separates data ingestion from data consumption into two clearly distinct phases. XML files from SIVIGILA, the Ministerio de Salud, regional provider registries, and the PAI vaccination programme are parsed once on a developer’s machine and stored in a portable SQLite database. In production the application only ever opens that pre-built database — no XML parsing, no in-memory trees, no startup delay. This design keeps RAM consumption low and delivers sub-3-second responses even on shared-memory hosting tiers such as Render’s free tier.

Two-Phase Approach

Migration Phase (one-time, local)

Run the seed and import scripts from the scripts/ directory. Each script reads one or more XML files with fast-xml-parser or xml2js, maps the parsed records to TypeORM entities, and bulk-saves them to data/salud-ia-bot.db in chunks of 100 rows. This phase runs on the developer’s machine before deployment.

npm run seed:antioquia      # Parses Prestadores_de_Salud_Departamento_de_Antioquia.xml
npm run seed:vaccination    # Parses three PAI XML files
npm run import:data         # Imports all remaining datasets

Production Phase (read-only)

The NestJS application starts, TypeORM opens the SQLite file with synchronize: false, and all service queries run as standard TypeORM find, findOne, and createQueryBuilder calls against the pre-populated tables. Zero XML is ever loaded into memory at runtime.

The data/salud-ia-bot.db file should be committed or transferred to your deployment environment. On Render and similar platforms, mount a persistent disk at the project root to preserve the database across deploys.

XML Data Sources

Each XML file maps to a dedicated seed script and a corresponding runtime service. The table below lists all source files verified against the data/ directory:

XML File	Source	Content	Migration Script + Service
`Eventos_de_Interés_en_Salud_Pública_20260514.xml`	SIVIGILA	Transmissible disease events (dengue, zika, malaria, tuberculosis, etc.)	`scripts/import-data.ts` + `HealthDataService`
`Salud_Mental.xml`	Ministerio Salud	CIE-10 mental health diagnoses and care records	`scripts/import-data.ts` + `MentalHealthService`
`Salud_sexual_-_preguntas.xml`	Internal	Sexual and reproductive health Q&A	`scripts/import-data.ts` + `SexualHealthService`
`Prestadores_de_Salud_Departamento_de_Antioquia.xml`	Regions	Antioquia health providers	`scripts/seed-antioquia.ts` + `AntioquiaHealthService`
`Centros_de_salud_Yopal._.xml`	Regions	Yopal health centres with GPS coordinates	`scripts/import-data.ts` + `YopalHealthService`
`SERVICIOS_OFERTADOS_RED_DE_SALUD_DEL_CENTRO_ESE_POR_SEDE_CALI.xml`	Regions	Cali services by sede and complexity level	`scripts/import-data.ts` + `CaliHealthService`
`servicios_salud_boyaca.xml`	Regions	Boyaca provider catalogue	`scripts/import-data.ts` + `BoyacaHealthService`
`Coberturas_administrativas_de_vacunación_por_departamento_20260528.xml`	PAI	Departmental vaccination coverage	`scripts/seed-vaccination.ts` + `VaccinationService`
`Cobertura_de_Vacunación_PAI_en_el_Valle_del_Cauca.xml`	PAI	Valle del Cauca PAI coverage	`scripts/seed-vaccination.ts` + `VaccinationService`
`DATOS_DE_VACUNACIÓN_EN_NIÑOS_Y_NIÑAS.xml`	PAI	Children’s vaccination data	`scripts/seed-vaccination.ts` + `VaccinationService`
`Calidad_del_Aire_en_Colombia_(Promedio_Anual)_20260528.xml`	External API	Annual average air quality indicators by municipality	`AirQualityService`

TypeORM Configuration

The database module configures TypeORM to use the better-sqlite3 driver. The synchronize: false flag is critical — it ensures the schema is never auto-modified at startup and that the tables seeded by the migration scripts remain intact:

// database.module.ts
TypeOrmModule.forRoot({
  type: 'better-sqlite3',
  database: process.cwd() + '/data/salud-ia-bot.db',
  entities: entities,
  synchronize: false, // schema managed by seed/migration scripts
  logging: false,
});

The entities array is imported from src/entities/index.ts and includes all eight entity classes registered in DataModule:

// data.module.ts — TypeOrmModule.forFeature registration
TypeOrmModule.forFeature([
  BoyacaProvider,
  AntioquiaProvider,
  CaliProvider,
  YopalProvider,
  Vaccination,
  MentalHealth,
  SexualHealth,
  HealthEvent,
])

Seed Script Pattern

All seed and import scripts follow the same three-step pattern: parse the XML with fast-xml-parser (or xml2js for complex nested structures), map each row through a typed mapper function, then bulk-save to SQLite using TypeORM’s chunked save:

import { XMLParser } from 'fast-xml-parser';

const parser = new XMLParser();
const xmlContent = fs.readFileSync(path.join(__dirname, '../../data/source.xml'), 'utf-8');
const data = parser.parse(xmlContent);

const entities = rows.map(mapper); // typed mapper function per dataset
await repo.save(entities, { chunk: 100 });

The chunk: 100 option splits large inserts into batches of 100 rows, preventing SQLite parameter-binding limits from being exceeded on datasets with thousands of records. The full SIVIGILA XML schema looks like this:

<Eventos>
  <Evento>
    <nombre_del_evento>Dengue</nombre_del_evento>
    <total_de_eventos>15420</total_de_eventos>
    <femenino>8200</femenino>
    <masculino>7220</masculino>
    <urbano>9800</urbano>
    <rural>5620</rural>
    <fecha_notificaci_n>2024-01-15</fecha_notificaci_n>
  </Evento>
</Eventos>

Data Models

The eight TypeORM entities cover five conceptual domains:

HealthEvent

Maps SIVIGILA transmissible disease records. Fields include event name, total cases, female/male split, urban/rural split, age groups (infant through elderly), and notification date. Queried by HealthDataService and SaludPublicaService.

MentalHealth (Diagnosis)

Stores CIE-10 mental health diagnosis entries from Salud_Mental.xml. Fields include diagnosis code, diagnosis name, total cases, and demographic breakdowns. Queried by MentalHealthService.

SexualHealth (QA)

A question-and-answer store from Salud_sexual_-_preguntas.xml. Each row holds a question string and a pre-written respuesta text. SexualHealthService runs keyword search across question fields.

Provider entities

Four separate entities — AntioquiaProvider, BoyacaProvider, CaliProvider, YopalProvider — reflect the different schemas of each regional dataset. YopalProvider includes latitud and longitud columns to support the Haversine geosearch.

Vaccination

Stores PAI departmental and municipal coverage records from all three vaccination XML files. Fields include department name, vaccine type, and coverage percentage. VaccinationService exposes getAllDepartament() and per-vaccine queries consumed by MlPredictionService for the composite risk score.

Benefits of the SQLite Approach

Zero parse overhead

No XML is loaded at application startup. TypeORM opens the SQLite file in milliseconds and queries are resolved via indexed table scans rather than full in-memory tree traversal.

Reduced RAM usage

Large XML trees for Antioquia (~thousands of providers) and vaccination data stay on disk. Services such as AntioquiaHealthService and VaccinationService use TypeORM repository queries instead of loading arrays into memory.

Faster response times

Combined with the NestJS CacheModule used in DataModule and BotModule, frequently-queried results are memoized in memory. DatasetBuilderService additionally maintains a 24-hour in-process cache for the data tensors fed to the ML prediction models.

The migration scripts are standalone TypeScript files (not NestJS modules) and must be run with ts-node outside the NestJS application lifecycle. They connect directly to TypeORM via DataSource and terminate after the import is complete. They do not run when npm run start:dev or npm run start:prod is executed.

Get Started

Core Features

Architecture

Operations

Two-Phase Approach

XML Data Sources

TypeORM Configuration

Seed Script Pattern

Data Models

HealthEvent

MentalHealth (Diagnosis)

SexualHealth (QA)

Provider entities

Vaccination

Benefits of the SQLite Approach

Zero parse overhead

Reduced RAM usage

Faster response times

Build docs developers (and LLMs) love

Get Started

Core Features

Architecture

Operations

Documentation Index

​Two-Phase Approach

​XML Data Sources

​TypeORM Configuration

​Seed Script Pattern

​Data Models

HealthEvent

MentalHealth (Diagnosis)

SexualHealth (QA)

Provider entities

Vaccination

​Benefits of the SQLite Approach

Zero parse overhead

Reduced RAM usage

Faster response times

Build docs developers (and LLMs) love

Two-Phase Approach

XML Data Sources

TypeORM Configuration

Seed Script Pattern

Data Models

Benefits of the SQLite Approach