AgroIA uses PostgreSQL with the pgvector extension to store agronomic reports alongside their 768-dimensional semantic embeddings, enabling cosine-similarity search at query time. The schema is defined inDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/sdarionicolas-boop/AgroIA-RAG/llms.txt
Use this file to discover all available pages before exploring further.
01_migrate_schema.sql and is fully idempotent — it can be run against a fresh database for initial setup or against an existing one to apply incremental migrations without data loss.
Running the migration
Start the PostgreSQL container
pgvector/pgvector image so the vector extension is pre-installed.The migration is idempotent. All
CREATE TABLE, CREATE INDEX, and ALTER TABLE statements use IF NOT EXISTS, and the duplicate-cleanup DELETE is safe to re-run on an empty table. You can apply it repeatedly without side effects.Table: informes_lotes
One row per lot. Holds the latest consolidated report, the full technical content used by the RAG engine, and the vector embedding for semantic search.
Unique constraint: UNIQUE (lote_id) — each lot has exactly one current report. Subsequent ingestions perform an ON CONFLICT (lote_id) DO UPDATE.
| Column | Type | Notes |
|---|---|---|
id | SERIAL | Primary key |
lote_id | VARCHAR(100) | Business identifier for the lot |
fecha | DATE | Report date |
ndvi_promedio | DOUBLE PRECISION | Average NDVI for the critical month |
gdd_acumulados | DOUBLE PRECISION | Accumulated growing degree days |
score_total | INTEGER | AgroIA Score 0–100 |
cv_espacial | DOUBLE PRECISION | Spatial coefficient of variation (NDVI) |
zona_activa | BOOLEAN | Whether zone-C delineation is active |
puntos_zona_c | INTEGER | Number of points classified as zone C |
cultivo | VARCHAR(50) | Crop type (e.g. maiz, soja) |
superficie_ha | DOUBLE PRECISION | Lot area in hectares |
contenido_tecnico | TEXT | Full agronomic report text; source for RAG retrieval |
metadata | JSONB | Arbitrary key-value data from the pipeline |
embedding | vector(768) | Semantic embedding generated by nomic-embed-text |
created_at | TIMESTAMP | Row creation time |
updated_at | TIMESTAMP | Last upsert time |
Table: lote_historial
Time series table. One row per lot per campaign year. Populated from the historial_anos array in the ingestion payload. Used by the Streamlit dashboard for historical charts and by the RAG engine for temporal context.
Unique constraint: UNIQUE (lote_id, anio) — one record per lot per year. Subsequent ingestions perform an ON CONFLICT (lote_id, anio) DO UPDATE.
| Column | Type | Notes |
|---|---|---|
id | SERIAL | Primary key |
lote_id | VARCHAR(100) | Foreign reference to informes_lotes.lote_id |
anio | INTEGER | Campaign year (ASCII column name — see note below) |
cultivo | VARCHAR(50) | Crop type for that campaign |
ndvi_critico | DOUBLE PRECISION | NDVI at the critical phenological stage |
horas_calor | DOUBLE PRECISION | Accumulated heat hours (NASA POWER) |
score_total | INTEGER | AgroIA Score for that year |
score_vigor | DOUBLE PRECISION | Vigor sub-score (40% weight) |
score_estabilidad | DOUBLE PRECISION | Stability sub-score (30% weight) |
score_limpieza | DOUBLE PRECISION | Cleanliness sub-score (20% weight) |
score_clima | DOUBLE PRECISION | Climate sub-score (10% weight) |
valido_para_score | BOOLEAN | Whether the row is used in score calculations |
superficie_ha | DOUBLE PRECISION | Field area in hectares for this campaign |
cv_espacial | DOUBLE PRECISION | Spatial coefficient of variation for the campaign |
zonificacion_activa | BOOLEAN | Whether zone delineation was active |
puntos_zona_c | INTEGER | Zone-C point count for that year |
Indexes
Both tables have indexes on the columns most frequently filtered or ordered by the Streamlit dashboard and the RAG query layer:| Index | Table | Column(s) |
|---|---|---|
idx_lotes_lote_id | informes_lotes | lote_id |
idx_lotes_cultivo | informes_lotes | cultivo |
idx_lotes_score | informes_lotes | score_total |
idx_historial_lote | lote_historial | lote_id |
idx_historial_anio | lote_historial | anio |
idx_historial_cultivo | lote_historial | cultivo |
idx_historial_score | lote_historial | score_total |
Advanced topics
pgvector cosine similarity search
pgvector cosine similarity search
The For large datasets, add an
embedding column in informes_lotes stores 768-dimensional vectors produced by nomic-embed-text. The RAG engine queries using the <=> cosine distance operator:ivfflat or hnsw index on the embedding column to speed up approximate nearest-neighbor search.Changing the embedding dimension
Changing the embedding dimension
If you switch to a model that produces a different vector size, you must drop and recreate the
embedding column with the new dimension, then re-ingest all lots to regenerate embeddings. The current schema is fixed at vector(768) for nomic-embed-text.