PostgreSQL and pgvector database schema

AgroIA uses PostgreSQL with the pgvector extension to store agronomic reports alongside their 768-dimensional semantic embeddings, enabling cosine-similarity search at query time. The schema is defined in 01_migrate_schema.sql and is fully idempotent — it can be run against a fresh database for initial setup or against an existing one to apply incremental migrations without data loss.

Running the migration

Start the PostgreSQL container

docker run -d \
  --name postgres-agri \
  -e POSTGRES_USER=agri_user \
  -e POSTGRES_PASSWORD=your_password \
  -e POSTGRES_DB=agri_db \
  -p 5432:5432 \
  pgvector/pgvector:pg16

Use the official pgvector/pgvector image so the vector extension is pre-installed.

Apply the schema

Linux / macOS

psql -U agri_user -d agri_db -f 01_migrate_schema.sql

Windows (Docker exec)

Get-Content 01_migrate_schema.sql | docker exec -i postgres-agri psql -U agri_user -d agri_db

Verify the tables exist

The final SELECT in the migration script prints row counts for both tables:

 tabla            | registros
------------------+-----------
 informes_lotes   |         0
 lote_historial   |         0

The migration is idempotent. All CREATE TABLE, CREATE INDEX, and ALTER TABLE statements use IF NOT EXISTS, and the duplicate-cleanup DELETE is safe to re-run on an empty table. You can apply it repeatedly without side effects.

Table: `informes_lotes`

One row per lot. Holds the latest consolidated report, the full technical content used by the RAG engine, and the vector embedding for semantic search. Unique constraint: UNIQUE (lote_id) — each lot has exactly one current report. Subsequent ingestions perform an ON CONFLICT (lote_id) DO UPDATE.

Column	Type	Notes
`id`	`SERIAL`	Primary key
`lote_id`	`VARCHAR(100)`	Business identifier for the lot
`fecha`	`DATE`	Report date
`ndvi_promedio`	`DOUBLE PRECISION`	Average NDVI for the critical month
`gdd_acumulados`	`DOUBLE PRECISION`	Accumulated growing degree days
`score_total`	`INTEGER`	AgroIA Score 0–100
`cv_espacial`	`DOUBLE PRECISION`	Spatial coefficient of variation (NDVI)
`zona_activa`	`BOOLEAN`	Whether zone-C delineation is active
`puntos_zona_c`	`INTEGER`	Number of points classified as zone C
`cultivo`	`VARCHAR(50)`	Crop type (e.g. `maiz`, `soja`)
`superficie_ha`	`DOUBLE PRECISION`	Lot area in hectares
`contenido_tecnico`	`TEXT`	Full agronomic report text; source for RAG retrieval
`metadata`	`JSONB`	Arbitrary key-value data from the pipeline
`embedding`	`vector(768)`	Semantic embedding generated by `nomic-embed-text`
`created_at`	`TIMESTAMP`	Row creation time
`updated_at`	`TIMESTAMP`	Last upsert time

CREATE TABLE IF NOT EXISTS informes_lotes (
    id               SERIAL PRIMARY KEY,
    lote_id          VARCHAR(100) NOT NULL,
    fecha            DATE,
    ndvi_promedio    DOUBLE PRECISION,
    gdd_acumulados   DOUBLE PRECISION,
    score_total      INTEGER,
    cv_espacial      DOUBLE PRECISION,
    zona_activa      BOOLEAN DEFAULT FALSE,
    puntos_zona_c    INTEGER DEFAULT 0,
    cultivo          VARCHAR(50),
    superficie_ha    DOUBLE PRECISION,
    contenido_tecnico TEXT,
    metadata         JSONB,
    embedding        vector(768),
    created_at       TIMESTAMP DEFAULT now(),
    updated_at       TIMESTAMP DEFAULT now()
);

Table: `lote_historial`

Time series table. One row per lot per campaign year. Populated from the historial_anos array in the ingestion payload. Used by the Streamlit dashboard for historical charts and by the RAG engine for temporal context. Unique constraint: UNIQUE (lote_id, anio) — one record per lot per year. Subsequent ingestions perform an ON CONFLICT (lote_id, anio) DO UPDATE.

Column	Type	Notes
`id`	`SERIAL`	Primary key
`lote_id`	`VARCHAR(100)`	Foreign reference to `informes_lotes.lote_id`
`anio`	`INTEGER`	Campaign year (ASCII column name — see note below)
`cultivo`	`VARCHAR(50)`	Crop type for that campaign
`ndvi_critico`	`DOUBLE PRECISION`	NDVI at the critical phenological stage
`horas_calor`	`DOUBLE PRECISION`	Accumulated heat hours (NASA POWER)
`score_total`	`INTEGER`	AgroIA Score for that year
`score_vigor`	`DOUBLE PRECISION`	Vigor sub-score (40% weight)
`score_estabilidad`	`DOUBLE PRECISION`	Stability sub-score (30% weight)
`score_limpieza`	`DOUBLE PRECISION`	Cleanliness sub-score (20% weight)
`score_clima`	`DOUBLE PRECISION`	Climate sub-score (10% weight)
`valido_para_score`	`BOOLEAN`	Whether the row is used in score calculations
`superficie_ha`	`DOUBLE PRECISION`	Field area in hectares for this campaign
`cv_espacial`	`DOUBLE PRECISION`	Spatial coefficient of variation for the campaign
`zonificacion_activa`	`BOOLEAN`	Whether zone delineation was active
`puntos_zona_c`	`INTEGER`	Zone-C point count for that year

CREATE TABLE IF NOT EXISTS lote_historial (
    id                  SERIAL PRIMARY KEY,
    lote_id             VARCHAR(100) NOT NULL,
    anio                INTEGER      NOT NULL,
    cultivo             VARCHAR(50),
    ndvi_critico        DOUBLE PRECISION,
    horas_calor         DOUBLE PRECISION,
    score_total         INTEGER,
    score_vigor         DOUBLE PRECISION,
    score_estabilidad   DOUBLE PRECISION,
    score_limpieza      DOUBLE PRECISION,
    score_clima         DOUBLE PRECISION,
    valido_para_score   BOOLEAN DEFAULT TRUE,
    superficie_ha       DOUBLE PRECISION,
    cv_espacial         DOUBLE PRECISION,
    zonificacion_activa BOOLEAN DEFAULT FALSE,
    puntos_zona_c       INTEGER DEFAULT 0,
    created_at          TIMESTAMP DEFAULT now(),
    updated_at          TIMESTAMP DEFAULT now(),
    CONSTRAINT unique_lote_anio UNIQUE (lote_id, anio)
);

The year column is named anio, not año. All JSON payload keys and SQL column names must use ASCII characters only. Using accented characters (e.g. historial_años, año) will cause a key mismatch and silent data loss. Always use historial_anos in payloads and anio in queries.

Indexes

Both tables have indexes on the columns most frequently filtered or ordered by the Streamlit dashboard and the RAG query layer:

Index	Table	Column(s)
`idx_lotes_lote_id`	`informes_lotes`	`lote_id`
`idx_lotes_cultivo`	`informes_lotes`	`cultivo`
`idx_lotes_score`	`informes_lotes`	`score_total`
`idx_historial_lote`	`lote_historial`	`lote_id`
`idx_historial_anio`	`lote_historial`	`anio`
`idx_historial_cultivo`	`lote_historial`	`cultivo`
`idx_historial_score`	`lote_historial`	`score_total`

Advanced topics

pgvector cosine similarity search

The embedding column in informes_lotes stores 768-dimensional vectors produced by nomic-embed-text. The RAG engine queries using the <=> cosine distance operator:

SELECT lote_id, contenido_tecnico
FROM informes_lotes
ORDER BY embedding <=> $1::vector
LIMIT 5;

For large datasets, add an ivfflat or hnsw index on the embedding column to speed up approximate nearest-neighbor search.

Changing the embedding dimension

If you switch to a model that produces a different vector size, you must drop and recreate the embedding column with the new dimension, then re-ingest all lots to regenerate embeddings. The current schema is fixed at vector(768) for nomic-embed-text.

Get Started

Core Concepts

Guides

Configuration

PostgreSQL and pgvector database schema

Running the migration

Table: `informes_lotes`

Table: `lote_historial`

Indexes

Advanced topics

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Configuration

Documentation Index

​Running the migration

​Table: informes_lotes

​Table: lote_historial

​Indexes

​Advanced topics

Build docs developers (and LLMs) love

Running the migration

Table: `informes_lotes`

Table: `lote_historial`

Indexes

Advanced topics