AgroIA system architecture and data flow

AgroIA is composed of several loosely coupled subsystems that together automate the full agronomic diagnostic cycle. At the entry point, a GPS coordinate or shapefile is converted into a precise GeoJSON polygon by the SAM delineator. That polygon feeds into the local analysis pipeline, which queries Google Earth Engine for Sentinel-2 NDVI history, calls NASA POWER for climate data, computes the AgroIA Score, generates PDF and HTML reports, and pushes a vector-embedded summary into PostgreSQL. From there, a FastAPI backend exposes the data via REST, a Streamlit dashboard renders it visually, a Telegram bot makes it queryable on mobile, and a RAG engine powered by Ollama enables natural-language lot interrogation.

End-to-end data flow

The diagram below maps the complete flow from raw field data to consumable outputs, matching the architecture described in AGENTS.md:

[GPS points / shapefile]          [Bulk GeoJSON file]
         │                                │
         ▼  (SAM delineation)             ▼  (API ingestion)
[SAM Polygon delineator]     [FastAPI — POST /ingesta/geojson]
  Output: GeoJSON polygon                 │
         │                                │
         └──────────────┬─────────────────┘
                        ▼
         [Analysis Pipeline — Motor Local v2.5]
          python start.py --pipeline <ruta.shp> [cultivo]
          ├─ GEE Sentinel-2 SR  (NDVI histórico, 6 años)
          ├─ NASA POWER         (estrés térmico, fórmula sinusoidal)
          ├─ IsolationForest    (limpieza de outliers satelitales)
          ├─ Score AgroIA       (0–100) + Zonificación K-Means A/B/C
          ├─ build_report()     → PDF  in src/outputs/
          ├─ generar_mapa_offline() → HTML map in outputs/
          └─ enviar_al_rag()    → upsert to PostgreSQL + pgvector
                        │
                        ▼
              [PostgreSQL + pgvector]
               informes_lotes  (UNIQUE lote_id)
               lote_historial  (UNIQUE lote_id, anio)
                        │
          ┌─────────────┼──────────────┐
          ▼             ▼              ▼
    [FastAPI REST]  [Streamlit]  [Telegram bot]
     port 8000      port 8501     polling
          │             │
          └──────────────┤
                         ▼
                  [Ollama — RAG engine]
                   nomic-embed-text (embeddings)
                   gemma3:4b        (generation)

Component breakdown

FastAPI backend

Runs on port 8000. Handles ingestion (/ingesta, /ingesta/geojson), lot queries (/lotes, /lotes/{lote_id}), and health checks. Authentication uses a bearer token from INGESTA_SECRET_KEY.

PostgreSQL + pgvector

Stores lot reports (informes_lotes) and time-series data (lote_historial). The pgvector extension enables cosine similarity search over 768-dimensional embeddings generated by nomic-embed-text.

Google Earth Engine

Provides Sentinel-2 SR imagery for NDVI extraction over the last six years. Authenticated via the earthengine CLI. Requires GEE_PROJECT_ID in .env.

SAM delineator

Segment Anything Model converts GPS points into crop field polygons. Two production runs are stored: 268 maize polygons (TAYPE zone) and 340 pivot-irrigated polygons (Tandil/Balcarce).

Ollama (LLM)

Runs nomic-embed-text for embedding generation and gemma3:4b for RAG response generation. Both models run locally for full data sovereignty.

Streamlit dashboard

Runs on port 8501. Displays lot rankings, NDVI time series, Folium HTML maps, score breakdowns, and the RAG chat interface. Sourced from src/streamlit_app.py.

Telegram bot

Polling-based bot defined in src/bot/telegram_main.py. Exposes RAG queries and lot lookups to mobile users. Started via python start.py --bot.

NASA POWER

Provides historical climate data for heat stress calculation. Called via get_nasa_climate_safe() in src/pipeline/nasa_power.py. No authentication required.

Pipeline module internals

The analysis pipeline lives in src/pipeline/ and is invoked via run_full_analysis(). Each step is a discrete module:

Module	File	Responsibility
GEE extractor	`src/pipeline/gee_extractor.py`	`init_gee()`, Sentinel-2 SR NDVI queries
NASA POWER	`src/pipeline/nasa_power.py`	`get_nasa_climate_safe()`, six-year climate history
Agro math	`src/pipeline/agro_math.py`	`calcular_score()`, `get_gee_ndvi_validado()`, crop config
Reporter	`src/pipeline/reporter.py`	`build_report()` PDF, `generar_mapa_offline()` HTML
Comparative reporter	`src/pipeline/comparative_reporter.py`	Multi-lot ranking PDF
Ingestion	`src/pipeline/ingesta.py`	`construir_payload_v2()`, `enviar_al_rag()`
Utilities	`src/pipeline/utils.py`	`validar_shapefile()` CRS and geometry checks

src/pipeline_local.py is a legacy entry point retained for compatibility. Use python start.py --pipeline for all new work — it injects src/ into the Python path and calls run_full_analysis() from src/pipeline/__init__.py.

AgroIA Score formula

The score aggregates four dimensions into a single 0–100 index. The weights reflect agronomic importance, with crop vigor carrying the highest weight:

Score (0–100) = Vigor (40%) + Estabilidad (30%) + Limpieza (20%) + Clima (10%)

Component	Weight	Source	Method
Vigor	40%	Sentinel-2 SR NDVI	Normalized mean NDVI during the critical crop month
Stability	30%	GEE NDVI history (6 years)	Inverse of the coefficient of variation
Cleanliness	20%	GEE NDVI series	IsolationForest (`contamination=0.2`) penalizes satellite outliers
Climate	10%	NASA POWER	Accumulated heat hours using sinusoidal formula

Lots are also classified into zones A, B, and C using K-Means clustering on the NDVI spatial distribution within the polygon.

Database schema

`informes_lotes`

Primary store for aggregated lot reports. One row per lot (UNIQUE(lote_id)).

Column	Type	Description
`lote_id`	`text`	Unique lot identifier (primary key)
`metadata`	`jsonb`	Score breakdown, crop type, area, zone classification
`embedding`	`vector(768)`	nomic-embed-text embedding for semantic search
`created_at`	`timestamptz`	Ingestion timestamp

`lote_historial`

Time-series table for annual NDVI and climate records. One row per lot per year (UNIQUE(lote_id, anio)).

Column	Type	Description
`lote_id`	`text`	Foreign reference to `informes_lotes`
`anio`	`integer`	Year (ASCII field — no diacritics)
`ndvi_promedio`	`float`	Mean NDVI for the year
`stress_termico`	`float`	Accumulated heat stress hours

All JSON payload keys must be ASCII. The anio field uses the ASCII spelling (no tilde) throughout the codebase. Using año in any payload or query will cause a schema mismatch. See the v2 migration notes in AGENTS.md for the full list of renamed keys.

Technology stack

Layer	Technology	Version
API framework	FastAPI	≥ 0.100.0
Web server	Uvicorn	≥ 0.23.0
Database	PostgreSQL + pgvector	pg16
Vector search	pgvector Python client	≥ 0.2.0
ORM / settings	pydantic-settings	≥ 2.0.0
Satellite imagery	Google Earth Engine API	≥ 0.1.340
SAM	segment-anything	≥ 1.0
CV backend	OpenCV (headless)	≥ 4.8.0
ML / anomaly detection	scikit-learn	≥ 1.3.0
Deep learning runtime	PyTorch	≥ 2.0.0
Geospatial processing	GeoPandas + Shapely	≥ 0.13.0 / 2.0.0
Map rendering	Folium	≥ 0.14.0
LLM runtime	Ollama	latest
Frontend	Streamlit	≥ 1.28.0
Bot framework	python-telegram-bot	≥ 20.0
Container runtime	Docker + Compose	—
Python runtime	Python	3.10 (slim)

RAG engine

The RAG module lives in src/rag/core.py. It uses pgvector cosine similarity to retrieve the most relevant lot reports and passes them as context to gemma3:4b via Ollama. Key exported functions:

consultar_agente(lote_id, pregunta, top_k=3) — returns an LLM response with RAG context for a specific lot.
fetch_context(lote_id, pregunta, top_k=3) — retrieval only, no LLM call.
listar_lotes() — lists all lots stored in the database.
get_historial_lote_raw(lote_id) — returns the raw time series for a lot.
get_datos_lote_raw(lote_id) — returns the full report for a lot.

Import functions directly from src/rag/core.py rather than re-implementing retrieval logic. The module is the single source of truth for all RAG operations.

Deployment topology

Host machine
├── Ollama (localhost:11434)
│   ├── nomic-embed-text
│   └── gemma3:4b
├── PostgreSQL + pgvector (localhost:5432)
│   ├── informes_lotes
│   └── lote_historial
└── Docker Compose
    ├── agroia_api  (→ 0.0.0.0:8000)
    └── agroia_ui   (→ 0.0.0.0:8501)

The containers reach the host-side services via host.docker.internal. Pipeline runs (python start.py --pipeline) execute on the host directly, outside Docker, to maintain file system access to shapefile inputs and output directories.

Next steps

AgroIA Score concepts

Detailed explanation of each score component, normalization formulas, and crop-specific parameters.

SAM delineation

How SAM converts GPS points to field polygons and what the two production runs cover.

API reference

Full endpoint documentation for ingestion, lot queries, and pipeline module APIs.

Configuration reference

Complete .env variable reference with defaults and validation rules.

Get Started

Core Concepts

Guides

Configuration

AgroIA system architecture and data flow

End-to-end data flow

Component breakdown

FastAPI backend

PostgreSQL + pgvector

Google Earth Engine

SAM delineator

Ollama (LLM)

Streamlit dashboard

Telegram bot

NASA POWER

Pipeline module internals

AgroIA Score formula

Database schema

`informes_lotes`

`lote_historial`

Technology stack

RAG engine

Deployment topology

Next steps

AgroIA Score concepts

SAM delineation

API reference

Configuration reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Configuration

Documentation Index

​End-to-end data flow

​Component breakdown

FastAPI backend

PostgreSQL + pgvector

Google Earth Engine

SAM delineator

Ollama (LLM)

Streamlit dashboard

Telegram bot

NASA POWER

​Pipeline module internals

​AgroIA Score formula

​Database schema

​informes_lotes

​lote_historial

​Technology stack

​RAG engine

​Deployment topology

​Next steps

AgroIA Score concepts

SAM delineation

API reference

Configuration reference

Build docs developers (and LLMs) love

End-to-end data flow

Component breakdown

Pipeline module internals

AgroIA Score formula

Database schema

`informes_lotes`

`lote_historial`

Technology stack

RAG engine

Deployment topology

Next steps