AI model setup: Ollama, embeddings, and generation

AgroIA uses two Ollama models running locally: one to convert agronomic report text into dense vector embeddings for semantic search, and one to generate natural-language answers grounded in those retrieved reports. Both models are served by Ollama and communicate with the application over HTTP. No external API calls or cloud GPU are required.

Models at a glance

Role	Model	Output	Used by
Embeddings	`nomic-embed-text`	768-dimensional `float32` vector	`src/utils/loader.py`, `src/rag/core.py`
Generation	`gemma3:4b`	Agronomic expert text response	`src/rag/core.py`

Pulling the models

Run both ollama pull commands before starting AgroIA. They need to be available locally before the application attempts to use them.

ollama pull nomic-embed-text
ollama pull gemma3:4b

Verify that both models are loaded and ready with ollama list. The output should show both model names with their sizes before you start the API or pipeline.

Embedding model: `nomic-embed-text`

nomic-embed-text converts the contenido_tecnico text field of each lot report into a 768-dimensional vector. This vector is stored in the embedding column of informes_lotes as a vector(768) type (pgvector). At query time, the RAG engine embeds the user’s question using the same model and retrieves the most semantically similar lot reports via cosine distance. The embedding call is made through the generate_embedding function in src/utils/loader.py:

resp = ollama.embed(model=settings.embedding_model, input=text)
return resp["embeddings"][0]  # list of 768 floats

Changing EMBEDDING_MODEL to a model that produces a different vector dimension requires you to drop and recreate the embedding column with the new size, then re-ingest every lot to regenerate its embedding. Mixing embeddings from different models in the same column produces meaningless similarity scores.

Generation model: `gemma3:4b`

gemma3:4b is the agronomic expert LLM. The RAG engine in src/rag/core.py builds a prompt from the BASE_PROMPT system message, injects the retrieved lot context, and calls Ollama with the following inference options:

Option	Value	Effect
`temperature`	`0.2`	Low randomness — produces consistent, factual agronomic answers
`num_predict`	`1024`	Maximum tokens in the generated response

Local LLM inference latency on CPU typically ranges from 14 to 71 seconds per query depending on hardware. On machines with a supported GPU, Ollama will use it automatically, significantly reducing latency.

Configuring Ollama

OLLAMA_URL

AgroIA connects to Ollama at the URL defined by the OLLAMA_URL environment variable (default: http://localhost:11434). When running inside Docker Compose, this is automatically overridden to http://host.docker.internal:11434. See Environment variables for details.

Running Ollama as a service

Start Ollama in the background before launching AgroIA:

ollama serve &

Or install Ollama as a system service so it starts automatically on boot. The Ollama server must be reachable at OLLAMA_URL when the API or pipeline starts.

Changing models

To use different models, update the corresponding variables in config/.env:

EMBEDDING_MODEL=nomic-embed-text
GENERATION_MODEL=gemma3:4b

Pull the new model

ollama pull <new-model-name>

Update config/.env

Change EMBEDDING_MODEL or GENERATION_MODEL (or both) in config/.env.

Re-ingest all data (embedding model only)

If you changed EMBEDDING_MODEL, you must re-ingest every lot to regenerate embeddings in the new model’s vector space. Re-running the pipeline or posting existing payloads to /ingesta will overwrite the stored embeddings via the upsert logic.

Restart the application

python start.py

The new model names are read from settings at startup.

Advanced configuration

Running Ollama on a remote machine

Set OLLAMA_URL to the remote server’s address:

OLLAMA_URL=http://192.168.1.100:11434

Ensure the remote machine has both models pulled and that port 11434 is reachable from your AgroIA host.

GPU acceleration

Ollama detects compatible NVIDIA and Apple Silicon GPUs automatically. Install the appropriate CUDA drivers (NVIDIA) or ensure you are running a native macOS Ollama build (Apple Silicon). No configuration change is needed in AgroIA — the speed improvement is transparent.

Get Started

Core Concepts

Guides

Configuration

AI model setup: Ollama, embeddings, and generation

Models at a glance

Pulling the models

Embedding model: `nomic-embed-text`

Generation model: `gemma3:4b`

Configuring Ollama

OLLAMA_URL

Running Ollama as a service

Changing models

Advanced configuration

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Configuration

Documentation Index

​Models at a glance

​Pulling the models

​Embedding model: nomic-embed-text

​Generation model: gemma3:4b

​Configuring Ollama

​OLLAMA_URL

​Running Ollama as a service

​Changing models

​Advanced configuration

Build docs developers (and LLMs) love

Models at a glance

Pulling the models

Embedding model: `nomic-embed-text`

Generation model: `gemma3:4b`

Configuring Ollama

OLLAMA_URL

Running Ollama as a service

Changing models

Advanced configuration