Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/gcapella0/agente-inteligente-expedientes/llms.txt

Use this file to discover all available pages before exploring further.

The system supports two interchangeable LLM backends: OpenRouter (cloud, recommended for production) and Ollama (local, free, no API key required). Both implement the same BaseLlmProvider interface, so ClassifierAgent works identically regardless of which one is active. LLM configuration is persisted in MongoDB (sistema_config collection, document _id: "llm_config"). This means you can switch providers at runtime via PUT /config/llm without restarting the service — the factory reads MongoDB first and falls back to .env only if no database config exists.

OpenRouter

OpenRouter is the recommended provider for production. It routes your requests to hosted models with no local GPU required, and the system automatically rotates through fallback models if rate limits are hit.

Setup

  1. Create an account at openrouter.ai and generate an API key.
  2. Set the following variables in your .env:
LLM_PROVIDER=openrouter
OPENROUTER_API_KEY=sk-or-...
OPENROUTER_MODEL=minimax/minimax-m2.5:free
OPENROUTER_FALLBACK_MODELS=google/gemma-3-27b-it:free,meta-llama/llama-4-scout:free,qwen/qwen3-8b:free

Automatic model rotation

OpenRouterProvider builds a deduplicated model list at startup: the primary model (OPENROUTER_MODEL) first, followed by each entry in OPENROUTER_FALLBACK_MODELS. When a classification call is made:
  • It tries the primary model first, up to 3 retries for transient errors (connection timeouts, non-JSON responses).
  • If a RateLimitError is received, it immediately rotates to the next model in the list — no retry on the rate-limited model.
  • The cycle repeats through all models. If all are exhausted, the agent returns a structured error result with "valido": false rather than crashing the pipeline.
This means the full model rotation order is controlled by the order you declare models in .env:
OPENROUTER_MODEL        → tried first
OPENROUTER_FALLBACK_MODELS (left to right) → tried in sequence on rate limit

Testing connectivity

Use the built-in health endpoint to verify your API key and model are reachable before processing documents:
curl -X POST http://localhost:8000/config/llm/probar \
  -H "Authorization: Bearer $TOKEN"
The endpoint calls OpenRouterProvider.health_check(), which lists available models from the API and returns:
{ "status": "ok", "provider": "openrouter", "model": "minimax/minimax-m2.5:free" }

Ollama

Ollama is ideal for local development: no API key, no cost, and no data leaves your machine. Performance depends entirely on your hardware — a modern CPU can classify documents in 30–60 seconds with the right model.

Setup

  1. Install Ollama: ollama.com/download
  2. Pull the model you want to use:
ollama pull phi3:mini
  1. Set the following variables in your .env:
LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=phi3:mini
OLLAMA_TIMEOUT_SECONDS=120
OLLAMA_NUM_PREDICT=400

Model recommendations

Choose your Ollama model based on available RAM and acceptable classification latency:
ModelRAM approx.Estimated time (i5 CPU)Quality
phi3:mini~2.2 GB30–60 sGood
qwen2.5:0.5b~0.8 GB10–20 sBasic
gemma4:e4b~4B params quantizedRecommended for serverBest quality
Do not use mistral on a CPU without a GPU. Its inference time consistently exceeds the OLLAMA_TIMEOUT_SECONDS=120 limit, causing the classification step to always fail. Use phi3:mini or qwen2.5:0.5b on CPU-only machines.

How OllamaProvider works

OllamaProvider calls Ollama’s POST /api/chat endpoint (not /api/generate) with stream: false. This allows sending the system prompt and user message as separate roles, which improves classification accuracy and reduces context length. Key behaviour:
  • OCR text is truncated to 1500 characters before sending to Ollama. This keeps CPU inference fast while preserving the most relevant content from the document.
  • The context window is limited to 2048 tokens (num_ctx).
  • If the response is not valid JSON, the provider retries once before returning an error result.
  • On a timeout, the log message recommends switching to a smaller model.

Runtime switching (no restart required)

The create_llm_provider() factory in src/services/llm/llm_factory.py implements a MongoDB-first lookup:
  1. It queries sistema_config.find_one({"_id": "llm_config"}).
  2. If the document exists, it overrides config.LLM_PROVIDER, config.OLLAMA_MODEL, etc. in memory.
  3. It then instantiates the appropriate provider class.
  4. If MongoDB is unavailable, it silently falls back to the values loaded from .env.
This means you can switch providers — or change the active model — without restarting the service:
curl -X PUT http://localhost:8000/config/llm \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"provider": "ollama", "model": "phi3:mini", "host": "http://localhost:11434"}'
The endpoint writes the new config to MongoDB. The next document processed by ClassifierAgent will use the new provider.

Test before saving

You can test a provider configuration without persisting it by passing optional provider and host fields to the probe endpoint:
curl -X POST http://localhost:8000/config/llm/probar \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"provider": "ollama", "host": "http://localhost:11434"}'
This runs a health_check() against the specified provider and returns a status without modifying the stored config.
Docker + Ollama: If the service is running inside Docker and Ollama is running on the host machine, you must explicitly set OLLAMA_BASE_URL=http://host.docker.internal:11434 in your .env file — that is the variable OllamaProvider reads from src/config.py. Note that docker-compose.yml includes OLLAMA_HOST: http://host.docker.internal:11434 in its environment block, but OLLAMA_HOST is not read by the application code and has no effect on routing. You must update OLLAMA_BASE_URL in .env (or use PUT /config/llm at runtime) to point the provider at the host machine.

Build docs developers (and LLMs) love