Each of the four pipeline stages is implemented as a standalone Python class living underDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/gcapella0/agente-inteligente-expedientes/llms.txt
Use this file to discover all available pages before exploring further.
src/agents/. Every agent receives its dependencies through constructor injection (__init__), making them easy to unit-test and swap out in isolation. When the pipeline runs end-to-end, the output dict of each agent is passed directly as the input to the next, accumulating keys as it travels through the chain.
All agent classes follow the same convention: they expose a single primary method (
run, process_directory, classify, or process) that accepts a data payload and returns an enriched dictionary. This uniformity is what makes composing them into a pipeline straightforward.Agent Reference
WatcherAgent — IMAP email monitor
WatcherAgent — IMAP email monitor
Source:
Accent-insensitive matching is applied locally via Supported separators: Any attachment whose extension is not in this set is silently ignored. At least one qualifying attachment must be present or the email is discarded.
src/agents/watcher_agent.pyWatcherAgent runs in an infinite polling loop, connecting to an IMAP mailbox over SSL on each cycle, searching for emails that carry academic dossier attachments, and downloading those attachments to a per-teacher input folder.IMAP connection
The agent connects withimaplib.IMAP4_SSL using credentials and host from environment variables. On each cycle it selects the configured folder (MAIL_FOLDER, default INBOX), searches for matching emails, downloads them, then disconnects cleanly before sleeping until the next poll interval.Three-tier email search
Because different IMAP providers handle keyword search differently, the agent tries three strategies in order:| Tier | Method | Notes |
|---|---|---|
| 1 | Gmail X-GM-RAW query with subject:, body terms, and has:attachment | Fastest; only works on Gmail |
| 2 | Standard IMAP SUBJECT / BODY search for each ASCII keyword variant | Works on all providers; skips non-ASCII keywords |
| 3 | Fallback SINCE search over the last 7 days with local keyword filtering | Catches accented subjects (e.g., Currículum) that fail IMAP keyword search |
unicodedata.normalize("NFKD") so that Expediente Docente and Expediente Docente are treated as equivalent.Teacher name extraction
The teacher’s name is parsed from the email subject using the pattern:-, –, —, :. The extracted name becomes the subdirectory name under data/input/ (e.g., data/input/Juan_Perez/).Accepted attachment formats
Configuration parameters
| Variable | Default | Description |
|---|---|---|
MAIL_HOST | — | IMAP server hostname |
MAIL_USER | — | Mailbox username / email address |
MAIL_PASS | — | Mailbox password or app password |
MAIL_FOLDER | INBOX | Folder to monitor |
POLL_INTERVAL_SECONDS | from config | Seconds between polling cycles |
SUBJECT_KEYWORD | Expediente Docente | Comma-separated subject keywords |
BODY_KEYWORD | (empty) | Comma-separated body keywords |
OcrAgent — document text extraction
OcrAgent — document text extraction
Source:
OCR result sub-dictionary (
src/agents/ocr_agent.pyOcrAgent wraps the OcrService (which in turn uses python-doctr[torch]) to extract text from every supported file found under data/input/. The docTR model is approximately 500 MB and is loaded once at service startup to avoid reloading it on every invocation.Primary method
directory— root directory to scan; defaults toconfig.INPUT_DIR(data/input/)skip_hashes— set of SHA-256 hex strings; files whose hash is in this set are skipped without running OCR
Supported file types
.txt email body files saved by WatcherAgent are intentionally excluded.Per-file result dictionary
Each dict returned byprocess_directory contains:| Key | Type | Description |
|---|---|---|
archivo_path | Path | Absolute path to the source file |
archivo_nombre | str | Filename |
carpeta_origen | str | Name of the subdirectory (teacher folder) |
formato | str | File extension without dot (e.g., pdf) |
tamano_bytes | int | Raw file size in bytes |
hash_sha256 | str | SHA-256 hex digest of file contents |
ocr_resultado | dict | None | OCR output (see below) or None on failure |
OCR result sub-dictionary (ocr_resultado)
| Key | Type | Description |
|---|---|---|
texto_completo | str | Full extracted text as a single string |
json_ligero | dict | Structured block representation optimised for LLM prompts |
confianza_promedio | float | Average word-level confidence (0–1) |
paginas | int | Number of pages / images processed |
idioma_detectado | str | Detected language code |
palabras_detectadas | int | Total word count across all pages |
The
json_ligero field contains only the document’s structural blocks — lines, words, and bounding-box hints — rather than raw pixel data. Sending this compact representation to the LLM instead of texto_completo reduces token usage while preserving enough context for accurate classification.ClassifierAgent — LLM document classification
ClassifierAgent — LLM document classification
Source: Accepts the dict produced by
src/agents/classifier_agent.pyClassifierAgent sends OCR output to a large language model and receives a structured JSON response identifying the document type, extracting key fields, and flagging whether the document is valid for storage.Primary method
OcrAgent for a single file and returns that same dict enriched with a clasificacion key.LLM input selection
The agent prefersjson_ligero (compact block structure) over texto_completo when both are available, because it is more token-efficient. If neither contains usable content, the agent short-circuits and returns valido=False immediately — no LLM call is made for blank documents.clasificacion output keys
| Key | Type | Description |
|---|---|---|
valido | bool | Whether the document is a recognisable academic document |
tipo | TipoDocumento | One of the 22 document type values (see Data Models) |
campos_extraidos | dict | Structured fields pulled from the document (name, cédula, dates, etc.) |
confianza_clasificacion | float | LLM-reported confidence score, 0–1 |
razon_rechazo | str | None | Human-readable rejection reason when valido=False |
modelo_llm | str | Model identifier used for this classification |
tokens_usados | int | Total tokens consumed by the request |
LLM temperature
The LLM is called withtemperature=0.1 to keep classification deterministic and reproducible. Higher temperatures introduce unnecessary variability in document type predictions.If the LLM service raises an exception (network error, rate limit, etc.),
ClassifierAgent catches it and returns valido=False with the error message in razon_rechazo. The pipeline continues — StorageAgent will skip the document gracefully.StorageAgent — MongoDB persistence and file management
StorageAgent — MongoDB persistence and file management
Source: Returns
src/agents/storage_agent.pyStorageAgent is the terminal stage of the pipeline. It takes the fully enriched result dict from ClassifierAgent and persists it to MongoDB, then moves the physical file to permanent storage under data/storage/{cedula}/.Primary method
{"exito": bool, "accion": "insert" | "skip" | "error", "docente_id": str | None, "documento_id": str | None}.Seven-step processing flow
Step 1 — Validate document Checksclasificacion.valido == True. Documents flagged as invalid by the classifier are skipped with accion: "skip".Step 2 — Extract and normalise cédula
The agent resolves the teacher’s national ID in priority order:cedula_titularfield fromcampos_extraidos- Derived from
numero_rif(strips the check digit from Venezuelan RIFV-XXXXXXXX-D) - MongoDB lookup by teacher folder name (exact match required to avoid ambiguity)
- Folder name used as a provisional identifier (flagged so it can be updated later)
documentos collection for an existing record with the same hash_sha256. Skips the document if found.Step 4 — Create or retrieve docente record
Looks up the docentes collection by cédula. If not found, creates a new record using campos_extraidos and the folder name as a fallback for the teacher’s name. If a provisional record exists under the folder name, it is upgraded with the real cédula. Before inserting the document, the file is optionally compressed (PDF via Ghostscript with -dPDFSETTINGS=/ebook -r150x150; images via Pillow JPEG quality=85, optimize=True). If the compressed file is not smaller than the original, the original is kept.Step 5 — Insert document in MongoDB
Constructs a DocumentoModel-compatible dict including ArchivoInfo, OcrInfo, ValidacionDocumento, and MetadataDocumento, then inserts it into the documentos collection.Step 6 — Update dossier completeness
Calls MongoService.update_completitud(cedula) to recalculate the teacher’s completeness percentage based on the documents now present in the collection.Step 7 — Move file to storage
Moves the file (compressed version if applicable) to data/storage/{cedula}/. If any MongoDB step fails before this point, the file is not moved, so the pipeline can retry it on the next cycle.Agent Configuration via API
Per-agent runtime parameters are stored in MongoDB and exposed through the configuration REST API:| Agent | Parameter | Default |
|---|---|---|
watcher | timeout_segundos | 60 |
watcher | retry_veces | 3 |
ocr | timeout_segundos | 120 |
ocr | retry_veces | 2 |
classifier | temperatura | 0.7 |
classifier | max_tokens | 2000 |
storage | timeout_segundos | 30 |
Changes made through the API are applied on the next agent execution cycle without requiring a service restart. The MongoDB-backed configuration store makes it possible to tune agent behaviour from the web UI without touching environment variables or redeploying the application.