Four autonomous agents: IMAP, OCR, LLM, and storage

Each of the four pipeline stages is implemented as a standalone Python class living under src/agents/. Every agent receives its dependencies through constructor injection (__init__), making them easy to unit-test and swap out in isolation. When the pipeline runs end-to-end, the output dict of each agent is passed directly as the input to the next, accumulating keys as it travels through the chain.

All agent classes follow the same convention: they expose a single primary method (run, process_directory, classify, or process) that accepts a data payload and returns an enriched dictionary. This uniformity is what makes composing them into a pipeline straightforward.

Agent Reference

WatcherAgent — IMAP email monitor

Source: src/agents/watcher_agent.pyWatcherAgent runs in an infinite polling loop, connecting to an IMAP mailbox over SSL on each cycle, searching for emails that carry academic dossier attachments, and downloading those attachments to a per-teacher input folder.

IMAP connection

The agent connects with imaplib.IMAP4_SSL using credentials and host from environment variables. On each cycle it selects the configured folder (MAIL_FOLDER, default INBOX), searches for matching emails, downloads them, then disconnects cleanly before sleeping until the next poll interval.

Three-tier email search

Because different IMAP providers handle keyword search differently, the agent tries three strategies in order:

Tier	Method	Notes
1	Gmail X-GM-RAW query with `subject:`, body terms, and `has:attachment`	Fastest; only works on Gmail
2	Standard IMAP `SUBJECT` / `BODY` search for each ASCII keyword variant	Works on all providers; skips non-ASCII keywords
3	Fallback `SINCE` search over the last 7 days with local keyword filtering	Catches accented subjects (e.g., `Currículum`) that fail IMAP keyword search

Accent-insensitive matching is applied locally via unicodedata.normalize("NFKD") so that Expediente Docente and Expediente Docente are treated as equivalent.

Teacher name extraction

The teacher’s name is parsed from the email subject using the pattern:

{SUBJECT_KEYWORD} - Teacher Name

Supported separators: -, –, —, :. The extracted name becomes the subdirectory name under data/input/ (e.g., data/input/Juan_Perez/).

Accepted attachment formats

ATTACHMENT_EXTENSIONS = {".pdf", ".jpg", ".jpeg"}

Any attachment whose extension is not in this set is silently ignored. At least one qualifying attachment must be present or the email is discarded.

Configuration parameters

Variable	Default	Description
`MAIL_HOST`	—	IMAP server hostname
`MAIL_USER`	—	Mailbox username / email address
`MAIL_PASS`	—	Mailbox password or app password
`MAIL_FOLDER`	`INBOX`	Folder to monitor
`POLL_INTERVAL_SECONDS`	from `config`	Seconds between polling cycles
`SUBJECT_KEYWORD`	`Expediente Docente`	Comma-separated subject keywords
`BODY_KEYWORD`	(empty)	Comma-separated body keywords

For Gmail accounts, generate an App Password under Google Account → Security → 2-Step Verification → App passwords. The standard account password will not work with IMAP when 2FA is enabled.

OcrAgent — document text extraction

Source: src/agents/ocr_agent.pyOcrAgent wraps the OcrService (which in turn uses python-doctr[torch]) to extract text from every supported file found under data/input/. The docTR model is approximately 500 MB and is loaded once at service startup to avoid reloading it on every invocation.

Primary method

OcrAgent.process_directory(
    directory: Path | None = None,
    skip_hashes: set[str] | None = None,
) -> list[dict]

directory — root directory to scan; defaults to config.INPUT_DIR (data/input/)
skip_hashes — set of SHA-256 hex strings; files whose hash is in this set are skipped without running OCR

The method iterates over every subdirectory of the root (one per teacher) and processes each qualifying file inside. Results are returned as a flat list — one dict per processed file.

Supported file types

SUPPORTED_EXTENSIONS = {".pdf", ".jpg", ".jpeg", ".png"}

.txt email body files saved by WatcherAgent are intentionally excluded.

Per-file result dictionary

Each dict returned by process_directory contains:

Key	Type	Description
`archivo_path`	`Path`	Absolute path to the source file
`archivo_nombre`	`str`	Filename
`carpeta_origen`	`str`	Name of the subdirectory (teacher folder)
`formato`	`str`	File extension without dot (e.g., `pdf`)
`tamano_bytes`	`int`	Raw file size in bytes
`hash_sha256`	`str`	SHA-256 hex digest of file contents
`ocr_resultado`	`dict \| None`	OCR output (see below) or `None` on failure

OCR result sub-dictionary (`ocr_resultado`)

Key	Type	Description
`texto_completo`	`str`	Full extracted text as a single string
`json_ligero`	`dict`	Structured block representation optimised for LLM prompts
`confianza_promedio`	`float`	Average word-level confidence (0–1)
`paginas`	`int`	Number of pages / images processed
`idioma_detectado`	`str`	Detected language code
`palabras_detectadas`	`int`	Total word count across all pages

The json_ligero field contains only the document’s structural blocks — lines, words, and bounding-box hints — rather than raw pixel data. Sending this compact representation to the LLM instead of texto_completo reduces token usage while preserving enough context for accurate classification.

ClassifierAgent — LLM document classification

Source: src/agents/classifier_agent.pyClassifierAgent sends OCR output to a large language model and receives a structured JSON response identifying the document type, extracting key fields, and flagging whether the document is valid for storage.

Primary method

ClassifierAgent.classify(ocr_result: dict) -> dict

Accepts the dict produced by OcrAgent for a single file and returns that same dict enriched with a clasificacion key.

LLM input selection

The agent prefers json_ligero (compact block structure) over texto_completo when both are available, because it is more token-efficient. If neither contains usable content, the agent short-circuits and returns valido=False immediately — no LLM call is made for blank documents.

`clasificacion` output keys

Key	Type	Description
`valido`	`bool`	Whether the document is a recognisable academic document
`tipo`	`TipoDocumento`	One of the 22 document type values (see Data Models)
`campos_extraidos`	`dict`	Structured fields pulled from the document (name, cédula, dates, etc.)
`confianza_clasificacion`	`float`	LLM-reported confidence score, 0–1
`razon_rechazo`	`str \| None`	Human-readable rejection reason when `valido=False`
`modelo_llm`	`str`	Model identifier used for this classification
`tokens_usados`	`int`	Total tokens consumed by the request

LLM temperature

The LLM is called with temperature=0.1 to keep classification deterministic and reproducible. Higher temperatures introduce unnecessary variability in document type predictions.

If the LLM service raises an exception (network error, rate limit, etc.), ClassifierAgent catches it and returns valido=False with the error message in razon_rechazo. The pipeline continues — StorageAgent will skip the document gracefully.

StorageAgent — MongoDB persistence and file management

Source: src/agents/storage_agent.pyStorageAgent is the terminal stage of the pipeline. It takes the fully enriched result dict from ClassifierAgent and persists it to MongoDB, then moves the physical file to permanent storage under data/storage/{cedula}/.

Primary method

StorageAgent.process(classified_result: dict) -> dict

Returns {"exito": bool, "accion": "insert" | "skip" | "error", "docente_id": str | None, "documento_id": str | None}.

Seven-step processing flow

Step 1 — Validate document Checks clasificacion.valido == True. Documents flagged as invalid by the classifier are skipped with accion: "skip".Step 2 — Extract and normalise cédula The agent resolves the teacher’s national ID in priority order:

cedula_titular field from campos_extraidos
Derived from numero_rif (strips the check digit from Venezuelan RIF V-XXXXXXXX-D)
MongoDB lookup by teacher folder name (exact match required to avoid ambiguity)
Folder name used as a provisional identifier (flagged so it can be updated later)

Step 3 — Duplicate hash check Queries MongoDB documentos collection for an existing record with the same hash_sha256. Skips the document if found.Step 4 — Create or retrieve docente record Looks up the docentes collection by cédula. If not found, creates a new record using campos_extraidos and the folder name as a fallback for the teacher’s name. If a provisional record exists under the folder name, it is upgraded with the real cédula. Before inserting the document, the file is optionally compressed (PDF via Ghostscript with -dPDFSETTINGS=/ebook -r150x150; images via Pillow JPEG quality=85, optimize=True). If the compressed file is not smaller than the original, the original is kept.Step 5 — Insert document in MongoDB Constructs a DocumentoModel-compatible dict including ArchivoInfo, OcrInfo, ValidacionDocumento, and MetadataDocumento, then inserts it into the documentos collection.Step 6 — Update dossier completeness Calls MongoService.update_completitud(cedula) to recalculate the teacher’s completeness percentage based on the documents now present in the collection.Step 7 — Move file to storage Moves the file (compressed version if applicable) to data/storage/{cedula}/. If any MongoDB step fails before this point, the file is not moved, so the pipeline can retry it on the next cycle.

The file-move-last ordering is an intentional safety guarantee: if MongoDB is unavailable, the source file remains in data/input/ and will be picked up again on the next pipeline execution without any manual intervention.

Agent Configuration via API

Per-agent runtime parameters are stored in MongoDB and exposed through the configuration REST API:

GET  /config/agentes   # retrieve all agent configs
PUT  /config/agentes   # update all agent configs (full upsert)

Default values shipped with the system:

Agent	Parameter	Default
`watcher`	`timeout_segundos`	`60`
`watcher`	`retry_veces`	`3`
`ocr`	`timeout_segundos`	`120`
`ocr`	`retry_veces`	`2`
`classifier`	`temperatura`	`0.7`
`classifier`	`max_tokens`	`2000`
`storage`	`timeout_segundos`	`30`

Changes made through the API are applied on the next agent execution cycle without requiring a service restart. The MongoDB-backed configuration store makes it possible to tune agent behaviour from the web UI without touching environment variables or redeploying the application.

Introducción

Arquitectura

Configuración

Interfaz Web

Agent Reference

IMAP connection

Three-tier email search

Teacher name extraction

Accepted attachment formats

Configuration parameters

Primary method

Supported file types

Per-file result dictionary

OCR result sub-dictionary (`ocr_resultado`)

Primary method

LLM input selection

`clasificacion` output keys

LLM temperature

Primary method

Seven-step processing flow

Agent Configuration via API

Build docs developers (and LLMs) love

Introducción

Arquitectura

Configuración

Interfaz Web

Documentation Index

​Agent Reference

​IMAP connection

​Three-tier email search

​Teacher name extraction

​Accepted attachment formats

​Configuration parameters

​Primary method

​Supported file types

​Per-file result dictionary

​OCR result sub-dictionary (ocr_resultado)

​Primary method

​LLM input selection

​clasificacion output keys

​LLM temperature

​Primary method

​Seven-step processing flow

​Agent Configuration via API

Build docs developers (and LLMs) love

Agent Reference

IMAP connection

Three-tier email search

Teacher name extraction

Accepted attachment formats

Configuration parameters

Primary method

Supported file types

Per-file result dictionary

OCR result sub-dictionary (`ocr_resultado`)

Primary method

LLM input selection

`clasificacion` output keys

LLM temperature

Primary method

Seven-step processing flow

Agent Configuration via API