Chilean elementary school students do mathematics by hand in paper notebooks. Requiring digital input would lose adoption, so Innova supports a “Subir foto” (upload photo) mode where students photograph their handwritten work and submit it directly. The OCR Vision Pipeline converts those images into structured LaTeX step sequences that the downstream error-classification pipeline can process. To balance cost and accuracy the pipeline uses a dual-model strategy: Gemini 2.5 Flash handles all requests first (free tier covers the pilot; ~10× cheaper than Claude at scale per ADR-004), and only escalates to Claude vision when the primary model’s confidence score falls below a configurable threshold. The entire strategy is hidden behind theDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/vruizz22/innova-ai-engine/llms.txt
Use this file to discover all available pages before exploring further.
MathOCRPort adapter interface, so either backend can be swapped or extended without touching orchestration logic.
Orchestration Flow
TheOcrOrchestrator is the single entry point for all image extraction. Its extract method implements the dual-model strategy:
Call Gemini (primary)
GeminiAdapter.extract sends the image bytes to gemini-2.5-flash via the async Google GenAI SDK (client.aio.models.generate_content). Gemini returns a JSON payload containing latex_steps, final_answer, overall_confidence, and an optional topic_hint.Evaluate confidence
If
primary.overall_confidence ≥ threshold (default 0.7, controlled by the OCR_CONFIDENCE_THRESHOLD environment variable), the Gemini result is returned immediately. No Claude call is made.Escalate to Claude (fallback)
If Gemini’s confidence is below the threshold,
ClaudeAdapter.extract is called with the same image bytes. Claude receives the image as a base64-encoded JPEG alongside a compact instruction prompt.Both
GeminiAdapter and ClaudeAdapter implement the MathOCRPort protocol, so the orchestrator handles them identically. If a parse failure occurs (malformed JSON response from either model), the adapter returns an OcrResult with an empty latex_steps list and overall_confidence = 0.0 rather than raising an exception.The MathOCRPort Adapter Interface
The adapter pattern (Clean Architecture port) decouples the orchestration logic from any specific vision model. Adding a new OCR backend (e.g. a self-hosted LaTeX-OCR model) requires only implementing this protocol:
GeminiAdapter and ClaudeAdapter satisfy MathOCRPort structurally (Python structural subtyping via runtime_checkable). The orchestrator instantiates concrete adapters directly, but any conforming implementation can be substituted.
Model Details
Gemini 2.5 Flash (Primary)
client.aio.models.generate_content) with genai_types.Part.from_bytes to pass the JPEG image as a multimodal content part alongside the text prompt. The model name is read from the GEMINI_MODEL environment variable, defaulting to gemini-2.5-flash.
Claude Haiku (Fallback)
The Claude adapter usesclaude-haiku-4-5-20251001 with the Anthropic Messages API. The image is base64-encoded and sent as a {"type": "image", "source": {"type": "base64", "media_type": "image/jpeg", ...}} content block followed by a text extraction prompt. Claude extracts a JSON fragment by scanning the response text for the outermost {...} boundaries.
The OcrResult Schema
Ordered list of LaTeX strings representing each step of the student’s solution as transcribed from the image. An empty list signals a complete parse failure. These steps are passed directly to the error-classification pipeline as
raw_steps.Model-self-reported confidence that the transcription is accurate, in
[0.0, 1.0]. This is the value compared against OCR_CONFIDENCE_THRESHOLD to decide whether escalation is needed. A value of 0.0 indicates a failed parse.Which model produced this result. One of
"GEMINI" or "CLAUDE". Stored in the database for observability and cost tracking.An optional short label (e.g.
"subtraction_borrow") that the OCR model inferred from visual context. Can pre-populate the domain routing step in the downstream classifier; not guaranteed to be present or accurate.Estimated USD cost of the API call that produced this result. Set to
0.0 by the Gemini adapter (free tier during pilot) and not yet populated by the Claude adapter. Used by the observability layer for cost accounting.Confidence Threshold Configuration
The escalation threshold is set via theOCR_CONFIDENCE_THRESHOLD environment variable (loaded by pydantic-settings):
| Setting | Behaviour |
|---|---|
OCR_CONFIDENCE_THRESHOLD=0.7 (default) | Escalate to Claude when Gemini confidence < 0.7 |
OCR_CONFIDENCE_THRESHOLD=1.0 | Always escalate (Gemini never returns exactly 1.0 in practice) |
OCR_CONFIDENCE_THRESHOLD=0.0 | Never escalate — always accept Gemini result |
Cost context (ADR-004): During the pilot, Gemini’s free tier (15 RPM, 1M TPM) covers all OCR at zero cost. At production scale with 1,000 daily submissions, Gemini costs ~1,000/month for Claude vision — a ~10× difference. The adapter pattern ensures the cost-optimal provider is always primary with the higher-quality provider available as a fallback.
Output Format Example
A successful extraction from a 4th-grade subtraction problem might return:raw_steps. The topic_hint may assist domain routing; in this example it correctly identifies the subtraction-with-borrowing domain.