TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/vruizz22/innova-ai-engine/llms.txt
Use this file to discover all available pages before exploring further.
guideIngest worker is the first stage of the v9 guides pipeline. When a teacher uploads a math worksheet PDF, the backend enqueues a message to guide-ingest-queue. This worker picks it up and orchestrates a full multi-step extraction pipeline: a cheap Gemini Flash precheck validates the PDF’s quality, pypdfium2 renders pages to images and extracts figure regions, the PDF is sliced into overlapping chunks, and Claude Sonnet runs a forced extract_guide tool call per chunk to produce structured questions. All chunks are then merged, deduplicated, figure crops are stored in S3, a .tex render is persisted, and each question is published downstream to the solution-generation-queue so the solutionGenerator worker can build the answer key.
Lambda configuration
Handler
src.pipeline.guide_ingest_worker.handlerTrigger
SQS
guide-ingest-queue — ARN via SQS_GUIDE_INGEST_ARNBatch size
1 — one guide per invocation (heavy multi-call PDF job)
Resources
Timeout: 600 s · Memory: 2048 MB
Batch size is intentionally 1. Each guide triggers multiple Gemini and Claude API calls plus in-memory PDF rendering with
pypdfium2, so a larger batch would risk Lambda timeout and memory exhaustion.SQS message schema
The backend publishes aGuideIngestMessage to guide-ingest-queue. The worker deserializes it with Pydantic before starting extraction.
UUID of the
Guide record in Postgres. Used as the primary key for all S3 paths and DB writes.S3 object key pointing to the uploaded worksheet PDF inside
S3_GUIDES_BUCKET.Curriculum grade level of the course the guide belongs to. Passed to the Sonnet extraction prompt to calibrate difficulty expectations.
Optional correlation ID forwarded as an SQS message attribute to the next worker. Bound to the structlog context via
bind_trace_id for end-to-end log tracing.Extraction pipeline
Mark guide as EXTRACTING
The repository writes the guide’s status to
EXTRACTING in Postgres before any work begins, preventing duplicate processing if the message is retried.Gemini Precheck (GeminiPrecheck)
The full PDF bytes are sent to Gemini Flash with a structured JSON prompt. The model returns:If
quality is below GUIDE_MIN_EXTRACTION_QUALITY (default 0.5), the guide is immediately marked EXTRACTION_FAILED in Postgres — no Sonnet tokens are spent. The failure reason is taken from notes if present, otherwise defaults to "PDF quality {score:.2f} below threshold — rescan at 300dpi and re-upload.".PDF rendering (Pypdfium2Processor)
Pypdfium2Processor opens the PDF in memory using pypdfium2. It provides three operations used throughout the pipeline:page_count(pdf_bytes)— returns total page count for chunk range computation.slice_pages(pdf_bytes, start, end)— produces a new in-memory PDF containing only the[start, end)page range, sent to Sonnet as adocumentcontent block.crop_figure(pdf_bytes, bbox)— renders a single page at2×scale (~144 dpi) withpypdfium2and crops the bounding box region to a PNG using Pillow.
Chunking
The PDF is split into overlapping page ranges using
The step size is
compute_chunk_ranges:| Parameter | Env var | Default |
|---|---|---|
chunk_size | GUIDE_INGEST_CHUNK_PAGES | 20 |
overlap | GUIDE_INGEST_CHUNK_OVERLAP | 1 |
chunk_size - overlap, so consecutive chunks share overlap pages. This ensures a question straddling a chunk boundary is seen whole in at least one chunk. After extraction, page indices in figure bounding boxes are offset by the chunk’s start page so they point to absolute positions in the source PDF.Sonnet extraction (SonnetExtractor)
For each chunk, The system prompt is marked
SonnetExtractor sends the sliced PDF as a native Anthropic document content block — not base64 images — together with the grade level and page count. Claude Sonnet 4.6 is forced to call the extract_guide tool, returning a structured ExtractGuideResult:cache_control: ephemeral so it is prompt-cached across all chunk calls for the same guide, reducing cost and latency. Temperature is fixed at 0.0 for deterministic extraction.Merge and deduplication
All per-chunk
ExtractGuideResult lists are merged by merge_chunks. Questions flagged continues_previous=True are appended to the last question of the preceding chunk rather than inserted as a new entry, preventing duplicates at chunk boundaries. Each surviving question receives a monotonically increasing sequence number starting from 1.Figure cropping and S3 upload
For every The S3 key is appended to the question’s
FigureBBox on each merged question, Pypdfium2Processor.crop_figure renders and crops the region. The PNG is uploaded to:figure_keys list, which is later stored in Postgres alongside the question record.LaTeX render and S3 upload
The full list of merged questions is passed to The
render_guide_tex, which produces a .tex file. It is uploaded to:latex_key is stored in the guide’s Postgres record.Persist questions to Postgres
repo.complete(...) writes all MergedQuestion records to the guide_questions table and updates the guide row with source_kind, pages, latex_key, extraction_confidence, and extraction_model (claude-sonnet-4-6).Outcome schema
The handler logs the outcome of every record and returns it in the Lambda response.status | Meaning |
|---|---|
GENERATING_SOLUTIONS | Extraction succeeded; questions persisted; downstream message published |
EXTRACTION_FAILED | Precheck quality below threshold; guide marked failed; no Sonnet call made |
Extraction tuning environment variables
Maximum number of PDF pages per Sonnet chunk. Larger values reduce the number of API calls but increase token usage per call. Must be positive.
Number of pages shared between consecutive chunks. Prevents questions that span a page boundary from being split across two calls. Set to
0 to disable overlap.Minimum Gemini precheck quality score (range
0.0–1.0) required to proceed with Sonnet extraction. PDFs scoring below this threshold are rejected immediately and the guide is marked EXTRACTION_FAILED. Raise this value to enforce higher-quality uploads; lower it to accept degraded scans.Error handling and partial batch failure
The worker implements the SQSReportBatchItemFailures protocol. Each record is processed independently inside a try/except block. On failure, the record’s messageId is added to batchItemFailures, leaving that message in the queue for SQS to retry (subject to the queue’s visibility timeout and DLQ configuration). Successfully processed records are not retried.
Two distinct failure modes are handled:
| Exception | Behavior |
|---|---|
PausedError | SSM killswitch active — record returned to queue silently; no metric emitted |
| Any other exception | Record returned to queue; M_EXTRACTION_FAILED metric emitted via CloudWatch |
SSM killswitch
The killswitch is checked twice per guide — once insideGeminiPrecheck.precheck and once inside SonnetExtractor.extract_chunk — so it takes effect even mid-extraction.
SSM Parameter Store path. Set the parameter value to any truthy string to pause the worker.
Observability
| Signal | Detail |
|---|---|
M_EXTRACTION_FAILED | CloudWatch custom metric; emitted as Count: 1 on any non-PausedError exception |
M_INGEST_COST_USD | CloudWatch custom metric; emitted on successful extraction with the computed USD cost of all Sonnet tokens used |
| Structured logs | guide_precheck_done, guide_extract_chunk_done, guide_ingested, guide_extraction_failed — all include guide_id and trace_id |
| Token / cost accounting | Per-invocation TokenUsage is accumulated across all chunks; total input/output/cache tokens and cost in USD are written to the cost_events table in Postgres via repo.save_cost_event |