Overview
ThinkEx provides powerful PDF processing capabilities using Azure Mistral Document AI for OCR (Optical Character Recognition). The system supports:- Large PDF handling: Up to 100MB files
- Batch processing: Automatic chunking for parallel OCR
- Rich extraction: Text, tables, headers, footers, hyperlinks, and images
- Markdown output: Structured markdown with preserved formatting
- Durable workflows: Long-running OCR with status polling
- Password protection detection: Client-side validation before upload
OCR Endpoints
Start Durable OCR Workflow
Recommended for large PDFs or when you need non-blocking processing.URL of the PDF file to process. Must be from an allowed host (Supabase, app domain, or localhost in development)
Client-provided ID for tracking this OCR operation (typically the database item ID)
Workflow run ID for polling status
Echo of the provided item ID
Poll OCR Status
Check the status of a running OCR workflow.Workflow run ID returned from
/api/pdf/ocr/startCurrent status:
running, completed, or failedOCR results (only present when status is
completed)Full extracted text from all pages, joined with double newlines
Array of page-level OCR data with rich metadata
Error message (only present when status is
failed)Direct OCR Processing
Synchronous OCR for smaller PDFs (blocks until complete).URL of the PDF file to process (max 100MB)
Full extracted text from all pages
Array of page objects with extracted data
OCR Page Schema
Each page in theocrPages array contains:
Zero-based page number
Extracted text in Markdown format with preserved structure
Embedded images with base64 data and unique IDs. Format:
[{ id: "p0-img-0", ... }]Detected tables with structure information
Page header text (if detected)
Page footer text (if detected)
Detected hyperlinks with URL and position data
Client SDK
Upload and OCR (Blocking)
Upload and OCR (Non-blocking)
Password Protection Detection
OCR Processing Details
Chunking Strategy
PDFs are automatically split into chunks for parallel processing. Chunk size adapts to total page count:| Total Pages | Pages per Chunk | Reason |
|---|---|---|
| 1-10 | 1 | Maximum parallelism |
| 11-30 | 3 | Balance speed and API calls |
| 31-100 | 6 | Reduce API overhead |
| 100+ | 12 | Minimize rate limit risk |
- Maximum chunk size: 30MB
- Maximum concurrent requests: 5 (prevents Azure timeouts)
- Chunks processed in batches with detailed logging
Image Handling
Images embedded in PDFs are:- Extracted with base64 encoding (if
OCR_INCLUDE_IMAGESis notfalse) - Given globally unique IDs:
p{pageIndex}-{imageId}(e.g.,p0-img-0) - Referenced in markdown with the unique ID
- Included in the
imagesarray of each page
Retry Logic
OCR requests automatically retry on:- 408 Request Timeout: Retries once after 3 seconds
- 429 Rate Limit: Retries once after 3-6 seconds (exponential backoff)
Environment Configuration
URL Validation (SSRF Prevention)
For security,fileUrl must be from an allowed host:
- Production: Supabase storage URL or
NEXT_PUBLIC_APP_URL - Development: Above plus
localhostand127.0.0.1
Error Handling
Common Errors
| Status | Error | Solution |
|---|---|---|
| 401 | Unauthorized | User must be authenticated |
| 400 | fileUrl is required | Include fileUrl in request body |
| 400 | fileUrl origin is not allowed | Use allowed storage domain |
| 400 | URL does not point to a PDF | Verify file type and URL |
| 400 | PDF exceeds 100MB limit | Split or compress the PDF |
| 500 | OCR processing failed | Check Azure credentials and quota |
| 500 | Azure OCR failed (408) | PDF too complex or Azure timeout - will retry |
| 500 | Azure OCR failed (429) | Rate limit exceeded - will retry |
Logging
OCR operations are logged with detailed timing and chunk information:Performance
- Small PDFs (1-10 pages): 2-5 seconds
- Medium PDFs (11-50 pages): 10-30 seconds
- Large PDFs (51-100 pages): 30-120 seconds
- Page complexity (images, tables, text density)
- Azure API response times
- Network latency
See Also
- File Upload - Upload PDFs before processing
- Storage Configuration - Storage backend setup