Skip to main content

Overview

ThinkEx provides powerful PDF processing capabilities using Azure Mistral Document AI for OCR (Optical Character Recognition). The system supports:
  • Large PDF handling: Up to 100MB files
  • Batch processing: Automatic chunking for parallel OCR
  • Rich extraction: Text, tables, headers, footers, hyperlinks, and images
  • Markdown output: Structured markdown with preserved formatting
  • Durable workflows: Long-running OCR with status polling
  • Password protection detection: Client-side validation before upload

OCR Endpoints

Start Durable OCR Workflow

Recommended for large PDFs or when you need non-blocking processing.
POST /api/pdf/ocr/start
Content-Type: application/json

{
  "fileUrl": "https://storage.example.com/uploads/document.pdf",
  "itemId": "item_123456789"
}
fileUrl
string
required
URL of the PDF file to process. Must be from an allowed host (Supabase, app domain, or localhost in development)
itemId
string
required
Client-provided ID for tracking this OCR operation (typically the database item ID)
runId
string
Workflow run ID for polling status
itemId
string
Echo of the provided item ID

Poll OCR Status

Check the status of a running OCR workflow.
GET /api/pdf/ocr/status?runId=wf_abc123def456
runId
string
required
Workflow run ID returned from /api/pdf/ocr/start
status
string
Current status: running, completed, or failed
result
object
OCR results (only present when status is completed)
result.textContent
string
Full extracted text from all pages, joined with double newlines
result.ocrPages
array
Array of page-level OCR data with rich metadata
error
string
Error message (only present when status is failed)

Direct OCR Processing

Synchronous OCR for smaller PDFs (blocks until complete).
POST /api/pdf/ocr
Content-Type: application/json

{
  "fileUrl": "https://storage.example.com/uploads/document.pdf"
}
fileUrl
string
required
URL of the PDF file to process (max 100MB)
textContent
string
Full extracted text from all pages
ocrPages
array
Array of page objects with extracted data

OCR Page Schema

Each page in the ocrPages array contains:
index
number
Zero-based page number
markdown
string
Extracted text in Markdown format with preserved structure
images
array
Embedded images with base64 data and unique IDs. Format: [{ id: "p0-img-0", ... }]
tables
array
Detected tables with structure information
header
string | null
Page header text (if detected)
Page footer text (if detected)
Detected hyperlinks with URL and position data

Client SDK

Upload and OCR (Blocking)

import { uploadPdfAndRunOcr } from '@/lib/uploads/pdf-upload-with-ocr';

const file = document.querySelector('input[type="file"]').files[0];

try {
  const result = await uploadPdfAndRunOcr(file);
  
  if (result.ocrStatus === 'complete') {
    console.log('Text:', result.textContent);
    console.log('Pages:', result.ocrPages.length);
  } else {
    console.error('OCR failed:', result.ocrError);
  }
} catch (error) {
  console.error('Upload failed:', error);
}

Upload and OCR (Non-blocking)

import { uploadPdfToStorage, runOcrFromUrl } from '@/lib/uploads/pdf-upload-with-ocr';

const file = document.querySelector('input[type="file"]').files[0];

// Upload first (fast)
const { url, filename, fileSize } = await uploadPdfToStorage(file);

// Add item to UI immediately with pending state
addItemToUI({ url, filename, ocrStatus: 'pending' });

// Run OCR in background
runOcrFromUrl(url)
  .then(result => {
    if (result.ocrStatus === 'complete') {
      updateItemInUI({ textContent: result.textContent, ocrPages: result.ocrPages });
    }
  })
  .catch(error => {
    updateItemInUI({ ocrStatus: 'failed', ocrError: error.message });
  });

Password Protection Detection

import { isPasswordProtectedPdf, filterPasswordProtectedPdfs } from '@/lib/uploads/pdf-validation';

// Check single file
const isProtected = await isPasswordProtectedPdf(file);
if (isProtected) {
  alert('This PDF is password-protected and cannot be processed');
}

// Filter multiple files
const files = Array.from(fileInput.files);
const { valid, rejected } = await filterPasswordProtectedPdfs(files);

if (rejected.length > 0) {
  alert(`Password-protected files rejected: ${rejected.join(', ')}`);
}

// Upload only valid files
for (const file of valid) {
  await uploadPdfAndRunOcr(file);
}

OCR Processing Details

Chunking Strategy

PDFs are automatically split into chunks for parallel processing. Chunk size adapts to total page count:
Total PagesPages per ChunkReason
1-101Maximum parallelism
11-303Balance speed and API calls
31-1006Reduce API overhead
100+12Minimize rate limit risk
  • Maximum chunk size: 30MB
  • Maximum concurrent requests: 5 (prevents Azure timeouts)
  • Chunks processed in batches with detailed logging

Image Handling

Images embedded in PDFs are:
  • Extracted with base64 encoding (if OCR_INCLUDE_IMAGES is not false)
  • Given globally unique IDs: p{pageIndex}-{imageId} (e.g., p0-img-0)
  • Referenced in markdown with the unique ID
  • Included in the images array of each page
This prevents ID collisions when merging chunks (Azure returns chunk-relative IDs).

Retry Logic

OCR requests automatically retry on:
  • 408 Request Timeout: Retries once after 3 seconds
  • 429 Rate Limit: Retries once after 3-6 seconds (exponential backoff)
Other errors fail immediately.

Environment Configuration

# Required: Azure Document AI credentials
AZURE_DOCUMENT_AI_API_KEY=your-api-key
AZURE_DOCUMENT_AI_ENDPOINT=https://your-endpoint.openai.azure.com/openai/deployments/your-deployment/chat/completions?api-version=2024-02-15-preview

# Optional: Model to use (default: mistral-document-ai-2512)
AZURE_DOCUMENT_AI_MODEL=mistral-document-ai-2512

# Optional: Include embedded images in OCR results (default: true)
OCR_INCLUDE_IMAGES=true

# Required: Storage configuration (for fileUrl validation)
NEXT_PUBLIC_SUPABASE_URL=https://your-project.supabase.co
NEXT_PUBLIC_APP_URL=http://localhost:3000

URL Validation (SSRF Prevention)

For security, fileUrl must be from an allowed host:
  • Production: Supabase storage URL or NEXT_PUBLIC_APP_URL
  • Development: Above plus localhost and 127.0.0.1
Requests to other domains are rejected with a 400 error.

Error Handling

Common Errors

StatusErrorSolution
401UnauthorizedUser must be authenticated
400fileUrl is requiredInclude fileUrl in request body
400fileUrl origin is not allowedUse allowed storage domain
400URL does not point to a PDFVerify file type and URL
400PDF exceeds 100MB limitSplit or compress the PDF
500OCR processing failedCheck Azure credentials and quota
500Azure OCR failed (408)PDF too complex or Azure timeout - will retry
500Azure OCR failed (429)Rate limit exceeded - will retry

Logging

OCR operations are logged with detailed timing and chunk information:
[PDF_OCR_AZURE] Start { pageCount: 25, totalChunks: 5, pagesPerChunk: 6 }
[PDF_OCR_AZURE] Batch start { batch: '1/1', chunks: ['pages 0-5', 'pages 6-11', ...] }
[PDF_OCR_AZURE] Chunk done { pages: '0-5', extractedPages: 6 }
[PDF_OCR_AZURE] Batch complete { batch: '1/1', pagesProcessed: 25, ms: 12453 }
[PDF_OCR_AZURE] Complete { pageCount: 25, totalMs: 12453 }

Performance

  • Small PDFs (1-10 pages): 2-5 seconds
  • Medium PDFs (11-50 pages): 10-30 seconds
  • Large PDFs (51-100 pages): 30-120 seconds
Actual times depend on:
  • Page complexity (images, tables, text density)
  • Azure API response times
  • Network latency

See Also

Build docs developers (and LLMs) love