PDF Processing

Overview

ThinkEx provides powerful PDF processing capabilities using Azure Mistral Document AI for OCR (Optical Character Recognition). The system supports:

Large PDF handling: Up to 100MB files
Batch processing: Automatic chunking for parallel OCR
Rich extraction: Text, tables, headers, footers, hyperlinks, and images
Markdown output: Structured markdown with preserved formatting
Durable workflows: Long-running OCR with status polling
Password protection detection: Client-side validation before upload

OCR Endpoints

Start Durable OCR Workflow

Recommended for large PDFs or when you need non-blocking processing.

POST /api/pdf/ocr/start
Content-Type: application/json

{
  "fileUrl": "https://storage.example.com/uploads/document.pdf",
  "itemId": "item_123456789"
}

fileUrl

string

required

URL of the PDF file to process. Must be from an allowed host (Supabase, app domain, or localhost in development)

itemId

string

required

Client-provided ID for tracking this OCR operation (typically the database item ID)

runId

string

Workflow run ID for polling status

itemId

string

Echo of the provided item ID

Poll OCR Status

Check the status of a running OCR workflow.

GET /api/pdf/ocr/status?runId=wf_abc123def456

runId

string

required

Workflow run ID returned from /api/pdf/ocr/start

status

string

Current status: running, completed, or failed

result

object

OCR results (only present when status is completed)

result.textContent

string

Full extracted text from all pages, joined with double newlines

result.ocrPages

array

Array of page-level OCR data with rich metadata

error

string

Error message (only present when status is failed)

Direct OCR Processing

Synchronous OCR for smaller PDFs (blocks until complete).

POST /api/pdf/ocr
Content-Type: application/json

{
  "fileUrl": "https://storage.example.com/uploads/document.pdf"
}

fileUrl

string

required

URL of the PDF file to process (max 100MB)

textContent

string

Full extracted text from all pages

ocrPages

array

Array of page objects with extracted data

OCR Page Schema

Each page in the ocrPages array contains:

index

number

Zero-based page number

markdown

string

Extracted text in Markdown format with preserved structure

images

array

Embedded images with base64 data and unique IDs. Format: [{ id: "p0-img-0", ... }]

tables

array

Detected tables with structure information

header

string | null

Page header text (if detected)

footer

string | null

Page footer text (if detected)

hyperlinks

array

Detected hyperlinks with URL and position data

Client SDK

Upload and OCR (Blocking)

import { uploadPdfAndRunOcr } from '@/lib/uploads/pdf-upload-with-ocr';

const file = document.querySelector('input[type="file"]').files[0];

try {
  const result = await uploadPdfAndRunOcr(file);
  
  if (result.ocrStatus === 'complete') {
    console.log('Text:', result.textContent);
    console.log('Pages:', result.ocrPages.length);
  } else {
    console.error('OCR failed:', result.ocrError);
  }
} catch (error) {
  console.error('Upload failed:', error);
}

Upload and OCR (Non-blocking)

import { uploadPdfToStorage, runOcrFromUrl } from '@/lib/uploads/pdf-upload-with-ocr';

const file = document.querySelector('input[type="file"]').files[0];

// Upload first (fast)
const { url, filename, fileSize } = await uploadPdfToStorage(file);

// Add item to UI immediately with pending state
addItemToUI({ url, filename, ocrStatus: 'pending' });

// Run OCR in background
runOcrFromUrl(url)
  .then(result => {
    if (result.ocrStatus === 'complete') {
      updateItemInUI({ textContent: result.textContent, ocrPages: result.ocrPages });
    }
  })
  .catch(error => {
    updateItemInUI({ ocrStatus: 'failed', ocrError: error.message });
  });

Password Protection Detection

import { isPasswordProtectedPdf, filterPasswordProtectedPdfs } from '@/lib/uploads/pdf-validation';

// Check single file
const isProtected = await isPasswordProtectedPdf(file);
if (isProtected) {
  alert('This PDF is password-protected and cannot be processed');
}

// Filter multiple files
const files = Array.from(fileInput.files);
const { valid, rejected } = await filterPasswordProtectedPdfs(files);

if (rejected.length > 0) {
  alert(`Password-protected files rejected: ${rejected.join(', ')}`);
}

// Upload only valid files
for (const file of valid) {
  await uploadPdfAndRunOcr(file);
}

OCR Processing Details

Chunking Strategy

PDFs are automatically split into chunks for parallel processing. Chunk size adapts to total page count:

Total Pages	Pages per Chunk	Reason
1-10	1	Maximum parallelism
11-30	3	Balance speed and API calls
31-100	6	Reduce API overhead
100+	12	Minimize rate limit risk

Maximum chunk size: 30MB
Maximum concurrent requests: 5 (prevents Azure timeouts)
Chunks processed in batches with detailed logging

Image Handling

Images embedded in PDFs are:

Extracted with base64 encoding (if OCR_INCLUDE_IMAGES is not false)
Given globally unique IDs: p{pageIndex}-{imageId} (e.g., p0-img-0)
Referenced in markdown with the unique ID
Included in the images array of each page

This prevents ID collisions when merging chunks (Azure returns chunk-relative IDs).

Retry Logic

OCR requests automatically retry on:

408 Request Timeout: Retries once after 3 seconds
429 Rate Limit: Retries once after 3-6 seconds (exponential backoff)

Other errors fail immediately.

Environment Configuration

# Required: Azure Document AI credentials
AZURE_DOCUMENT_AI_API_KEY=your-api-key
AZURE_DOCUMENT_AI_ENDPOINT=https://your-endpoint.openai.azure.com/openai/deployments/your-deployment/chat/completions?api-version=2024-02-15-preview

# Optional: Model to use (default: mistral-document-ai-2512)
AZURE_DOCUMENT_AI_MODEL=mistral-document-ai-2512

# Optional: Include embedded images in OCR results (default: true)
OCR_INCLUDE_IMAGES=true

# Required: Storage configuration (for fileUrl validation)
NEXT_PUBLIC_SUPABASE_URL=https://your-project.supabase.co
NEXT_PUBLIC_APP_URL=http://localhost:3000

URL Validation (SSRF Prevention)

For security, fileUrl must be from an allowed host:

Production: Supabase storage URL or NEXT_PUBLIC_APP_URL
Development: Above plus localhost and 127.0.0.1

Requests to other domains are rejected with a 400 error.

Error Handling

Common Errors

Status	Error	Solution
401	Unauthorized	User must be authenticated
400	fileUrl is required	Include fileUrl in request body
400	fileUrl origin is not allowed	Use allowed storage domain
400	URL does not point to a PDF	Verify file type and URL
400	PDF exceeds 100MB limit	Split or compress the PDF
500	OCR processing failed	Check Azure credentials and quota
500	Azure OCR failed (408)	PDF too complex or Azure timeout - will retry
500	Azure OCR failed (429)	Rate limit exceeded - will retry

Logging

OCR operations are logged with detailed timing and chunk information:

[PDF_OCR_AZURE] Start { pageCount: 25, totalChunks: 5, pagesPerChunk: 6 }
[PDF_OCR_AZURE] Batch start { batch: '1/1', chunks: ['pages 0-5', 'pages 6-11', ...] }
[PDF_OCR_AZURE] Chunk done { pages: '0-5', extractedPages: 6 }
[PDF_OCR_AZURE] Batch complete { batch: '1/1', pagesProcessed: 25, ms: 12453 }
[PDF_OCR_AZURE] Complete { pageCount: 25, totalMs: 12453 }

Performance

Small PDFs (1-10 pages): 2-5 seconds
Medium PDFs (11-50 pages): 10-30 seconds
Large PDFs (51-100 pages): 30-120 seconds

Actual times depend on:

Page complexity (images, tables, text density)
Azure API response times
Network latency

Workspace API

AI Tools

File Management

Overview

OCR Endpoints

Start Durable OCR Workflow

Poll OCR Status

Direct OCR Processing

OCR Page Schema

Client SDK

Upload and OCR (Blocking)

Upload and OCR (Non-blocking)

Password Protection Detection

OCR Processing Details

Chunking Strategy

Image Handling

Retry Logic

Environment Configuration

URL Validation (SSRF Prevention)

Error Handling

Common Errors

Logging

Performance

See Also

Build docs developers (and LLMs) love

Workspace API

AI Tools

File Management

​Overview

​OCR Endpoints

​Start Durable OCR Workflow

​Poll OCR Status

​Direct OCR Processing

​OCR Page Schema

​Client SDK

​Upload and OCR (Blocking)

​Upload and OCR (Non-blocking)

​Password Protection Detection

​OCR Processing Details

​Chunking Strategy

​Image Handling

​Retry Logic

​Environment Configuration

​URL Validation (SSRF Prevention)

​Error Handling

​Common Errors

​Logging

​Performance

​See Also

Build docs developers (and LLMs) love

Overview

OCR Endpoints

Start Durable OCR Workflow

Poll OCR Status

Direct OCR Processing

OCR Page Schema

Client SDK

Upload and OCR (Blocking)

Upload and OCR (Non-blocking)

Password Protection Detection

OCR Processing Details

Chunking Strategy

Image Handling

Retry Logic

Environment Configuration

URL Validation (SSRF Prevention)

Error Handling

Common Errors

Logging

Performance

See Also