Scaling - Chunkr

Chunkr is designed to scale horizontally by increasing the number of worker replicas. This guide covers scaling strategies for different workload patterns.

Understanding Chunkr’s Architecture

Chunkr uses a distributed architecture with specialized workers:

┌─────────────┐
│   Client    │
└──────┬──────┘
       │
       ▼
┌─────────────┐      ┌─────────────┐
│   Server    │◄────►│    Redis    │
│  (1 replica)│      │   (Queue)   │
└──────┬──────┘      └─────────────┘
       │
       │ Enqueues tasks
       │
       ▼
┌─────────────────────────────────┐
│      Task Workers (30)          │
│  - Orchestrate processing       │
│  - Call ML services             │
└────┬──────────────────────┬─────┘
     │                      │
     ▼                      ▼
┌────────────┐      ┌──────────────┐
│Segmentation│      │     OCR      │
│   Workers  │      │   Workers    │
│ (6 replicas)│      │ (3 replicas) │
└────────────┘      └──────────────┘

Default Replica Configuration

The default compose.yaml is optimized for medium workloads:

Service	Default Replicas	Purpose
server	1	API server (stateless, can scale)
task	30	Background task orchestration
segmentation-backend	6	Document layout analysis
ocr-backend	3	Text recognition
web	1	Frontend UI
postgres	1	Database (single instance)
redis	1	Queue/cache (single instance)

Scaling Strategies

Vertical Scaling

Increase resources for individual containers:

services:
  task:
    deploy:
      resources:
        limits:
          cpus: '4.0'
          memory: 8G
        reservations:
          cpus: '2.0'
          memory: 4G

Horizontal Scaling

Increase the number of worker replicas:

services:
  task:
    deploy:
      replicas: 50  # Scale up from 30
  
  segmentation-backend:
    deploy:
      replicas: 12  # Scale up from 6
  
  ocr-backend:
    deploy:
      replicas: 6   # Scale up from 3

Horizontal scaling is more effective for Chunkr as it allows processing multiple documents in parallel.

Scaling for Different Workloads

High Volume, Small Documents

Characteristics: Many PDF pages, mostly text Recommended configuration:

services:
  task:
    deploy:
      replicas: 50
  
  segmentation-backend:
    deploy:
      replicas: 8
  
  ocr-backend:
    deploy:
      replicas: 12  # OCR is bottleneck for text-heavy docs

Large Documents, Complex Layouts

Characteristics: Multi-page documents with tables, images, complex formatting Recommended configuration:

services:
  task:
    deploy:
      replicas: 30
  
  segmentation-backend:
    deploy:
      replicas: 12  # Segmentation is bottleneck
  
  ocr-backend:
    deploy:
      replicas: 6

Mixed Workload

Characteristics: Variety of document types and sizes Recommended configuration:

services:
  task:
    deploy:
      replicas: 40
  
  segmentation-backend:
    deploy:
      replicas: 10
  
  ocr-backend:
    deploy:
      replicas: 8

GPU Scaling Considerations

Single GPU

With one GPU (8GB+), balance workers to avoid memory contention:

segmentation-backend:
  deploy:
    replicas: 4
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]
  environment:
    - MAX_BATCH_SIZE=4

ocr-backend:
  deploy:
    replicas: 2
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]

Too many replicas on a single GPU can cause out-of-memory errors. Start conservative and scale up while monitoring GPU memory.

Multiple GPUs

With multiple GPUs, scale workers proportionally: 2 GPUs:

segmentation-backend:
  deploy:
    replicas: 8  # 4 per GPU

ocr-backend:
  deploy:
    replicas: 4  # 2 per GPU

4 GPUs:

segmentation-backend:
  deploy:
    replicas: 16  # 4 per GPU

ocr-backend:
  deploy:
    replicas: 8   # 2 per GPU

GPU Pinning

For optimal performance, pin specific workers to specific GPUs:

segmentation-backend-gpu0:
  build:
    context: .
    dockerfile: docker/segmentation/Dockerfile
  deploy:
    replicas: 4
    resources:
      reservations:
        devices:
          - driver: nvidia
            device_ids: ['0']
            capabilities: [gpu]

segmentation-backend-gpu1:
  build:
    context: .
    dockerfile: docker/segmentation/Dockerfile
  deploy:
    replicas: 4
    resources:
      reservations:
        devices:
          - driver: nvidia
            device_ids: ['1']
            capabilities: [gpu]

CPU Scaling

For CPU-only deployments, scale more conservatively:

services:
  task:
    deploy:
      replicas: 10  # Reduced from GPU default
  
  segmentation-backend:
    deploy:
      replicas: 6
      resources: {}
    environment:
      - MAX_BATCH_SIZE=64
      - OMP_NUM_THREADS=12
      - MKL_NUM_THREADS=12
  
  ocr-backend:
    deploy:
      replicas: 3
      resources: {}

CPU Threading Configuration:

OMP_NUM_THREADS: OpenMP threads per worker
MKL_NUM_THREADS: Intel MKL threads per worker
NUMEXPR_NUM_THREADS: NumExpr threads per worker

Set thread counts to: (total_cpu_cores / replicas) to avoid over-subscription.

Load Balancing

Chunkr uses nginx for load balancing ML workers:

Segmentation Load Balancer

segmentation:
  image: nginx:latest
  ports:
    - "8001:8000"
  volumes:
    - ./nginx/segmentation.conf:/etc/nginx/nginx.conf:ro
  depends_on:
    - segmentation-backend

The nginx configuration distributes requests across all segmentation-backend replicas.

OCR Load Balancer

ocr:
  image: nginx:latest
  ports:
    - "8002:8000"
  volumes:
    - ./nginx/ocr.conf:/etc/nginx/nginx.conf:ro
  depends_on:
    - ocr-backend

The load balancers automatically detect all replicas using Docker’s DNS service discovery.

Monitoring and Optimization

Monitor Queue Depth

Check Redis queue length to identify bottlenecks:

# Connect to Redis
docker compose exec redis redis-cli

# Check queue length
LLEN task_queue_name

# Monitor in real-time
watch -n 1 'docker compose exec redis redis-cli LLEN task_queue_name'

Interpretation:

Queue growing: Workers can’t keep up, scale up
Queue near zero: Workers idle, may be over-provisioned
Queue stable: System balanced

Monitor Worker Utilization

# CPU and memory usage
docker stats

# GPU usage
watch -n 1 nvidia-smi

# Task worker logs
docker compose logs -f task | grep "Processing"

Identify Bottlenecks

Monitor queue depth

If queues are growing, workers are the bottleneck.

Check resource usage

CPU near 100%: Scale horizontally or upgrade CPU
GPU near 100%: Add more GPUs or increase batch size
Memory high: Reduce replicas or batch sizes

Review processing times

Check logs for average processing time per document:

docker compose logs task | grep "completed in"

Scale the bottleneck

Increase replicas for the slowest component first.

Scaling Commands

Scale Specific Service

# Scale task workers to 50
docker compose up -d --scale task=50

# Scale segmentation workers to 12
docker compose up -d --scale segmentation-backend=12

# Scale multiple services
docker compose up -d --scale task=50 --scale segmentation-backend=12

Update docker-compose.yaml

For persistent scaling, update the compose file:

services:
  task:
    deploy:
      replicas: 50

Then apply:

docker compose up -d

Zero-Downtime Scaling

# Scale up gradually
docker compose up -d --scale task=40 --no-recreate
sleep 10
docker compose up -d --scale task=50 --no-recreate

Database Scaling

PostgreSQL is a single instance by default. For production:

Connection Pooling

Add PgBouncer for connection pooling:

pgbouncer:
  image: pgbouncer/pgbouncer:latest
  environment:
    - DATABASES_HOST=postgres
    - DATABASES_PORT=5432
    - DATABASES_USER=postgres
    - DATABASES_PASSWORD=postgres
    - DATABASES_DBNAME=chunkr
    - POOL_MODE=transaction
    - MAX_CLIENT_CONN=1000
    - DEFAULT_POOL_SIZE=25
  ports:
    - "6432:6432"

Update connection string:

PG__URL=postgresql://postgres:postgres@pgbouncer:6432/chunkr

Read Replicas

For read-heavy workloads, add PostgreSQL read replicas and route read queries accordingly.

Redis Scaling

Redis Cluster

For high availability and better performance:

redis-cluster:
  image: redis:latest
  command: redis-server --cluster-enabled yes --cluster-config-file nodes.conf
  volumes:
    - redis_cluster_data:/data

Redis Sentinel

For automatic failover:

redis-sentinel:
  image: redis:latest
  command: redis-sentinel /etc/redis/sentinel.conf
  volumes:
    - ./redis/sentinel.conf:/etc/redis/sentinel.conf

Cost Optimization

Auto-scaling Strategy

Monitor queue depth every minute
Scale up when queue > 100 tasks for 2 minutes
Scale down when queue < 10 tasks for 5 minutes
Minimum replicas: Keep baseline capacity
Maximum replicas: Set budget limits

Resource Limits

Prevent runaway resource usage:

services:
  task:
    deploy:
      replicas: 30
      resources:
        limits:
          cpus: '2.0'
          memory: 4G
        reservations:
          cpus: '1.0'
          memory: 2G

Production Scaling Checklist

Baseline testing

Test with expected document types
Measure processing times
Identify resource bottlenecks

Configure monitoring

Set up metrics collection
Configure alerts for queue depth
Monitor GPU/CPU utilization

Scale incrementally

Start with default configuration
Increase replicas by 25-50% at a time
Monitor impact before further scaling

Optimize bottlenecks

Scale the slowest component first
Tune batch sizes and threading
Consider adding GPUs if needed

Set resource limits

Prevent memory exhaustion
Ensure predictable performance
Enable graceful degradation

Next Steps

Review GPU Setup for GPU scaling
Configure Environment Variables
Return to Docker Compose Deployment

Getting Started

Core Concepts

Configuration

Deployment

Guides

Documentation Index

​Understanding Chunkr’s Architecture

​Default Replica Configuration

​Scaling Strategies

​Vertical Scaling

​Horizontal Scaling

​Scaling for Different Workloads

​High Volume, Small Documents

​Large Documents, Complex Layouts

​Mixed Workload

​GPU Scaling Considerations

​Single GPU

​Multiple GPUs

​GPU Pinning

​CPU Scaling

​Load Balancing

​Segmentation Load Balancer

​OCR Load Balancer

​Monitoring and Optimization

​Monitor Queue Depth

​Monitor Worker Utilization

​Identify Bottlenecks

​Scaling Commands

​Scale Specific Service

​Update docker-compose.yaml

​Zero-Downtime Scaling

​Database Scaling

​Connection Pooling

​Read Replicas

​Redis Scaling

​Redis Cluster

​Redis Sentinel

​Cost Optimization

​Auto-scaling Strategy

​Resource Limits

​Production Scaling Checklist

​Next Steps

Build docs developers (and LLMs) love

Understanding Chunkr’s Architecture

Default Replica Configuration

Scaling Strategies

Vertical Scaling

Horizontal Scaling

Scaling for Different Workloads

High Volume, Small Documents

Large Documents, Complex Layouts

Mixed Workload

GPU Scaling Considerations

Single GPU

Multiple GPUs

GPU Pinning

CPU Scaling

Load Balancing

Segmentation Load Balancer

OCR Load Balancer

Monitoring and Optimization

Monitor Queue Depth

Monitor Worker Utilization

Identify Bottlenecks

Scaling Commands

Scale Specific Service

Update docker-compose.yaml

Zero-Downtime Scaling

Database Scaling

Connection Pooling

Read Replicas

Redis Scaling

Redis Cluster

Redis Sentinel

Cost Optimization

Auto-scaling Strategy

Resource Limits

Production Scaling Checklist

Next Steps