Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/lumina-ai-inc/chunkr/llms.txt

Use this file to discover all available pages before exploring further.

Chunkr is designed to scale horizontally by increasing the number of worker replicas. This guide covers scaling strategies for different workload patterns.

Understanding Chunkr’s Architecture

Chunkr uses a distributed architecture with specialized workers:
┌─────────────┐
│   Client    │
└──────┬──────┘


┌─────────────┐      ┌─────────────┐
│   Server    │◄────►│    Redis    │
│  (1 replica)│      │   (Queue)   │
└──────┬──────┘      └─────────────┘

       │ Enqueues tasks


┌─────────────────────────────────┐
│      Task Workers (30)          │
│  - Orchestrate processing       │
│  - Call ML services             │
└────┬──────────────────────┬─────┘
     │                      │
     ▼                      ▼
┌────────────┐      ┌──────────────┐
│Segmentation│      │     OCR      │
│   Workers  │      │   Workers    │
│ (6 replicas)│      │ (3 replicas) │
└────────────┘      └──────────────┘

Default Replica Configuration

The default compose.yaml is optimized for medium workloads:
ServiceDefault ReplicasPurpose
server1API server (stateless, can scale)
task30Background task orchestration
segmentation-backend6Document layout analysis
ocr-backend3Text recognition
web1Frontend UI
postgres1Database (single instance)
redis1Queue/cache (single instance)

Scaling Strategies

Vertical Scaling

Increase resources for individual containers:
services:
  task:
    deploy:
      resources:
        limits:
          cpus: '4.0'
          memory: 8G
        reservations:
          cpus: '2.0'
          memory: 4G

Horizontal Scaling

Increase the number of worker replicas:
services:
  task:
    deploy:
      replicas: 50  # Scale up from 30
  
  segmentation-backend:
    deploy:
      replicas: 12  # Scale up from 6
  
  ocr-backend:
    deploy:
      replicas: 6   # Scale up from 3
Horizontal scaling is more effective for Chunkr as it allows processing multiple documents in parallel.

Scaling for Different Workloads

High Volume, Small Documents

Characteristics: Many PDF pages, mostly text Recommended configuration:
services:
  task:
    deploy:
      replicas: 50
  
  segmentation-backend:
    deploy:
      replicas: 8
  
  ocr-backend:
    deploy:
      replicas: 12  # OCR is bottleneck for text-heavy docs

Large Documents, Complex Layouts

Characteristics: Multi-page documents with tables, images, complex formatting Recommended configuration:
services:
  task:
    deploy:
      replicas: 30
  
  segmentation-backend:
    deploy:
      replicas: 12  # Segmentation is bottleneck
  
  ocr-backend:
    deploy:
      replicas: 6

Mixed Workload

Characteristics: Variety of document types and sizes Recommended configuration:
services:
  task:
    deploy:
      replicas: 40
  
  segmentation-backend:
    deploy:
      replicas: 10
  
  ocr-backend:
    deploy:
      replicas: 8

GPU Scaling Considerations

Single GPU

With one GPU (8GB+), balance workers to avoid memory contention:
segmentation-backend:
  deploy:
    replicas: 4
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]
  environment:
    - MAX_BATCH_SIZE=4

ocr-backend:
  deploy:
    replicas: 2
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]
Too many replicas on a single GPU can cause out-of-memory errors. Start conservative and scale up while monitoring GPU memory.

Multiple GPUs

With multiple GPUs, scale workers proportionally: 2 GPUs:
segmentation-backend:
  deploy:
    replicas: 8  # 4 per GPU

ocr-backend:
  deploy:
    replicas: 4  # 2 per GPU
4 GPUs:
segmentation-backend:
  deploy:
    replicas: 16  # 4 per GPU

ocr-backend:
  deploy:
    replicas: 8   # 2 per GPU

GPU Pinning

For optimal performance, pin specific workers to specific GPUs:
segmentation-backend-gpu0:
  build:
    context: .
    dockerfile: docker/segmentation/Dockerfile
  deploy:
    replicas: 4
    resources:
      reservations:
        devices:
          - driver: nvidia
            device_ids: ['0']
            capabilities: [gpu]

segmentation-backend-gpu1:
  build:
    context: .
    dockerfile: docker/segmentation/Dockerfile
  deploy:
    replicas: 4
    resources:
      reservations:
        devices:
          - driver: nvidia
            device_ids: ['1']
            capabilities: [gpu]

CPU Scaling

For CPU-only deployments, scale more conservatively:
services:
  task:
    deploy:
      replicas: 10  # Reduced from GPU default
  
  segmentation-backend:
    deploy:
      replicas: 6
      resources: {}
    environment:
      - MAX_BATCH_SIZE=64
      - OMP_NUM_THREADS=12
      - MKL_NUM_THREADS=12
  
  ocr-backend:
    deploy:
      replicas: 3
      resources: {}
CPU Threading Configuration:
  • OMP_NUM_THREADS: OpenMP threads per worker
  • MKL_NUM_THREADS: Intel MKL threads per worker
  • NUMEXPR_NUM_THREADS: NumExpr threads per worker
Set thread counts to: (total_cpu_cores / replicas) to avoid over-subscription.

Load Balancing

Chunkr uses nginx for load balancing ML workers:

Segmentation Load Balancer

segmentation:
  image: nginx:latest
  ports:
    - "8001:8000"
  volumes:
    - ./nginx/segmentation.conf:/etc/nginx/nginx.conf:ro
  depends_on:
    - segmentation-backend
The nginx configuration distributes requests across all segmentation-backend replicas.

OCR Load Balancer

ocr:
  image: nginx:latest
  ports:
    - "8002:8000"
  volumes:
    - ./nginx/ocr.conf:/etc/nginx/nginx.conf:ro
  depends_on:
    - ocr-backend
The load balancers automatically detect all replicas using Docker’s DNS service discovery.

Monitoring and Optimization

Monitor Queue Depth

Check Redis queue length to identify bottlenecks:
# Connect to Redis
docker compose exec redis redis-cli

# Check queue length
LLEN task_queue_name

# Monitor in real-time
watch -n 1 'docker compose exec redis redis-cli LLEN task_queue_name'
Interpretation:
  • Queue growing: Workers can’t keep up, scale up
  • Queue near zero: Workers idle, may be over-provisioned
  • Queue stable: System balanced

Monitor Worker Utilization

# CPU and memory usage
docker stats

# GPU usage
watch -n 1 nvidia-smi

# Task worker logs
docker compose logs -f task | grep "Processing"

Identify Bottlenecks

1

Monitor queue depth

If queues are growing, workers are the bottleneck.
2

Check resource usage

  • CPU near 100%: Scale horizontally or upgrade CPU
  • GPU near 100%: Add more GPUs or increase batch size
  • Memory high: Reduce replicas or batch sizes
3

Review processing times

Check logs for average processing time per document:
docker compose logs task | grep "completed in"
4

Scale the bottleneck

Increase replicas for the slowest component first.

Scaling Commands

Scale Specific Service

# Scale task workers to 50
docker compose up -d --scale task=50

# Scale segmentation workers to 12
docker compose up -d --scale segmentation-backend=12

# Scale multiple services
docker compose up -d --scale task=50 --scale segmentation-backend=12

Update docker-compose.yaml

For persistent scaling, update the compose file:
services:
  task:
    deploy:
      replicas: 50
Then apply:
docker compose up -d

Zero-Downtime Scaling

# Scale up gradually
docker compose up -d --scale task=40 --no-recreate
sleep 10
docker compose up -d --scale task=50 --no-recreate

Database Scaling

PostgreSQL is a single instance by default. For production:

Connection Pooling

Add PgBouncer for connection pooling:
pgbouncer:
  image: pgbouncer/pgbouncer:latest
  environment:
    - DATABASES_HOST=postgres
    - DATABASES_PORT=5432
    - DATABASES_USER=postgres
    - DATABASES_PASSWORD=postgres
    - DATABASES_DBNAME=chunkr
    - POOL_MODE=transaction
    - MAX_CLIENT_CONN=1000
    - DEFAULT_POOL_SIZE=25
  ports:
    - "6432:6432"
Update connection string:
PG__URL=postgresql://postgres:postgres@pgbouncer:6432/chunkr

Read Replicas

For read-heavy workloads, add PostgreSQL read replicas and route read queries accordingly.

Redis Scaling

Redis Cluster

For high availability and better performance:
redis-cluster:
  image: redis:latest
  command: redis-server --cluster-enabled yes --cluster-config-file nodes.conf
  volumes:
    - redis_cluster_data:/data

Redis Sentinel

For automatic failover:
redis-sentinel:
  image: redis:latest
  command: redis-sentinel /etc/redis/sentinel.conf
  volumes:
    - ./redis/sentinel.conf:/etc/redis/sentinel.conf

Cost Optimization

Auto-scaling Strategy

  1. Monitor queue depth every minute
  2. Scale up when queue > 100 tasks for 2 minutes
  3. Scale down when queue < 10 tasks for 5 minutes
  4. Minimum replicas: Keep baseline capacity
  5. Maximum replicas: Set budget limits

Resource Limits

Prevent runaway resource usage:
services:
  task:
    deploy:
      replicas: 30
      resources:
        limits:
          cpus: '2.0'
          memory: 4G
        reservations:
          cpus: '1.0'
          memory: 2G

Production Scaling Checklist

1

Baseline testing

  • Test with expected document types
  • Measure processing times
  • Identify resource bottlenecks
2

Configure monitoring

  • Set up metrics collection
  • Configure alerts for queue depth
  • Monitor GPU/CPU utilization
3

Scale incrementally

  • Start with default configuration
  • Increase replicas by 25-50% at a time
  • Monitor impact before further scaling
4

Optimize bottlenecks

  • Scale the slowest component first
  • Tune batch sizes and threading
  • Consider adding GPUs if needed
5

Set resource limits

  • Prevent memory exhaustion
  • Ensure predictable performance
  • Enable graceful degradation

Next Steps

Build docs developers (and LLMs) love