Documentation Index
Fetch the complete documentation index at: https://mintlify.com/lumina-ai-inc/chunkr/llms.txt
Use this file to discover all available pages before exploring further.
Chunkr is designed to scale horizontally by increasing the number of worker replicas. This guide covers scaling strategies for different workload patterns.
Understanding Chunkr’s Architecture
Chunkr uses a distributed architecture with specialized workers:
┌─────────────┐
│ Client │
└──────┬──────┘
│
▼
┌─────────────┐ ┌─────────────┐
│ Server │◄────►│ Redis │
│ (1 replica)│ │ (Queue) │
└──────┬──────┘ └─────────────┘
│
│ Enqueues tasks
│
▼
┌─────────────────────────────────┐
│ Task Workers (30) │
│ - Orchestrate processing │
│ - Call ML services │
└────┬──────────────────────┬─────┘
│ │
▼ ▼
┌────────────┐ ┌──────────────┐
│Segmentation│ │ OCR │
│ Workers │ │ Workers │
│ (6 replicas)│ │ (3 replicas) │
└────────────┘ └──────────────┘
Default Replica Configuration
The default compose.yaml is optimized for medium workloads:
| Service | Default Replicas | Purpose |
|---|
| server | 1 | API server (stateless, can scale) |
| task | 30 | Background task orchestration |
| segmentation-backend | 6 | Document layout analysis |
| ocr-backend | 3 | Text recognition |
| web | 1 | Frontend UI |
| postgres | 1 | Database (single instance) |
| redis | 1 | Queue/cache (single instance) |
Scaling Strategies
Vertical Scaling
Increase resources for individual containers:
services:
task:
deploy:
resources:
limits:
cpus: '4.0'
memory: 8G
reservations:
cpus: '2.0'
memory: 4G
Horizontal Scaling
Increase the number of worker replicas:
services:
task:
deploy:
replicas: 50 # Scale up from 30
segmentation-backend:
deploy:
replicas: 12 # Scale up from 6
ocr-backend:
deploy:
replicas: 6 # Scale up from 3
Horizontal scaling is more effective for Chunkr as it allows processing multiple documents in parallel.
Scaling for Different Workloads
High Volume, Small Documents
Characteristics: Many PDF pages, mostly text
Recommended configuration:
services:
task:
deploy:
replicas: 50
segmentation-backend:
deploy:
replicas: 8
ocr-backend:
deploy:
replicas: 12 # OCR is bottleneck for text-heavy docs
Large Documents, Complex Layouts
Characteristics: Multi-page documents with tables, images, complex formatting
Recommended configuration:
services:
task:
deploy:
replicas: 30
segmentation-backend:
deploy:
replicas: 12 # Segmentation is bottleneck
ocr-backend:
deploy:
replicas: 6
Mixed Workload
Characteristics: Variety of document types and sizes
Recommended configuration:
services:
task:
deploy:
replicas: 40
segmentation-backend:
deploy:
replicas: 10
ocr-backend:
deploy:
replicas: 8
GPU Scaling Considerations
Single GPU
With one GPU (8GB+), balance workers to avoid memory contention:
segmentation-backend:
deploy:
replicas: 4
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
environment:
- MAX_BATCH_SIZE=4
ocr-backend:
deploy:
replicas: 2
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
Too many replicas on a single GPU can cause out-of-memory errors. Start conservative and scale up while monitoring GPU memory.
Multiple GPUs
With multiple GPUs, scale workers proportionally:
2 GPUs:
segmentation-backend:
deploy:
replicas: 8 # 4 per GPU
ocr-backend:
deploy:
replicas: 4 # 2 per GPU
4 GPUs:
segmentation-backend:
deploy:
replicas: 16 # 4 per GPU
ocr-backend:
deploy:
replicas: 8 # 2 per GPU
GPU Pinning
For optimal performance, pin specific workers to specific GPUs:
segmentation-backend-gpu0:
build:
context: .
dockerfile: docker/segmentation/Dockerfile
deploy:
replicas: 4
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0']
capabilities: [gpu]
segmentation-backend-gpu1:
build:
context: .
dockerfile: docker/segmentation/Dockerfile
deploy:
replicas: 4
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['1']
capabilities: [gpu]
CPU Scaling
For CPU-only deployments, scale more conservatively:
services:
task:
deploy:
replicas: 10 # Reduced from GPU default
segmentation-backend:
deploy:
replicas: 6
resources: {}
environment:
- MAX_BATCH_SIZE=64
- OMP_NUM_THREADS=12
- MKL_NUM_THREADS=12
ocr-backend:
deploy:
replicas: 3
resources: {}
CPU Threading Configuration:
- OMP_NUM_THREADS: OpenMP threads per worker
- MKL_NUM_THREADS: Intel MKL threads per worker
- NUMEXPR_NUM_THREADS: NumExpr threads per worker
Set thread counts to: (total_cpu_cores / replicas) to avoid over-subscription.
Load Balancing
Chunkr uses nginx for load balancing ML workers:
Segmentation Load Balancer
segmentation:
image: nginx:latest
ports:
- "8001:8000"
volumes:
- ./nginx/segmentation.conf:/etc/nginx/nginx.conf:ro
depends_on:
- segmentation-backend
The nginx configuration distributes requests across all segmentation-backend replicas.
OCR Load Balancer
ocr:
image: nginx:latest
ports:
- "8002:8000"
volumes:
- ./nginx/ocr.conf:/etc/nginx/nginx.conf:ro
depends_on:
- ocr-backend
The load balancers automatically detect all replicas using Docker’s DNS service discovery.
Monitoring and Optimization
Monitor Queue Depth
Check Redis queue length to identify bottlenecks:
# Connect to Redis
docker compose exec redis redis-cli
# Check queue length
LLEN task_queue_name
# Monitor in real-time
watch -n 1 'docker compose exec redis redis-cli LLEN task_queue_name'
Interpretation:
- Queue growing: Workers can’t keep up, scale up
- Queue near zero: Workers idle, may be over-provisioned
- Queue stable: System balanced
Monitor Worker Utilization
# CPU and memory usage
docker stats
# GPU usage
watch -n 1 nvidia-smi
# Task worker logs
docker compose logs -f task | grep "Processing"
Identify Bottlenecks
Monitor queue depth
If queues are growing, workers are the bottleneck.
Check resource usage
- CPU near 100%: Scale horizontally or upgrade CPU
- GPU near 100%: Add more GPUs or increase batch size
- Memory high: Reduce replicas or batch sizes
Review processing times
Check logs for average processing time per document:docker compose logs task | grep "completed in"
Scale the bottleneck
Increase replicas for the slowest component first.
Scaling Commands
Scale Specific Service
# Scale task workers to 50
docker compose up -d --scale task=50
# Scale segmentation workers to 12
docker compose up -d --scale segmentation-backend=12
# Scale multiple services
docker compose up -d --scale task=50 --scale segmentation-backend=12
Update docker-compose.yaml
For persistent scaling, update the compose file:
services:
task:
deploy:
replicas: 50
Then apply:
Zero-Downtime Scaling
# Scale up gradually
docker compose up -d --scale task=40 --no-recreate
sleep 10
docker compose up -d --scale task=50 --no-recreate
Database Scaling
PostgreSQL is a single instance by default. For production:
Connection Pooling
Add PgBouncer for connection pooling:
pgbouncer:
image: pgbouncer/pgbouncer:latest
environment:
- DATABASES_HOST=postgres
- DATABASES_PORT=5432
- DATABASES_USER=postgres
- DATABASES_PASSWORD=postgres
- DATABASES_DBNAME=chunkr
- POOL_MODE=transaction
- MAX_CLIENT_CONN=1000
- DEFAULT_POOL_SIZE=25
ports:
- "6432:6432"
Update connection string:
PG__URL=postgresql://postgres:postgres@pgbouncer:6432/chunkr
Read Replicas
For read-heavy workloads, add PostgreSQL read replicas and route read queries accordingly.
Redis Scaling
Redis Cluster
For high availability and better performance:
redis-cluster:
image: redis:latest
command: redis-server --cluster-enabled yes --cluster-config-file nodes.conf
volumes:
- redis_cluster_data:/data
Redis Sentinel
For automatic failover:
redis-sentinel:
image: redis:latest
command: redis-sentinel /etc/redis/sentinel.conf
volumes:
- ./redis/sentinel.conf:/etc/redis/sentinel.conf
Cost Optimization
Auto-scaling Strategy
- Monitor queue depth every minute
- Scale up when queue > 100 tasks for 2 minutes
- Scale down when queue < 10 tasks for 5 minutes
- Minimum replicas: Keep baseline capacity
- Maximum replicas: Set budget limits
Resource Limits
Prevent runaway resource usage:
services:
task:
deploy:
replicas: 30
resources:
limits:
cpus: '2.0'
memory: 4G
reservations:
cpus: '1.0'
memory: 2G
Production Scaling Checklist
Baseline testing
- Test with expected document types
- Measure processing times
- Identify resource bottlenecks
Configure monitoring
- Set up metrics collection
- Configure alerts for queue depth
- Monitor GPU/CPU utilization
Scale incrementally
- Start with default configuration
- Increase replicas by 25-50% at a time
- Monitor impact before further scaling
Optimize bottlenecks
- Scale the slowest component first
- Tune batch sizes and threading
- Consider adding GPUs if needed
Set resource limits
- Prevent memory exhaustion
- Ensure predictable performance
- Enable graceful degradation
Next Steps