Overview
Workers are background processes that consume messages from Kafka topics and perform asynchronous operations. They form the processing backbone of Chronoverse, handling everything from job scheduling to log persistence. All workers are horizontally scalable through Kafka consumer groups, enabling high throughput and fault tolerance.Worker Architecture
Scheduling Worker
Purpose
Identifies workflows that are due for execution and creates corresponding job records.Architecture
Dependencies
Dependencies
- PostgreSQL: Stores workflows and jobs
- Kafka: Publishes job events
- Configuration: Poll interval, batch size, fetch limit
Configuration
Configuration
Workflow
Create Job Records
For each workflow, create a job entry:
- Generate unique job ID (UUID)
- Set
scheduled_atto current time - Set
statusto PENDING - Set
triggerto AUTOMATIC - Store in PostgreSQL
Publish Events
Send job events to Kafka in batches:
- Topic:
workflowsorjobs(depending on workflow type) - Partition: Based on workflow ID (ensures ordering)
- Transactional writes for exactly-once semantics
Scaling Considerations
- Single Instance: Typically runs as a single instance (no consumer group)
- Concurrency: Uses database transactions to prevent duplicate job creation
- Performance: Poll interval and batch size tune throughput
Workflow Worker
Purpose
Prepares Docker image configurations and execution environments for CONTAINER workflows.Architecture
Dependencies
Dependencies
- Redis: Caches execution templates
- ClickHouse: Stores workflow metadata
- Kafka: Consumes workflow events, publishes job events
- Docker: Validates images and configurations
- gRPC Clients: Communicates with Workflows, Jobs, and Notifications services
Configuration
Configuration
Workflow
Fetch Workflow Configuration
Call Workflows Service via gRPC:
- Get workflow details (kind, payload, etc.)
- Validate build status is COMPLETED
Validate Image
- Check if image exists in Docker
- Validate image name format
- Ensure image is accessible
Prepare Execution Template
Create execution configuration:
- Container name:
chronoverse-{job_id} - Network:
chronoverse - Environment variables
- Resource limits
- Logging configuration
Error Handling
- Invalid Image: Update workflow build status to FAILED
- Configuration Error: Log error and fail job with details
- Service Unavailable: Retry with exponential backoff (circuit breaker)
- Timeout: Mark job as FAILED and log timeout
Scaling Considerations
- Consumer Group: Multiple instances share the load
- Parallelism: Each instance processes multiple workflows concurrently
- Stateless: No shared state between instances
- Kafka Partitions: Scale up to the number of topic partitions
Execution Worker
Purpose
Executes scheduled jobs (both HEARTBEAT and CONTAINER types) in isolated environments.Architecture
Dependencies
Dependencies
- Redis: Fetches execution templates and streams logs
- Docker: Executes containers
- Kafka: Consumes job events, publishes log and analytics events
- gRPC Clients: Communicates with Workflows, Jobs, and Notifications services
Configuration
Configuration
Workflow
Fetch Job Details
Call Jobs Service via gRPC to get:
- Job ID and workflow ID
- Workflow kind (HEARTBEAT or CONTAINER)
- Current status
Update Status to RUNNING
- Call Jobs Service to update status
- Set
started_attimestamp - Publish notification event
Execute Based on Kind
HEARTBEAT:CONTAINER:
- Fetch execution config from Redis
- Create Docker container
- Start container
- Stream logs (stdout/stderr)
- Wait for completion
- Capture exit code
Process Logs
For each log line:
- Add timestamp and sequence number
- Publish to Kafka’s
job_logstopic - Stream to Redis for real-time viewing
Update Final Status
- Call Jobs Service with final status
- Set
completed_attimestamp - Update workflow consecutive failure count
- Clean up resources (container, cache)
Container Lifecycle Management
Container Creation
Container Creation
Log Streaming
Log Streaming
Cleanup
Cleanup
After job completion:
- Stop container (if still running)
- Remove container
- Delete Redis cache entries
- Remove any temporary volumes
Error Handling
- Container Creation Failed: Log error, mark job as FAILED
- Container Timeout: Stop container, mark as FAILED
- Docker Daemon Unavailable: Retry with backoff, eventually fail
- Log Streaming Error: Continue execution, log error separately
Scaling Considerations
- Consumer Group: Multiple instances process jobs in parallel
- Docker Access: Each instance needs Docker socket access
- Resource Limits: Monitor Docker host resources
- Network: All containers must access
chronoversenetwork
Scale Execution Workers based on job volume and execution time. More workers = higher concurrent execution capacity.
JobLogs Processor
Purpose
Persists job logs from Kafka to ClickHouse and MeiliSearch for efficient storage and querying.Architecture
Dependencies
Dependencies
- Redis: Temporary log storage for running jobs
- ClickHouse: Long-term log storage
- MeiliSearch: Full-text search indexing
- Kafka: Consumes log events
Configuration
Configuration
Workflow
Batch Accumulation
Accumulate logs until:
- Batch size reaches limit (e.g., 1000 logs)
- Time interval expires (e.g., 5 seconds)
- Whichever comes first
Write to ClickHouse
Batch insert logs to ClickHouse:ClickHouse optimizations:
- Column-oriented storage for compression
- Time-based partitioning
- Asynchronous inserts
Batching Strategy
Size-Based Batching
Size-Based Batching
When batch reaches configured size:
- Reduces write operations
- Improves throughput
- May increase latency slightly
Time-Based Batching
Time-Based Batching
When time interval expires:
- Ensures timely writes
- Prevents indefinite accumulation
- Handles low-volume scenarios
Error Handling
- ClickHouse Unavailable: Retry with backoff, don’t commit offset
- MeiliSearch Unavailable: Log error, continue (search is non-critical)
- Batch Write Failed: Retry entire batch, eventually move to dead-letter queue
Scaling Considerations
- Consumer Group: Multiple instances for parallel processing
- Partition Assignment: Kafka rebalancing distributes load
- Write Throughput: ClickHouse batching improves performance
- Search Indexing: MeiliSearch can handle concurrent writes
JobLogs Processor is optimized for high-volume log ingestion. Scale based on log volume and write latency requirements.
Analytics Processor
Purpose
Consumes job and workflow events to generate analytics data and metrics.Architecture
Dependencies
Dependencies
- PostgreSQL: Stores aggregated analytics
- Kafka: Consumes analytics events
Configuration
Configuration
Workflow
Process Event
Extract metrics:
- Job success/failure counts
- Execution duration statistics
- Workflow reliability metrics
- Resource usage patterns
Metrics Tracked
Job Metrics
- Total executions
- Success rate
- Failure rate
- Average duration
- Duration percentiles (p50, p95, p99)
Workflow Metrics
- Active workflows count
- Terminated workflows
- Average interval
- Consecutive failures trend
System Metrics
- Total jobs processed
- Worker utilization
- Queue depth
- Processing latency
User Metrics
- Workflows per user
- Jobs per user
- Resource consumption
- Activity patterns
Scaling Considerations
- Consumer Group: Multiple instances for parallel processing
- Database Writes: Use batching and upserts for efficiency
- Aggregation: Consider time-windowed aggregations for large datasets
Worker Communication Flow
Complete Job Execution Flow
Monitoring Workers
Health Checks
All workers expose health metrics:- Kafka Consumer Lag: How far behind the latest message
- Processing Rate: Messages processed per second
- Error Rate: Failed message processing attempts
- Resource Usage: CPU, memory, disk I/O
Observability
Workers integrate with OpenTelemetry:- Traces: Track message processing flow
- Metrics: Consumer lag, processing time, error counts
- Logs: Structured logging with context
View worker metrics in Grafana dashboards (LGTM stack) on port 3000.
Worker Best Practices
Configuration Tuning
Consumer Group Size
- Start with 1-2 instances per worker type
- Scale based on consumer lag metrics
- Don’t exceed Kafka partition count
Parallelism Limits
- Set based on available CPU cores
- Monitor resource utilization
- Account for I/O-bound operations
Error Handling
- Transient Errors: Retry with exponential backoff
- Permanent Errors: Log and move to dead-letter queue
- Circuit Breakers: Prevent cascading failures
- Graceful Degradation: Continue processing other messages
Resource Management
- Connection Pooling: Reuse database connections
- Memory Limits: Set container memory limits
- Garbage Collection: Monitor and tune GC settings
- Graceful Shutdown: Handle SIGTERM for clean exits
Next Steps
Architecture
Learn about overall system architecture
Workflows
Understand workflow concepts
Jobs
Learn about job execution
Deployment
Deploy and configure workers