Overview
Chronoverse implements a message-driven microservices architecture designed for reliability, scalability, and fault tolerance. The system uses a dual communication approach to balance responsiveness and reliability:- Kafka: For reliable, asynchronous processing and event-driven workflows
- gRPC: For efficient, low-latency synchronous service-to-service communication
System Components
The architecture consists of three main layers: infrastructure, services, and workers.Infrastructure Layer
PostgreSQL
Stores transactional data including workflows, jobs, users, and notifications
ClickHouse
High-performance analytics database for job logs and metrics
Redis
In-memory cache for session management and real-time data
Kafka
Message broker for asynchronous event processing
MeiliSearch
Fast search engine for job logs and workflow queries
Docker
Container runtime for executing workflow jobs
Service Layer
The service layer exposes gRPC APIs for synchronous operations:Server (HTTP Gateway)
Server (HTTP Gateway)
HTTP API gateway that exposes RESTful endpoints to external clients. Handles:
- Authentication and authorization middleware
- Request routing to internal gRPC services
- Session management via Redis
- Port: 8080
Users Service
Users Service
Manages user accounts and authentication. Handles:
- User registration and login
- JWT token generation and validation
- Notification preferences
- Port: 50051
Workflows Service
Workflows Service
Manages workflow definitions and configuration. Handles:
- Workflow CRUD operations
- Build status tracking
- Consecutive failure counting
- Workflow termination
- Port: 50052
Jobs Service
Jobs Service
Manages job lifecycle from scheduling through completion. Handles:
- Job scheduling and status updates
- Job log retrieval and streaming
- Job history and filtering
- Port: 50053
Notifications Service
Notifications Service
Provides real-time alerts and status updates. Handles:
- Server-Sent Events (SSE) for real-time notifications
- Workflow and job state change notifications
- Port: 50054
Analytics Service
Analytics Service
Provides insights into performance and trends. Handles:
- Job and workflow metrics
- Performance analytics
- Trend analysis
- Port: 50055
Worker Layer
Workers consume messages from Kafka topics and perform background processing:Scheduling Worker
Scheduling Worker
Identifies jobs due for execution based on their schedules. It:
- Polls PostgreSQL for workflows ready to execute
- Creates job entries in the database
- Publishes job events to Kafka for processing
- Supports automatic retry and error handling
Workflow Worker
Workflow Worker
Prepares workflow execution environments. It:
- Consumes workflow events from Kafka
- Builds Docker image configurations from workflow definitions
- Prepares execution templates for CONTAINER workflows
- Validates workflow payloads
- Updates workflow build status
Execution Worker
Execution Worker
Executes scheduled jobs in isolated containers. It:
- Consumes job events from Kafka
- Executes HEARTBEAT and CONTAINER workflow types
- Manages job lifecycle (start, monitor, complete)
- Captures stdout/stderr logs
- Publishes logs to Kafka for persistence
JobLogs Processor
JobLogs Processor
Persists execution logs to long-term storage. It:
- Consumes log events from Kafka
- Performs efficient batch insertion to ClickHouse
- Indexes logs in MeiliSearch for fast searching
- Optimizes storage and querying performance
Analytics Processor
Analytics Processor
Generates analytics data from job and workflow events. It:
- Consumes events from Kafka
- Aggregates metrics and performance data
- Stores results in PostgreSQL for querying
- Enables trend analysis and reporting
Communication Patterns
Synchronous Communication (gRPC)
Services communicate via gRPC for low-latency, request-response operations:All gRPC connections support mTLS encryption and include circuit breakers and retry logic for resilience.
Asynchronous Communication (Kafka)
Workers use Kafka topics for reliable, event-driven processing:Kafka topics use SSL/TLS encryption and support consumer groups for horizontal scaling and fault tolerance.
Security Architecture
Authentication & Authorization
- JWT Tokens: EdDSA (Ed25519) signatures for token generation and validation
- Role-Based Access: Admin and user roles with different permission levels
- Session Management: Redis-backed sessions with configurable TTL
Transport Security
- mTLS: Mutual TLS authentication between all services
- TLS 1.2/1.3: Encrypted communication for all protocols
- Certificate Management: Automated certificate generation and rotation
Network Isolation
- Docker Network: Internal bridge network isolates services
- Port Exposure: Minimal external port exposure in production
- Proxy Access: Docker socket access via security proxy
Deployment Architecture
Development Environment
Production Environment
Observability
OpenTelemetry Integration
All services and workers export telemetry data:- Traces: Distributed tracing across service boundaries
- Metrics: Performance counters and resource utilization
- Logs: Structured logging with context propagation
LGTM Stack
Grafana’s LGTM (Loki, Grafana, Tempo, Mimir) provides:- Grafana: Visualization and dashboards
- Tempo: Distributed tracing backend
- Loki: Log aggregation and querying
- Mimir: Metrics storage and querying
OTEL data is exported via gRPC to the LGTM collector on port 4317.
Scalability
Horizontal Scaling
- Stateless Services: All services can scale horizontally
- Kafka Consumer Groups: Workers scale by adding more instances
- Load Balancing: gRPC supports client-side load balancing
Vertical Scaling
- Auto Memory Limit: Automatic memory management based on container limits
- Auto Max Procs: Automatic GOMAXPROCS tuning
- Connection Pooling: Optimized database connection management
Performance Optimizations
- Batch Processing: JobLogs processor uses batching for efficiency
- Parallel Execution: Workers support configurable parallelism
- Caching: Redis caching for frequently accessed data
- Indexing: MeiliSearch for fast log search queries
Fault Tolerance
Service Resilience
- Circuit Breakers: Prevent cascading failures in gRPC calls
- Retry Logic: Automatic retry with exponential backoff
- Health Checks: Container health monitoring and restart policies
Data Resilience
- PostgreSQL: ACID transactions with point-in-time recovery
- Kafka: Replication and acknowledgment guarantees
- Redis: Persistence with AOF and RDB snapshots
Workflow Resilience
- Failure Tracking: Consecutive failure count per workflow
- Auto-Termination: Workflows terminate after max failures
- Job Retry: Failed jobs can be manually re-triggered
Next Steps
Workflows
Learn about workflow types and lifecycle
Jobs
Understand job scheduling and execution
Workers
Deep dive into worker components
Deployment
Deploy Chronoverse to your infrastructure