Architecture - Chronoverse

Overview

Chronoverse implements a message-driven microservices architecture designed for reliability, scalability, and fault tolerance. The system uses a dual communication approach to balance responsiveness and reliability:

Kafka: For reliable, asynchronous processing and event-driven workflows
gRPC: For efficient, low-latency synchronous service-to-service communication

Data persistence is handled by PostgreSQL for transactional data and ClickHouse for analytics and high-volume job logs.

System Components

The architecture consists of three main layers: infrastructure, services, and workers.

Infrastructure Layer

PostgreSQL

Stores transactional data including workflows, jobs, users, and notifications

ClickHouse

High-performance analytics database for job logs and metrics

Redis

In-memory cache for session management and real-time data

Kafka

Message broker for asynchronous event processing

MeiliSearch

Fast search engine for job logs and workflow queries

Docker

Container runtime for executing workflow jobs

Service Layer

The service layer exposes gRPC APIs for synchronous operations:

Server (HTTP Gateway)

HTTP API gateway that exposes RESTful endpoints to external clients. Handles:

Authentication and authorization middleware
Request routing to internal gRPC services
Session management via Redis
Port: 8080

Users Service

Manages user accounts and authentication. Handles:

User registration and login
JWT token generation and validation
Notification preferences
Port: 50051

Workflows Service

Manages workflow definitions and configuration. Handles:

Workflow CRUD operations
Build status tracking
Consecutive failure counting
Workflow termination
Port: 50052

Jobs Service

Manages job lifecycle from scheduling through completion. Handles:

Job scheduling and status updates
Job log retrieval and streaming
Job history and filtering
Port: 50053

Notifications Service

Provides real-time alerts and status updates. Handles:

Server-Sent Events (SSE) for real-time notifications
Workflow and job state change notifications
Port: 50054

Analytics Service

Provides insights into performance and trends. Handles:

Job and workflow metrics
Performance analytics
Trend analysis
Port: 50055

Worker Layer

Workers consume messages from Kafka topics and perform background processing:

Scheduling Worker

Identifies jobs due for execution based on their schedules. It:

Polls PostgreSQL for workflows ready to execute
Creates job entries in the database
Publishes job events to Kafka for processing
Supports automatic retry and error handling

Workflow Worker

Prepares workflow execution environments. It:

Consumes workflow events from Kafka
Builds Docker image configurations from workflow definitions
Prepares execution templates for CONTAINER workflows
Validates workflow payloads
Updates workflow build status

Execution Worker

Executes scheduled jobs in isolated containers. It:

Consumes job events from Kafka
Executes HEARTBEAT and CONTAINER workflow types
Manages job lifecycle (start, monitor, complete)
Captures stdout/stderr logs
Publishes logs to Kafka for persistence

JobLogs Processor

Persists execution logs to long-term storage. It:

Consumes log events from Kafka
Performs efficient batch insertion to ClickHouse
Indexes logs in MeiliSearch for fast searching
Optimizes storage and querying performance

Analytics Processor

Generates analytics data from job and workflow events. It:

Consumes events from Kafka
Aggregates metrics and performance data
Stores results in PostgreSQL for querying
Enables trend analysis and reporting

Communication Patterns

Synchronous Communication (gRPC)

Services communicate via gRPC for low-latency, request-response operations:

All gRPC connections support mTLS encryption and include circuit breakers and retry logic for resilience.

Asynchronous Communication (Kafka)

Workers use Kafka topics for reliable, event-driven processing:

Kafka topics use SSL/TLS encryption and support consumer groups for horizontal scaling and fault tolerance.

Security Architecture

Authentication & Authorization

JWT Tokens: EdDSA (Ed25519) signatures for token generation and validation
Role-Based Access: Admin and user roles with different permission levels
Session Management: Redis-backed sessions with configurable TTL

Transport Security

mTLS: Mutual TLS authentication between all services
TLS 1.2/1.3: Encrypted communication for all protocols
Certificate Management: Automated certificate generation and rotation

Network Isolation

Docker Network: Internal bridge network isolates services
Port Exposure: Minimal external port exposure in production
Proxy Access: Docker socket access via security proxy

Deployment Architecture

Development Environment

# All ports exposed for debugging
Ports:
  - 8080:8080   # Server HTTP API
  - 3001:3000   # Dashboard
  - 5432:5432   # PostgreSQL
  - 9440:9440   # ClickHouse
  - 6379:6379   # Redis
  - 9094:9094   # Kafka
  - 7700:7700   # MeiliSearch
  - 3000:3000   # Grafana (LGTM)
  - 50051-50055 # gRPC Services

Production Environment

# Minimal port exposure
Ports:
  - 80:80       # Nginx/Reverse Proxy
  - 443:443     # HTTPS
  - 3000:3000   # Grafana (Optional)

# All internal services communicate via Docker network

Observability

OpenTelemetry Integration

All services and workers export telemetry data:

Traces: Distributed tracing across service boundaries
Metrics: Performance counters and resource utilization
Logs: Structured logging with context propagation

LGTM Stack

Grafana’s LGTM (Loki, Grafana, Tempo, Mimir) provides:

Grafana: Visualization and dashboards
Tempo: Distributed tracing backend
Loki: Log aggregation and querying
Mimir: Metrics storage and querying

OTEL data is exported via gRPC to the LGTM collector on port 4317.

Scalability

Horizontal Scaling

Stateless Services: All services can scale horizontally
Kafka Consumer Groups: Workers scale by adding more instances
Load Balancing: gRPC supports client-side load balancing

Vertical Scaling

Auto Memory Limit: Automatic memory management based on container limits
Auto Max Procs: Automatic GOMAXPROCS tuning
Connection Pooling: Optimized database connection management

Performance Optimizations

Batch Processing: JobLogs processor uses batching for efficiency
Parallel Execution: Workers support configurable parallelism
Caching: Redis caching for frequently accessed data
Indexing: MeiliSearch for fast log search queries

Fault Tolerance

Service Resilience

Circuit Breakers: Prevent cascading failures in gRPC calls
Retry Logic: Automatic retry with exponential backoff
Health Checks: Container health monitoring and restart policies

Data Resilience

PostgreSQL: ACID transactions with point-in-time recovery
Kafka: Replication and acknowledgment guarantees
Redis: Persistence with AOF and RDB snapshots

Workflow Resilience

Failure Tracking: Consecutive failure count per workflow
Auto-Termination: Workflows terminate after max failures
Job Retry: Failed jobs can be manually re-triggered

Next Steps

Workflows

Learn about workflow types and lifecycle

Jobs

Understand job scheduling and execution

Workers

Deep dive into worker components

Deployment

Deploy Chronoverse to your infrastructure

Get Started

Core Concepts

Deployment

Features

Operations

​Overview

​System Components

​Infrastructure Layer

PostgreSQL

ClickHouse

Redis

Kafka

MeiliSearch

Docker

​Service Layer

​Worker Layer

​Communication Patterns

​Synchronous Communication (gRPC)

​Asynchronous Communication (Kafka)

​Security Architecture

​Authentication & Authorization

​Transport Security

​Network Isolation

​Deployment Architecture

​Development Environment

​Production Environment

​Observability

​OpenTelemetry Integration

​LGTM Stack

​Scalability

​Horizontal Scaling

​Vertical Scaling

​Performance Optimizations

​Fault Tolerance

​Service Resilience

​Data Resilience

​Workflow Resilience

​Next Steps

Workflows

Jobs

Workers

Deployment

Build docs developers (and LLMs) love

Overview

System Components

Infrastructure Layer

Service Layer

Worker Layer

Communication Patterns

Synchronous Communication (gRPC)

Asynchronous Communication (Kafka)

Security Architecture

Authentication & Authorization

Transport Security

Network Isolation

Deployment Architecture

Development Environment

Production Environment

Observability

OpenTelemetry Integration

LGTM Stack

Scalability

Horizontal Scaling

Vertical Scaling

Performance Optimizations

Fault Tolerance

Service Resilience

Data Resilience

Workflow Resilience

Next Steps