Overview
Better Uptime is a modern uptime monitoring platform built for performance, reliability, and scale. The system follows a distributed, event-driven architecture with clear separation of concerns across multiple services.Architecture Diagram
View the full, interactive architecture diagram on Excalidraw
Core Components
Client (Frontend)
The Next.js-based web application provides the user interface for:- Managing monitored websites
- Viewing real-time uptime status
- Analyzing historical metrics and trends
- Configuring status pages
- Server-side rendering for optimal performance
- tRPC client for type-safe API communication
- Real-time updates via WebSocket connections
- Responsive UI built with React and Tailwind CSS
Server (API Layer)
The tRPC server acts as the central API gateway:- Exposes type-safe RPC endpoints
- Handles authentication and authorization
- Manages WebSocket connections for real-time updates
- Orchestrates data access across PostgreSQL, ClickHouse, and Redis
userRouter- User authentication and profile managementwebsiteRouter- Website monitoring configurationstatusPageRouter- Public status page managementstatusDomainRouter- Custom domain configuration
Publisher
The publisher service continuously enqueues monitoring tasks:- Queries PostgreSQL for active websites (every 3 minutes)
- Publishes website check tasks to Redis Streams
- Ensures only active websites are monitored
- Implements in-flight protection to prevent overlapping cycles
apps/publisher/src/index.ts:6
Worker
Worker instances perform the actual uptime checks:- Consume website check tasks from Redis Streams
- Execute HTTP requests to monitored URLs
- Record metrics (response time, status code, availability)
- Write results to ClickHouse for long-term storage
- Implement graceful error handling and PEL (Pending Entries List) management
- Consumer group support for horizontal scaling
- Automatic reclaim of stale messages
- Self-liveness monitoring and auto-recovery
- Timeout protection on all external operations
apps/worker/src/index.ts:123
Component Interactions
Data Flow
The system processes monitoring data through three distinct phases:1. Task Publishing
The publisher service maintains a continuous publishing cycle:- Query PostgreSQL for all active websites
- Bulk publish to Redis Stream using
XADD - Trim stream to prevent unbounded growth (~8000 messages max)
- Wait for next interval (3 minutes)
2. Task Processing
Worker instances consume and process tasks:- Fresh messages: Read new tasks with
XREADGROUP(blocking, 1 second) - Validation: Check if website is still active via Prisma
- HTTP check: Execute request with configurable timeout (10 seconds)
- Metrics recording: Batch insert to ClickHouse
- Acknowledgment: Remove from Redis PEL with
XACK - PEL reclaim: Automatically reclaim stale messages (idle > 5 minutes)
3. Data Retrieval
The server aggregates data from multiple sources:- Configuration data: PostgreSQL (website settings, user data)
- Metrics data: ClickHouse (time-series uptime events)
- Real-time status: Redis (current processing state)
- WebSocket push: Live updates to connected clients
Scalability
Horizontal Scaling
- Worker instances: Multiple workers can join the same Redis consumer group
- Region support: Workers can be deployed across different geographic regions
- Independent scaling: Each component (client, server, worker, publisher) scales independently
Data Partitioning
- ClickHouse: Ordered by
(website_id, region_id, checked_at)for efficient queries - Redis Streams: Consumer groups distribute work across workers
- PostgreSQL: Indexed for fast website lookups and user queries
Reliability
Fault Tolerance
- Redis reconnection: Automatic exponential backoff with jitter
- PEL management: Stale messages are automatically reclaimed
- Worker watchdog: Self-monitoring with automatic restart on freeze
- Timeout protection: All external operations have client-side timeouts
Data Consistency
- At-least-once delivery: Redis Streams with consumer groups
- Idempotent processing: Safe to process the same check multiple times
- ACK safety: Messages are ACKed even on processing failures to prevent PEL growth
- Publisher re-enqueue: Failed checks are retried on the next publishing cycle
Next Steps
Technology Stack
Explore the technologies powering Better Uptime
Monorepo Structure
Understand the codebase organization
Data Flow
Deep dive into the data pipeline