System Architecture

Overview

Better Uptime is a modern uptime monitoring platform built for performance, reliability, and scale. The system follows a distributed, event-driven architecture with clear separation of concerns across multiple services.

Architecture Diagram

View the full, interactive architecture diagram on Excalidraw

Core Components

Client (Frontend)

The Next.js-based web application provides the user interface for:

Managing monitored websites
Viewing real-time uptime status
Analyzing historical metrics and trends
Configuring status pages

Key characteristics:

Server-side rendering for optimal performance
tRPC client for type-safe API communication
Real-time updates via WebSocket connections
Responsive UI built with React and Tailwind CSS

Server (API Layer)

The tRPC server acts as the central API gateway:

Exposes type-safe RPC endpoints
Handles authentication and authorization
Manages WebSocket connections for real-time updates
Orchestrates data access across PostgreSQL, ClickHouse, and Redis

Routers:

userRouter - User authentication and profile management
websiteRouter - Website monitoring configuration
statusPageRouter - Public status page management
statusDomainRouter - Custom domain configuration

Publisher

The publisher service continuously enqueues monitoring tasks:

Queries PostgreSQL for active websites (every 3 minutes)
Publishes website check tasks to Redis Streams
Ensures only active websites are monitored
Implements in-flight protection to prevent overlapping cycles

Reference: apps/publisher/src/index.ts:6

Worker

Worker instances perform the actual uptime checks:

Consume website check tasks from Redis Streams
Execute HTTP requests to monitored URLs
Record metrics (response time, status code, availability)
Write results to ClickHouse for long-term storage
Implement graceful error handling and PEL (Pending Entries List) management

Key features:

Consumer group support for horizontal scaling
Automatic reclaim of stale messages
Self-liveness monitoring and auto-recovery
Timeout protection on all external operations

Reference: apps/worker/src/index.ts:123

Component Interactions

Data Flow

The system processes monitoring data through three distinct phases:

1. Task Publishing

The publisher service maintains a continuous publishing cycle:

Query PostgreSQL for all active websites
Bulk publish to Redis Stream using XADD
Trim stream to prevent unbounded growth (~8000 messages max)
Wait for next interval (3 minutes)

2. Task Processing

Worker instances consume and process tasks:

Fresh messages: Read new tasks with XREADGROUP (blocking, 1 second)
Validation: Check if website is still active via Prisma
HTTP check: Execute request with configurable timeout (10 seconds)
Metrics recording: Batch insert to ClickHouse
Acknowledgment: Remove from Redis PEL with XACK
PEL reclaim: Automatically reclaim stale messages (idle > 5 minutes)

3. Data Retrieval

The server aggregates data from multiple sources:

Configuration data: PostgreSQL (website settings, user data)
Metrics data: ClickHouse (time-series uptime events)
Real-time status: Redis (current processing state)
WebSocket push: Live updates to connected clients

Scalability

Horizontal Scaling

Worker instances: Multiple workers can join the same Redis consumer group
Region support: Workers can be deployed across different geographic regions
Independent scaling: Each component (client, server, worker, publisher) scales independently

Data Partitioning

ClickHouse: Ordered by (website_id, region_id, checked_at) for efficient queries
Redis Streams: Consumer groups distribute work across workers
PostgreSQL: Indexed for fast website lookups and user queries

Reliability

Fault Tolerance

Redis reconnection: Automatic exponential backoff with jitter
PEL management: Stale messages are automatically reclaimed
Worker watchdog: Self-monitoring with automatic restart on freeze
Timeout protection: All external operations have client-side timeouts

Data Consistency

At-least-once delivery: Redis Streams with consumer groups
Idempotent processing: Safe to process the same check multiple times
ACK safety: Messages are ACKed even on processing failures to prevent PEL growth
Publisher re-enqueue: Failed checks are retried on the next publishing cycle

Next Steps

Technology Stack

Explore the technologies powering Better Uptime

Monorepo Structure

Understand the codebase organization

Data Flow

Deep dive into the data pipeline

Get Started

Core Features

Deployment

Architecture

Overview

Architecture Diagram

Core Components

Client (Frontend)

Server (API Layer)

Publisher

Worker

Component Interactions

Data Flow

1. Task Publishing

2. Task Processing

3. Data Retrieval

Scalability

Horizontal Scaling

Data Partitioning

Reliability

Fault Tolerance

Data Consistency

Next Steps

Technology Stack

Monorepo Structure

Data Flow

Build docs developers (and LLMs) love

Get Started

Core Features

Deployment

Architecture

​Overview

​Architecture Diagram

​Core Components

​Client (Frontend)

​Server (API Layer)

​Publisher

​Worker

​Component Interactions

​Data Flow

​1. Task Publishing

​2. Task Processing

​3. Data Retrieval

​Scalability

​Horizontal Scaling

​Data Partitioning

​Reliability

​Fault Tolerance

​Data Consistency

​Next Steps

Technology Stack

Monorepo Structure

Data Flow

Build docs developers (and LLMs) love

Overview

Architecture Diagram

Core Components

Client (Frontend)

Server (API Layer)

Publisher

Worker

Component Interactions

Data Flow

1. Task Publishing

2. Task Processing

3. Data Retrieval

Scalability

Horizontal Scaling

Data Partitioning

Reliability

Fault Tolerance

Data Consistency

Next Steps