Overview
Better Uptime implements a distributed, event-driven data pipeline for monitoring website uptime. Data flows through multiple stages: task publishing, distributed processing, metrics storage, and real-time updates.Pipeline Stages
Stage 1: Task Publishing
Publisher Service
The publisher runs on a fixed interval (3 minutes) and enqueues monitoring tasks. Implementation:apps/publisher/src/index.ts:6
Redis Stream Publishing
Bulk add operation:packages/streams/src/index.ts:166
Key characteristics:
- Batch size: 250 messages per Redis transaction
- Stream trimming: Maintains ~8000 messages maximum
- Idempotent: Safe to run multiple times
- Fast: Completes in milliseconds for thousands of websites
Publishing Cycle Diagram
Stage 2: Distributed Processing
Worker Consumer Architecture
Workers use Redis consumer groups to distribute load and ensure reliability. Consumer group setup:Main Processing Loop
Worker implementation:apps/worker/src/index.ts:463
Message Processing Pipeline
Single message flow:apps/worker/src/index.ts:313
HTTP Health Check
Check implementation:apps/worker/src/index.ts:64
Timeout strategy:
- Axios timeout: 10 seconds (via AbortController)
- Hard timeout: 12 seconds (via
withTimeoutwrapper) - Fallback: Prevents indefinite hangs
Reliability Mechanisms
Consumer Groups
Automatic load distribution:- Horizontal scaling: Add more workers to process faster
- Fault tolerance: If a worker crashes, messages are reclaimed
- No coordination needed: Redis handles distribution
Pending Entries List (PEL)
Automatic message reclaim:- Worker reads message but crashes before ACK
- Worker hangs on HTTP request
- Network partition prevents ACK
- Runs every loop iteration (maintenance, not fallback)
- Small batch size prevents starving fresh messages
- Messages reclaimed after 5 minutes idle
- Force-clear messages stuck > 1 hour
Worker Watchdog
Self-liveness monitoring:apps/worker/src/index.ts:274
Recovery:
- PM2 automatically restarts crashed workers
- Exponential backoff prevents restart storms
- Messages in PEL are reclaimed by other workers
Stage 3: Metrics Storage
ClickHouse Insert
Batch insert operation:packages/clickhouse/src/index.ts:154
Schema Design
Optimized for time-series queries:- Fast queries filtering by
website_id - Efficient range scans on
checked_at - Columnar storage compresses well
- Parallel query execution
Query Patterns
Recent events (last 90 per website):packages/clickhouse/src/index.ts:220
Stage 4: Real-Time Updates
WebSocket Subscriptions
The server pushes live updates to connected clients:Data Aggregation Flow
Performance Characteristics
Throughput
Publisher:- Publishes 1000 websites in ~500ms
- Handles 10,000+ active websites easily
- Redis pipeline batching (250 per transaction)
- Processes 5 checks concurrently
- Each check completes in 100-500ms typically
- Single worker: ~600 checks/minute (10 checks/second)
- 10 workers: ~6,000 checks/minute
- Ingests 10,000+ rows/second
- Query latency < 100ms for typical dashboards
- Handles billions of rows efficiently
Latency
End-to-end (publish to result):- Minimum: ~1 second (Redis + HTTP + ClickHouse)
- Typical: ~2-5 seconds
- Maximum: ~30 seconds (timeout + retry)
- Redis XREADGROUP: < 10ms (in-memory)
- HTTP check: 100-1000ms (network dependent)
- ClickHouse insert: 10-50ms (batch)
- Message ACK: < 10ms
Failure Modes & Recovery
Publisher Failure
Symptom: No new tasks in Redis Stream Detection: Monitor stream length Recovery:- Publisher restarts automatically (PM2)
- Next cycle publishes all active websites
- No data loss (websites re-enqueued)
Worker Failure
Symptom: Messages stuck in PEL Detection: PEL monitoring (every 5 minutes) Recovery:- Other workers reclaim stale messages (5 min idle)
- Force-clear messages stuck > 1 hour
- Worker restarts and rejoins consumer group
Redis Failure
Symptom: Connection errors Detection: Client-side timeouts Recovery:- Exponential backoff reconnection
- Operations queue until reconnect
- PEL persists through Redis restart
ClickHouse Failure
Symptom: Insert timeouts Detection: Insert operation fails Recovery:- Messages still ACKed (prevent PEL growth)
- Failed checks retried on next publish cycle
- No permanent data loss (idempotent checks)
PostgreSQL Failure
Symptom: Website validation fails Detection: Prisma query timeout Recovery:- Treat as invalid website (ACK message)
- Re-enqueued on next publish cycle
- Connection pool reconnects automatically
Next Steps
System Architecture
Review the high-level architecture
Technology Stack
Learn about the technologies used
Monorepo Structure
Explore the codebase organization