High Availability

High availability (HA) ensures that UTMStack continues to operate even when individual components fail. This guide covers redundancy, failover, and disaster recovery strategies.

High Availability Overview

Availability Targets

Tier	Uptime %	Downtime/Year	Configuration
Standard	99.0%	3.65 days	Single node with backups
High	99.9%	8.76 hours	Multi-node with replication
Very High	99.99%	52.56 minutes	Full HA cluster with redundancy
Mission Critical	99.999%	5.26 minutes	Geo-redundant, active-active

Failure Scenarios

Component Failures:

Backend API server crash
Database server failure
Elasticsearch node failure
Network connectivity loss
Disk failure
Agent communication loss

Recovery Objectives:

RTO (Recovery Time Objective): How quickly service is restored
RPO (Recovery Point Objective): How much data loss is acceptable

Typical targets:

RTO: < 5 minutes
RPO: < 1 minute

High Availability Architecture

Load Balancer Configuration

HAProxy with Keepalived

HAProxy Configuration (/etc/haproxy/haproxy.cfg):

global
    log /dev/log local0
    log /dev/log local1 notice
    chroot /var/lib/haproxy
    stats socket /run/haproxy/admin.sock mode 660 level admin
    stats timeout 30s
    user haproxy
    group haproxy
    daemon
    
    # SSL
    ssl-default-bind-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256
    ssl-default-bind-options ssl-min-ver TLSv1.2

defaults
    log     global
    mode    http
    option  httplog
    option  dontlognull
    option  http-server-close
    option  forwardfor except 127.0.0.0/8
    option  redispatch
    retries 3
    timeout connect 5000
    timeout client  50000
    timeout server  50000

# Stats page
listen stats
    bind *:8404
    stats enable
    stats uri /stats
    stats refresh 30s
    stats admin if TRUE

# Frontend for HTTPS
frontend https_frontend
    bind *:443 ssl crt /etc/ssl/utmstack.pem
    
    # Security headers
    http-response set-header Strict-Transport-Security "max-age=31536000; includeSubDomains"
    http-response set-header X-Frame-Options "SAMEORIGIN"
    http-response set-header X-Content-Type-Options "nosniff"
    
    # ACLs
    acl is_api path_beg /api
    acl is_websocket path_beg /websocket
    
    # Routing
    use_backend api_backend if is_api
    use_backend websocket_backend if is_websocket
    default_backend web_backend

# Backend API servers
backend api_backend
    balance leastconn
    option httpchk GET /actuator/health
    http-check expect status 200
    
    server app1 10.0.1.11:8080 check inter 5000 fall 3 rise 2
    server app2 10.0.1.12:8080 check inter 5000 fall 3 rise 2
    server app3 10.0.1.13:8080 check inter 5000 fall 3 rise 2 backup

# WebSocket backend
backend websocket_backend
    balance leastconn
    option http-server-close
    option forceclose
    
    server app1 10.0.1.11:8080 check
    server app2 10.0.1.12:8080 check

# Frontend web servers
backend web_backend
    balance roundrobin
    
    server web1 10.0.1.21:80 check
    server web2 10.0.1.22:80 check

# gRPC backend for agents
frontend grpc_frontend
    bind *:50051
    mode tcp
    default_backend grpc_backend

backend grpc_backend
    mode tcp
    balance leastconn
    option tcp-check
    
    server app1 10.0.1.11:50051 check
    server app2 10.0.1.12:50051 check
    server app3 10.0.1.13:50051 check

Keepalived Configuration (Primary - /etc/keepalived/keepalived.conf):

vrrp_script check_haproxy {
    script "/usr/bin/killall -0 haproxy"
    interval 2
    weight 2
}

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1
    
    authentication {
        auth_type PASS
        auth_pass ${VRRP_PASSWORD}
    }
    
    virtual_ipaddress {
        10.0.1.100/24
    }
    
    track_script {
        check_haproxy
    }
    
    notify_master "/etc/keepalived/notify_master.sh"
    notify_backup "/etc/keepalived/notify_backup.sh"
    notify_fault "/etc/keepalived/notify_fault.sh"
}

Keepalived Configuration (Backup):

vrrp_instance VI_1 {
    state BACKUP
    interface eth0
    virtual_router_id 51
    priority 99  # Lower than primary
    # ... rest same as primary
}

Database High Availability

PostgreSQL Streaming Replication

Primary Server Configuration (postgresql.conf):

# Replication
listen_addresses = '*'
wal_level = replica
max_wal_senders = 10
max_replication_slots = 10
wal_keep_size = 1GB
archive_mode = on
archive_command = 'rsync -a %p /archive/%f'

# Synchronous replication (optional, for zero data loss)
synchronous_commit = on
synchronous_standby_names = 'standby1,standby2'

Primary Server (pg_hba.conf):

# Replication connections
host replication replicator 10.0.1.0/24 md5

Standby Server Configuration (postgresql.conf):

hot_standby = on
max_standby_streaming_delay = 30s
wal_receiver_status_interval = 10s
hot_standby_feedback = on

Standby Server (standby.signal file):

primary_conninfo = 'host=10.0.1.11 port=5432 user=replicator password=${REPLICATION_PASSWORD} application_name=standby1'
primary_slot_name = 'standby1'
restore_command = 'cp /archive/%f %p'

Setup Replication:

# On standby server
sudo systemctl stop postgresql
sudo rm -rf /var/lib/postgresql/14/main/*
sudo -u postgres pg_basebackup -h 10.0.1.11 -D /var/lib/postgresql/14/main -U replicator -P -v -R -X stream -C -S standby1
sudo systemctl start postgresql

Monitor Replication:

-- On primary
SELECT 
    client_addr,
    application_name,
    state,
    sync_state,
    pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS send_lag,
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag
FROM pg_stat_replication;

-- On standby
SELECT pg_is_in_recovery();
SELECT pg_last_wal_receive_lsn();
SELECT pg_last_wal_replay_lsn();

Automatic Failover with Patroni

Patroni Configuration (/etc/patroni/patroni.yml):

scope: utmstack-cluster
namespace: /service/
name: postgresql1

restapi:
  listen: 0.0.0.0:8008
  connect_address: 10.0.1.11:8008

etcd:
  hosts:
    - 10.0.1.51:2379
    - 10.0.1.52:2379
    - 10.0.1.53:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      parameters:
        max_connections: 100
        shared_buffers: 8GB
        effective_cache_size: 24GB

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.0.1.11:5432
  data_dir: /var/lib/postgresql/14/main
  pgpass: /var/lib/postgresql/.pgpass
  authentication:
    replication:
      username: replicator
      password: ${REPLICATION_PASSWORD}
    superuser:
      username: postgres
      password: ${POSTGRES_PASSWORD}
  parameters:
    unix_socket_directories: '/var/run/postgresql'

Elasticsearch High Availability

Cluster Configuration

3-Node Cluster Setup: Master Node (elasticsearch.yml):

cluster.name: utmstack
node.name: es-master
node.roles: [master, data, ingest]

network.host: 0.0.0.0
http.port: 9200

discovery.seed_hosts:
  - es-master
  - es-data1
  - es-data2

cluster.initial_master_nodes:
  - es-master

# Quorum
gateway.recover_after_nodes: 2
gateway.expected_nodes: 3
gateway.recover_after_time: 5m

# Split-brain prevention
discovery.zen.minimum_master_nodes: 2

Shard Allocation Awareness:

# Enable rack awareness
node.attr.rack: rack1  # Different for each node
cluster.routing.allocation.awareness.attributes: rack

Snapshot and Restore:

# Create repository
PUT /_snapshot/backup_repo
{
  "type": "fs",
  "settings": {
    "location": "/mnt/backups/elasticsearch",
    "compress": true,
    "max_snapshot_bytes_per_sec": "100mb",
    "max_restore_bytes_per_sec": "100mb"
  }
}

# Create snapshot policy
PUT /_slm/policy/daily_snapshots
{
  "schedule": "0 0 1 * * ?",
  "name": "<daily-snap-{now/d}>",
  "repository": "backup_repo",
  "config": {
    "indices": ["*"],
    "ignore_unavailable": true,
    "include_global_state": false
  },
  "retention": {
    "expire_after": "30d",
    "min_count": 7,
    "max_count": 30
  }
}

# Restore from snapshot
POST /_snapshot/backup_repo/daily-snap-2026-03-03/_restore
{
  "indices": "logs-2026.03.03",
  "ignore_unavailable": true,
  "include_global_state": false
}

Redis High Availability

Redis Sentinel

Sentinel Configuration (/etc/redis/sentinel.conf):

port 26379
sentinel monitor utmstack-redis 10.0.1.31 6379 2
sentinel auth-pass utmstack-redis ${REDIS_PASSWORD}
sentinel down-after-milliseconds utmstack-redis 5000
sentinel parallel-syncs utmstack-redis 1
sentinel failover-timeout utmstack-redis 10000

# Notification scripts
sentinel notification-script utmstack-redis /etc/redis/notify.sh
sentinel client-reconfig-script utmstack-redis /etc/redis/reconfig.sh

Application Configuration:

spring:
  redis:
    sentinel:
      master: utmstack-redis
      nodes:
        - 10.0.1.51:26379
        - 10.0.1.52:26379
        - 10.0.1.53:26379
    password: ${REDIS_PASSWORD}
    lettuce:
      pool:
        max-active: 20
        max-idle: 10

Application Layer HA

Stateless Backend Services

Docker Swarm (alternative to manual setup):

# docker-compose.yml
version: '3.8'

services:
  backend:
    image: utmstack/backend:latest
    deploy:
      replicas: 3
      update_config:
        parallelism: 1
        delay: 10s
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
      placement:
        max_replicas_per_node: 1
    environment:
      - SPRING_DATASOURCE_URL=jdbc:postgresql://pg-vip:5432/utmstack
      - SPRING_REDIS_SENTINEL_MASTER=utmstack-redis
    networks:
      - utmstack

networks:
  utmstack:
    driver: overlay

Health Checks

Spring Boot Actuator:

@Component
public class CustomHealthIndicator implements HealthIndicator {
    @Override
    public Health health() {
        // Check critical dependencies
        if (!checkDatabase() || !checkElasticsearch() || !checkRedis()) {
            return Health.down()
                .withDetail("database", checkDatabase())
                .withDetail("elasticsearch", checkElasticsearch())
                .withDetail("redis", checkRedis())
                .build();
        }
        return Health.up().build();
    }
}

Liveness and Readiness Probes (Kubernetes):

apiVersion: v1
kind: Pod
spec:
  containers:
  - name: utmstack-backend
    image: utmstack/backend:latest
    livenessProbe:
      httpGet:
        path: /actuator/health/liveness
        port: 8080
      initialDelaySeconds: 30
      periodSeconds: 10
      failureThreshold: 3
    readinessProbe:
      httpGet:
        path: /actuator/health/readiness
        port: 8080
      initialDelaySeconds: 15
      periodSeconds: 5
      failureThreshold: 3

Disaster Recovery

Backup Strategy

Automated Backup Script:

#!/bin/bash
# /usr/local/bin/utmstack-backup.sh

BACKUP_DIR="/backup/utmstack"
DATE=$(date +%Y%m%d_%H%M%S)

# PostgreSQL backup
pg_dump -Fc utmstack > "${BACKUP_DIR}/postgres_${DATE}.dump"

# Elasticsearch snapshot
curl -X PUT "http://localhost:9200/_snapshot/backup_repo/snapshot_${DATE}?wait_for_completion=true"

# Configuration backup
tar -czf "${BACKUP_DIR}/config_${DATE}.tar.gz" /etc/utmstack

# Redis backup
cp /var/lib/redis/dump.rdb "${BACKUP_DIR}/redis_${DATE}.rdb"

# Retention (keep last 7 days)
find ${BACKUP_DIR} -name "*.dump" -mtime +7 -delete
find ${BACKUP_DIR} -name "*.tar.gz" -mtime +7 -delete
find ${BACKUP_DIR} -name "*.rdb" -mtime +7 -delete

# Upload to remote storage
rclone sync ${BACKUP_DIR} remote:utmstack-backups

Cron Schedule:

# Daily backup at 2 AM
0 2 * * * /usr/local/bin/utmstack-backup.sh

Recovery Procedures

PostgreSQL Recovery:

# Stop application
systemctl stop utmstack-backend

# Restore database
pg_restore -d utmstack /backup/postgres_20260303_020000.dump

# Start application
systemctl start utmstack-backend

Elasticsearch Recovery:

# Close indices
curl -X POST "localhost:9200/logs-*/_close"

# Restore snapshot
curl -X POST "localhost:9200/_snapshot/backup_repo/snapshot_20260303_020000/_restore"

# Wait for completion
curl "localhost:9200/_cat/recovery?v"

Monitoring HA Status

Grafana Dashboard Alerts:

apiVersion: 1
groups:
  - name: utmstack-ha
    interval: 1m
    rules:
      - alert: DatabaseReplicationLag
        expr: pg_replication_lag_seconds > 60
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High replication lag"
          
      - alert: ElasticsearchClusterRed
        expr: elasticsearch_cluster_health_status{color="red"} == 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Elasticsearch cluster is RED"
          
      - alert: RedisDown
        expr: redis_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Redis is down"

Next Steps

Horizontal Scaling

Scale for higher capacity

Performance Tuning

Optimize for best performance

System Architecture

Review complete architecture

Data Storage

Understand storage design

Overview

Components

Scalability

High Availability

High Availability Overview

Availability Targets

Failure Scenarios

High Availability Architecture

Load Balancer Configuration

HAProxy with Keepalived

Database High Availability

PostgreSQL Streaming Replication

Automatic Failover with Patroni

Elasticsearch High Availability

Cluster Configuration

Redis High Availability

Redis Sentinel

Application Layer HA

Stateless Backend Services

Health Checks

Disaster Recovery

Backup Strategy

Recovery Procedures

Monitoring HA Status

Next Steps

Horizontal Scaling

Performance Tuning

System Architecture

Data Storage

Build docs developers (and LLMs) love

Overview

Components

Scalability

Documentation Index

​High Availability Overview

​Availability Targets

​Failure Scenarios

​High Availability Architecture

​Load Balancer Configuration

​HAProxy with Keepalived

​Database High Availability

​PostgreSQL Streaming Replication

​Automatic Failover with Patroni

​Elasticsearch High Availability

​Cluster Configuration

​Redis High Availability

​Redis Sentinel

​Application Layer HA

​Stateless Backend Services

​Health Checks

​Disaster Recovery

​Backup Strategy

​Recovery Procedures

​Monitoring HA Status

​Next Steps

Horizontal Scaling

Performance Tuning

System Architecture

Data Storage

Build docs developers (and LLMs) love

High Availability Overview

Availability Targets

Failure Scenarios

High Availability Architecture

Load Balancer Configuration

HAProxy with Keepalived

Database High Availability

PostgreSQL Streaming Replication

Automatic Failover with Patroni

Elasticsearch High Availability

Cluster Configuration

Redis High Availability

Redis Sentinel

Application Layer HA

Stateless Backend Services

Health Checks

Disaster Recovery

Backup Strategy

Recovery Procedures

Monitoring HA Status

Next Steps