High Availability

Apache Druid can be configured for high availability to ensure continuous operation and minimize downtime. High availability requires proper configuration of ZooKeeper, metadata storage, Coordinators, Overlords, and Brokers.

Core Components for High Availability

ZooKeeper Cluster

3 or 5 node cluster for distributed coordination

Metadata Storage

MySQL or PostgreSQL with replication and failover

Coordinator/Overlord

Multiple servers with automatic failover

Broker Load Balancing

Multiple Brokers behind a load balancer

ZooKeeper Configuration

For a highly-available ZooKeeper deployment, you need a cluster of 3 or 5 ZooKeeper nodes.

Never run ZooKeeper with an even number of nodes, as this can lead to split-brain scenarios during network partitions.

Deployment Options

We recommend one of the following approaches:

Dedicated Hardware: Install ZooKeeper on its own dedicated hardware for best performance and isolation
Co-located with Master Servers: Run 3 or 5 Master servers (where Overlords or Coordinators are running) and configure ZooKeeper on them

For detailed ZooKeeper configuration and tuning, refer to the ZooKeeper Admin Guide.

Metadata Storage Configuration

For highly-available metadata storage, use MySQL or PostgreSQL with replication and failover enabled.

Supported Databases

MySQL
PostgreSQL

Configure MySQL with replication and automatic failover using MySQL Enterprise High Availability features.Resources:

MySQL Enterprise High Availability

Ensure your metadata store connection strings support failover mechanisms specific to your chosen database.

Coordinator and Overlord Failover

Druid Coordinators and Overlords support automatic failover when multiple instances are running.

Configuration Requirements

All Coordinator and Overlord instances must:

Use the same ZooKeeper cluster
Connect to the same metadata storage
Have identical configuration for service discovery

Failover Behavior

Leader Election

Only one Coordinator and one Overlord will be active (leader) at a time, elected through ZooKeeper.

Automatic Failover

If the active instance fails, another instance automatically takes over as leader.

Request Redirection

Inactive instances automatically redirect HTTP requests to the current active instance.

# Example common.runtime.properties for Coordinators/Overlords
druid.zk.service.host=zk1:2181,zk2:2181,zk3:2181
druid.metadata.storage.type=postgresql
druid.metadata.storage.connector.connectURI=jdbc:postgresql://db-host:5432/druid
druid.metadata.storage.connector.user=druid
druid.metadata.storage.connector.password=<password>

Broker High Availability

Brokers are stateless and all running instances are active, making them easy to scale horizontally.

Load Balancer Configuration

Place all Broker instances behind a load balancer to distribute query traffic.

upstream druid_brokers {
    server broker1:8082;
    server broker2:8082;
    server broker3:8082;
    
    # Use least connections for query load balancing
    least_conn;
}

server {
    listen 80;
    
    location / {
        proxy_pass http://druid_brokers;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

Health Checks

Configure your load balancer to perform health checks on the Broker status endpoint:

GET http://broker:8082/status/health

Benefits of Multiple Brokers:

Query load distribution
No single point of failure
Rolling upgrades without downtime
Horizontal scaling for query capacity

Historical and MiddleManager HA

While not traditional “HA” components, Historicals and MiddleManagers contribute to cluster resilience:

Historical Redundancy

Configure replication rules to store segments on multiple Historicals
If one Historical fails, queries can still be served from replicas
The Coordinator automatically redistributes segments from failed nodes

MiddleManager/Indexer Resilience

Running multiple MiddleManagers provides task capacity redundancy
Failed tasks are automatically retried by the Overlord
Use multiple MiddleManagers to handle task load and provide fault tolerance

Verification and Testing

Test ZooKeeper Failover

Stop one ZooKeeper node
Verify the cluster continues operating normally
Check ZooKeeper logs for leader re-election
Restart the stopped node and verify it rejoins the cluster

Test Coordinator Failover

Identify the current leader Coordinator
Stop the leader process
Verify another Coordinator becomes the leader within seconds
Check that HTTP requests to the old leader redirect to the new leader

Test Broker Load Balancing

Send queries through the load balancer
Verify queries distribute across all Brokers
Stop one Broker and verify queries continue
Restart the Broker and verify it receives traffic again

Best Practices

Monitor Leader Elections

Track ZooKeeper leader elections and Coordinator/Overlord failovers to detect issues

Use Odd Number of Nodes

Always use 3 or 5 ZooKeeper nodes, never an even number

Test Failover Regularly

Perform regular chaos testing to validate HA configuration

Monitor Metadata Store

Ensure metadata store replication lag is minimal

For production deployments, consider running Coordinators and Overlords on the same servers to reduce infrastructure costs while maintaining high availability.

Getting Started

Design & Architecture

Data Ingestion

Querying

Data Management

Operations

Configuration

Core Components for High Availability

ZooKeeper Cluster

Metadata Storage

Coordinator/Overlord

Broker Load Balancing

ZooKeeper Configuration

Deployment Options

Metadata Storage Configuration

Supported Databases

Coordinator and Overlord Failover

Configuration Requirements

Failover Behavior

Broker High Availability

Load Balancer Configuration

Health Checks

Historical and MiddleManager HA

Historical Redundancy

MiddleManager/Indexer Resilience

Verification and Testing

Best Practices

Monitor Leader Elections

Use Odd Number of Nodes

Test Failover Regularly

Monitor Metadata Store

Build docs developers (and LLMs) love

Getting Started

Design & Architecture

Data Ingestion

Querying

Data Management

Operations

Configuration

​Core Components for High Availability

ZooKeeper Cluster

Metadata Storage

Coordinator/Overlord

Broker Load Balancing

​ZooKeeper Configuration

​Deployment Options

​Metadata Storage Configuration

​Supported Databases

​Coordinator and Overlord Failover

​Configuration Requirements

​Failover Behavior

​Broker High Availability

​Load Balancer Configuration

​Health Checks

​Historical and MiddleManager HA

​Historical Redundancy

​MiddleManager/Indexer Resilience

​Verification and Testing

​Best Practices

Monitor Leader Elections

Use Odd Number of Nodes

Test Failover Regularly

Monitor Metadata Store

Build docs developers (and LLMs) love

Core Components for High Availability

ZooKeeper Configuration

Deployment Options

Metadata Storage Configuration

Supported Databases

Coordinator and Overlord Failover

Configuration Requirements

Failover Behavior

Broker High Availability

Load Balancer Configuration

Health Checks

Historical and MiddleManager HA

Historical Redundancy

MiddleManager/Indexer Resilience

Verification and Testing

Best Practices