Skip to main content
Apache Druid can be configured for high availability to ensure continuous operation and minimize downtime. High availability requires proper configuration of ZooKeeper, metadata storage, Coordinators, Overlords, and Brokers.

Core Components for High Availability

ZooKeeper Cluster

3 or 5 node cluster for distributed coordination

Metadata Storage

MySQL or PostgreSQL with replication and failover

Coordinator/Overlord

Multiple servers with automatic failover

Broker Load Balancing

Multiple Brokers behind a load balancer

ZooKeeper Configuration

For a highly-available ZooKeeper deployment, you need a cluster of 3 or 5 ZooKeeper nodes.
Never run ZooKeeper with an even number of nodes, as this can lead to split-brain scenarios during network partitions.

Deployment Options

We recommend one of the following approaches:
  1. Dedicated Hardware: Install ZooKeeper on its own dedicated hardware for best performance and isolation
  2. Co-located with Master Servers: Run 3 or 5 Master servers (where Overlords or Coordinators are running) and configure ZooKeeper on them
For detailed ZooKeeper configuration and tuning, refer to the ZooKeeper Admin Guide.

Metadata Storage Configuration

For highly-available metadata storage, use MySQL or PostgreSQL with replication and failover enabled.

Supported Databases

Configure MySQL with replication and automatic failover using MySQL Enterprise High Availability features.Resources:
Ensure your metadata store connection strings support failover mechanisms specific to your chosen database.

Coordinator and Overlord Failover

Druid Coordinators and Overlords support automatic failover when multiple instances are running.

Configuration Requirements

All Coordinator and Overlord instances must:
  • Use the same ZooKeeper cluster
  • Connect to the same metadata storage
  • Have identical configuration for service discovery

Failover Behavior

1

Leader Election

Only one Coordinator and one Overlord will be active (leader) at a time, elected through ZooKeeper.
2

Automatic Failover

If the active instance fails, another instance automatically takes over as leader.
3

Request Redirection

Inactive instances automatically redirect HTTP requests to the current active instance.
# Example common.runtime.properties for Coordinators/Overlords
druid.zk.service.host=zk1:2181,zk2:2181,zk3:2181
druid.metadata.storage.type=postgresql
druid.metadata.storage.connector.connectURI=jdbc:postgresql://db-host:5432/druid
druid.metadata.storage.connector.user=druid
druid.metadata.storage.connector.password=<password>

Broker High Availability

Brokers are stateless and all running instances are active, making them easy to scale horizontally.

Load Balancer Configuration

Place all Broker instances behind a load balancer to distribute query traffic.
upstream druid_brokers {
    server broker1:8082;
    server broker2:8082;
    server broker3:8082;
    
    # Use least connections for query load balancing
    least_conn;
}

server {
    listen 80;
    
    location / {
        proxy_pass http://druid_brokers;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

Health Checks

Configure your load balancer to perform health checks on the Broker status endpoint:
GET http://broker:8082/status/health
Benefits of Multiple Brokers:
  • Query load distribution
  • No single point of failure
  • Rolling upgrades without downtime
  • Horizontal scaling for query capacity

Historical and MiddleManager HA

While not traditional “HA” components, Historicals and MiddleManagers contribute to cluster resilience:

Historical Redundancy

  • Configure replication rules to store segments on multiple Historicals
  • If one Historical fails, queries can still be served from replicas
  • The Coordinator automatically redistributes segments from failed nodes

MiddleManager/Indexer Resilience

  • Running multiple MiddleManagers provides task capacity redundancy
  • Failed tasks are automatically retried by the Overlord
  • Use multiple MiddleManagers to handle task load and provide fault tolerance

Verification and Testing

  1. Stop one ZooKeeper node
  2. Verify the cluster continues operating normally
  3. Check ZooKeeper logs for leader re-election
  4. Restart the stopped node and verify it rejoins the cluster
  1. Identify the current leader Coordinator
  2. Stop the leader process
  3. Verify another Coordinator becomes the leader within seconds
  4. Check that HTTP requests to the old leader redirect to the new leader
  1. Send queries through the load balancer
  2. Verify queries distribute across all Brokers
  3. Stop one Broker and verify queries continue
  4. Restart the Broker and verify it receives traffic again

Best Practices

Monitor Leader Elections

Track ZooKeeper leader elections and Coordinator/Overlord failovers to detect issues

Use Odd Number of Nodes

Always use 3 or 5 ZooKeeper nodes, never an even number

Test Failover Regularly

Perform regular chaos testing to validate HA configuration

Monitor Metadata Store

Ensure metadata store replication lag is minimal
For production deployments, consider running Coordinators and Overlords on the same servers to reduce infrastructure costs while maintaining high availability.

Build docs developers (and LLMs) love