Core Components for High Availability
ZooKeeper Cluster
3 or 5 node cluster for distributed coordination
Metadata Storage
MySQL or PostgreSQL with replication and failover
Coordinator/Overlord
Multiple servers with automatic failover
Broker Load Balancing
Multiple Brokers behind a load balancer
ZooKeeper Configuration
For a highly-available ZooKeeper deployment, you need a cluster of 3 or 5 ZooKeeper nodes.Deployment Options
We recommend one of the following approaches:- Dedicated Hardware: Install ZooKeeper on its own dedicated hardware for best performance and isolation
- Co-located with Master Servers: Run 3 or 5 Master servers (where Overlords or Coordinators are running) and configure ZooKeeper on them
Metadata Storage Configuration
For highly-available metadata storage, use MySQL or PostgreSQL with replication and failover enabled.Supported Databases
- MySQL
- PostgreSQL
Configure MySQL with replication and automatic failover using MySQL Enterprise High Availability features.Resources:
Ensure your metadata store connection strings support failover mechanisms specific to your chosen database.
Coordinator and Overlord Failover
Druid Coordinators and Overlords support automatic failover when multiple instances are running.Configuration Requirements
All Coordinator and Overlord instances must:- Use the same ZooKeeper cluster
- Connect to the same metadata storage
- Have identical configuration for service discovery
Failover Behavior
Leader Election
Only one Coordinator and one Overlord will be active (leader) at a time, elected through ZooKeeper.
Automatic Failover
If the active instance fails, another instance automatically takes over as leader.
Broker High Availability
Brokers are stateless and all running instances are active, making them easy to scale horizontally.Load Balancer Configuration
Place all Broker instances behind a load balancer to distribute query traffic.Health Checks
Configure your load balancer to perform health checks on the Broker status endpoint:Benefits of Multiple Brokers:
- Query load distribution
- No single point of failure
- Rolling upgrades without downtime
- Horizontal scaling for query capacity
Historical and MiddleManager HA
While not traditional “HA” components, Historicals and MiddleManagers contribute to cluster resilience:Historical Redundancy
- Configure replication rules to store segments on multiple Historicals
- If one Historical fails, queries can still be served from replicas
- The Coordinator automatically redistributes segments from failed nodes
MiddleManager/Indexer Resilience
- Running multiple MiddleManagers provides task capacity redundancy
- Failed tasks are automatically retried by the Overlord
- Use multiple MiddleManagers to handle task load and provide fault tolerance
Verification and Testing
Test ZooKeeper Failover
Test ZooKeeper Failover
- Stop one ZooKeeper node
- Verify the cluster continues operating normally
- Check ZooKeeper logs for leader re-election
- Restart the stopped node and verify it rejoins the cluster
Test Coordinator Failover
Test Coordinator Failover
- Identify the current leader Coordinator
- Stop the leader process
- Verify another Coordinator becomes the leader within seconds
- Check that HTTP requests to the old leader redirect to the new leader
Test Broker Load Balancing
Test Broker Load Balancing
- Send queries through the load balancer
- Verify queries distribute across all Brokers
- Stop one Broker and verify queries continue
- Restart the Broker and verify it receives traffic again
Best Practices
Monitor Leader Elections
Track ZooKeeper leader elections and Coordinator/Overlord failovers to detect issues
Use Odd Number of Nodes
Always use 3 or 5 ZooKeeper nodes, never an even number
Test Failover Regularly
Perform regular chaos testing to validate HA configuration
Monitor Metadata Store
Ensure metadata store replication lag is minimal