Operating Modes
- Local Mode
- Remote Mode
In local mode, the Overlord is also responsible for creating Peons for executing tasks. When running the Overlord in local mode, all Middle Manager and Peon configurations must be provided as well.
Local mode is typically used for simple workflows.
Key Responsibilities
Task Acceptance
Receives and validates incoming indexing tasks
Task Distribution
Coordinates distribution of tasks to Middle Managers or Peons
Lock Management
Creates and manages locks around tasks to prevent conflicts
Status Reporting
Returns task statuses and progress to callers
Worker Blacklisting
Monitors task failures and blacklists problematic workers
Autoscaling
Automatically scales Middle Manager capacity based on task load
Configuration
For Apache Druid Overlord service configuration, see:Running the Overlord
HTTP Endpoints
For a list of API endpoints supported by the Overlord, see:Blacklisted Workers
If a Middle Manager has task failures above a threshold, the Overlord will blacklist these Middle Managers. This prevents problematic workers from receiving new tasks while they’re experiencing issues.Blacklisting Behavior
Monitor task failures
The Overlord tracks task failures per Middle Manager against the configured threshold.
Blacklist problematic workers
When a Middle Manager exceeds the failure threshold, it is added to the blacklist.
Respect blacklist limits
No more than 20% of the Middle Managers can be blacklisted at any given time.
Configuration Variables
The following variables can be used to set the threshold and blacklist timeouts:Maximum number of task failures before a worker is blacklisted
Time period a worker remains blacklisted before being reconsidered
How often the Overlord checks for workers that can be removed from the blacklist
Maximum percentage of workers that can be blacklisted (default: 0.2 or 20%)
Example Configuration
Autoscaling
The autoscaling mechanisms currently in place are tightly coupled with specific deployment infrastructure but the framework should be in place for other implementations. The Druid community is highly open to new implementations or extensions of the existing mechanisms.
- Scale Up
- Scale Down
New Middle Managers may be added when a task has been in pending state for too long.
This ensures that task queues don’t grow indefinitely during periods of high ingestion load.
Deployment-Specific Autoscaling
In some deployments, Middle Manager services are provisioned as cloud instances (e.g., Amazon AWS EC2 nodes) and they are provisioned to register themselves in a deployment environment like Galaxy.Task Distribution Workflow
Here’s how the Overlord distributes tasks to workers:Architecture Integration
With Middle Managers
- Assigns tasks to available Middle Managers
- Monitors task execution and progress
- Tracks worker health and blacklists problematic workers
- Manages autoscaling of worker capacity
With Metadata Store
- Stores task metadata and status
- Maintains task lock information
- Persists task history and audit logs
With Clients
- Receives task submission requests
- Returns task IDs and status information
- Provides APIs for task management
With ZooKeeper
- Coordinates distributed locking
- Manages leader election for high availability
- Publishes worker availability information
High Availability
The Overlord supports high availability through leader election. You can run multiple Overlord instances, and they will coordinate through ZooKeeper to ensure only one is active at a time. If the active Overlord fails, another will automatically take over.