Supported Streaming Sources
Apache Kafka
Ingest from Apache Kafka through the bundled Kafka indexing service extension
Amazon Kinesis
Ingest from Amazon Kinesis through the bundled Kinesis indexing service extension
Key Features
Exactly-Once Processing
Exactly-Once Processing
Each indexing service provides real-time data ingestion with exactly-once stream processing guarantee.Druid achieves this by:
- Using Kafka partition and offset mechanism
- Using Kinesis shard and sequence number mechanism
- Coordinating handoffs through the supervisor
- Managing failures with automatic recovery
Real-Time Queries
Real-Time Queries
Query data as it arrives:
- Middle Managers and Indexers can respond to queries with arriving data
- No need to wait for segment publishing
- Low-latency access to recent events
Late Data Handling
Late Data Handling
Both streaming sources support ingesting late-arriving data:
- Configure
lateMessageRejectionPeriodto control how late data can be - Supervisor manages segment handoffs to handle out-of-order events
- Automatic backfilling for missed data
How Streaming Ingestion Works
Load Extension
Load the appropriate extension on both Overlord and Middle Manager:
druid-kafka-indexing-servicefor Kafkadruid-kinesis-indexing-servicefor Kinesis
Create Supervisor
Submit a supervisor spec through:
- Druid web console
- Supervisor API
Supervisor Manages Tasks
The supervisor:
- Creates and manages indexing tasks
- Coordinates segment handoffs
- Manages failures and retries
- Ensures scalability and replication requirements
Architecture
Supervisor Responsibilities
The supervisor oversees the streaming ingestion process:- Task Management
- Partition Assignment
- Offset Management
- Failure Recovery
- Creates indexing tasks based on configuration
- Monitors task health and status
- Restarts failed tasks
- Scales tasks up/down based on load
Configuration Overview
A streaming supervisor spec consists of three main sections:Comparison: Kafka vs Kinesis
- Similarities
- Kafka-Specific
- Kinesis-Specific
Both streaming sources:✅ Support exactly-once processing
✅ Handle late-arriving data
✅ Provide real-time query capabilities
✅ Managed by supervisors
✅ Support schema discovery
✅ Support all input formats
✅ Handle late-arriving data
✅ Provide real-time query capabilities
✅ Managed by supervisors
✅ Support schema discovery
✅ Support all input formats
Getting Started
Create Supervisor Spec
Define your supervisor specification with:
- Data schema (what to ingest)
- I/O config (how to connect)
- Tuning config (performance settings)
Performance Considerations
Task Count and Replicas
Task Count and Replicas
Configure parallelism with Total tasks =
taskCount and replicas:taskCount * replicas = 6 tasksTask Duration
Task Duration
Balance between real-time queries and segment size:
- Shorter duration: More frequent handoffs, smaller segments
- Longer duration: Fewer handoffs, larger segments
Memory Settings
Memory Settings
Control memory usage:
Autoscaling
Autoscaling
Enable task autoscaling based on lag:
Common Use Cases
Real-Time Analytics
Process and query events as they arrive for real-time dashboards and alerting
Log Aggregation
Centralize and analyze logs from distributed systems in real-time
IoT Data
Ingest sensor data and device telemetry for monitoring and analysis
User Activity
Track user events and behavior for analytics and personalization
Troubleshooting
High Lag
High Lag
If your ingestion is falling behind:
- Increase
taskCount(up to partition count) - Enable autoscaling
- Increase
maxRowsPerSegment - Add more Middle Manager capacity
- Check for slow queries blocking ingestion
Task Failures
Task Failures
If tasks are failing:
- Check task logs for errors
- Verify stream connectivity
- Check data format matches spec
- Review parse exceptions
- Ensure sufficient memory
Parse Errors
Parse Errors
If data is not parsing correctly:
- Verify input format configuration
- Check sample data against spec
- Review
parseExceptionHandlersettings - Monitor
maxParseExceptionsthreshold
Next Steps
Kafka Ingestion
Learn about Kafka-specific configuration
Kinesis Ingestion
Learn about Kinesis-specific configuration
Supervisor
Deep dive into supervisor management
Data Formats
Explore supported input formats