Skip to main content
Apache Druid can consume data streams from external streaming sources for real-time data ingestion with exactly-once processing guarantees.

Supported Streaming Sources

Apache Kafka

Ingest from Apache Kafka through the bundled Kafka indexing service extension

Amazon Kinesis

Ingest from Amazon Kinesis through the bundled Kinesis indexing service extension

Key Features

Each indexing service provides real-time data ingestion with exactly-once stream processing guarantee.Druid achieves this by:
  • Using Kafka partition and offset mechanism
  • Using Kinesis shard and sequence number mechanism
  • Coordinating handoffs through the supervisor
  • Managing failures with automatic recovery
Query data as it arrives:
  • Middle Managers and Indexers can respond to queries with arriving data
  • No need to wait for segment publishing
  • Low-latency access to recent events
Both streaming sources support ingesting late-arriving data:
  • Configure lateMessageRejectionPeriod to control how late data can be
  • Supervisor manages segment handoffs to handle out-of-order events
  • Automatic backfilling for missed data

How Streaming Ingestion Works

1

Load Extension

Load the appropriate extension on both Overlord and Middle Manager:
  • druid-kafka-indexing-service for Kafka
  • druid-kinesis-indexing-service for Kinesis
2

Create Supervisor

Submit a supervisor spec through:
  • Druid web console
  • Supervisor API
The supervisor is a continuously-running process that manages indexing tasks.
3

Supervisor Manages Tasks

The supervisor:
  • Creates and manages indexing tasks
  • Coordinates segment handoffs
  • Manages failures and retries
  • Ensures scalability and replication requirements
4

Tasks Ingest Data

Indexing tasks:
  • Read from stream partitions/shards
  • Process and transform data
  • Create segments
  • Publish segments to deep storage

Architecture

                                 ┌─────────────────┐
                                 │   Overlord      │
                                 │                │
                                 │  ┌──────────┐ │
                                 │  │ Supervisor │ │
                                 │  └──────────┘ │
                                 └─────────────────┘

                    ┌────────────┼────────────┐
                    │                │                │
          ┌─────────┴────────┐  ┌────────┴─────────┐
          │ Middle Manager │  │ Middle Manager  │
          │                │  │                 │
          │ ┌───────────┐│  │ ┌───────────┐ │
          │ │ Task 1     ││  │ │ Task 2     │ │
          │ │ (reading)  ││  │ │ (reading)  │ │
          │ └───────────┘│  │ └───────────┘ │
          └─────────────────┘  └──────────────────┘
                    │                │
                    │                │
                    │                │
          ┌─────────┴───────────────────────────┐
          │      Kafka/Kinesis Stream         │
          │                                   │
          │  Partition 1  |  Partition 2     │
          └─────────────────────────────────────┘

Supervisor Responsibilities

The supervisor oversees the streaming ingestion process:
  • Creates indexing tasks based on configuration
  • Monitors task health and status
  • Restarts failed tasks
  • Scales tasks up/down based on load

Configuration Overview

A streaming supervisor spec consists of three main sections:
{
  "type": "kafka",  // or "kinesis"
  "spec": {
    "dataSchema": {
      // What data to ingest and how to process it
      "dataSource": "my-datasource",
      "timestampSpec": { ... },
      "dimensionsSpec": { ... },
      "metricsSpec": [ ... ],
      "granularitySpec": { ... }
    },
    "ioConfig": {
      // Connection and I/O settings
      "topic": "my-topic",  // or "stream" for Kinesis
      "consumerProperties": { ... },
      "inputFormat": { ... },
      "taskCount": 1,
      "replicas": 1,
      "taskDuration": "PT1H"
    },
    "tuningConfig": {
      // Performance tuning
      "type": "kafka",  // or "kinesis"
      "maxRowsInMemory": 100000,
      "maxRowsPerSegment": 5000000
    }
  }
}

Comparison: Kafka vs Kinesis

Both streaming sources:✅ Support exactly-once processing
✅ Handle late-arriving data
✅ Provide real-time query capabilities
✅ Managed by supervisors
✅ Support schema discovery
✅ Support all input formats

Getting Started

1

Choose Your Stream

Decide between Kafka or Kinesis based on your infrastructure
2

Load Extension

Add the appropriate extension to your Druid configuration:
druid.extensions.loadList=["druid-kafka-indexing-service"]
or
druid.extensions.loadList=["druid-kinesis-indexing-service"]
3

Create Supervisor Spec

Define your supervisor specification with:
  • Data schema (what to ingest)
  • I/O config (how to connect)
  • Tuning config (performance settings)
4

Submit Supervisor

Submit via web console or API:
curl -X POST -H 'Content-Type: application/json' \
  -d @supervisor-spec.json \
  http://overlord:8090/druid/indexer/v1/supervisor
5

Monitor Ingestion

Monitor your supervisor and tasks:
  • Check supervisor status
  • View task progress
  • Monitor lag metrics
  • Review error reports

Performance Considerations

Configure parallelism with taskCount and replicas:
{
  "ioConfig": {
    "taskCount": 3,    // 3 tasks per replica
    "replicas": 2      // 2 replica sets
  }
}
Total tasks = taskCount * replicas = 6 tasks
Each task consumes one partition/shard, so taskCount should not exceed the number of partitions/shards.
Balance between real-time queries and segment size:
  • Shorter duration: More frequent handoffs, smaller segments
  • Longer duration: Fewer handoffs, larger segments
{
  "ioConfig": {
    "taskDuration": "PT1H"  // 1 hour task duration
  }
}
Control memory usage:
{
  "tuningConfig": {
    "maxRowsInMemory": 100000,
    "maxBytesInMemory": 134217728,  // 128MB
    "maxRowsPerSegment": 5000000
  }
}
Enable task autoscaling based on lag:
{
  "ioConfig": {
    "autoScalerConfig": {
      "enableTaskAutoScaler": true,
      "taskCountMax": 10,
      "taskCountMin": 2,
      "autoScalerStrategy": "lagBased",
      "lagCollectionIntervalMillis": 30000
    }
  }
}

Common Use Cases

Real-Time Analytics

Process and query events as they arrive for real-time dashboards and alerting

Log Aggregation

Centralize and analyze logs from distributed systems in real-time

IoT Data

Ingest sensor data and device telemetry for monitoring and analysis

User Activity

Track user events and behavior for analytics and personalization

Troubleshooting

If your ingestion is falling behind:
  1. Increase taskCount (up to partition count)
  2. Enable autoscaling
  3. Increase maxRowsPerSegment
  4. Add more Middle Manager capacity
  5. Check for slow queries blocking ingestion
If tasks are failing:
  1. Check task logs for errors
  2. Verify stream connectivity
  3. Check data format matches spec
  4. Review parse exceptions
  5. Ensure sufficient memory
If data is not parsing correctly:
  1. Verify input format configuration
  2. Check sample data against spec
  3. Review parseExceptionHandler settings
  4. Monitor maxParseExceptions threshold

Next Steps

Kafka Ingestion

Learn about Kafka-specific configuration

Kinesis Ingestion

Learn about Kinesis-specific configuration

Supervisor

Deep dive into supervisor management

Data Formats

Explore supported input formats

Build docs developers (and LLMs) love