Data Ingestion Overview

Loading data in Druid is called ingestion or indexing. When you ingest data into Druid, Druid reads the data from your source system and stores it in data files called segments. In general, segment files contain a few million rows each.

How Ingestion Works

For most ingestion methods, the Druid Middle Manager processes or the Indexer processes load your source data. During ingestion, Druid creates segments and stores them in deep storage. Historical nodes load the segments into memory to respond to queries. For streaming ingestion, the Middle Managers and indexers can respond to queries in real-time with arriving data.

Core Concepts

Before you start ingesting data, familiarize yourself with these key concepts:

Schema Model

Learn about datasources, primary timestamp, dimensions, and metrics

Data Rollup

Understand rollup and how to maximize its benefits

Partitioning

Learn about time chunk and secondary partitioning

Schema Design

Best practices for designing your Druid schema

Ingestion Methods

Streaming Ingestion

Streaming ingestion is controlled by a continuously-running supervisor that manages indexing tasks.

Kafka
Kinesis

Apache Kafka Integration

Supervisor type: kafka
Reads directly from Apache Kafka
Can ingest late data: Yes
Exactly-once guarantees: Yes

Kafka Ingestion Guide

Learn how to ingest from Apache Kafka

Amazon Kinesis Integration

Supervisor type: kinesis
Reads directly from Amazon Kinesis
Can ingest late data: Yes
Exactly-once guarantees: Yes

Kinesis Ingestion Guide

Learn how to ingest from Amazon Kinesis

Batch Ingestion

Batch ingestion jobs are associated with a controller task that runs for the duration of the job.

Native Batch

JSON-based Batch Ingestion

Controller task type: index_parallel
Submit via Tasks API
No external dependencies
Supports any inputSource and inputFormat
Dynamic, hash-based, and range-based partitioning

Native Batch Guide

Learn about JSON-based batch ingestion

SQL-based

Multi-Stage Query Engine

Controller task type: query_controller
Submit INSERT or REPLACE statements
No external dependencies
Range partitioning with CLUSTERED BY
Always perfect rollup

Hadoop (Deprecated)

Hadoop-based Ingestion

Controller task type: index_hadoop
Requires Hadoop cluster
Hash-based or range-based partitioning
Always perfect rollup

Feature Comparison

Parallelism

Native batch: Using subtasks if maxNumConcurrentSubTasks > 1
SQL: Using query_worker subtasks
Hadoop: Using YARN

Fault Tolerance

Native batch: Workers automatically relaunched upon failure
SQL: Controller or worker task failure leads to job failure
Hadoop: YARN containers automatically relaunched upon failure

Operations

Native batch: Can append and overwrite
SQL: Can INSERT (append) and REPLACE (overwrite)
Hadoop: Can overwrite only

Getting Started

Choose Your Ingestion Method

Select streaming for real-time data or batch for historical data

Design Your Schema

Define your datasource, timestamp, dimensions, and metrics

Configure Your Spec

Create an ingestion spec with dataSchema, ioConfig, and tuningConfig

Submit and Monitor

Submit your spec and monitor the ingestion process

Next Steps

Data Formats

Learn about supported input formats like JSON, CSV, Parquet, and Avro

Supervisor

Understand supervisors for streaming ingestion

Input Sources

Configure data sources like S3, HDFS, and local files

Ingestion Spec

Complete reference for ingestion specifications

Getting Started

Design & Architecture

Data Ingestion

Querying

Data Management

Operations

Configuration

Data Ingestion Overview

How Ingestion Works

Core Concepts

Schema Model

Data Rollup

Partitioning

Schema Design

Ingestion Methods

Streaming Ingestion

Kafka Ingestion Guide

Kinesis Ingestion Guide

Batch Ingestion

Native Batch Guide

Feature Comparison

Getting Started

Next Steps

Data Formats

Supervisor

Input Sources

Ingestion Spec

Build docs developers (and LLMs) love

Getting Started

Design & Architecture

Data Ingestion

Querying

Data Management

Operations

Configuration

​How Ingestion Works

​Core Concepts

Schema Model

Data Rollup

Partitioning

Schema Design

​Ingestion Methods

​Streaming Ingestion

Kafka Ingestion Guide

Kinesis Ingestion Guide

​Batch Ingestion

Native Batch Guide

​Feature Comparison

​Getting Started

​Next Steps

Data Formats

Supervisor

Input Sources

Ingestion Spec

Build docs developers (and LLMs) love

How Ingestion Works

Core Concepts

Ingestion Methods

Streaming Ingestion

Batch Ingestion

Feature Comparison

Getting Started

Next Steps