How Ingestion Works
For most ingestion methods, the Druid Middle Manager processes or the Indexer processes load your source data. During ingestion, Druid creates segments and stores them in deep storage. Historical nodes load the segments into memory to respond to queries. For streaming ingestion, the Middle Managers and indexers can respond to queries in real-time with arriving data.Core Concepts
Before you start ingesting data, familiarize yourself with these key concepts:Schema Model
Learn about datasources, primary timestamp, dimensions, and metrics
Data Rollup
Understand rollup and how to maximize its benefits
Partitioning
Learn about time chunk and secondary partitioning
Schema Design
Best practices for designing your Druid schema
Ingestion Methods
Streaming Ingestion
Streaming ingestion is controlled by a continuously-running supervisor that manages indexing tasks.- Kafka
- Kinesis
Apache Kafka Integration
- Supervisor type:
kafka - Reads directly from Apache Kafka
- Can ingest late data: Yes
- Exactly-once guarantees: Yes
Kafka Ingestion Guide
Learn how to ingest from Apache Kafka
Batch Ingestion
Batch ingestion jobs are associated with a controller task that runs for the duration of the job.Native Batch
Native Batch
JSON-based Batch Ingestion
- Controller task type:
index_parallel - Submit via Tasks API
- No external dependencies
- Supports any inputSource and inputFormat
- Dynamic, hash-based, and range-based partitioning
Native Batch Guide
Learn about JSON-based batch ingestion
SQL-based
SQL-based
Multi-Stage Query Engine
- Controller task type:
query_controller - Submit INSERT or REPLACE statements
- No external dependencies
- Range partitioning with CLUSTERED BY
- Always perfect rollup
Hadoop (Deprecated)
Hadoop (Deprecated)
Hadoop-based Ingestion
- Controller task type:
index_hadoop - Requires Hadoop cluster
- Hash-based or range-based partitioning
- Always perfect rollup
Feature Comparison
Parallelism
- Native batch: Using subtasks if
maxNumConcurrentSubTasks> 1 - SQL: Using
query_workersubtasks - Hadoop: Using YARN
Fault Tolerance
- Native batch: Workers automatically relaunched upon failure
- SQL: Controller or worker task failure leads to job failure
- Hadoop: YARN containers automatically relaunched upon failure
Operations
- Native batch: Can append and overwrite
- SQL: Can INSERT (append) and REPLACE (overwrite)
- Hadoop: Can overwrite only
Getting Started
Next Steps
Data Formats
Learn about supported input formats like JSON, CSV, Parquet, and Avro
Supervisor
Understand supervisors for streaming ingestion
Input Sources
Configure data sources like S3, HDFS, and local files
Ingestion Spec
Complete reference for ingestion specifications