Skip to main content
Loading data in Druid is called ingestion or indexing. When you ingest data into Druid, Druid reads the data from your source system and stores it in data files called segments. In general, segment files contain a few million rows each.

How Ingestion Works

For most ingestion methods, the Druid Middle Manager processes or the Indexer processes load your source data. During ingestion, Druid creates segments and stores them in deep storage. Historical nodes load the segments into memory to respond to queries. For streaming ingestion, the Middle Managers and indexers can respond to queries in real-time with arriving data.

Core Concepts

Before you start ingesting data, familiarize yourself with these key concepts:

Schema Model

Learn about datasources, primary timestamp, dimensions, and metrics

Data Rollup

Understand rollup and how to maximize its benefits

Partitioning

Learn about time chunk and secondary partitioning

Schema Design

Best practices for designing your Druid schema

Ingestion Methods

Streaming Ingestion

Streaming ingestion is controlled by a continuously-running supervisor that manages indexing tasks.
Apache Kafka Integration
  • Supervisor type: kafka
  • Reads directly from Apache Kafka
  • Can ingest late data: Yes
  • Exactly-once guarantees: Yes

Kafka Ingestion Guide

Learn how to ingest from Apache Kafka

Batch Ingestion

Batch ingestion jobs are associated with a controller task that runs for the duration of the job.
JSON-based Batch Ingestion
  • Controller task type: index_parallel
  • Submit via Tasks API
  • No external dependencies
  • Supports any inputSource and inputFormat
  • Dynamic, hash-based, and range-based partitioning

Native Batch Guide

Learn about JSON-based batch ingestion
Multi-Stage Query Engine
  • Controller task type: query_controller
  • Submit INSERT or REPLACE statements
  • No external dependencies
  • Range partitioning with CLUSTERED BY
  • Always perfect rollup
Hadoop-based Ingestion
  • Controller task type: index_hadoop
  • Requires Hadoop cluster
  • Hash-based or range-based partitioning
  • Always perfect rollup

Feature Comparison

Parallelism
  • Native batch: Using subtasks if maxNumConcurrentSubTasks > 1
  • SQL: Using query_worker subtasks
  • Hadoop: Using YARN
Fault Tolerance
  • Native batch: Workers automatically relaunched upon failure
  • SQL: Controller or worker task failure leads to job failure
  • Hadoop: YARN containers automatically relaunched upon failure
Operations
  • Native batch: Can append and overwrite
  • SQL: Can INSERT (append) and REPLACE (overwrite)
  • Hadoop: Can overwrite only

Getting Started

1

Choose Your Ingestion Method

Select streaming for real-time data or batch for historical data
2

Design Your Schema

Define your datasource, timestamp, dimensions, and metrics
3

Configure Your Spec

Create an ingestion spec with dataSchema, ioConfig, and tuningConfig
4

Submit and Monitor

Submit your spec and monitor the ingestion process

Next Steps

Data Formats

Learn about supported input formats like JSON, CSV, Parquet, and Avro

Supervisor

Understand supervisors for streaming ingestion

Input Sources

Configure data sources like S3, HDFS, and local files

Ingestion Spec

Complete reference for ingestion specifications

Build docs developers (and LLMs) love