Storage Overview

Druid stores data in datasources, which are similar to tables in a traditional RDBMS. Each datasource is partitioned by time and, optionally, further partitioned by other attributes. Each time range is called a chunk (for example, a single day, if your datasource is partitioned by day). Within a chunk, data is partitioned into one or more segments.

Understanding Segments

Each segment is a single file, typically comprising up to a few million rows of data. Since segments are organized into time chunks, it’s helpful to think of segments as living on a timeline.

A datasource may have anywhere from just a few segments, up to hundreds of thousands and even millions of segments.

Segment Building Process

Each segment is created by a Middle Manager as mutable and uncommitted. Data is queryable as soon as it is added to an uncommitted segment. The segment building process accelerates later queries by producing a data file that is compact and indexed:

Conversion to columnar format

Data is restructured into a column-oriented format for efficient scanning

Indexing with bitmap indexes

Bitmap indexes are created for fast filtering operations

Compression

Multiple compression techniques are applied:

Dictionary encoding with id storage minimization for String columns
Bitmap compression for bitmap indexes
Type-aware compression for all columns

Periodically, segments are committed and published to deep storage, become immutable, and move from Middle Managers to the Historical services. An entry about the segment is also written to the metadata store. This entry is a self-describing bit of metadata about the segment, including things like the schema of the segment, its size, and its location on deep storage. These entries tell the Coordinator what data is available on the cluster.

Indexing and Handoff

Indexing is the mechanism by which new segments are created, and handoff is the mechanism by which they are published and served by Historical services.

On the Indexing Side

Task starts and determines segment identifier

An indexing task starts running and building a new segment. It must determine the identifier before it starts building:

Appending tasks (Kafka task, index task in append mode): Call an “allocate” API on the Overlord
Overwriting tasks (Hadoop task, index task not in append mode): Lock an interval and create a new version number

Real-time data becomes queryable

If the indexing task is a realtime task (like a Kafka task), the segment is immediately queryable at this point. It’s available, but unpublished.

Push to deep storage and publish

When the indexing task has finished reading data for the segment, it pushes it to deep storage and then publishes it by writing a record into the metadata store.

Wait for handoff (realtime tasks only)

If the indexing task is a realtime task, it waits for a Historical service to load the segment to ensure data is continuously available for queries. Non-realtime tasks exit immediately.

On the Coordinator / Historical Side

Coordinator polls metadata store

The Coordinator polls the metadata store periodically (by default, every 1 minute) for newly published segments.

Choose Historical for loading

When the Coordinator finds a segment that is published and used, but unavailable, it chooses a Historical service to load that segment and instructs that Historical to do so.

Historical loads segment

The Historical loads the segment and begins serving it.

Handoff completes

At this point, if the indexing task was waiting for handoff, it will exit.

Segment Identifiers

Segments all have a four-part identifier with the following components:

Datasource name

string

required

The name of the datasource

Time interval

string

required

For the time chunk containing the segment; corresponds to the segmentGranularity specified at ingestion time

Version number

string

required

Generally an ISO8601 timestamp corresponding to when the segment set was first started

Partition number

integer

required

An integer, unique within a datasource+interval+version; may not necessarily be contiguous

Example Segment Identifiers

For a segment in datasource clarity-cloud0, time chunk 2018-05-21T16:00:00.000Z/2018-05-21T17:00:00.000Z, version 2018-05-21T15:56:09.909Z, and partition number 1:

clarity-cloud0_2018-05-21T16:00:00.000Z_2018-05-21T17:00:00.000Z_2018-05-21T15:56:09.909Z_1

Segments with partition number 0 (the first partition in a chunk) omit the partition number from the identifier.

clarity-cloud0_2018-05-21T16:00:00.000Z_2018-05-21T17:00:00.000Z_2018-05-21T15:56:09.909Z

Segment Versioning

The version number provides a form of multi-version concurrency control (MVCC) to support batch-mode overwriting.

Append Only
Overwriting

If all you ever do is append data, then there will be just a single version for each time chunk.

The switch appears to happen instantaneously to a user, because Druid handles this by:

First loading the new data (but not allowing it to be queried)
As soon as the new data is all loaded, switching all new queries to use those new segments
Then dropping the old segments a few minutes later

Segment Lifecycle

Each segment has a lifecycle that involves three major areas:

Metadata Store

Segment metadata (a small JSON payload) is stored once a segment is done being constructed. The act of inserting a record is called publishing. Records have a boolean flag named used that controls whether the segment is intended to be queryable.

Deep Storage

Segment data files are pushed to deep storage once a segment is done being constructed. This happens immediately before publishing metadata to the metadata store.

Availability for Querying

Segments are available for querying on some Druid data server, like a realtime task, directly from deep storage, or a Historical service.

Segment State Flags

You can inspect the state of currently active segments using the Druid SQL sys.segments table. It includes the following flags:

is_published: True if segment metadata has been published to the metadata store and used is true
is_available: True if the segment is currently available for querying, either on a realtime task or Historical service
is_realtime: True if the segment is only available on realtime tasks
is_overshadowed: True if the segment is published and is fully overshadowed by some other published segments

Availability and Consistency

Druid has an architectural separation between ingestion and querying. When understanding Druid’s availability and consistency properties, we must look at each function separately.

Ingestion Side: Transactional Guarantees

Druid’s primary ingestion methods are all pull-based and offer transactional guarantees. Ingestion using these methods will publish in an all-or-nothing manner:

Supervised Seekable-Stream Ingestion

Methods like Kafka and Kinesis. Druid commits stream offsets to its metadata store alongside segment metadata, in the same transaction. This ensures exactly-once publishing behavior.

Ingestion of data that has not yet been published can be rolled back if ingestion tasks fail. Partially-ingested data is discarded, and Druid will resume ingestion from the last committed set of stream offsets.

Hadoop-based Batch Ingestion

Each task publishes all segment metadata in a single transaction.

Native Batch Ingestion

Parallel mode: The supervisor task publishes all segment metadata in a single transaction after the subtasks are finished
Simple mode: The single task publishes all segment metadata in a single transaction after it is complete

Idempotency Guarantee

Some ingestion methods offer an idempotency guarantee. This means that repeated executions of the same ingestion will not cause duplicate data to be ingested:

✅ Supervised seekable-stream ingestion (Kafka, Kinesis) - idempotent
⚠️ Hadoop-based batch ingestion - idempotent unless input source is the same Druid datasource
⚠️ Native batch ingestion - idempotent unless appendToExisting is true or input source is the same Druid datasource

Query Side: Atomic Replacement

The Druid Broker is responsible for ensuring that a consistent set of segments is involved in a given query. It selects the appropriate set of segment versions to use when the query starts based on what is currently available.

Atomic replacement ensures that from a user’s perspective, queries flip instantaneously from an older version of data to a newer set of data, with no consistency or performance impact. This is used for Hadoop-based batch ingestion, native batch ingestion when appendToExisting is false, and compaction.

Atomic replacement happens for each time chunk individually. If a batch ingestion task or compaction involves multiple time chunks, then each time chunk will undergo atomic replacement soon after the task finishes, but the replacements will not all happen simultaneously.

Typically, atomic replacement in Druid is based on a core set concept that works in conjunction with segment versions:

When a time chunk is overwritten, a new core set of segments is created with a higher version number
The core set must all be available before the Broker will use them instead of the older set
There can also only be one core set per version per time chunk
Druid will also only use a single version at a time per time chunk

Together, these properties provide Druid’s atomic replacement guarantees.

Handling Unavailable Segments

If segments become unavailable due to multiple Historicals going offline simultaneously (beyond your replication factor), then Druid queries will include only the segments that are still available. In the background, Druid will reload these unavailable segments on other Historicals as quickly as possible, at which point they will be included in queries again.

Getting Started

Design & Architecture

Data Ingestion

Querying

Data Management

Operations

Configuration

Understanding Segments

Segment Building Process

Indexing and Handoff

On the Indexing Side

On the Coordinator / Historical Side

Segment Identifiers

Example Segment Identifiers

Segment Versioning

Segment Lifecycle

Metadata Store

Deep Storage

Availability for Querying

Segment State Flags

Availability and Consistency

Ingestion Side: Transactional Guarantees

Idempotency Guarantee

Query Side: Atomic Replacement

Handling Unavailable Segments

Build docs developers (and LLMs) love

Getting Started

Design & Architecture

Data Ingestion

Querying

Data Management

Operations

Configuration

​Understanding Segments

​Segment Building Process

​Indexing and Handoff

​On the Indexing Side

​On the Coordinator / Historical Side

​Segment Identifiers

​Example Segment Identifiers

​Segment Versioning

​Segment Lifecycle

Metadata Store

Deep Storage

Availability for Querying

​Segment State Flags

​Availability and Consistency

​Ingestion Side: Transactional Guarantees

​Idempotency Guarantee

​Query Side: Atomic Replacement

​Handling Unavailable Segments

Build docs developers (and LLMs) love

Understanding Segments

Segment Building Process

Indexing and Handoff

On the Indexing Side

On the Coordinator / Historical Side

Segment Identifiers

Example Segment Identifiers

Segment Versioning

Segment Lifecycle

Segment State Flags

Availability and Consistency

Ingestion Side: Transactional Guarantees

Idempotency Guarantee

Query Side: Atomic Replacement

Handling Unavailable Segments