Skip to main content
Apache Druid stores data partitioned by time chunk in immutable files called segments. This section covers the essential data management operations for working with segments in Druid.

Overview

Druid’s segment-based architecture requires specific approaches for managing data after ingestion. Since segments are immutable, operations like updates and deletions work by creating new segments that replace existing ones.

Core Data Management Operations

Segment Management

Druid stores data in segments - immutable files partitioned by time chunk. Understanding segment lifecycle and optimization is key to maintaining query performance.

Compaction

Optimize segment size and improve query performance by reindexing existing data

Updates

Overwrite or reindex existing data with new values

Deletion

Remove data by time range or specific records

Retention Rules

Automatically manage data lifecycle with load and drop rules

Data Lifecycle

Data management in Druid involves several stages:
1

Ingestion

Data is ingested and written into segments partitioned by time
2

Optimization

Segments are compacted to optimal sizes for query performance
3

Retention

Retention rules determine which segments remain available for queries
4

Deletion

Old or unused data is marked unused and optionally removed from deep storage

Key Concepts

Segment Lifecycle

Segments transition through different states:
  • Used: Active segments available for querying
  • Unused: Segments marked for deletion but still in deep storage
  • Deleted: Permanently removed from deep storage

Atomic Updates

Druid’s atomic update mechanism ensures queries seamlessly transition from old data to new data on a time-chunk-by-time-chunk basis. This means:
  • No partial updates visible to queries
  • No downtime during data replacement
  • Time-based locking prevents conflicts
Druid does not support single-record updates by primary key. To update specific records, you must reindex the entire time interval containing those records.

Common Operations

Updates and Overwrites

To update existing data:
  • Use native batch ingestion with appendToExisting: false
  • Use SQL REPLACE statements for overwriting data
  • Reindex using the Druid input source to modify existing segments
See Data Updates for detailed examples.

Compaction Strategies

Compaction improves performance by:
  • Combining many small segments into optimally-sized ones
  • Changing segment or query granularity
  • Reordering dimensions for better compression
  • Removing unused columns or applying rollup
See Compaction for use cases and configuration.

Retention Management

Control data retention with:
  • Load rules: Define which segments to keep on Historical servers
  • Drop rules: Mark segments as unused based on time periods or intervals
  • Broadcast rules: Load segments onto Broker nodes (testing only)
See Rule Configuration for detailed rule types.

Best Practices

Automatic Compaction: Enable auto-compaction for all datasources to maintain optimal segment sizes without manual intervention.
Retention Rules: Set up retention rules early to automatically manage data lifecycle and prevent unlimited storage growth.
Kill Tasks: After marking segments unused via drop rules, use kill tasks to permanently delete data from deep storage.

Learn More

Segment Optimization

Guidelines for optimal segment sizing

Schema Changes

Modify datasource schemas for new and existing data

Storage Design

Deep dive into Druid’s storage architecture

Manual Compaction

Submit one-time compaction tasks for specific intervals

Build docs developers (and LLMs) love