Overview
Druid’s segment-based architecture requires specific approaches for managing data after ingestion. Since segments are immutable, operations like updates and deletions work by creating new segments that replace existing ones.Core Data Management Operations
Segment Management
Druid stores data in segments - immutable files partitioned by time chunk. Understanding segment lifecycle and optimization is key to maintaining query performance.Compaction
Optimize segment size and improve query performance by reindexing existing data
Updates
Overwrite or reindex existing data with new values
Deletion
Remove data by time range or specific records
Retention Rules
Automatically manage data lifecycle with load and drop rules
Data Lifecycle
Data management in Druid involves several stages:Key Concepts
Segment Lifecycle
Segments transition through different states:- Used: Active segments available for querying
- Unused: Segments marked for deletion but still in deep storage
- Deleted: Permanently removed from deep storage
Atomic Updates
Druid’s atomic update mechanism ensures queries seamlessly transition from old data to new data on a time-chunk-by-time-chunk basis. This means:- No partial updates visible to queries
- No downtime during data replacement
- Time-based locking prevents conflicts
Common Operations
Updates and Overwrites
To update existing data:- Use native batch ingestion with
appendToExisting: false - Use SQL
REPLACEstatements for overwriting data - Reindex using the Druid input source to modify existing segments
Compaction Strategies
Compaction improves performance by:- Combining many small segments into optimally-sized ones
- Changing segment or query granularity
- Reordering dimensions for better compression
- Removing unused columns or applying rollup
Retention Management
Control data retention with:- Load rules: Define which segments to keep on Historical servers
- Drop rules: Mark segments as unused based on time periods or intervals
- Broadcast rules: Load segments onto Broker nodes (testing only)
Best Practices
Automatic Compaction: Enable auto-compaction for all datasources to maintain optimal segment sizes without manual intervention.
Retention Rules: Set up retention rules early to automatically manage data lifecycle and prevent unlimited storage growth.
Kill Tasks: After marking segments unused via drop rules, use kill tasks to permanently delete data from deep storage.
Learn More
Segment Optimization
Guidelines for optimal segment sizing
Schema Changes
Modify datasource schemas for new and existing data
Storage Design
Deep dive into Druid’s storage architecture
Manual Compaction
Submit one-time compaction tasks for specific intervals