Apache Druid stores data in immutable segments partitioned by time chunk. This architecture requires specific approaches for updating existing data.
Update Methods
Druid does not support single-record updates by primary key. To update data, you must overwrite or reindex entire time intervals.
Overwrite Existing Data
Overwriting replaces existing data for a specific time range while leaving data outside that range untouched.
How Overwrites Work
Druid’s atomic update mechanism ensures:
Queries seamlessly flip from old to new data
Updates occur on a time-chunk-by-time-chunk basis
No partial updates are visible to queries
No downtime during data replacement
While an overwrite job is ongoing for a time range, new ingestions for that same time range are queued. Ingestions for other time ranges proceed normally. Read-only queries continue using the existing data.
Native Batch Overwrite
Use native batch ingestion with specific settings:
{
"type" : "index_parallel" ,
"spec" : {
"dataSchema" : {
"dataSource" : "wikipedia" ,
"timestampSpec" : { ... },
"dimensionsSpec" : { ... },
"granularitySpec" : {
"segmentGranularity" : "day" ,
"queryGranularity" : "hour" ,
"intervals" : [ "2023-01-01/2023-01-02" ]
}
},
"ioConfig" : {
"type" : "index_parallel" ,
"inputSource" : { ... },
"appendToExisting" : false
},
"tuningConfig" : { ... }
}
}
Key settings :
appendToExisting: false - Overwrites existing data
intervals - Specifies which time range to replace
See Native Batch Ingestion for full configuration.
SQL REPLACE Overwrite
Use SQL REPLACE statements to overwrite data:
REPLACE INTO wikipedia
OVERWRITE ALL
SELECT
TIME_PARSE( "timestamp" ) AS __time,
channel,
user,
added,
deleted
FROM TABLE (
EXTERN(
'{"type":"http","uris":["https://example.com/data.json"]}' ,
'{"type":"json"}'
)
)
PARTITIONED BY DAY
Overwrite specific time range:
REPLACE INTO wikipedia
OVERWRITE WHERE __time >= '2023-01-01' AND __time < '2023-02-01'
SELECT
TIME_PARSE( "timestamp" ) AS __time,
channel,
updated_field
FROM TABLE (...)
PARTITIONED BY DAY
SQL REPLACE provides more flexible data transformations during overwrites compared to native batch.
See SQL REPLACE Reference for complete syntax.
Reindex Data
Reindexing is an overwrite where the source of new data is the existing data itself. Use reindexing to:
Perform schema changes
Repartition data
Filter out unwanted data
Enrich existing data
Apply new aggregations or rollup
Native Batch Reindex
Use the Druid input source to read existing segments:
{
"type" : "index_parallel" ,
"spec" : {
"dataSchema" : {
"dataSource" : "wikipedia" ,
"transformSpec" : {
"filter" : {
"type" : "selector" ,
"dimension" : "country" ,
"value" : "United States"
},
"transforms" : [
{
"type" : "expression" ,
"name" : "normalized_channel" ,
"expression" : "lower(channel)"
}
]
},
"dimensionsSpec" : {
"dimensions" : [
"channel" ,
"normalized_channel" ,
"user" ,
"country"
]
},
"granularitySpec" : {
"segmentGranularity" : "day" ,
"queryGranularity" : "hour" ,
"intervals" : [ "2023-01-01/2023-02-01" ]
}
},
"ioConfig" : {
"type" : "index_parallel" ,
"inputSource" : {
"type" : "druid" ,
"dataSource" : "wikipedia" ,
"interval" : "2023-01-01/2023-02-01"
},
"appendToExisting" : false
},
"tuningConfig" : {
"type" : "index_parallel" ,
"maxRowsPerSegment" : 5000000
}
}
}
Key components :
inputSource.type: "druid" - Reads from existing Druid datasource
transformSpec - Filters or modifies data during reindex
dimensionsSpec - Can add or remove dimensions
See Druid Input Source for configuration.
SQL Reindex
Use REPLACE with SELECT from the same datasource:
REPLACE INTO wikipedia
OVERWRITE ALL
SELECT
__time,
channel,
LOWER (channel) AS normalized_channel,
"user" ,
country,
added,
deleted
FROM wikipedia
WHERE country = 'United States'
PARTITIONED BY DAY
Change granularity during reindex:
REPLACE INTO metrics_hourly
OVERWRITE ALL
SELECT
TIME_FLOOR(__time, 'PT1H' ) AS __time,
server ,
SUM (requests) AS total_requests,
AVG (latency) AS avg_latency
FROM metrics_minutely
GROUP BY 1 , 2
PARTITIONED BY DAY
Druid does not have UPDATE or ALTER TABLE statements. Use REPLACE with SELECT to modify existing data.
Update Rolled-Up Datasources
For datasources with rollup enabled, you can effectively update data through appends:
Append New Rows
Ingest rows with identical dimensions to existing rows.
Automatic Query-Time Aggregation
Queries automatically combine rows with matching dimensions using aggregation operators.
Optional Physical Compaction
Use compaction to physically combine matching rows and reduce storage.
Example
Existing data:
__time | country | page_views
2023-01-01 00:00| US | 100
Append new data:
__time | country | page_views
2023-01-01 00:00| US | 50
Query result (automatic aggregation):
__time | country | page_views
2023-01-01 00:00| US | 150
Update with Lookups
For dimensions that change frequently, use lookups instead of reindexing:
When to Use Lookups
ID to name mappings (e.g., product_id → product_name)
Category assignments that change
Status codes to descriptions
Any dimension with values that update regularly
Example Lookup Configuration
{
"product_names" : {
"type" : "map" ,
"map" : {
"P001" : "Widget Pro" ,
"P002" : "Gadget Plus" ,
"P003" : "Tool Max"
}
}
}
Query with lookup:
SELECT
__time,
product_id,
LOOKUP(product_id, 'product_names' ) AS product_name,
SUM (revenue) AS total_revenue
FROM sales
GROUP BY 1 , 2
Benefits :
Update mappings without reindexing data
Lower storage requirements (store IDs, not strings)
Centralized dimension value management
See Lookups for configuration and management.
Concurrent Updates and Ingestion
Manage conflicts between updates and ongoing ingestion:
Enable Concurrent Append and Replace
Allow appends during replace operations:
{
"type" : "index_parallel" ,
"context" : {
"useConcurrentLocks" : true
},
"spec" : { ... }
}
For SQL:
REPLACE INTO wikipedia
OVERWRITE ALL
SELECT * FROM ...
PARTITIONED BY DAY
CONTEXT(useConcurrentLocks = true)
See Concurrent Append and Replace for details.
Lock Priority
By default:
Ingestion tasks have higher priority
Updates/compaction yield to new ingestion
To change (advanced users only):
{
"context" : {
"priority" : 50
}
}
Increasing update task priority can cause ingestion to fail or lag. Only change if you understand the implications.
Common Update Patterns
Add New Dimension
Reindex to add a computed dimension:
REPLACE INTO events
OVERWRITE ALL
SELECT
__time,
user_id,
event_type,
CASE
WHEN event_type IN ( 'purchase' , 'add_to_cart' ) THEN 'conversion'
ELSE 'engagement'
END AS event_category
FROM events
PARTITIONED BY DAY
Remove Dimension
Reindex without the unwanted dimension:
REPLACE INTO logs
OVERWRITE WHERE __time >= '2023-01-01'
SELECT
__time,
server ,
endpoint ,
status_code
-- Omit 'deprecated_field'
FROM logs
WHERE __time >= '2023-01-01'
PARTITIONED BY DAY
Change Metric Aggregation
Reindex with different aggregation:
{
"metricsSpec" : [
{
"type" : "longSum" ,
"name" : "sum_value" ,
"fieldName" : "value"
},
{
"type" : "longMax" ,
"name" : "max_value" ,
"fieldName" : "value"
}
]
}
Filter Out Bad Data
Reindex with filter to remove erroneous records:
{
"transformSpec" : {
"filter" : {
"type" : "and" ,
"fields" : [
{
"type" : "not" ,
"field" : { "type" : "selector" , "dimension" : "user_id" , "value" : "bot" }
},
{
"type" : "bound" ,
"dimension" : "value" ,
"lower" : "0" ,
"lowerStrict" : false
}
]
}
}
}
Best Practices
Use Time Ranges Update only the necessary time intervals to minimize data processing and lock duration.
Test Transformations Validate transform logic on a small interval before applying to the full dataset.
Consider Lookups For frequently changing dimensions, use lookups instead of repeated reindexing.
Monitor Locks Watch for lock conflicts between updates and ingestion in high-throughput scenarios.
Learn More
Compaction Optimize segments after updates
Native Batch Native batch ingestion reference
SQL REPLACE SQL-based data replacement
Transform Spec Filter and transform data during ingestion