Skip to main content
Apache Druid stores data in immutable segments partitioned by time chunk. This architecture requires specific approaches for updating existing data.

Update Methods

Druid does not support single-record updates by primary key. To update data, you must overwrite or reindex entire time intervals.

Overwrite Existing Data

Overwriting replaces existing data for a specific time range while leaving data outside that range untouched.

How Overwrites Work

Druid’s atomic update mechanism ensures:
  • Queries seamlessly flip from old to new data
  • Updates occur on a time-chunk-by-time-chunk basis
  • No partial updates are visible to queries
  • No downtime during data replacement
While an overwrite job is ongoing for a time range, new ingestions for that same time range are queued. Ingestions for other time ranges proceed normally. Read-only queries continue using the existing data.

Native Batch Overwrite

Use native batch ingestion with specific settings:
{
  "type": "index_parallel",
  "spec": {
    "dataSchema": {
      "dataSource": "wikipedia",
      "timestampSpec": {...},
      "dimensionsSpec": {...},
      "granularitySpec": {
        "segmentGranularity": "day",
        "queryGranularity": "hour",
        "intervals": ["2023-01-01/2023-01-02"]
      }
    },
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {...},
      "appendToExisting": false
    },
    "tuningConfig": {...}
  }
}
Key settings:
  • appendToExisting: false - Overwrites existing data
  • intervals - Specifies which time range to replace
See Native Batch Ingestion for full configuration.

SQL REPLACE Overwrite

Use SQL REPLACE statements to overwrite data:
REPLACE INTO wikipedia
OVERWRITE ALL
SELECT 
  TIME_PARSE("timestamp") AS __time,
  channel,
  user,
  added,
  deleted
FROM TABLE(
  EXTERN(
    '{"type":"http","uris":["https://example.com/data.json"]}',
    '{"type":"json"}'
  )
)
PARTITIONED BY DAY
Overwrite specific time range:
REPLACE INTO wikipedia
OVERWRITE WHERE __time >= '2023-01-01' AND __time < '2023-02-01'
SELECT 
  TIME_PARSE("timestamp") AS __time,
  channel,
  updated_field
FROM TABLE(...)
PARTITIONED BY DAY
SQL REPLACE provides more flexible data transformations during overwrites compared to native batch.
See SQL REPLACE Reference for complete syntax.

Reindex Data

Reindexing is an overwrite where the source of new data is the existing data itself. Use reindexing to:
  • Perform schema changes
  • Repartition data
  • Filter out unwanted data
  • Enrich existing data
  • Apply new aggregations or rollup

Native Batch Reindex

Use the Druid input source to read existing segments:
{
  "type": "index_parallel",
  "spec": {
    "dataSchema": {
      "dataSource": "wikipedia",
      "transformSpec": {
        "filter": {
          "type": "selector",
          "dimension": "country",
          "value": "United States"
        },
        "transforms": [
          {
            "type": "expression",
            "name": "normalized_channel",
            "expression": "lower(channel)"
          }
        ]
      },
      "dimensionsSpec": {
        "dimensions": [
          "channel",
          "normalized_channel",
          "user",
          "country"
        ]
      },
      "granularitySpec": {
        "segmentGranularity": "day",
        "queryGranularity": "hour",
        "intervals": ["2023-01-01/2023-02-01"]
      }
    },
    "ioConfig": {
      "type": "index_parallel",
      "inputSource": {
        "type": "druid",
        "dataSource": "wikipedia",
        "interval": "2023-01-01/2023-02-01"
      },
      "appendToExisting": false
    },
    "tuningConfig": {
      "type": "index_parallel",
      "maxRowsPerSegment": 5000000
    }
  }
}
Key components:
  • inputSource.type: "druid" - Reads from existing Druid datasource
  • transformSpec - Filters or modifies data during reindex
  • dimensionsSpec - Can add or remove dimensions
See Druid Input Source for configuration.

SQL Reindex

Use REPLACE with SELECT from the same datasource:
REPLACE INTO wikipedia
OVERWRITE ALL
SELECT
  __time,
  channel,
  LOWER(channel) AS normalized_channel,
  "user",
  country,
  added,
  deleted
FROM wikipedia
WHERE country = 'United States'
PARTITIONED BY DAY
Change granularity during reindex:
REPLACE INTO metrics_hourly
OVERWRITE ALL  
SELECT
  TIME_FLOOR(__time, 'PT1H') AS __time,
  server,
  SUM(requests) AS total_requests,
  AVG(latency) AS avg_latency
FROM metrics_minutely
GROUP BY 1, 2
PARTITIONED BY DAY
Druid does not have UPDATE or ALTER TABLE statements. Use REPLACE with SELECT to modify existing data.

Update Rolled-Up Datasources

For datasources with rollup enabled, you can effectively update data through appends:
1

Append New Rows

Ingest rows with identical dimensions to existing rows.
2

Automatic Query-Time Aggregation

Queries automatically combine rows with matching dimensions using aggregation operators.
3

Optional Physical Compaction

Use compaction to physically combine matching rows and reduce storage.

Example

Existing data:
__time          | country | page_views
2023-01-01 00:00| US      | 100
Append new data:
__time          | country | page_views  
2023-01-01 00:00| US      | 50
Query result (automatic aggregation):
__time          | country | page_views
2023-01-01 00:00| US      | 150
Use Compaction or Automatic Compaction to physically combine matching rows and improve storage efficiency.

Update with Lookups

For dimensions that change frequently, use lookups instead of reindexing:

When to Use Lookups

  • ID to name mappings (e.g., product_id → product_name)
  • Category assignments that change
  • Status codes to descriptions
  • Any dimension with values that update regularly

Example Lookup Configuration

{
  "product_names": {
    "type": "map",
    "map": {
      "P001": "Widget Pro",
      "P002": "Gadget Plus",
      "P003": "Tool Max"
    }
  }
}
Query with lookup:
SELECT
  __time,
  product_id,
  LOOKUP(product_id, 'product_names') AS product_name,
  SUM(revenue) AS total_revenue
FROM sales
GROUP BY 1, 2
Benefits:
  • Update mappings without reindexing data
  • Lower storage requirements (store IDs, not strings)
  • Centralized dimension value management
See Lookups for configuration and management.

Concurrent Updates and Ingestion

Manage conflicts between updates and ongoing ingestion:

Enable Concurrent Append and Replace

Allow appends during replace operations:
{
  "type": "index_parallel",
  "context": {
    "useConcurrentLocks": true
  },
  "spec": {...}
}
For SQL:
REPLACE INTO wikipedia
OVERWRITE ALL
SELECT * FROM ...
PARTITIONED BY DAY
CONTEXT(useConcurrentLocks=true)
See Concurrent Append and Replace for details.

Lock Priority

By default:
  • Ingestion tasks have higher priority
  • Updates/compaction yield to new ingestion
To change (advanced users only):
{
  "context": {
    "priority": 50
  }
}
Increasing update task priority can cause ingestion to fail or lag. Only change if you understand the implications.

Common Update Patterns

Add New Dimension

Reindex to add a computed dimension:
REPLACE INTO events  
OVERWRITE ALL
SELECT
  __time,
  user_id,
  event_type,
  CASE 
    WHEN event_type IN ('purchase', 'add_to_cart') THEN 'conversion'
    ELSE 'engagement'
  END AS event_category
FROM events
PARTITIONED BY DAY

Remove Dimension

Reindex without the unwanted dimension:
REPLACE INTO logs
OVERWRITE WHERE __time >= '2023-01-01'
SELECT
  __time,
  server,
  endpoint,
  status_code
  -- Omit 'deprecated_field'
FROM logs  
WHERE __time >= '2023-01-01'
PARTITIONED BY DAY

Change Metric Aggregation

Reindex with different aggregation:
{
  "metricsSpec": [
    {
      "type": "longSum",
      "name": "sum_value",
      "fieldName": "value"
    },
    {
      "type": "longMax",  
      "name": "max_value",
      "fieldName": "value"
    }
  ]
}

Filter Out Bad Data

Reindex with filter to remove erroneous records:
{
  "transformSpec": {
    "filter": {
      "type": "and",
      "fields": [
        {
          "type": "not",
          "field": {"type": "selector", "dimension": "user_id", "value": "bot"}
        },
        {
          "type": "bound",
          "dimension": "value",
          "lower": "0",
          "lowerStrict": false
        }
      ]
    }
  }
}

Best Practices

Use Time Ranges

Update only the necessary time intervals to minimize data processing and lock duration.

Test Transformations

Validate transform logic on a small interval before applying to the full dataset.

Consider Lookups

For frequently changing dimensions, use lookups instead of repeated reindexing.

Monitor Locks

Watch for lock conflicts between updates and ingestion in high-throughput scenarios.

Learn More

Compaction

Optimize segments after updates

Native Batch

Native batch ingestion reference

SQL REPLACE

SQL-based data replacement

Transform Spec

Filter and transform data during ingestion

Build docs developers (and LLMs) love