Skip to main content
This guide provides tips for users coming from other systems and general best practices for schema design in Druid.

Druid’s Data Model Overview

Datasources

Similar to tables in a traditional RDBMS

Rollup

Optional partial aggregation during ingestion

Time-based

Every row must have a timestamp

OLAP

Columns are either dimensions or metrics
For general information, see the Schema Model documentation.

Coming from Other Systems

Relational Databases (RDBMS)

Denormalization is RecommendedUnlike relational databases where normalization is best practice, Druid benefits from denormalized data.
-- Sales table
product_id | quantity | price
-----------|----------|------
123        | 5        | 99.99

-- Products table (separate)
product_id | name      | category
-----------|-----------|----------
123        | Widget    | Electronics
Requires JOIN at query time
Storage EfficiencyDruid uses dictionary encoding to store string columns efficiently. Each unique value is stored once, and rows contain integer references.Query Performance
  • No JOIN operations needed at query time
  • Direct access to all data
  • Operates on compressed dictionary-encoded data
Use lookups (similar to dimension tables) when:
  • You need to update dimension values frequently
  • Changes should reflect immediately in already-ingested data
  • Memory footprint is acceptable (full copy on each server)
Lookups are not suitable for large tables. Use pre-join for large datasets.

Tips for Relational Data

1

Remove Primary Keys

Druid datasources do not have primary or unique key constraints
2

Denormalize When Possible

Flatten dimension tables into your main table before ingestion
3

Pre-join Large Tables

Join two large distributed tables before loading into Druid (Druid doesn’t support query-time joins of large datasources)
4

Consider Rollup

Decide if you want perfect rollup (pre-aggregation) or load data as-is

Time Series Databases

Data Point vs Series ModelDruid treats each point separately rather than as part of a “time series”.

Tips for Time Series Data

Create a dimension for the series name:
{
  "dimensionsSpec": {
    "dimensions": [
      "metric_name",  // Place first for best performance
      "host",
      "datacenter",
      "environment"
    ]
  }
}
Place the metric name first in the dimensions list for better locality and performance.
What time series databases call “tags” become dimensions in Druid:
{
  "dimensions": [
    "metric_name",
    "host",
    "region",
    "availability_zone"
  ]
}
Define aggregations for the types of queries you’ll run:
{
  "metricsSpec": [
    {"type": "doubleSum", "name": "sum", "fieldName": "value"},
    {"type": "doubleMin", "name": "min", "fieldName": "value"},
    {"type": "doubleMax", "name": "max", "fieldName": "value"},
    {"type": "approxHistogram", "name": "histogram", "fieldName": "value"}
  ]
}
Enable rollup to combine data points at different time granularities or to merge timeseries and non-timeseries data.

Log Aggregation Systems

Explicit Schema vs Schema-lessUnlike Elasticsearch or Splunk, Druid requires more explicit schema definition upfront.

Tips for Log Data

If you don’t know the columns ahead of time, use automatic schema discovery:
{
  "dimensionsSpec": {
    "useSchemaDiscovery": true,
    "dimensionExclusions": ["timestamp", "message"]
  }
}
For nested JSON logs, you have two options:
Use Druid’s native nested column support:
{
  "dimensionsSpec": {
    "dimensions": [
      {
        "type": "json",
        "name": "request_headers"
      }
    ]
  }
}
Enable rollup if you have analytical use cases:
{
  "granularitySpec": {
    "rollup": true,
    "queryGranularity": "minute"
  }
}
With rollup enabled, you lose the ability to retrieve individual log events.

General Best Practices

Partitioning and Sorting

Optimize Partitioning

Proper partitioning and sorting can substantially impact footprint and performance.
Choose appropriate segment granularity:
  • High-volume data: HOUR
  • Medium-volume: DAY
  • Low-volume: WEEK or MONTH

High Cardinality Columns

For high cardinality columns like user IDs, use sketches for approximate analysis:
For count-distinct queries:
{
  "metricsSpec": [
    {
      "type": "hyperUnique",
      "name": "unique_users",
      "fieldName": "user_id"
    }
  ]
}
Benefits:
  • Improved rollup ratios (collapse multiple distinct values)
  • Reduced memory footprint at query time
  • Faster aggregation of approximate results

String vs Numeric Dimensions

Numeric Dimensions (Long, Double, Float)✅ Faster to group on
❌ Slower to filter on (no indexes)
✅ Less memory usage
String Dimensions✅ Faster to filter on (with indexes)
❌ Slower to group on
❌ More memory usage
Configure columns as numeric dimensions if you primarily group by them. Use string dimensions if you primarily filter on them.

Secondary Timestamps

If you have multiple timestamps, store additional ones as long-typed dimensions:
{
  "dimensionsSpec": {
    "dimensions": [
      {
        "type": "long",
        "name": "created_time"
      },
      {
        "type": "long",
        "name": "updated_time"
      }
    ]
  }
}
Use transformSpec to convert timestamps to milliseconds:
{
  "transformSpec": {
    "transforms": [
      {
        "type": "expression",
        "name": "created_time",
        "expression": "timestamp_parse(created_at, 'yyyy-MM-dd HH:mm:ss')"
      }
    ]
  }
}
Use SQL time functions at query time:
SELECT 
  MILLIS_TO_TIMESTAMP(created_time) as created_at,
  TIME_FLOOR(MILLIS_TO_TIMESTAMP(updated_time), 'PT1H') as hour
FROM datasource
WHERE created_time > TIMESTAMP_TO_MILLIS(CURRENT_TIMESTAMP - INTERVAL '7' DAY)

Nested Dimensions

Use COMPLEX<json> for nested data:
{
  "dimensionsSpec": {
    "dimensions": [
      {
        "type": "json",
        "name": "user_metadata"
      }
    ]
  }
}
Query with JSON functions:
SELECT 
  JSON_VALUE(user_metadata, '$.preferences.theme'),
  COUNT(*)
FROM datasource
GROUP BY 1

Counting Ingested Events

With rollup enabled, a count aggregator at query time tells you the number of Druid rows, not the number of ingested events.
Use a count metric during ingestion:
{
  "metricsSpec": [
    {
      "type": "count",
      "name": "count"
    }
  ]
}

Schema Auto-Discovery

Recommended for most use cases
{
  "dimensionsSpec": {
    "useSchemaDiscovery": true,
    "dimensionExclusions": ["timestamp"]
  }
}
Druid infers:
  • STRING, LONG, DOUBLE
  • ARRAY<STRING>, ARRAY<LONG>, ARRAY<DOUBLE>
  • COMPLEX<json> for nested data
Type-aware schema discovery can impact downstream BI tools depending on how they handle ARRAY typed columns.

Same Column as Dimension and Metric

To filter on an ID while also computing unique counts:
{
  "dimensionsSpec": {
    "dimensions": [
      "user_id"  // For filtering
    ]
  },
  "metricsSpec": [
    {
      "type": "hyperUnique",
      "name": "unique_users",
      "fieldName": "user_id"  // For unique counts
    }
  ]
}

Complete Example

{
  "dataSchema": {
    "dataSource": "ecommerce_events",
    "timestampSpec": {
      "column": "timestamp",
      "format": "iso"
    },
    "dimensionsSpec": {
      "dimensions": [
        "event_type",
        "product_id",
        "product_name",
        "product_category",
        "country",
        "city",
        {
          "type": "long",
          "name": "user_id"
        },
        {
          "type": "double",
          "name": "price"
        }
      ]
    },
    "metricsSpec": [
      {
        "type": "count",
        "name": "count"
      },
      {
        "type": "doubleSum",
        "name": "revenue",
        "fieldName": "price"
      },
      {
        "type": "hyperUnique",
        "name": "unique_users",
        "fieldName": "user_id"
      },
      {
        "type": "hyperUnique",
        "name": "unique_products",
        "fieldName": "product_id"
      }
    ],
    "granularitySpec": {
      "segmentGranularity": "HOUR",
      "queryGranularity": "MINUTE",
      "rollup": true
    }
  }
}

Next Steps

Schema Model

Learn about the core schema concepts

Partitioning

Optimize partitioning for your use case

Data Rollup

Understand rollup in detail

Nested Columns

Work with nested JSON data

Build docs developers (and LLMs) love