Schema Design Tips

This guide provides tips for users coming from other systems and general best practices for schema design in Druid.

Druid’s Data Model Overview

Datasources

Similar to tables in a traditional RDBMS

Rollup

Optional partial aggregation during ingestion

Time-based

Every row must have a timestamp

OLAP

Columns are either dimensions or metrics

For general information, see the Schema Model documentation.

Coming from Other Systems

Relational Databases (RDBMS)

Key Differences

Denormalization is RecommendedUnlike relational databases where normalization is best practice, Druid benefits from denormalized data.

Traditional RDBMS
Druid (Recommended)

-- Sales table
product_id | quantity | price
-----------|----------|------
123        | 5        | 99.99

-- Products table (separate)
product_id | name      | category
-----------|-----------|----------
123        | Widget    | Electronics

Requires JOIN at query time

{
  "product_id": 123,
  "product_name": "Widget",
  "product_category": "Electronics",
  "quantity": 5,
  "price": 99.99
}

All data in one table, no JOIN needed

Why Denormalization Works

Storage EfficiencyDruid uses dictionary encoding to store string columns efficiently. Each unique value is stored once, and rows contain integer references.Query Performance

No JOIN operations needed at query time
Direct access to all data
Operates on compressed dictionary-encoded data

When to Use Lookups

Use lookups (similar to dimension tables) when:

You need to update dimension values frequently
Changes should reflect immediately in already-ingested data
Memory footprint is acceptable (full copy on each server)

Lookups are not suitable for large tables. Use pre-join for large datasets.

Tips for Relational Data

Remove Primary Keys

Druid datasources do not have primary or unique key constraints

Denormalize When Possible

Flatten dimension tables into your main table before ingestion

Pre-join Large Tables

Join two large distributed tables before loading into Druid (Druid doesn’t support query-time joins of large datasources)

Consider Rollup

Decide if you want perfect rollup (pre-aggregation) or load data as-is

Time Series Databases

Key Differences

Data Point vs Series ModelDruid treats each point separately rather than as part of a “time series”.

Tips for Time Series Data

Metric Name Dimension

Create a dimension for the series name:

{
  "dimensionsSpec": {
    "dimensions": [
      "metric_name",  // Place first for best performance
      "host",
      "datacenter",
      "environment"
    ]
  }
}

Place the metric name first in the dimensions list for better locality and performance.

Tags as Dimensions

What time series databases call “tags” become dimensions in Druid:

{
  "dimensions": [
    "metric_name",
    "host",
    "region",
    "availability_zone"
  ]
}

Configure Metrics

Define aggregations for the types of queries you’ll run:

{
  "metricsSpec": [
    {"type": "doubleSum", "name": "sum", "fieldName": "value"},
    {"type": "doubleMin", "name": "min", "fieldName": "value"},
    {"type": "doubleMax", "name": "max", "fieldName": "value"},
    {"type": "approxHistogram", "name": "histogram", "fieldName": "value"}
  ]
}

Enable rollup to combine data points at different time granularities or to merge timeseries and non-timeseries data.

Log Aggregation Systems

Key Differences

Explicit Schema vs Schema-lessUnlike Elasticsearch or Splunk, Druid requires more explicit schema definition upfront.

Tips for Log Data

Schema Discovery

If you don’t know the columns ahead of time, use automatic schema discovery:

{
  "dimensionsSpec": {
    "useSchemaDiscovery": true,
    "dimensionExclusions": ["timestamp", "message"]
  }
}

Nested Data

For nested JSON logs, you have two options:

Nested Columns
Flatten

Use Druid’s native nested column support:

{
  "dimensionsSpec": {
    "dimensions": [
      {
        "type": "json",
        "name": "request_headers"
      }
    ]
  }
}

Flatten nested structures:

{
  "inputFormat": {
    "type": "json",
    "flattenSpec": {
      "fields": [
        {"type": "path", "name": "user_agent", "expr": "$.request.headers.user_agent"}
      ]
    }
  }
}

Rollup for Analytics

Enable rollup if you have analytical use cases:

{
  "granularitySpec": {
    "rollup": true,
    "queryGranularity": "minute"
  }
}

With rollup enabled, you lose the ability to retrieve individual log events.

General Best Practices

Partitioning and Sorting

Optimize Partitioning

Proper partitioning and sorting can substantially impact footprint and performance.

Time-based
Secondary Partitioning

Choose appropriate segment granularity:

High-volume data: HOUR
Medium-volume: DAY
Low-volume: WEEK or MONTH

For time series data, partition by metric name:

{
  "partitionsSpec": {
    "type": "single_dim",
    "partitionDimension": "metric_name",
    "targetRowsPerSegment": 5000000
  }
}

High Cardinality Columns

Use Sketches

For high cardinality columns like user IDs, use sketches for approximate analysis:

HyperLogLog
Theta Sketch
Quantiles

For count-distinct queries:

{
  "metricsSpec": [
    {
      "type": "hyperUnique",
      "name": "unique_users",
      "fieldName": "user_id"
    }
  ]
}

For set operations:

{
  "metricsSpec": [
    {
      "type": "thetaSketch",
      "name": "user_sketch",
      "fieldName": "user_id"
    }
  ]
}

For percentile queries:

{
  "metricsSpec": [
    {
      "type": "quantilesDoublesSketch",
      "name": "latency_quantiles",
      "fieldName": "latency_ms"
    }
  ]
}

Benefits:

Improved rollup ratios (collapse multiple distinct values)
Reduced memory footprint at query time
Faster aggregation of approximate results

String vs Numeric Dimensions

Performance Trade-offs
Example

Numeric Dimensions (Long, Double, Float)✅ Faster to group on
❌ Slower to filter on (no indexes)
✅ Less memory usageString Dimensions✅ Faster to filter on (with indexes)
❌ Slower to group on
❌ More memory usage

{
  "dimensionsSpec": {
    "dimensions": [
      "country",  // String - frequently filtered
      {
        "type": "long",
        "name": "user_id"  // Numeric - frequently grouped
      },
      {
        "type": "long",
        "name": "age"  // Numeric - frequently grouped
      }
    ]
  }
}

Configure columns as numeric dimensions if you primarily group by them. Use string dimensions if you primarily filter on them.

Secondary Timestamps

Store as Long Dimensions

If you have multiple timestamps, store additional ones as long-typed dimensions:

{
  "dimensionsSpec": {
    "dimensions": [
      {
        "type": "long",
        "name": "created_time"
      },
      {
        "type": "long",
        "name": "updated_time"
      }
    ]
  }
}

Convert with Transform

Use transformSpec to convert timestamps to milliseconds:

{
  "transformSpec": {
    "transforms": [
      {
        "type": "expression",
        "name": "created_time",
        "expression": "timestamp_parse(created_at, 'yyyy-MM-dd HH:mm:ss')"
      }
    ]
  }
}

Query Secondary Timestamps

Use SQL time functions at query time:

SELECT 
  MILLIS_TO_TIMESTAMP(created_time) as created_at,
  TIME_FLOOR(MILLIS_TO_TIMESTAMP(updated_time), 'PT1H') as hour
FROM datasource
WHERE created_time > TIMESTAMP_TO_MILLIS(CURRENT_TIMESTAMP - INTERVAL '7' DAY)

Nested Dimensions

Native Support
Flatten

Use COMPLEX<json> for nested data:

{
  "dimensionsSpec": {
    "dimensions": [
      {
        "type": "json",
        "name": "user_metadata"
      }
    ]
  }
}

Query with JSON functions:

SELECT 
  JSON_VALUE(user_metadata, '$.preferences.theme'),
  COUNT(*)
FROM datasource
GROUP BY 1

Flatten nested data during ingestion:

{
  "inputFormat": {
    "type": "json",
    "flattenSpec": {
      "fields": [
        {"type": "path", "name": "user_theme", "expr": "$.user.preferences.theme"},
        {"type": "path", "name": "user_language", "expr": "$.user.preferences.language"}
      ]
    }
  }
}

Counting Ingested Events

With rollup enabled, a count aggregator at query time tells you the number of Druid rows, not the number of ingested events.

At Ingestion
At Query Time

Use a count metric during ingestion:

{
  "metricsSpec": [
    {
      "type": "count",
      "name": "count"
    }
  ]
}

Use longSum to get the total ingested events:

{
  "aggregations": [
    {
      "type": "longSum",
      "name": "numIngestedEvents",
      "fieldName": "count"
    }
  ]
}

Or in SQL:

SELECT SUM(count) as total_events FROM datasource

Schema Auto-Discovery

Type-Aware
String-Based

Recommended for most use cases

{
  "dimensionsSpec": {
    "useSchemaDiscovery": true,
    "dimensionExclusions": ["timestamp"]
  }
}

Druid infers:

STRING, LONG, DOUBLE
ARRAY<STRING>, ARRAY<LONG>, ARRAY<DOUBLE>
COMPLEX<json> for nested data

Legacy approach

{
  "dimensionsSpec": {
    "dimensions": [],
    "includeAllDimensions": true
  }
}

All discovered columns typed as strings
Nested data structures are ignored

Type-aware schema discovery can impact downstream BI tools depending on how they handle ARRAY typed columns.

Same Column as Dimension and Metric

Use Case: Unique IDs

To filter on an ID while also computing unique counts:

{
  "dimensionsSpec": {
    "dimensions": [
      "user_id"  // For filtering
    ]
  },
  "metricsSpec": [
    {
      "type": "hyperUnique",
      "name": "unique_users",
      "fieldName": "user_id"  // For unique counts
    }
  ]
}

Complete Example

{
  "dataSchema": {
    "dataSource": "ecommerce_events",
    "timestampSpec": {
      "column": "timestamp",
      "format": "iso"
    },
    "dimensionsSpec": {
      "dimensions": [
        "event_type",
        "product_id",
        "product_name",
        "product_category",
        "country",
        "city",
        {
          "type": "long",
          "name": "user_id"
        },
        {
          "type": "double",
          "name": "price"
        }
      ]
    },
    "metricsSpec": [
      {
        "type": "count",
        "name": "count"
      },
      {
        "type": "doubleSum",
        "name": "revenue",
        "fieldName": "price"
      },
      {
        "type": "hyperUnique",
        "name": "unique_users",
        "fieldName": "user_id"
      },
      {
        "type": "hyperUnique",
        "name": "unique_products",
        "fieldName": "product_id"
      }
    ],
    "granularitySpec": {
      "segmentGranularity": "HOUR",
      "queryGranularity": "MINUTE",
      "rollup": true
    }
  }
}

Next Steps

Schema Model

Learn about the core schema concepts

Partitioning

Optimize partitioning for your use case

Data Rollup

Understand rollup in detail

Nested Columns

Work with nested JSON data

Getting Started

Design & Architecture

Data Ingestion

Querying

Data Management

Operations

Configuration

Druid’s Data Model Overview

Datasources

Rollup

Time-based

OLAP

Coming from Other Systems

Relational Databases (RDBMS)

Tips for Relational Data

Time Series Databases

Tips for Time Series Data

Log Aggregation Systems

Tips for Log Data

General Best Practices

Partitioning and Sorting

Optimize Partitioning

High Cardinality Columns

String vs Numeric Dimensions

Secondary Timestamps

Nested Dimensions

Counting Ingested Events

Schema Auto-Discovery

Same Column as Dimension and Metric

Complete Example

Next Steps

Schema Model

Partitioning

Data Rollup

Nested Columns

Build docs developers (and LLMs) love

Getting Started

Design & Architecture

Data Ingestion

Querying

Data Management

Operations

Configuration

​Druid’s Data Model Overview

Datasources

Rollup

Time-based

OLAP

​Coming from Other Systems

​Relational Databases (RDBMS)

​Tips for Relational Data

​Time Series Databases

​Tips for Time Series Data

​Log Aggregation Systems

​Tips for Log Data

​General Best Practices

​Partitioning and Sorting

Optimize Partitioning

​High Cardinality Columns

​String vs Numeric Dimensions

​Secondary Timestamps

​Nested Dimensions

​Counting Ingested Events

​Schema Auto-Discovery

​Same Column as Dimension and Metric

​Complete Example

​Next Steps

Schema Model

Partitioning

Data Rollup

Nested Columns

Build docs developers (and LLMs) love

Druid’s Data Model Overview

Coming from Other Systems

Relational Databases (RDBMS)

Tips for Relational Data

Time Series Databases

Tips for Time Series Data

Log Aggregation Systems

Tips for Log Data

General Best Practices

Partitioning and Sorting

High Cardinality Columns

String vs Numeric Dimensions

Secondary Timestamps

Nested Dimensions

Counting Ingested Events

Schema Auto-Discovery

Same Column as Dimension and Metric

Complete Example

Next Steps