Source Input Formats - Apache Druid

Apache Druid can ingest denormalized data in JSON, CSV, or a delimited form such as TSV, or any custom format. While most examples in the documentation use data in JSON format, it is not difficult to configure Druid to ingest any other delimited data.

Supported Formats

JSON

Most common format for ingesting data

CSV

Comma-separated values format

TSV

Tab-separated values format

Parquet

Columnar storage format

ORC

Optimized Row Columnar format

Avro

Binary serialization format

Data Format Examples

JSON
CSV
TSV

{"timestamp": "2013-08-31T01:02:33Z", "page": "Gypsy Danger", "language" : "en", "user" : "nuclear", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143}
{"timestamp": "2013-08-31T03:32:45Z", "page": "Striker Eureka", "language" : "en", "user" : "speed", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Australia", "country":"Australia", "region":"Cantebury", "city":"Syndey", "added": 459, "deleted": 129, "delta": 330}

2013-08-31T01:02:33Z,"Gypsy Danger","en","nuclear","true","true","false","false","article","North America","United States","Bay Area","San Francisco",57,200,-143
2013-08-31T03:32:45Z,"Striker Eureka","en","speed","false","true","true","false","wikipedia","Australia","Australia","Cantebury","Syndey",459,129,330

CSV and TSV data do not contain column headers by default. You’ll need to specify columns in your ingestion spec.

2013-08-31T01:02:33Z  "Gypsy Danger"  "en"  "nuclear" "true"  "true"  "false" "false" "article" "North America" "United States" "Bay Area"  "San Francisco" 57  200 -143
2013-08-31T03:32:45Z  "Striker Eureka"  "en"  "speed" "false" "true"  "true"  "false" "wikipedia" "Australia" "Australia" "Cantebury" "Syndey"  459 129 330

Input Format Configuration

JSON Format

Configure the JSON inputFormat to load JSON data:

type

string

required

Set value to json

flattenSpec

object

Specifies flattening configuration for nested JSON data

featureSpec

object

JSON parser features supported by Jackson. Example: {"ALLOW_SINGLE_QUOTES": true, "ALLOW_UNQUOTED_FIELD_NAMES": true}

assumeNewlineDelimited

boolean

default:false

If true, enables more flexible parsing exception handling for newline-delimited JSON

useJsonNodeReader

boolean

default:false

When ingesting multi-line JSON events, enables retention of valid JSON events encountered before a parsing exception

{
  "ioConfig": {
    "inputFormat": {
      "type": "json"
    }
  }
}

CSV Format

Configure the CSV inputFormat to load CSV data:

type

string

required

Set value to csv

columns

array

Specifies the columns of the data. Required if findColumnsFromHeader is false or missing

findColumnsFromHeader

boolean

If true, extracts column names from the header row

skipHeaderRows

integer

default:0

Number of rows to skip at the beginning

listDelimiter

string

default:"ctrl+A"

Custom delimiter for multi-value dimensions

tryParseNumbers

boolean

default:false

If true, attempts to parse numeric strings into long or double

{
  "ioConfig": {
    "inputFormat": {
      "type": "csv",
      "columns": [
        "timestamp",
        "page",
        "language",
        "user",
        "unpatrolled",
        "newPage",
        "robot",
        "anonymous",
        "namespace",
        "continent",
        "country",
        "region",
        "city",
        "added",
        "deleted",
        "delta"
      ]
    }
  }
}

TSV Format

Configure the TSV inputFormat to load TSV data:

type

string

required

Set value to tsv

delimiter

string

default:"\\t"

Custom delimiter for data values

columns

array

Specifies the columns of the data. Required if findColumnsFromHeader is false or missing

findColumnsFromHeader

boolean

If true, extracts column names from the header row

{
  "ioConfig": {
    "inputFormat": {
      "type": "tsv",
      "columns": ["timestamp","page","language","user"],
      "delimiter": "|"
    }
  }
}

Binary Formats

Parquet Format

Load the druid-parquet-extensions extension to use Parquet format

type

string

required

Set value to parquet

flattenSpec

object

Define a flattenSpec to extract nested values. Only ‘path’ expressions are supported

binaryAsString

boolean

default:false

Treat bytes parquet columns as UTF-8 encoded strings

{
  "ioConfig": {
    "inputFormat": {
      "type": "parquet",
      "flattenSpec": {
        "useFieldDiscovery": true,
        "fields": [
          {
            "type": "path",
            "name": "nested",
            "expr": "$.path.to.nested"
          }
        ]
      },
      "binaryAsString": false
    }
  }
}

ORC Format

Load the druid-orc-extensions extension to use ORC format

type

string

required

Set value to orc

flattenSpec

object

Specifies flattening configuration for nested ORC data. Only ‘path’ expressions are supported

binaryAsString

boolean

default:false

Treat binary ORC columns as UTF-8 encoded strings

{
  "ioConfig": {
    "inputFormat": {
      "type": "orc",
      "flattenSpec": {
        "useFieldDiscovery": true,
        "fields": [
          {
            "type": "path",
            "name": "nested",
            "expr": "$.path.to.nested"
          }
        ]
      },
      "binaryAsString": false
    }
  }
}

Avro Stream Format

Load the druid-avro-extensions extension to use Avro format

type

string

required

Set value to avro_stream

avroBytesDecoder

object

required

Specifies how to decode bytes to Avro record

flattenSpec

object

Define a flattenSpec to extract nested values. Only ‘path’ expressions are supported

binaryAsString

boolean

default:false

Treat bytes Avro columns as UTF-8 encoded strings

{
  "ioConfig": {
    "inputFormat": {
      "type": "avro_stream",
      "avroBytesDecoder": {
        "type": "schema_inline",
        "schema": {
          "namespace": "org.apache.druid.data",
          "name": "User",
          "type": "record",
          "fields": [
            { "name": "FullName", "type": "string" },
            { "name": "Country", "type": "string" }
          ]
        }
      },
      "flattenSpec": {
        "useFieldDiscovery": true
      },
      "binaryAsString": false
    }
  }
}

Kafka Input Format

The kafka input format lets you parse Kafka metadata fields in addition to the payload value contents.

type

string

required

Set value to kafka

valueFormat

object

required

The input format to parse the Kafka value payload

timestampColumnName

string

default:"kafka.timestamp"

The name of the column for the Kafka timestamp

topicColumnName

string

default:"kafka.topic"

The name of the column for the Kafka topic

headerFormat

object

Specifies how to parse Kafka headers. Supports String types with various encodings (UTF-8, ISO-8859-1, etc.)

headerColumnPrefix

string

default:"kafka.header."

Prefix for all header columns

keyFormat

object

The input format to parse the Kafka key

keyColumnName

string

default:"kafka.key"

The name of the column for the Kafka key

{
  "ioConfig": {
    "inputFormat": {
      "type": "kafka",
      "valueFormat": {
        "type": "json"
      },
      "timestampColumnName": "kafka.timestamp",
      "topicColumnName": "kafka.topic",
      "headerFormat": {
        "type": "string",
        "encoding": "UTF-8"
      },
      "headerColumnPrefix": "kafka.header.",
      "keyFormat": {
        "type": "tsv",
        "findColumnsFromHeader": false,
        "columns": ["x"]
      },
      "keyColumnName": "kafka.key"
    }
  }
}

Kafka Metadata Example

Given this Kafka message:

Kafka timestamp: 1680795276351
Kafka topic: wiki-edits
Kafka headers: env=development, zone=z1
Kafka key: wiki-edit
Kafka payload: {"channel":"#sv.wikipedia","timestamp":"2016-06-27T00:00:11.080Z","page":"Salo Toraut","delta":31}

The Kafka input format parses it as:

{
  "channel": "#sv.wikipedia",
  "timestamp": "2016-06-27T00:00:11.080Z",
  "page": "Salo Toraut",
  "delta": 31,
  "kafka.timestamp": 1680795276351,
  "kafka.topic": "wiki-edits",
  "kafka.header.env": "development",
  "kafka.header.zone": "z1",
  "kafka.key": "wiki-edit"
}

Kinesis Input Format

The kinesis input format lets you parse Kinesis metadata fields in addition to the payload value contents.

type

string

required

Set value to kinesis

valueFormat

object

required

The input format to parse the Kinesis value payload

partitionKeyColumnName

string

default:"kinesis.partitionKey"

The name of the column for the Kinesis partition key

timestampColumnName

string

default:"kinesis.timestamp"

The name of the column for the Kinesis timestamp

{
  "ioConfig": {
    "inputFormat": {
      "type": "kinesis",
      "valueFormat": {
        "type": "json"
      },
      "timestampColumnName": "kinesis.timestamp",
      "partitionKeyColumnName": "kinesis.partitionKey"
    }
  }
}

FlattenSpec

You can use the flattenSpec object to flatten nested data as an alternative to nested columns.

useFieldDiscovery

boolean

default:true

If true, interpret all root-level fields as available for usage

fields

array

default:[]

Specifies the fields of interest and how they are accessed

{
  "flattenSpec": {
    "useFieldDiscovery": true,
    "fields": [
      { "name": "baz", "type": "root" },
      { "name": "foo_bar", "type": "path", "expr": "$.foo.bar" },
      { "name": "first_food", "type": "jq", "expr": ".thing.food[1]" }
    ]
  }
}

Field Types

Root

Refers to a field at the root level of the record. Only useful if useFieldDiscovery is false.

Path

Refers to a field using JsonPath notation. Supported by most formats including avro, json, orc, and parquet.Example: {"type": "path", "name": "nested", "expr": "$.path.to.nested"}

Refers to a field using jackson-jq notation. Only supported for the json format.Example: {"type": "jq", "name": "first_food", "expr": ".thing.food[1]"}

Tree

Refers to a nested field from the root level. More efficient than path or jq for simple hierarchical fetches. Only supported for json.Example: {"type": "tree", "name": "foo_other_bar", "nodes": ["foo", "other", "bar"]}

Compression Formats

Druid supports the following compression formats:

gzip

.gz files

bzip2

.bz2 files

xz

.xz files

zip

.zip files

Snappy

.sz files

ZSTD

.zst files

Next Steps

Input Sources

Configure where to read your data from

Ingestion Spec

Complete ingestion specification reference

Schema Design

Best practices for schema design

Native Batch

Learn about batch ingestion

Getting Started

Design & Architecture

Data Ingestion

Querying

Data Management

Operations

Configuration

​Supported Formats

JSON

CSV

TSV

Parquet

ORC

Avro

​Data Format Examples

​Input Format Configuration

​JSON Format

​CSV Format

​TSV Format

​Binary Formats

​Parquet Format

​ORC Format

​Avro Stream Format

​Kafka Input Format

​Kafka Metadata Example

​Kinesis Input Format

​FlattenSpec

​Field Types

​Compression Formats

gzip

bzip2

xz

zip

Snappy

ZSTD

​Next Steps

Input Sources

Ingestion Spec

Schema Design

Native Batch

Build docs developers (and LLMs) love

Supported Formats

Data Format Examples

Input Format Configuration

JSON Format

CSV Format

TSV Format

Binary Formats

Parquet Format

ORC Format

Avro Stream Format

Kafka Input Format

Kafka Metadata Example

Kinesis Input Format

FlattenSpec

Field Types

Compression Formats

Next Steps