Skip to main content
Apache Druid can ingest denormalized data in JSON, CSV, or a delimited form such as TSV, or any custom format. While most examples in the documentation use data in JSON format, it is not difficult to configure Druid to ingest any other delimited data.

Supported Formats

JSON

Most common format for ingesting data

CSV

Comma-separated values format

TSV

Tab-separated values format

Parquet

Columnar storage format

ORC

Optimized Row Columnar format

Avro

Binary serialization format

Data Format Examples

{"timestamp": "2013-08-31T01:02:33Z", "page": "Gypsy Danger", "language" : "en", "user" : "nuclear", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143}
{"timestamp": "2013-08-31T03:32:45Z", "page": "Striker Eureka", "language" : "en", "user" : "speed", "unpatrolled" : "false", "newPage" : "true", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Australia", "country":"Australia", "region":"Cantebury", "city":"Syndey", "added": 459, "deleted": 129, "delta": 330}

Input Format Configuration

JSON Format

Configure the JSON inputFormat to load JSON data:
type
string
required
Set value to json
flattenSpec
object
Specifies flattening configuration for nested JSON data
featureSpec
object
JSON parser features supported by Jackson. Example: {"ALLOW_SINGLE_QUOTES": true, "ALLOW_UNQUOTED_FIELD_NAMES": true}
assumeNewlineDelimited
boolean
default:false
If true, enables more flexible parsing exception handling for newline-delimited JSON
useJsonNodeReader
boolean
default:false
When ingesting multi-line JSON events, enables retention of valid JSON events encountered before a parsing exception
{
  "ioConfig": {
    "inputFormat": {
      "type": "json"
    }
  }
}

CSV Format

Configure the CSV inputFormat to load CSV data:
type
string
required
Set value to csv
columns
array
Specifies the columns of the data. Required if findColumnsFromHeader is false or missing
findColumnsFromHeader
boolean
If true, extracts column names from the header row
skipHeaderRows
integer
default:0
Number of rows to skip at the beginning
listDelimiter
string
default:"ctrl+A"
Custom delimiter for multi-value dimensions
tryParseNumbers
boolean
default:false
If true, attempts to parse numeric strings into long or double
{
  "ioConfig": {
    "inputFormat": {
      "type": "csv",
      "columns": [
        "timestamp",
        "page",
        "language",
        "user",
        "unpatrolled",
        "newPage",
        "robot",
        "anonymous",
        "namespace",
        "continent",
        "country",
        "region",
        "city",
        "added",
        "deleted",
        "delta"
      ]
    }
  }
}

TSV Format

Configure the TSV inputFormat to load TSV data:
type
string
required
Set value to tsv
delimiter
string
default:"\\t"
Custom delimiter for data values
columns
array
Specifies the columns of the data. Required if findColumnsFromHeader is false or missing
findColumnsFromHeader
boolean
If true, extracts column names from the header row
{
  "ioConfig": {
    "inputFormat": {
      "type": "tsv",
      "columns": ["timestamp","page","language","user"],
      "delimiter": "|"
    }
  }
}

Binary Formats

Parquet Format

Load the druid-parquet-extensions extension to use Parquet format
type
string
required
Set value to parquet
flattenSpec
object
Define a flattenSpec to extract nested values. Only ‘path’ expressions are supported
binaryAsString
boolean
default:false
Treat bytes parquet columns as UTF-8 encoded strings
{
  "ioConfig": {
    "inputFormat": {
      "type": "parquet",
      "flattenSpec": {
        "useFieldDiscovery": true,
        "fields": [
          {
            "type": "path",
            "name": "nested",
            "expr": "$.path.to.nested"
          }
        ]
      },
      "binaryAsString": false
    }
  }
}

ORC Format

Load the druid-orc-extensions extension to use ORC format
type
string
required
Set value to orc
flattenSpec
object
Specifies flattening configuration for nested ORC data. Only ‘path’ expressions are supported
binaryAsString
boolean
default:false
Treat binary ORC columns as UTF-8 encoded strings
{
  "ioConfig": {
    "inputFormat": {
      "type": "orc",
      "flattenSpec": {
        "useFieldDiscovery": true,
        "fields": [
          {
            "type": "path",
            "name": "nested",
            "expr": "$.path.to.nested"
          }
        ]
      },
      "binaryAsString": false
    }
  }
}

Avro Stream Format

Load the druid-avro-extensions extension to use Avro format
type
string
required
Set value to avro_stream
avroBytesDecoder
object
required
Specifies how to decode bytes to Avro record
flattenSpec
object
Define a flattenSpec to extract nested values. Only ‘path’ expressions are supported
binaryAsString
boolean
default:false
Treat bytes Avro columns as UTF-8 encoded strings
{
  "ioConfig": {
    "inputFormat": {
      "type": "avro_stream",
      "avroBytesDecoder": {
        "type": "schema_inline",
        "schema": {
          "namespace": "org.apache.druid.data",
          "name": "User",
          "type": "record",
          "fields": [
            { "name": "FullName", "type": "string" },
            { "name": "Country", "type": "string" }
          ]
        }
      },
      "flattenSpec": {
        "useFieldDiscovery": true
      },
      "binaryAsString": false
    }
  }
}

Kafka Input Format

The kafka input format lets you parse Kafka metadata fields in addition to the payload value contents.
type
string
required
Set value to kafka
valueFormat
object
required
The input format to parse the Kafka value payload
timestampColumnName
string
default:"kafka.timestamp"
The name of the column for the Kafka timestamp
topicColumnName
string
default:"kafka.topic"
The name of the column for the Kafka topic
headerFormat
object
Specifies how to parse Kafka headers. Supports String types with various encodings (UTF-8, ISO-8859-1, etc.)
headerColumnPrefix
string
default:"kafka.header."
Prefix for all header columns
keyFormat
object
The input format to parse the Kafka key
keyColumnName
string
default:"kafka.key"
The name of the column for the Kafka key
{
  "ioConfig": {
    "inputFormat": {
      "type": "kafka",
      "valueFormat": {
        "type": "json"
      },
      "timestampColumnName": "kafka.timestamp",
      "topicColumnName": "kafka.topic",
      "headerFormat": {
        "type": "string",
        "encoding": "UTF-8"
      },
      "headerColumnPrefix": "kafka.header.",
      "keyFormat": {
        "type": "tsv",
        "findColumnsFromHeader": false,
        "columns": ["x"]
      },
      "keyColumnName": "kafka.key"
    }
  }
}

Kafka Metadata Example

Given this Kafka message:
  • Kafka timestamp: 1680795276351
  • Kafka topic: wiki-edits
  • Kafka headers: env=development, zone=z1
  • Kafka key: wiki-edit
  • Kafka payload: {"channel":"#sv.wikipedia","timestamp":"2016-06-27T00:00:11.080Z","page":"Salo Toraut","delta":31}
The Kafka input format parses it as:
{
  "channel": "#sv.wikipedia",
  "timestamp": "2016-06-27T00:00:11.080Z",
  "page": "Salo Toraut",
  "delta": 31,
  "kafka.timestamp": 1680795276351,
  "kafka.topic": "wiki-edits",
  "kafka.header.env": "development",
  "kafka.header.zone": "z1",
  "kafka.key": "wiki-edit"
}

Kinesis Input Format

The kinesis input format lets you parse Kinesis metadata fields in addition to the payload value contents.
type
string
required
Set value to kinesis
valueFormat
object
required
The input format to parse the Kinesis value payload
partitionKeyColumnName
string
default:"kinesis.partitionKey"
The name of the column for the Kinesis partition key
timestampColumnName
string
default:"kinesis.timestamp"
The name of the column for the Kinesis timestamp
{
  "ioConfig": {
    "inputFormat": {
      "type": "kinesis",
      "valueFormat": {
        "type": "json"
      },
      "timestampColumnName": "kinesis.timestamp",
      "partitionKeyColumnName": "kinesis.partitionKey"
    }
  }
}

FlattenSpec

You can use the flattenSpec object to flatten nested data as an alternative to nested columns.
useFieldDiscovery
boolean
default:true
If true, interpret all root-level fields as available for usage
fields
array
default:[]
Specifies the fields of interest and how they are accessed
{
  "flattenSpec": {
    "useFieldDiscovery": true,
    "fields": [
      { "name": "baz", "type": "root" },
      { "name": "foo_bar", "type": "path", "expr": "$.foo.bar" },
      { "name": "first_food", "type": "jq", "expr": ".thing.food[1]" }
    ]
  }
}

Field Types

Refers to a field at the root level of the record. Only useful if useFieldDiscovery is false.
Refers to a field using JsonPath notation. Supported by most formats including avro, json, orc, and parquet.Example: {"type": "path", "name": "nested", "expr": "$.path.to.nested"}
Refers to a field using jackson-jq notation. Only supported for the json format.Example: {"type": "jq", "name": "first_food", "expr": ".thing.food[1]"}
Refers to a nested field from the root level. More efficient than path or jq for simple hierarchical fetches. Only supported for json.Example: {"type": "tree", "name": "foo_other_bar", "nodes": ["foo", "other", "bar"]}

Compression Formats

Druid supports the following compression formats:

gzip

.gz files

bzip2

.bz2 files

xz

.xz files

zip

.zip files

Snappy

.sz files

ZSTD

.zst files

Next Steps

Input Sources

Configure where to read your data from

Ingestion Spec

Complete ingestion specification reference

Schema Design

Best practices for schema design

Native Batch

Learn about batch ingestion

Build docs developers (and LLMs) love