druid-multi-stage-query extension, see SQL-based ingestion.
For a comparison of batch ingestion methods and to determine which is right for you, see the ingestion methods table.
Overview
Apache Druid supports the following types of JSON-based batch indexing tasks:- Parallel task indexing (
index_parallel) - Runs multiple indexing tasks concurrently. Recommended for production ingestion tasks. - Simple task indexing (
index) - Runs a single indexing task at a time. Suitable for development and test environments.
index_parallel ingestion specs.
Submit an Indexing Task
You can submit a JSON-based batch indexing task in two ways:- Use the Load Data UI in the web console to define and submit an ingestion spec
- POST the ingestion spec to the Tasks API endpoint:
/druid/indexer/v1/task
bin/post-index-task
Parallel Task Indexing
Theindex_parallel task type enables multi-threaded batch indexing using only Druid resources - no external systems like Hadoop are required.
How It Works
Theindex_parallel task operates as a supervisor that orchestrates the entire indexing process:
- The supervisor splits input data into portions
- Worker tasks are created to process individual data portions
- The Overlord schedules and runs workers on Middle Managers or Indexers
- Workers report resulting segment lists back to the supervisor
- The supervisor monitors worker status and retries failed tasks
- When all workers succeed, segments are published atomically
Behavior varies depending on the
partitionsSpec configuration. See PartitionsSpec for details.Requirements
Parallel tasks require:- A splittable
inputSourcein theioConfig maxNumConcurrentSubTasks> 1 in thetuningConfig
maxNumConcurrentSubTasks is 1 or less, tasks run sequentially.
Supported Compression Formats
JSON-based batch ingestion supports:bz2gzxzzipsz(Snappy)zst(ZSTD)
Configuration Example
Configuration Reference
Top-Level Properties
The task type. Set to
index_parallel for parallel task indexing.The task ID. If omitted, Druid generates the task ID using the task type, data source name, interval, and date-time stamp.
The ingestion spec that defines the data schema, IO config, and tuning config.
Context to specify various task configuration parameters. See Task context parameters.
dataSchema
Defines how Druid stores your data: the primary timestamp column, dimensions, metrics, and transformations. See Ingestion Spec DataSchema for complete details.ioConfig
Set to
index_parallel.Specifies how to parse input data. See input formats.
Creates segments as additional shards of the latest version. You must use
dynamic partitioning type for appended segments.If
true, appendToExisting is false, and granularitySpec contains an interval, the ingestion task replaces all existing segments fully contained by the specified interval.tuningConfig
Set to
index_parallel.Determines when Druid should perform intermediate persists to disk. Normally you don’t need to set this.
Number of bytes to aggregate in heap memory before persisting. Maximum heap memory usage for indexing is
maxBytesInMemory * (2 + maxPendingPersists).Defines how to partition data in each time chunk. See PartitionsSpec.
Maximum number of worker tasks that can run in parallel. If set to 1, the supervisor processes data ingestion on its own.
Maximum number of retries on task failures.
Polling period in milliseconds to check running task statuses.
PartitionsSpec
The primary partition for Druid is time. You can define a secondary partitioning method in the partitions spec.Partitioning Comparison
| PartitionsSpec | Speed | Method | Rollup Mode | Query Pruning |
|---|---|---|---|---|
dynamic | Fastest | Row count-based | Best-effort | N/A |
hashed | Moderate | Hash-based | Perfect | Yes |
single_dim | Slower | Single dimension range | Perfect | Yes |
range | Slowest | Multi-dimension range | Perfect | Yes |
Dynamic Partitioning
Best for best-effort rollup with the fastest ingestion speed.Set to
dynamic.Determines how many rows are in each segment.
Total number of rows across all segments waiting to be pushed.
Hash-Based Partitioning
For perfect rollup with hash-based partitioning on specified dimensions.Set to
hashed.Directly specify the number of shards to create. Cannot be used with
targetRowsPerSegment.Target row count for each partition. Druid determines partition count automatically.
The dimensions to partition on. Leave blank to select all dimensions.
Hash function to compute hash of partition dimensions. Currently only
murmur3_32_abs is supported.Single-Dimension Range Partitioning
For perfect rollup with range partitioning on a single dimension.Set to
single_dim.The dimension to partition on. Only rows with a single dimension value are allowed.
Target number of rows to include in a partition. Targets segments of 500MB-1GB. Either this or
maxRowsPerSegment is required.Assume input data is already grouped on time and dimensions. Ingestion runs faster but may choose sub-optimal partitions if violated.
Multi-Dimension Range Partitioning
Improves over single-dimension range partitioning by distributing segment sizes more evenly.Set to
range.Array of dimensions to partition on. Order from most to least frequently queried. Limit to 3-5 dimensions for best results.
Target number of rows per partition, targeting 500MB-1GB segments.
Splittable Input Sources
Parallel tasks require splittable input sources:- S3 input source
- Google Cloud Storage input source
- Azure input source
- HDFS input source
- HTTP input source
- Local input source
- Druid input source
- SQL input source
Capacity Planning
The supervisor task can create up tomaxNumConcurrentSubTasks worker tasks regardless of available task slots.
Total concurrent tasks: maxNumConcurrentSubTasks + 1 (including supervisor)
To limit batch ingestion to b tasks with t parallel index tasks:
HTTP Status Endpoints
The supervisor task provides HTTP endpoints to monitor running status:Mode Endpoint
parallel if running in parallel mode, otherwise sequential.
Phase Endpoint
Progress Endpoint
Related Resources
Batch Ingestion Tutorial
Step-by-step guide to loading a file
Input Sources
Available input sources for batch ingestion
Data Formats
Supported input formats
Tasks API
API reference for task management