Local quickstart

This quickstart helps you install Apache Druid and introduces you to Druid ingestion and query features. You’ll learn how to start Druid services, load data, and query it using SQL.

For this tutorial, you need a machine with at least 6 GiB of RAM.

What you’ll learn

In this quickstart, you’ll:

Install Druid
Start up Druid services
Use SQL to ingest and query data

Druid supports a variety of ingestion options. Once you’re done with this tutorial, refer to the Ingestion documentation to determine which ingestion method is right for you.

Prerequisites

You can follow these steps on a relatively modest machine, such as a workstation or virtual server with 6 GiB of RAM. The software requirements for the installation machine are:

Linux, Mac OS X, or other Unix-like OS (Windows is not supported)
Java 17
Python 3
Perl 5

Java must be available. Either it is on your path, or set one of the JAVA_HOME or DRUID_JAVA_HOME environment variables.

Before installing a production Druid instance, be sure to review the security overview. In general, avoid running Druid as root user. Consider creating a dedicated user account for running Druid.

Install Druid

Download Druid

Download the latest release from Apache Druid downloads.

Extract and navigate to the directory

In your terminal, extract the file and change directories to the distribution directory:

tar -xzf apache-druid-{{DRUIDVERSION}}-bin.tar.gz
cd apache-druid-{{DRUIDVERSION}}

The distribution directory contains LICENSE and NOTICE files and subdirectories for executable files, configuration files, sample data and more.

Verify Java requirements

You can run the following command to verify Java requirements for your environment:

./bin/verify-java

Start Druid services

Start up Druid services using the automatic single-machine configuration. This configuration includes default settings that are appropriate for this tutorial, such as loading the druid-multi-stage-query extension by default so that you can use the MSQ task engine.

You can view the default settings in the configuration files located in conf/druid/auto.

From the apache-druid-{{DRUIDVERSION}} package root, run the following command:

./bin/start-druid

This launches instances of ZooKeeper and the Druid services. For example:

$ ./bin/start-druid
[Tue Nov 29 16:31:06 2022] Starting Apache Druid.
[Tue Nov 29 16:31:06 2022] Open http://localhost:8888/ in your browser to access the web console.
[Tue Nov 29 16:31:06 2022] Or, if you have enabled TLS, use https on port 9088.
[Tue Nov 29 16:31:06 2022] Starting services with log directory [/apache-druid-{{DRUIDVERSION}}/log].
[Tue Nov 29 16:31:06 2022] Running command[zk]: bin/run-zk conf
[Tue Nov 29 16:31:06 2022] Running command[broker]: bin/run-druid broker ...
[Tue Nov 29 16:31:06 2022] Running command[router]: bin/run-druid router ...
[Tue Nov 29 16:31:06 2022] Running command[coordinator-overlord]: bin/run-druid coordinator-overlord ...
[Tue Nov 29 16:31:06 2022] Running command[historical]: bin/run-druid historical ...
[Tue Nov 29 16:31:06 2022] Running command[middleManager]: bin/run-druid middleManager ...

Druid may use up to 80% of the total available system memory. To explicitly set the total memory available to Druid, pass a value for the memory parameter. For example, ./bin/start-druid -m 16g.

Understanding data storage

Druid stores all persistent state data, such as the cluster metadata store and data segments, in apache-druid-{{DRUIDVERSION}}/var. Each service writes to a log file under apache-druid-{{DRUIDVERSION}}/log.

At any time, you can revert Druid to its original, post-installation state by deleting the entire var directory. You may want to do this, for example, between Druid tutorials or after experimentation, to start with a fresh instance.

Stopping Druid

To stop Druid at any time, use CTRL+C in the terminal. This exits the bin/start-druid script and terminates all Druid processes.

Open the web console

After starting the Druid services, open the web console at http://localhost:8888.

It may take a few seconds for all Druid services to finish starting, including the Druid Router, which serves the console. If you attempt to open the web console before startup is complete, you may see errors in the browser. Wait a few moments and try again.

In this quickstart, you use the web console to perform ingestion. The MSQ task engine specifically uses the Query view to edit and run SQL queries.

Load data

The Druid distribution bundles the wikiticker-2015-09-12-sampled.json.gz sample dataset that you can use for testing. The sample dataset is located in the quickstart/tutorial/ folder and represents Wikipedia page edits for a given day.

Connect to external data

In the Query view, click Connect external data.

Select the data source

Select the Local disk tile and enter the following values:

Base directory: quickstart/tutorial/
File filter: wikiticker-2015-09-12-sampled.json.gz

Entering the base directory and wildcard file filter separately, as afforded by the UI, allows you to specify multiple files for ingestion at once.

Connect data

Click Connect data.

Parse the data (optional)

On the Parse page, you can examine the raw data and perform the following optional actions before loading data into Druid:

Expand a row to see the corresponding source data
Customize how the data is handled by selecting from the Input format options
Adjust the primary timestamp column for the data

Druid requires data to have a primary timestamp column (internally stored in a column called __time). If your dataset doesn’t have a timestamp, Druid uses the default value of 1970-01-01 00:00:00.

Generate the query

Click Done. You’re returned to the Query view that displays the newly generated query. The query inserts the sample data into the table named wikiticker-2015-09-12-sampled.

Edit the destination datasource (optional)

Edit the first line of the query and change the default destination datasource name from wikiticker-2015-09-12-sampled to wikipedia.

Run the query

Optionally, click Preview to see the general shape of the data before you ingest it.Click Run to execute the query. The task may take a minute or two to complete. When done, the task displays its duration and the number of rows inserted into the table.

The view is set to automatically refresh, so you don’t need to refresh the browser to see the status change.

A successful task means that Druid data servers have picked up one or more segments.

Query data

Once the ingestion job is complete, you can query the data. In the Query view, run the following query to produce a list of top channels:

SELECT
  channel,
  COUNT(*)
FROM "wikipedia"
GROUP BY channel
ORDER BY COUNT(*) DESC

Congratulations! You’ve gone from downloading Druid to querying data with the MSQ task engine in just one quickstart.

Next steps

Druid SQL overview

Learn about how to query the data you just ingested

Ingestion overview

Explore options for ingesting more data

Load files using SQL

Learn how to generate SQL queries that load external data

Load streaming data from Kafka

Load streaming data from a Kafka topic

Remember that after stopping Druid services, you can start clean next time by deleting the var directory from the Druid root directory and running the bin/start-druid script again. You may want to do this before using other data ingestion tutorials, since they use the same Wikipedia datasource.

Getting Started

Design & Architecture

Data Ingestion

Querying

Data Management

Operations

Configuration

What you’ll learn

Prerequisites

Install Druid

Start Druid services

Understanding data storage

Stopping Druid

Open the web console

Load data

Query data

Next steps

Druid SQL overview

Ingestion overview

Load files using SQL

Load streaming data from Kafka

Build docs developers (and LLMs) love

Getting Started

Design & Architecture

Data Ingestion

Querying

Data Management

Operations

Configuration

​What you’ll learn

​Prerequisites

​Install Druid

​Start Druid services

​Understanding data storage

​Stopping Druid

​Open the web console

​Load data

​Query data

​Next steps

Druid SQL overview

Ingestion overview

Load files using SQL

Load streaming data from Kafka

Build docs developers (and LLMs) love

What you’ll learn

Prerequisites

Install Druid

Start Druid services

Understanding data storage

Stopping Druid

Open the web console

Load data

Query data

Next steps