- Install Druid
- Start up Druid services
- Use SQL to ingest and query data
Druid supports a variety of ingestion options. Once you’re done with this tutorial, refer to the Ingestion page to determine which ingestion method is right for you.
Prerequisites
You can follow these steps on a relatively modest machine, such as a workstation or virtual server with 6 GiB of RAM. The software requirements for the installation machine are:- Linux, Mac OS X, or other Unix-like OS. (Windows is not supported)
- Java 17
- Python 3
- Perl 5
Install Druid
Download Druid
Download the release from Apache Druid.
Start up Druid services
Start up Druid services using the automatic single-machine configuration. This configuration includes default settings that are appropriate for this tutorial, such as loading thedruid-multi-stage-query extension by default so that you can use the MSQ task engine.
You can view the default settings in the configuration files located in conf/druid/auto.
Start Druid
From the This launches instances of ZooKeeper and the Druid services. For example:
apache-druid-{{DRUIDVERSION}} package root, run the following command:Understanding Druid storage
Druid stores all persistent state data, such as the cluster metadata store and data segments, in To stop Druid at any time, use CTRL+C in the terminal. This exits the
apache-druid-{{DRUIDVERSION}}/var. Each service writes to a log file under apache-druid-{{DRUIDVERSION}}/log.At any time, you can revert Druid to its original, post-installation state by deleting the entire
var directory. You may want to do this, for example, between Druid tutorials or after experimentation, to start with a fresh instance.bin/start-druid script and terminates all Druid processes.Open the web console
After starting the Druid services, open the web console at http://localhost:8888.
It may take a few seconds for all Druid services to finish starting, including the Druid router, which serves the console. If you attempt to open the web console before startup is complete, you may see errors in the browser. Wait a few moments and try again.
Load data
The Druid distribution bundles thewikiticker-2015-09-12-sampled.json.gz sample dataset that you can use for testing. The sample dataset is located in the quickstart/tutorial/ folder, accessible from the Druid root directory, and represents Wikipedia page edits for a given day.
Connect to external data
In the Query view, click Connect external data.Select the Local disk tile and enter the following values:
-
Base directory:
quickstart/tutorial/ -
File filter:
wikiticker-2015-09-12-sampled.json.gz
Parse the data
On the Parse page, you can examine the raw data and perform the following optional actions before loading data into Druid:
- Expand a row to see the corresponding source data.
- Customize how the data is handled by selecting from the Input format options.
-
Adjust the primary timestamp column for the data. Druid requires data to have a primary timestamp column (internally stored in a column called
__time). If your dataset doesn’t have a timestamp, Druid uses the default value of1970-01-01 00:00:00.
Review the query
The query inserts the sample data into the table named
wikiticker-2015-09-12-sampled.Optionally, click Preview to see the general shape of the data before you ingest it.Edit the destination datasource name
Edit the first line of the query and change the default destination datasource name from
wikiticker-2015-09-12-sampled to wikipedia.Run the query
Click Run to execute the query. The task may take a minute or two to complete. When done, the task displays its duration and the number of rows inserted into the table. The view is set to automatically refresh, so you don’t need to refresh the browser to see the status change.
A successful task means that Druid data servers have picked up one or more segments.
A successful task means that Druid data servers have picked up one or more segments.Query data
Once the ingestion job is complete, you can query the data. In the Query view, run the following query to produce a list of top channels:
Congratulations! You’ve gone from downloading Druid to querying data with the MSQ task engine in just one quickstart.
Next steps
See the following topics for more information:- Druid SQL overview or the Query tutorial to learn about how to query the data you just ingested.
- Ingestion overview to explore options for ingesting more data.
- Tutorial: Load files using SQL to learn how to generate a SQL query that loads external data into a Druid datasource.
- Tutorial: Load data with native batch ingestion to load and query data with Druid’s native batch ingestion feature.
- Tutorial: Load stream data from Apache Kafka to load streaming data from a Kafka topic.
- Extensions for details on Druid extensions.