Skip to main content

Apache Spark

Lightning-fast unified analytics engine for large-scale data processing. Process massive datasets with SQL, streaming, machine learning, and graph processing.

Apache Spark Logo

Quick Start

Get Apache Spark up and running in minutes

1

Download Apache Spark

Download Spark from the Apache Spark downloads page. Choose a pre-built package for your Hadoop version or download the source to build yourself.
# Extract the downloaded archive
tar xvf spark-*.tgz
cd spark-*
2

Start the Interactive Shell

Launch the Spark shell to start working with your data interactively. Spark provides shells for Scala, Python, R, and SQL.
# Python shell
./bin/pyspark

# Scala shell
./bin/spark-shell

# R shell
./bin/sparkR
Make sure you have Java 17 or later installed. Set JAVA_HOME to point to your Java installation.
3

Run Your First Query

Try running a simple example to verify your installation:
# Create a DataFrame from a range
df = spark.range(1000 * 1000 * 1000)

# Count the records
df.count()
# Output: 1000000000
scala> spark.range(1000 * 1000 * 1000).count()
res0: Long = 1000000000
4

Run an Example Application

Spark includes several sample programs. Run the SparkPi example to calculate the value of Pi:
./bin/run-example SparkPi 10
You can also submit your own applications using spark-submit. Learn more in the Submitting Applications guide.

Explore by Component

Apache Spark provides a rich set of libraries for different data processing needs

Spark SQL

Work with structured data using SQL queries and DataFrames. Connect to data sources like Parquet, JSON, Hive, and JDBC.

Structured Streaming

Build scalable and fault-tolerant streaming applications. Process real-time data from Kafka, Kinesis, and more.

MLlib

Scale machine learning with distributed algorithms for classification, regression, clustering, and collaborative filtering.

GraphX

Analyze graph-structured data with Spark’s graph computation framework and built-in graph algorithms.

Spark Connect

Connect to Spark clusters remotely using the decoupled client-server architecture introduced in Spark 3.4.

Core API

Understand RDDs and the fundamental distributed computing primitives that power all Spark components.

Deploy Anywhere

Run Spark on your preferred cluster manager

Standalone Mode

Deploy Spark on a private cluster with the built-in standalone cluster manager. Simple setup with minimal dependencies.

Kubernetes

Run Spark natively on Kubernetes with container orchestration and resource isolation. Ideal for cloud-native deployments.

Apache YARN

Integrate with Hadoop YARN for resource management in Hadoop clusters. Leverage existing Hadoop infrastructure.

Cluster Overview

Understand Spark’s cluster architecture, deployment modes, and how applications are executed across a cluster.

Ready to process big data at scale?

Start building distributed data processing applications with Apache Spark. From batch processing to real-time analytics, Spark powers some of the world’s largest data workloads.