Installation - Apache Spark

System Requirements

Before installing Spark, ensure your system meets these requirements:

Java: Java 17 or 21 (Java 17+ required)
Python: Python 3.10+ (for PySpark)
Scala: 2.13 (automatically included in Spark distribution)
R: R 3.5+ (for SparkR - deprecated)
Operating System: Linux, macOS, or Windows
Architecture: x86_64 or ARM64

Spark runs on both Windows and UNIX-like systems. Make sure Java is in your PATH or JAVA_HOME is set.

Pre-Built Packages

The easiest way to get started is downloading a pre-built package.

Download Spark

Visit the Spark downloads page and download a pre-built package.Choose a package type:

Pre-built for Hadoop: Includes Hadoop libraries (recommended for most users)
Pre-built without Hadoop: Smaller package, provide Hadoop separately

Extract the archive

tar -xzf spark-4.0.0-bin-hadoop3.tgz
cd spark-4.0.0-bin-hadoop3

Set up environment (optional)

Add Spark to your PATH:

# Add to ~/.bashrc or ~/.zshrc
export SPARK_HOME=/path/to/spark
export PATH=$SPARK_HOME/bin:$PATH

Verify installation

Test your installation:

./bin/spark-shell --version

Installation by Platform

Linux
macOS
Windows
Docker

Ubuntu/Debian

Install Java first:

# Install Java 17
sudo apt update
sudo apt install openjdk-17-jdk

# Verify installation
java -version

Download and extract Spark:

# Download Spark
wget https://downloads.apache.org/spark/spark-4.0.0/spark-4.0.0-bin-hadoop3.tgz

# Extract
tar -xzf spark-4.0.0-bin-hadoop3.tgz
sudo mv spark-4.0.0-bin-hadoop3 /opt/spark

# Set environment variables
echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc
echo 'export PATH=$SPARK_HOME/bin:$PATH' >> ~/.bashrc
source ~/.bashrc

RHEL/CentOS/Fedora

# Install Java 17
sudo dnf install java-17-openjdk-devel

# Download and install Spark (same as above)
wget https://downloads.apache.org/spark/spark-4.0.0/spark-4.0.0-bin-hadoop3.tgz
tar -xzf spark-4.0.0-bin-hadoop3.tgz
sudo mv spark-4.0.0-bin-hadoop3 /opt/spark

Using Homebrew (Recommended)

# Install Java
brew install openjdk@17

# Install Spark
brew install apache-spark

Manual Installation

# Install Java from Homebrew or download from Adoptium
brew install openjdk@17

# Download Spark
curl -O https://downloads.apache.org/spark/spark-4.0.0/spark-4.0.0-bin-hadoop3.tgz

# Extract
tar -xzf spark-4.0.0-bin-hadoop3.tgz
sudo mv spark-4.0.0-bin-hadoop3 /usr/local/spark

# Add to PATH
echo 'export SPARK_HOME=/usr/local/spark' >> ~/.zshrc
echo 'export PATH=$SPARK_HOME/bin:$PATH' >> ~/.zshrc
source ~/.zshrc

Prerequisites

Install Java 17 or 21:
- Download from Adoptium
- Install and set JAVA_HOME environment variable
Download Spark:
- Visit Spark downloads
- Download the .tgz file

Installation Steps

Extract the archive using 7-Zip or similar tool
Move to C:\spark
Set environment variables:
- SPARK_HOME = C:\spark
- Add %SPARK_HOME%\bin to PATH

Install Hadoop winutils (required for Windows):

# Download winutils.exe
# Place in C:\hadoop\bin
# Set HADOOP_HOME = C:\hadoop

Run Spark

# Test installation
spark-shell --version

# Launch PySpark (if Python is installed)
pyspark

Windows users need winutils.exe for Hadoop compatibility. Download it from the winutils repository.

Run Spark in a Docker container:

# Pull the official Spark image
docker pull apache/spark:latest

# Run PySpark
docker run -it apache/spark:latest /opt/spark/bin/pyspark

# Run Spark shell
docker run -it apache/spark:latest /opt/spark/bin/spark-shell

# Run a Spark application
docker run -v /path/to/app:/app apache/spark:latest \
  /opt/spark/bin/spark-submit /app/my_app.py

Custom Dockerfile

FROM apache/spark:latest

# Install additional Python packages
RUN pip install pandas numpy matplotlib

# Copy your application
COPY my_app.py /app/my_app.py

# Set working directory
WORKDIR /app

Installing PySpark

Install PySpark via pip for Python development.

Using pip

# Install PySpark
pip install pyspark

# Install specific version
pip install pyspark==4.0.0

# Verify installation
python -c "import pyspark; print(pyspark.__version__)"

Using Conda

# Create a new environment
conda create -n spark python=3.11
conda activate spark

# Install PySpark
conda install -c conda-forge pyspark

In Your Project

Add to requirements.txt:

pyspark==4.0.0
pandas>=2.0.0
numpy>=1.24.0

Or setup.py:

setup(
    name="my-spark-app",
    install_requires=[
        'pyspark==4.0.0',
    ],
)

Building from Source

Build Spark from source for the latest features or custom configurations.

Prerequisites

Maven: 3.9.12 or later
Java: 17 or 21
Git: To clone the repository

Clone the repository

git clone https://github.com/apache/spark.git
cd spark

Configure Maven memory

export MAVEN_OPTS="-Xss64m -Xmx4g -Xms4g -XX:ReservedCodeCacheSize=128m"

Build Spark

Basic build:

./build/mvn -DskipTests clean package

This creates a runnable distribution in assembly/target/scala-2.13/.

Build with Specific Features

YARN Support
Hive Support
Kubernetes Support
All Features

./build/mvn -Pyarn -Dhadoop.version=3.4.3 -DskipTests clean package

./build/mvn -Pyarn -Phive -Phive-thriftserver -DskipTests clean package

./build/mvn -Pkubernetes -DskipTests clean package

./build/mvn -Pyarn -Phive -Phive-thriftserver -Pkubernetes \
  -Dhadoop.version=3.4.3 -DskipTests clean package

Build a Distribution

Create a complete distribution like the official releases:

./dev/make-distribution.sh --name custom-spark --pip --r --tgz \
  -Psparkr -Phive -Phive-thriftserver -Pyarn -Pkubernetes

This creates a .tgz file in the project root with:

Compiled binaries
Python pip package
R package
All selected features

Building with SBT

SBT provides faster iterative compilation for development:

# Basic build
./build/sbt package

# Interactive mode (faster for multiple builds)
./build/sbt
sbt> package
sbt> test

Configure SBT memory in .jvmopts:

-Xmx4g
-XX:ReservedCodeCacheSize=1g

Hadoop Version Compatibility

Spark must be built against the same Hadoop version as your cluster. Protocol changes between Hadoop versions can cause compatibility issues.

Specifying Hadoop Version

# Hadoop 3.3.x
./build/mvn -Pyarn -Dhadoop.version=3.3.6 -DskipTests clean package

# Hadoop 3.4.x
./build/mvn -Pyarn -Dhadoop.version=3.4.3 -DskipTests clean package

Hadoop-Free Build

For YARN deployments, build without bundled Hadoop to avoid classpath conflicts:

./build/mvn -Phadoop-provided -Pyarn -DskipTests clean package

Provide Hadoop libraries through yarn.application.classpath.

Running Tests

Verify your installation by running tests.

Test with Maven

# Run all tests
./build/mvn test

# Run specific module tests
./build/mvn -pl :spark-sql_2.13 test

# Run specific test suite
./build/mvn -Dtest=DataFrameSuite test

Test with SBT

# Run all tests
./build/sbt test

# Run specific test
./build/sbt "testOnly org.apache.spark.sql.DataFrameSuite"

PySpark Tests

# Build with Hive first
./build/mvn -DskipTests clean package -Phive

# Run Python tests
./python/run-tests

# Run specific module
./python/run-tests --modules=pyspark-sql

Tests require significant time and resources. Use -DskipTests during development builds.

Configuration

Essential Environment Variables

# Java installation
export JAVA_HOME=/path/to/java17

# Spark installation
export SPARK_HOME=/path/to/spark
export PATH=$SPARK_HOME/bin:$PATH

# Optional: Default master URL
export MASTER=spark://host:7077

# Optional: Hadoop configuration
export HADOOP_CONF_DIR=/etc/hadoop/conf

Spark Configuration Files

Copy and customize configuration templates:

cd $SPARK_HOME/conf

# Copy templates
cp spark-defaults.conf.template spark-defaults.conf
cp spark-env.sh.template spark-env.sh
cp log4j2.properties.template log4j2.properties

Edit spark-defaults.conf for default settings:

spark.master                     spark://master:7077
spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://namenode:8021/spark-logs
spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.driver.memory              4g
spark.executor.memory            4g

Troubleshooting

Java Version Issues

# Check Java version
java -version

# If wrong version, update JAVA_HOME
export JAVA_HOME=/path/to/java17

Memory Errors During Build

# Increase Maven memory
export MAVEN_OPTS="-Xmx6g -XX:ReservedCodeCacheSize=256m"

# Or use build/mvn (automatically sets memory)
./build/mvn clean package

Permission Errors

# Don't run as root
# Fix ownership if needed
sudo chown -R $USER:$USER /path/to/spark

Filename Too Long (Encrypted Filesystems)

If you encounter “Filename too long” errors on encrypted filesystems:

Edit pom.xml and add to scala-maven-plugin:

<arg>-Xmax-classfile-name</arg>
<arg>128</arg>

Edit project/SparkBuild.scala and add:

scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128")

Next Steps

Now that you have Spark installed:

Quick Start Tutorial

Try the interactive tutorial to learn Spark basics

Configuration Guide

Customize Spark for your needs

Cluster Deployment

Deploy Spark on a cluster

Programming Guides

Deep dive into Spark’s APIs

Get Started

Core Concepts

Spark SQL

Structured Streaming

Machine Learning

Graph Processing

Deployment

Configuration & Tuning

Monitoring

​System Requirements

​Pre-Built Packages

​Installation by Platform

​Ubuntu/Debian

​RHEL/CentOS/Fedora

​Using Homebrew (Recommended)

​Manual Installation

​Prerequisites

​Installation Steps

​Run Spark

​Custom Dockerfile

​Installing PySpark

​Using pip

​Using Conda

​In Your Project

​Building from Source

​Prerequisites

​Build with Specific Features

​Build a Distribution

​Building with SBT

​Hadoop Version Compatibility

​Specifying Hadoop Version

​Hadoop-Free Build

​Running Tests

​Test with Maven

​Test with SBT

​PySpark Tests

​Configuration

​Essential Environment Variables

​Spark Configuration Files

​Troubleshooting

​Java Version Issues

​Memory Errors During Build

​Permission Errors

​Filename Too Long (Encrypted Filesystems)

​Next Steps

Quick Start Tutorial

Configuration Guide

Cluster Deployment

Programming Guides

Build docs developers (and LLMs) love

System Requirements

Pre-Built Packages

Installation by Platform

Ubuntu/Debian

RHEL/CentOS/Fedora

Using Homebrew (Recommended)

Manual Installation

Prerequisites

Installation Steps

Run Spark

Custom Dockerfile

Installing PySpark

Using pip

Using Conda

In Your Project

Building from Source

Prerequisites

Build with Specific Features

Build a Distribution

Building with SBT

Hadoop Version Compatibility

Specifying Hadoop Version

Hadoop-Free Build

Running Tests

Test with Maven

Test with SBT

PySpark Tests

Configuration

Essential Environment Variables

Spark Configuration Files

Troubleshooting

Java Version Issues

Memory Errors During Build

Permission Errors

Filename Too Long (Encrypted Filesystems)

Next Steps