Skip to main content

System Requirements

Before installing Spark, ensure your system meets these requirements:
  • Java: Java 17 or 21 (Java 17+ required)
  • Python: Python 3.10+ (for PySpark)
  • Scala: 2.13 (automatically included in Spark distribution)
  • R: R 3.5+ (for SparkR - deprecated)
  • Operating System: Linux, macOS, or Windows
  • Architecture: x86_64 or ARM64
Spark runs on both Windows and UNIX-like systems. Make sure Java is in your PATH or JAVA_HOME is set.

Pre-Built Packages

The easiest way to get started is downloading a pre-built package.
1

Download Spark

Visit the Spark downloads page and download a pre-built package.Choose a package type:
  • Pre-built for Hadoop: Includes Hadoop libraries (recommended for most users)
  • Pre-built without Hadoop: Smaller package, provide Hadoop separately
2

Extract the archive

tar -xzf spark-4.0.0-bin-hadoop3.tgz
cd spark-4.0.0-bin-hadoop3
3

Set up environment (optional)

Add Spark to your PATH:
# Add to ~/.bashrc or ~/.zshrc
export SPARK_HOME=/path/to/spark
export PATH=$SPARK_HOME/bin:$PATH
4

Verify installation

Test your installation:
./bin/spark-shell --version

Installation by Platform

Ubuntu/Debian

Install Java first:
# Install Java 17
sudo apt update
sudo apt install openjdk-17-jdk

# Verify installation
java -version
Download and extract Spark:
# Download Spark
wget https://downloads.apache.org/spark/spark-4.0.0/spark-4.0.0-bin-hadoop3.tgz

# Extract
tar -xzf spark-4.0.0-bin-hadoop3.tgz
sudo mv spark-4.0.0-bin-hadoop3 /opt/spark

# Set environment variables
echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc
echo 'export PATH=$SPARK_HOME/bin:$PATH' >> ~/.bashrc
source ~/.bashrc

RHEL/CentOS/Fedora

# Install Java 17
sudo dnf install java-17-openjdk-devel

# Download and install Spark (same as above)
wget https://downloads.apache.org/spark/spark-4.0.0/spark-4.0.0-bin-hadoop3.tgz
tar -xzf spark-4.0.0-bin-hadoop3.tgz
sudo mv spark-4.0.0-bin-hadoop3 /opt/spark

Installing PySpark

Install PySpark via pip for Python development.

Using pip

# Install PySpark
pip install pyspark

# Install specific version
pip install pyspark==4.0.0

# Verify installation
python -c "import pyspark; print(pyspark.__version__)"

Using Conda

# Create a new environment
conda create -n spark python=3.11
conda activate spark

# Install PySpark
conda install -c conda-forge pyspark

In Your Project

Add to requirements.txt:
pyspark==4.0.0
pandas>=2.0.0
numpy>=1.24.0
Or setup.py:
setup(
    name="my-spark-app",
    install_requires=[
        'pyspark==4.0.0',
    ],
)

Building from Source

Build Spark from source for the latest features or custom configurations.

Prerequisites

  • Maven: 3.9.12 or later
  • Java: 17 or 21
  • Git: To clone the repository
1

Clone the repository

git clone https://github.com/apache/spark.git
cd spark
2

Configure Maven memory

export MAVEN_OPTS="-Xss64m -Xmx4g -Xms4g -XX:ReservedCodeCacheSize=128m"
3

Build Spark

Basic build:
./build/mvn -DskipTests clean package
This creates a runnable distribution in assembly/target/scala-2.13/.

Build with Specific Features

./build/mvn -Pyarn -Dhadoop.version=3.4.3 -DskipTests clean package

Build a Distribution

Create a complete distribution like the official releases:
./dev/make-distribution.sh --name custom-spark --pip --r --tgz \
  -Psparkr -Phive -Phive-thriftserver -Pyarn -Pkubernetes
This creates a .tgz file in the project root with:
  • Compiled binaries
  • Python pip package
  • R package
  • All selected features

Building with SBT

SBT provides faster iterative compilation for development:
# Basic build
./build/sbt package

# Interactive mode (faster for multiple builds)
./build/sbt
sbt> package
sbt> test
Configure SBT memory in .jvmopts:
-Xmx4g
-XX:ReservedCodeCacheSize=1g

Hadoop Version Compatibility

Spark must be built against the same Hadoop version as your cluster. Protocol changes between Hadoop versions can cause compatibility issues.

Specifying Hadoop Version

# Hadoop 3.3.x
./build/mvn -Pyarn -Dhadoop.version=3.3.6 -DskipTests clean package

# Hadoop 3.4.x
./build/mvn -Pyarn -Dhadoop.version=3.4.3 -DskipTests clean package

Hadoop-Free Build

For YARN deployments, build without bundled Hadoop to avoid classpath conflicts:
./build/mvn -Phadoop-provided -Pyarn -DskipTests clean package
Provide Hadoop libraries through yarn.application.classpath.

Running Tests

Verify your installation by running tests.

Test with Maven

# Run all tests
./build/mvn test

# Run specific module tests
./build/mvn -pl :spark-sql_2.13 test

# Run specific test suite
./build/mvn -Dtest=DataFrameSuite test

Test with SBT

# Run all tests
./build/sbt test

# Run specific test
./build/sbt "testOnly org.apache.spark.sql.DataFrameSuite"

PySpark Tests

# Build with Hive first
./build/mvn -DskipTests clean package -Phive

# Run Python tests
./python/run-tests

# Run specific module
./python/run-tests --modules=pyspark-sql
Tests require significant time and resources. Use -DskipTests during development builds.

Configuration

Essential Environment Variables

# Java installation
export JAVA_HOME=/path/to/java17

# Spark installation
export SPARK_HOME=/path/to/spark
export PATH=$SPARK_HOME/bin:$PATH

# Optional: Default master URL
export MASTER=spark://host:7077

# Optional: Hadoop configuration
export HADOOP_CONF_DIR=/etc/hadoop/conf

Spark Configuration Files

Copy and customize configuration templates:
cd $SPARK_HOME/conf

# Copy templates
cp spark-defaults.conf.template spark-defaults.conf
cp spark-env.sh.template spark-env.sh
cp log4j2.properties.template log4j2.properties
Edit spark-defaults.conf for default settings:
spark.master                     spark://master:7077
spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://namenode:8021/spark-logs
spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.driver.memory              4g
spark.executor.memory            4g

Troubleshooting

Java Version Issues

# Check Java version
java -version

# If wrong version, update JAVA_HOME
export JAVA_HOME=/path/to/java17

Memory Errors During Build

# Increase Maven memory
export MAVEN_OPTS="-Xmx6g -XX:ReservedCodeCacheSize=256m"

# Or use build/mvn (automatically sets memory)
./build/mvn clean package

Permission Errors

# Don't run as root
# Fix ownership if needed
sudo chown -R $USER:$USER /path/to/spark

Filename Too Long (Encrypted Filesystems)

If you encounter “Filename too long” errors on encrypted filesystems:
  1. Edit pom.xml and add to scala-maven-plugin:
    <arg>-Xmax-classfile-name</arg>
    <arg>128</arg>
    
  2. Edit project/SparkBuild.scala and add:
    scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128")
    

Next Steps

Now that you have Spark installed:

Quick Start Tutorial

Try the interactive tutorial to learn Spark basics

Configuration Guide

Customize Spark for your needs

Cluster Deployment

Deploy Spark on a cluster

Programming Guides

Deep dive into Spark’s APIs

Build docs developers (and LLMs) love