System Requirements
Before installing Spark, ensure your system meets these requirements:
Java : Java 17 or 21 (Java 17+ required)
Python : Python 3.10+ (for PySpark)
Scala : 2.13 (automatically included in Spark distribution)
R : R 3.5+ (for SparkR - deprecated)
Operating System : Linux, macOS, or Windows
Architecture : x86_64 or ARM64
Spark runs on both Windows and UNIX-like systems. Make sure Java is in your PATH or JAVA_HOME is set.
Pre-Built Packages
The easiest way to get started is downloading a pre-built package.
Download Spark
Visit the Spark downloads page and download a pre-built package. Choose a package type:
Pre-built for Hadoop : Includes Hadoop libraries (recommended for most users)
Pre-built without Hadoop : Smaller package, provide Hadoop separately
Extract the archive
tar -xzf spark-4.0.0-bin-hadoop3.tgz
cd spark-4.0.0-bin-hadoop3
Set up environment (optional)
Add Spark to your PATH: # Add to ~/.bashrc or ~/.zshrc
export SPARK_HOME = / path / to / spark
export PATH = $SPARK_HOME / bin : $PATH
Verify installation
Test your installation: ./bin/spark-shell --version
Linux
macOS
Windows
Docker
Ubuntu/Debian Install Java first: # Install Java 17
sudo apt update
sudo apt install openjdk-17-jdk
# Verify installation
java -version
Download and extract Spark: # Download Spark
wget https://downloads.apache.org/spark/spark-4.0.0/spark-4.0.0-bin-hadoop3.tgz
# Extract
tar -xzf spark-4.0.0-bin-hadoop3.tgz
sudo mv spark-4.0.0-bin-hadoop3 /opt/spark
# Set environment variables
echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc
echo 'export PATH=$SPARK_HOME/bin:$PATH' >> ~/.bashrc
source ~/.bashrc
RHEL/CentOS/Fedora # Install Java 17
sudo dnf install java-17-openjdk-devel
# Download and install Spark (same as above)
wget https://downloads.apache.org/spark/spark-4.0.0/spark-4.0.0-bin-hadoop3.tgz
tar -xzf spark-4.0.0-bin-hadoop3.tgz
sudo mv spark-4.0.0-bin-hadoop3 /opt/spark
Using Homebrew (Recommended) # Install Java
brew install openjdk@17
# Install Spark
brew install apache-spark
Manual Installation # Install Java from Homebrew or download from Adoptium
brew install openjdk@17
# Download Spark
curl -O https://downloads.apache.org/spark/spark-4.0.0/spark-4.0.0-bin-hadoop3.tgz
# Extract
tar -xzf spark-4.0.0-bin-hadoop3.tgz
sudo mv spark-4.0.0-bin-hadoop3 /usr/local/spark
# Add to PATH
echo 'export SPARK_HOME=/usr/local/spark' >> ~/.zshrc
echo 'export PATH=$SPARK_HOME/bin:$PATH' >> ~/.zshrc
source ~/.zshrc
Prerequisites
Install Java 17 or 21:
Download from Adoptium
Install and set JAVA_HOME environment variable
Download Spark:
Installation Steps
Extract the archive using 7-Zip or similar tool
Move to C:\spark
Set environment variables:
SPARK_HOME = C:\spark
Add %SPARK_HOME%\bin to PATH
Install Hadoop winutils (required for Windows):
# Download winutils.exe
# Place in C:\hadoop\bin
# Set HADOOP_HOME = C:\hadoop
Run Spark # Test installation
spark - shell -- version
# Launch PySpark (if Python is installed)
pyspark
Windows users need winutils.exe for Hadoop compatibility. Download it from the winutils repository . Run Spark in a Docker container: # Pull the official Spark image
docker pull apache/spark:latest
# Run PySpark
docker run -it apache/spark:latest /opt/spark/bin/pyspark
# Run Spark shell
docker run -it apache/spark:latest /opt/spark/bin/spark-shell
# Run a Spark application
docker run -v /path/to/app:/app apache/spark:latest \
/opt/spark/bin/spark-submit /app/my_app.py
Custom Dockerfile FROM apache/spark:latest
# Install additional Python packages
RUN pip install pandas numpy matplotlib
# Copy your application
COPY my_app.py /app/my_app.py
# Set working directory
WORKDIR /app
Installing PySpark
Install PySpark via pip for Python development.
Using pip
# Install PySpark
pip install pyspark
# Install specific version
pip install pyspark== 4.0.0
# Verify installation
python -c "import pyspark; print(pyspark.__version__)"
Using Conda
# Create a new environment
conda create -n spark python= 3.11
conda activate spark
# Install PySpark
conda install -c conda-forge pyspark
In Your Project
Add to requirements.txt:
pyspark==4.0.0
pandas>=2.0.0
numpy>=1.24.0
Or setup.py:
setup(
name = "my-spark-app" ,
install_requires = [
'pyspark==4.0.0' ,
],
)
Building from Source
Build Spark from source for the latest features or custom configurations.
Prerequisites
Maven : 3.9.12 or later
Java : 17 or 21
Git : To clone the repository
Clone the repository
git clone https://github.com/apache/spark.git
cd spark
Configure Maven memory
export MAVEN_OPTS = "-Xss64m -Xmx4g -Xms4g -XX:ReservedCodeCacheSize=128m"
Build Spark
Basic build: ./build/mvn -DskipTests clean package
This creates a runnable distribution in assembly/target/scala-2.13/.
Build with Specific Features
YARN Support
Hive Support
Kubernetes Support
All Features
./build/mvn -Pyarn -Dhadoop.version=3.4.3 -DskipTests clean package
./build/mvn -Pyarn -Phive -Phive-thriftserver -DskipTests clean package
./build/mvn -Pkubernetes -DskipTests clean package
./build/mvn -Pyarn -Phive -Phive-thriftserver -Pkubernetes \
-Dhadoop.version=3.4.3 -DskipTests clean package
Build a Distribution
Create a complete distribution like the official releases:
./dev/make-distribution.sh --name custom-spark --pip --r --tgz \
-Psparkr -Phive -Phive-thriftserver -Pyarn -Pkubernetes
This creates a .tgz file in the project root with:
Compiled binaries
Python pip package
R package
All selected features
Building with SBT
SBT provides faster iterative compilation for development:
# Basic build
./build/sbt package
# Interactive mode (faster for multiple builds)
./build/sbt
sbt > package
sbt > test
Configure SBT memory in .jvmopts:
-Xmx4g
-XX:ReservedCodeCacheSize=1g
Hadoop Version Compatibility
Spark must be built against the same Hadoop version as your cluster. Protocol changes between Hadoop versions can cause compatibility issues.
Specifying Hadoop Version
# Hadoop 3.3.x
./build/mvn -Pyarn -Dhadoop.version=3.3.6 -DskipTests clean package
# Hadoop 3.4.x
./build/mvn -Pyarn -Dhadoop.version=3.4.3 -DskipTests clean package
Hadoop-Free Build
For YARN deployments, build without bundled Hadoop to avoid classpath conflicts:
./build/mvn -Phadoop-provided -Pyarn -DskipTests clean package
Provide Hadoop libraries through yarn.application.classpath.
Running Tests
Verify your installation by running tests.
Test with Maven
# Run all tests
./build/mvn test
# Run specific module tests
./build/mvn -pl :spark-sql_2.13 test
# Run specific test suite
./build/mvn -Dtest=DataFrameSuite test
Test with SBT
# Run all tests
./build/sbt test
# Run specific test
./build/sbt "testOnly org.apache.spark.sql.DataFrameSuite"
PySpark Tests
# Build with Hive first
./build/mvn -DskipTests clean package -Phive
# Run Python tests
./python/run-tests
# Run specific module
./python/run-tests --modules=pyspark-sql
Tests require significant time and resources. Use -DskipTests during development builds.
Configuration
Essential Environment Variables
# Java installation
export JAVA_HOME = / path / to / java17
# Spark installation
export SPARK_HOME = / path / to / spark
export PATH = $SPARK_HOME / bin : $PATH
# Optional: Default master URL
export MASTER = spark :// host : 7077
# Optional: Hadoop configuration
export HADOOP_CONF_DIR = / etc / hadoop / conf
Spark Configuration Files
Copy and customize configuration templates:
cd $SPARK_HOME /conf
# Copy templates
cp spark-defaults.conf.template spark-defaults.conf
cp spark-env.sh.template spark-env.sh
cp log4j2.properties.template log4j2.properties
Edit spark-defaults.conf for default settings:
spark.master spark://master:7077
spark.eventLog.enabled true
spark.eventLog.dir hdfs://namenode:8021/spark-logs
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 4g
spark.executor.memory 4g
Troubleshooting
Java Version Issues
# Check Java version
java -version
# If wrong version, update JAVA_HOME
export JAVA_HOME = / path / to / java17
Memory Errors During Build
# Increase Maven memory
export MAVEN_OPTS = "-Xmx6g -XX:ReservedCodeCacheSize=256m"
# Or use build/mvn (automatically sets memory)
./build/mvn clean package
Permission Errors
# Don't run as root
# Fix ownership if needed
sudo chown -R $USER : $USER /path/to/spark
Filename Too Long (Encrypted Filesystems)
If you encounter “Filename too long” errors on encrypted filesystems:
Edit pom.xml and add to scala-maven-plugin:
< arg > -Xmax-classfile-name </ arg >
< arg > 128 </ arg >
Edit project/SparkBuild.scala and add:
scalacOptions in Compile ++= Seq ( "-Xmax-classfile-name" , "128" )
Next Steps
Now that you have Spark installed:
Quick Start Tutorial Try the interactive tutorial to learn Spark basics
Configuration Guide Customize Spark for your needs
Cluster Deployment Deploy Spark on a cluster
Programming Guides Deep dive into Spark’s APIs