Skip to main content

Overview

This quick start guide provides a hands-on introduction to Apache Spark. You’ll learn how to use Spark’s interactive shell and write your first application.
Before starting, make sure you have Java 17 or 21 installed on your system. Check with java -version.

Download Spark

First, download a packaged release of Spark from the downloads page.
1

Download the package

Choose a pre-built package for any Hadoop version (or “Hadoop free” binary). Extract the archive:
tar -xzf spark-*.tgz
cd spark-*
2

Verify Java installation

Ensure Java is available on your PATH or set JAVA_HOME:
java -version
# Should show Java 17 or 21
3

Test the installation

Run a simple example to verify everything works:
./bin/run-example SparkPi 10
You should see output calculating Pi.

Interactive Shell: Python

The easiest way to start learning Spark is through the interactive Python shell.

Launch PySpark

Start the Python shell:
./bin/pyspark
Or if you installed PySpark with pip:
pyspark

Basic Operations

Let’s create a DataFrame and perform some basic operations:
# Read the README file
textFile = spark.read.text("README.md")

# Count the number of rows
textFile.count()
# Output: 126

# Get the first row
textFile.first()
# Output: Row(value='# Apache Spark')

Filtering Data

Transform the DataFrame by filtering rows:
# Find lines containing "Spark"
linesWithSpark = textFile.filter(textFile.value.contains("Spark"))

# Count filtered lines
linesWithSpark.count()
# Output: 15

# Chain operations together
textFile.filter(textFile.value.contains("Spark")).count()
# Output: 15

Advanced Transformations

Perform more complex operations:
from pyspark.sql import functions as sf

# Find the line with the most words
textFile.select(
    sf.size(sf.split(textFile.value, "\s+")).name("numWords")
).agg(sf.max(sf.col("numWords"))).collect()
# Output: [Row(max(numWords)=15)]

Word Count Example

Implement the classic MapReduce word count:
# Count words in the file
wordCounts = textFile.select(
    sf.explode(sf.split(textFile.value, "\s+")).alias("word")
).groupBy("word").count()

# Collect and display results
wordCounts.show()

Caching Data

Cache frequently accessed data in memory:
# Cache the dataset
linesWithSpark.cache()

# First count - loads data into memory
linesWithSpark.count()
# Output: 15

# Second count - uses cached data (much faster)
linesWithSpark.count()
# Output: 15
Caching is extremely useful when you query the same dataset multiple times or run iterative algorithms.

Interactive Shell: Scala

If you prefer Scala, you can use the Scala shell instead.

Launch Spark Shell

./bin/spark-shell

Basic Operations

// Read the README file
val textFile = spark.read.textFile("README.md")

// Count rows
textFile.count()
// Output: res0: Long = 126

// Get first row
textFile.first()
// Output: res1: String = # Apache Spark

Filtering and Transformations

// Filter lines containing "Spark"
val linesWithSpark = textFile.filter(line => line.contains("Spark"))

// Count matches
textFile.filter(line => line.contains("Spark")).count()
// Output: res3: Long = 15

Advanced Operations

// Find the longest line
textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
// Output: res4: Int = 15

// Word count using flatMap
val wordCounts = textFile.flatMap(line => line.split(" "))
  .groupByKey(identity)
  .count()

wordCounts.collect()

Write Your First Application

Now let’s create a standalone Spark application.
Create a file named simple_app.py:
"""simple_app.py"""
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()

# Read data
logFile = "README.md"
logData = spark.read.text(logFile).cache()

# Count lines containing 'a' and 'b'
numAs = logData.filter(logData.value.contains('a')).count()
numBs = logData.filter(logData.value.contains('b')).count()

print(f"Lines with a: {numAs}, lines with b: {numBs}")

spark.stop()
Run your application:
# Using spark-submit
./bin/spark-submit --master "local[4]" simple_app.py

# Or using Python directly (if PySpark is pip installed)
python simple_app.py
Expected output:
Lines with a: 46, lines with b: 23

Run Example Programs

Spark includes several example programs you can run immediately.

Python Examples

# Calculate Pi
./bin/spark-submit examples/src/main/python/pi.py 10

# Word count
./bin/spark-submit examples/src/main/python/wordcount.py README.md

Scala/Java Examples

# Calculate Pi
./bin/run-example SparkPi 10

# PageRank algorithm
./bin/run-example SparkPageRank data/mllib/pagerank_data.txt 10

Master URLs

You can run Spark in different modes by setting the master URL:
Run on your local machine:
# Single thread
--master "local"

# Multiple threads (use all cores)
--master "local[*]"

# Specific number of threads
--master "local[4]"

Understanding Datasets and DataFrames

Spark 2.0+ uses Datasets and DataFrames as the primary API. RDDs are still supported but Datasets offer better performance and optimization.
Key concepts:
  • Dataset: Strongly-typed collection of objects (Scala/Java)
  • DataFrame: Dataset with named columns (like a database table)
  • Transformations: Operations that create new Datasets (lazy evaluation)
  • Actions: Operations that trigger computation and return results

Common Transformations

# Transformations (lazy - don't execute immediately)
filtered = df.filter(df.age > 21)
selected = df.select("name", "age")
grouped = df.groupBy("department").count()

Common Actions

# Actions (trigger computation)
count = df.count()          # Count rows
first = df.first()          # Get first row
rows = df.collect()         # Get all rows
df.show()                   # Display data

Next Steps

Congratulations on running your first Spark application!

Programming Guides

Deep dive into Spark SQL, DataFrames, and Datasets

Deploy to Cluster

Learn how to run Spark on a cluster

MLlib Guide

Build machine learning pipelines

Structured Streaming

Process real-time data streams
When running on a cluster, don’t call collect() on large datasets as it pulls all data to the driver. Use take() or write to distributed storage instead.

Build docs developers (and LLMs) love