Quick Start

Overview

This quick start guide provides a hands-on introduction to Apache Spark. You’ll learn how to use Spark’s interactive shell and write your first application.

Before starting, make sure you have Java 17 or 21 installed on your system. Check with java -version.

Download Spark

First, download a packaged release of Spark from the downloads page.

Download the package

Choose a pre-built package for any Hadoop version (or “Hadoop free” binary). Extract the archive:

tar -xzf spark-*.tgz
cd spark-*

Verify Java installation

Ensure Java is available on your PATH or set JAVA_HOME:

java -version
# Should show Java 17 or 21

Test the installation

Run a simple example to verify everything works:

./bin/run-example SparkPi 10

You should see output calculating Pi.

Interactive Shell: Python

The easiest way to start learning Spark is through the interactive Python shell.

Launch PySpark

Start the Python shell:

./bin/pyspark

Or if you installed PySpark with pip:

pyspark

Basic Operations

Let’s create a DataFrame and perform some basic operations:

# Read the README file
textFile = spark.read.text("README.md")

# Count the number of rows
textFile.count()
# Output: 126

# Get the first row
textFile.first()
# Output: Row(value='# Apache Spark')

Filtering Data

Transform the DataFrame by filtering rows:

# Find lines containing "Spark"
linesWithSpark = textFile.filter(textFile.value.contains("Spark"))

# Count filtered lines
linesWithSpark.count()
# Output: 15

# Chain operations together
textFile.filter(textFile.value.contains("Spark")).count()
# Output: 15

Advanced Transformations

Perform more complex operations:

from pyspark.sql import functions as sf

# Find the line with the most words
textFile.select(
    sf.size(sf.split(textFile.value, "\s+")).name("numWords")
).agg(sf.max(sf.col("numWords"))).collect()
# Output: [Row(max(numWords)=15)]

Word Count Example

Implement the classic MapReduce word count:

# Count words in the file
wordCounts = textFile.select(
    sf.explode(sf.split(textFile.value, "\s+")).alias("word")
).groupBy("word").count()

# Collect and display results
wordCounts.show()

Caching Data

Cache frequently accessed data in memory:

# Cache the dataset
linesWithSpark.cache()

# First count - loads data into memory
linesWithSpark.count()
# Output: 15

# Second count - uses cached data (much faster)
linesWithSpark.count()
# Output: 15

Caching is extremely useful when you query the same dataset multiple times or run iterative algorithms.

Interactive Shell: Scala

If you prefer Scala, you can use the Scala shell instead.

Launch Spark Shell

./bin/spark-shell

Basic Operations

// Read the README file
val textFile = spark.read.textFile("README.md")

// Count rows
textFile.count()
// Output: res0: Long = 126

// Get first row
textFile.first()
// Output: res1: String = # Apache Spark

Filtering and Transformations

// Filter lines containing "Spark"
val linesWithSpark = textFile.filter(line => line.contains("Spark"))

// Count matches
textFile.filter(line => line.contains("Spark")).count()
// Output: res3: Long = 15

Advanced Operations

// Find the longest line
textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
// Output: res4: Int = 15

// Word count using flatMap
val wordCounts = textFile.flatMap(line => line.split(" "))
  .groupByKey(identity)
  .count()

wordCounts.collect()

Write Your First Application

Now let’s create a standalone Spark application.

Python
Scala
Java

Create a file named simple_app.py:

"""simple_app.py"""
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()

# Read data
logFile = "README.md"
logData = spark.read.text(logFile).cache()

# Count lines containing 'a' and 'b'
numAs = logData.filter(logData.value.contains('a')).count()
numBs = logData.filter(logData.value.contains('b')).count()

print(f"Lines with a: {numAs}, lines with b: {numBs}")

spark.stop()

Run your application:

# Using spark-submit
./bin/spark-submit --master "local[4]" simple_app.py

# Or using Python directly (if PySpark is pip installed)
python simple_app.py

Expected output:

Lines with a: 46, lines with b: 23

Create SimpleApp.scala:

/* SimpleApp.scala */
import org.apache.spark.sql.SparkSession

object SimpleApp {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder
      .appName("Simple Application")
      .getOrCreate()
    
    val logFile = "README.md"
    val logData = spark.read.textFile(logFile).cache()
    
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    
    println(s"Lines with a: $numAs, Lines with b: $numBs")
    spark.stop()
  }
}

Create build.sbt:

name := "Simple Project"
version := "1.0"
scalaVersion := "2.13.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "4.0.0"

Build and run:

# Package the application
sbt package

# Run with spark-submit
./bin/spark-submit \
  --class "SimpleApp" \
  --master "local[4]" \
  target/scala-2.13/simple-project_2.13-1.0.jar

Create SimpleApp.java:

/* SimpleApp.java */
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;

public class SimpleApp {
  public static void main(String[] args) {
    SparkSession spark = SparkSession.builder()
      .appName("Simple Application")
      .getOrCreate();
    
    String logFile = "README.md";
    Dataset<String> logData = spark.read().textFile(logFile).cache();
    
    long numAs = logData.filter(s -> s.contains("a")).count();
    long numBs = logData.filter(s -> s.contains("b")).count();
    
    System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);
    
    spark.stop();
  }
}

Create pom.xml:

<project>
  <groupId>com.example</groupId>
  <artifactId>simple-project</artifactId>
  <modelVersion>4.0.0</modelVersion>
  <version>1.0</version>
  <dependencies>
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.13</artifactId>
      <version>4.0.0</version>
      <scope>provided</scope>
    </dependency>
  </dependencies>
</project>

Build and run:

# Package with Maven
mvn package

# Run with spark-submit
./bin/spark-submit \
  --class "SimpleApp" \
  --master "local[4]" \
  target/simple-project-1.0.jar

Run Example Programs

Spark includes several example programs you can run immediately.

Python Examples

# Calculate Pi
./bin/spark-submit examples/src/main/python/pi.py 10

# Word count
./bin/spark-submit examples/src/main/python/wordcount.py README.md

Scala/Java Examples

# Calculate Pi
./bin/run-example SparkPi 10

# PageRank algorithm
./bin/run-example SparkPageRank data/mllib/pagerank_data.txt 10

Master URLs

You can run Spark in different modes by setting the master URL:

Local Mode
Cluster Mode

Run on your local machine:

# Single thread
--master "local"

# Multiple threads (use all cores)
--master "local[*]"

# Specific number of threads
--master "local[4]"

Connect to a Spark cluster:

# Standalone cluster
--master "spark://host:7077"

# YARN cluster
--master "yarn"

# Kubernetes cluster
--master "k8s://https://cluster-url"

Understanding Datasets and DataFrames

Spark 2.0+ uses Datasets and DataFrames as the primary API. RDDs are still supported but Datasets offer better performance and optimization.

Key concepts:

Dataset: Strongly-typed collection of objects (Scala/Java)
DataFrame: Dataset with named columns (like a database table)
Transformations: Operations that create new Datasets (lazy evaluation)
Actions: Operations that trigger computation and return results

Common Transformations

# Transformations (lazy - don't execute immediately)
filtered = df.filter(df.age > 21)
selected = df.select("name", "age")
grouped = df.groupBy("department").count()

Common Actions

# Actions (trigger computation)
count = df.count()          # Count rows
first = df.first()          # Get first row
rows = df.collect()         # Get all rows
df.show()                   # Display data

Next Steps

Congratulations on running your first Spark application!

Programming Guides

Deep dive into Spark SQL, DataFrames, and Datasets

Deploy to Cluster

Learn how to run Spark on a cluster

MLlib Guide

Build machine learning pipelines

Structured Streaming

Process real-time data streams

When running on a cluster, don’t call collect() on large datasets as it pulls all data to the driver. Use take() or write to distributed storage instead.

Get Started

Core Concepts

Spark SQL

Structured Streaming

Machine Learning

Graph Processing

Deployment

Configuration & Tuning

Monitoring

Overview

Download Spark

Interactive Shell: Python

Launch PySpark

Basic Operations

Filtering Data

Advanced Transformations

Word Count Example

Caching Data

Interactive Shell: Scala

Launch Spark Shell

Basic Operations

Filtering and Transformations

Advanced Operations

Write Your First Application

Run Example Programs

Python Examples

Scala/Java Examples

Master URLs

Understanding Datasets and DataFrames

Common Transformations

Common Actions

Next Steps

Programming Guides

Deploy to Cluster

MLlib Guide

Structured Streaming

Build docs developers (and LLMs) love

Get Started

Core Concepts

Spark SQL

Structured Streaming

Machine Learning

Graph Processing

Deployment

Configuration & Tuning

Monitoring

​Overview

​Download Spark

​Interactive Shell: Python

​Launch PySpark

​Basic Operations

​Filtering Data

​Advanced Transformations

​Word Count Example

​Caching Data

​Interactive Shell: Scala

​Launch Spark Shell

​Basic Operations

​Filtering and Transformations

​Advanced Operations

​Write Your First Application

​Run Example Programs

​Python Examples

​Scala/Java Examples

​Master URLs

​Understanding Datasets and DataFrames

​Common Transformations

​Common Actions

​Next Steps

Programming Guides

Deploy to Cluster

MLlib Guide

Structured Streaming

Build docs developers (and LLMs) love

Overview

Download Spark

Interactive Shell: Python

Launch PySpark

Basic Operations

Filtering Data

Advanced Transformations

Word Count Example

Caching Data

Interactive Shell: Scala

Launch Spark Shell

Basic Operations

Filtering and Transformations

Advanced Operations

Write Your First Application

Run Example Programs

Python Examples

Scala/Java Examples

Master URLs

Understanding Datasets and DataFrames

Common Transformations

Common Actions

Next Steps