Overview
This quick start guide provides a hands-on introduction to Apache Spark. You’ll learn how to use Spark’s interactive shell and write your first application.
Before starting, make sure you have Java 17 or 21 installed on your system. Check with java -version.
Download Spark
First, download a packaged release of Spark from the downloads page .
Download the package
Choose a pre-built package for any Hadoop version (or “Hadoop free” binary). Extract the archive: tar -xzf spark- * .tgz
cd spark- *
Verify Java installation
Ensure Java is available on your PATH or set JAVA_HOME: java -version
# Should show Java 17 or 21
Test the installation
Run a simple example to verify everything works: ./bin/run-example SparkPi 10
You should see output calculating Pi.
Interactive Shell: Python
The easiest way to start learning Spark is through the interactive Python shell.
Launch PySpark
Start the Python shell:
Or if you installed PySpark with pip:
Basic Operations
Let’s create a DataFrame and perform some basic operations:
# Read the README file
textFile = spark.read.text( "README.md" )
# Count the number of rows
textFile.count()
# Output: 126
# Get the first row
textFile.first()
# Output: Row(value='# Apache Spark')
Filtering Data
Transform the DataFrame by filtering rows:
# Find lines containing "Spark"
linesWithSpark = textFile.filter(textFile.value.contains( "Spark" ))
# Count filtered lines
linesWithSpark.count()
# Output: 15
# Chain operations together
textFile.filter(textFile.value.contains( "Spark" )).count()
# Output: 15
Perform more complex operations:
from pyspark.sql import functions as sf
# Find the line with the most words
textFile.select(
sf.size(sf.split(textFile.value, "\s+" )).name( "numWords" )
).agg(sf.max(sf.col( "numWords" ))).collect()
# Output: [Row(max(numWords)=15)]
Word Count Example
Implement the classic MapReduce word count:
# Count words in the file
wordCounts = textFile.select(
sf.explode(sf.split(textFile.value, "\s+" )).alias( "word" )
).groupBy( "word" ).count()
# Collect and display results
wordCounts.show()
Caching Data
Cache frequently accessed data in memory:
# Cache the dataset
linesWithSpark.cache()
# First count - loads data into memory
linesWithSpark.count()
# Output: 15
# Second count - uses cached data (much faster)
linesWithSpark.count()
# Output: 15
Caching is extremely useful when you query the same dataset multiple times or run iterative algorithms.
Interactive Shell: Scala
If you prefer Scala, you can use the Scala shell instead.
Launch Spark Shell
Basic Operations
// Read the README file
val textFile = spark.read.textFile( "README.md" )
// Count rows
textFile.count()
// Output: res0: Long = 126
// Get first row
textFile.first()
// Output: res1: String = # Apache Spark
// Filter lines containing "Spark"
val linesWithSpark = textFile.filter(line => line.contains( "Spark" ))
// Count matches
textFile.filter(line => line.contains( "Spark" )).count()
// Output: res3: Long = 15
Advanced Operations
// Find the longest line
textFile.map(line => line.split( " " ).size).reduce((a, b) => Math .max(a, b))
// Output: res4: Int = 15
// Word count using flatMap
val wordCounts = textFile.flatMap(line => line.split( " " ))
.groupByKey(identity)
.count()
wordCounts.collect()
Write Your First Application
Now let’s create a standalone Spark application.
Create a file named simple_app.py: """simple_app.py"""
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder.appName( "SimpleApp" ).getOrCreate()
# Read data
logFile = "README.md"
logData = spark.read.text(logFile).cache()
# Count lines containing 'a' and 'b'
numAs = logData.filter(logData.value.contains( 'a' )).count()
numBs = logData.filter(logData.value.contains( 'b' )).count()
print ( f "Lines with a: { numAs } , lines with b: { numBs } " )
spark.stop()
Run your application: # Using spark-submit
./bin/spark-submit --master "local[4]" simple_app.py
# Or using Python directly (if PySpark is pip installed)
python simple_app.py
Expected output: Lines with a: 46, lines with b: 23
Create SimpleApp.scala: /* SimpleApp.scala */
import org . apache . spark . sql . SparkSession
object SimpleApp {
def main ( args : Array [ String ]) : Unit = {
val spark = SparkSession .builder
.appName( "Simple Application" )
.getOrCreate()
val logFile = "README.md"
val logData = spark.read.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains( "a" )).count()
val numBs = logData.filter(line => line.contains( "b" )).count()
println( s "Lines with a: $ numAs , Lines with b: $ numBs " )
spark.stop()
}
}
Create build.sbt: name := "Simple Project"
version := "1.0"
scalaVersion := "2.13.0"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "4.0.0"
Build and run: # Package the application
sbt package
# Run with spark-submit
./bin/spark-submit \
--class "SimpleApp" \
--master "local[4]" \
target/scala-2.13/simple-project_2.13-1.0.jar
Create SimpleApp.java: /* SimpleApp.java */
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;
public class SimpleApp {
public static void main ( String [] args ) {
SparkSession spark = SparkSession . builder ()
. appName ( "Simple Application" )
. getOrCreate ();
String logFile = "README.md" ;
Dataset < String > logData = spark . read (). textFile (logFile). cache ();
long numAs = logData . filter (s -> s . contains ( "a" )). count ();
long numBs = logData . filter (s -> s . contains ( "b" )). count ();
System . out . println ( "Lines with a: " + numAs + ", lines with b: " + numBs);
spark . stop ();
}
}
Create pom.xml: < project >
< groupId > com.example </ groupId >
< artifactId > simple-project </ artifactId >
< modelVersion > 4.0.0 </ modelVersion >
< version > 1.0 </ version >
< dependencies >
< dependency >
< groupId > org.apache.spark </ groupId >
< artifactId > spark-sql_2.13 </ artifactId >
< version > 4.0.0 </ version >
< scope > provided </ scope >
</ dependency >
</ dependencies >
</ project >
Build and run: # Package with Maven
mvn package
# Run with spark-submit
./bin/spark-submit \
--class "SimpleApp" \
--master "local[4]" \
target/simple-project-1.0.jar
Run Example Programs
Spark includes several example programs you can run immediately.
Python Examples
# Calculate Pi
./bin/spark-submit examples/src/main/python/pi.py 10
# Word count
./bin/spark-submit examples/src/main/python/wordcount.py README.md
Scala/Java Examples
# Calculate Pi
./bin/run-example SparkPi 10
# PageRank algorithm
./bin/run-example SparkPageRank data/mllib/pagerank_data.txt 10
Master URLs
You can run Spark in different modes by setting the master URL:
Run on your local machine: # Single thread
--master "local"
# Multiple threads (use all cores)
--master "local[*]"
# Specific number of threads
--master "local[4]"
Connect to a Spark cluster: # Standalone cluster
--master "spark://host:7077"
# YARN cluster
--master "yarn"
# Kubernetes cluster
--master "k8s://https://cluster-url"
Understanding Datasets and DataFrames
Spark 2.0+ uses Datasets and DataFrames as the primary API. RDDs are still supported but Datasets offer better performance and optimization.
Key concepts:
Dataset : Strongly-typed collection of objects (Scala/Java)
DataFrame : Dataset with named columns (like a database table)
Transformations : Operations that create new Datasets (lazy evaluation)
Actions : Operations that trigger computation and return results
# Transformations (lazy - don't execute immediately)
filtered = df.filter(df.age > 21 )
selected = df.select( "name" , "age" )
grouped = df.groupBy( "department" ).count()
Common Actions
# Actions (trigger computation)
count = df.count() # Count rows
first = df.first() # Get first row
rows = df.collect() # Get all rows
df.show() # Display data
Next Steps
Congratulations on running your first Spark application!
Programming Guides Deep dive into Spark SQL, DataFrames, and Datasets
Deploy to Cluster Learn how to run Spark on a cluster
MLlib Guide Build machine learning pipelines
Structured Streaming Process real-time data streams
When running on a cluster, don’t call collect() on large datasets as it pulls all data to the driver. Use take() or write to distributed storage instead.