Performance Bottlenecks
Spark applications commonly face these bottlenecks:Network Bandwidth
Most common when data fits in memory
Memory Usage
Critical for large datasets and caching
CPU
Important for compute-intensive operations
If your data fits in memory, the bottleneck is usually network bandwidth. Focus on data serialization and reducing shuffle operations.
Data Serialization
Serialization plays a crucial role in the performance of any distributed application. Formats that are slow to serialize or consume many bytes will greatly slow down computation.Serialization Options
Spark provides two serialization libraries:- Java Serialization (Default)
- Kryo Serialization (Recommended)
By default, Spark uses Java’s
ObjectOutputStream framework.Advantages:- Works with any class implementing
java.io.Serializable - No configuration required
- Flexible
- Often quite slow
- Produces large serialized formats
- Not recommended for production
Enabling Kryo Serialization
Switch to Kryo by configuring your SparkConf:Registering Classes with Kryo
For best performance, register your custom classes with Kryo:Spark automatically includes Kryo serializers for many commonly-used core Scala classes from the Twitter chill library.
Kryo Configuration
If your objects are large, increase the buffer size:Memory Tuning
There are three considerations in tuning memory usage: the amount of memory used by your objects, the cost of accessing those objects, and the overhead of garbage collection.Memory Overhead in Java Objects
Java objects can consume 2-5x more space than the raw data due to:Object Headers
Each Java object has about 16 bytes of overhead containing information like a pointer to its class.
String Overhead
Java Strings have about 40 bytes of overhead and store each character as two bytes. A 10-character string can consume 60 bytes.
Collection Overhead
Collections like HashMap and LinkedList use wrapper objects for each entry, adding headers and pointers (typically 8 bytes each).
Memory Management Overview
Memory usage in Spark falls under two categories:- Execution memory: Used for computation in shuffles, joins, sorts, and aggregations
- Storage memory: Used for caching and propagating internal data across the cluster
This design ensures applications that don’t use caching can use the entire space for execution, avoiding unnecessary disk spills.
Key Memory Configuration
| Property | Default | Description |
|---|---|---|
spark.memory.fraction | 0.6 | Size of M as a fraction of (JVM heap space - 300MiB). The rest (40%) is reserved for user data structures and safeguarding against OOM errors. |
spark.memory.storageFraction | 0.5 | Size of R as a fraction of M. R is the storage space where cached blocks are immune to being evicted by execution. |
The typical user should not need to adjust these values as defaults work well for most workloads.
Determining Memory Consumption
The best way to size memory consumption for a dataset:- Create an RDD and put it into cache
- Look at the “Storage” page in the web UI at
http://<driver>:4040 - The page shows how much memory the RDD occupies
SizeEstimator.estimate() for specific objects:
Tuning Data Structures
Reduce memory consumption by avoiding Java features that add overhead:Use Arrays Instead of Collections
Use Arrays Instead of Collections
Design data structures to prefer arrays of objects and primitive types instead of standard Java or Scala collection classes like HashMap.Consider using the fastutil library for primitive type collections compatible with Java standard library.
Avoid Nested Structures
Avoid Nested Structures
Avoid nested structures with many small objects and pointers when possible. Each object adds overhead.
Use Numeric IDs
Use Numeric IDs
Consider using numeric IDs or enumeration objects instead of strings for keys. Strings have significant overhead.
Use Compressed Oops
Use Compressed Oops
If you have less than 32 GiB of RAM, set the JVM flag
-XX:+UseCompressedOops to make pointers be four bytes instead of eight.Serialized RDD Storage
When objects are still too large despite tuning, store them in serialized form:We highly recommend using Kryo if you want to cache data in serialized form, as it leads to much smaller sizes than Java serialization.
Garbage Collection Tuning
JVM garbage collection can be a problem when you have large “churn” in terms of the RDDs stored by your program.When GC Becomes an Issue
GC is usually not a problem in programs that:- Read an RDD once
- Run many operations on it
- Storing many RDDs
- Creating many short-lived objects
- Having high turnover of cached data
The cost of garbage collection is proportional to the number of Java objects. Using data structures with fewer objects (e.g., an array of
Ints instead of a LinkedList) greatly lowers this cost.Measuring GC Impact
Add these flags to Java options to collect GC statistics:These logs appear in the
stdout files in the workers’ work directories, not in your driver program.Understanding JVM Memory Management
Generational Regions
Java heap space is divided into Young and Old generations. Young holds short-lived objects, Old holds objects with longer lifetimes.
GC Tuning Strategy
The goal is to ensure only long-lived RDDs are stored in the Old generation and that the Young generation is sufficiently sized for short-lived objects.Check for Too Many Collections
Check for Too Many Collections
If a full GC is invoked multiple times before a task completes, there isn’t enough memory available for executing tasks.Solution: Increase executor memory or reduce cached data.
Allocate More Memory for Eden
Allocate More Memory for Eden
If there are too many minor collections but not many major GCs, allocating more memory for Eden would help.Formula: If Eden size is determined to be E, set Young generation using:The 4/3 scaling accounts for space used by survivor regions.
Reduce Cache Size
Reduce Cache Size
If OldGen is close to being full, reduce memory used for caching:It’s better to cache fewer objects than to slow down task execution.
Adjust G1GC Settings
Adjust G1GC Settings
Since Spark 4.0.0 uses JDK 17 by default, G1GC is the default garbage collector. With large executor heap sizes, increase the G1 region size:
Practical GC Tuning Example
If your task reads data from HDFS:- Estimate memory used by the task using the HDFS block size
- Decompressed block size is often 2-3x the block size
- For 3-4 tasks’ worth of working space with 128 MiB HDFS blocks:
- Set Young generation size:
Other Performance Considerations
Level of Parallelism
Clusters will not be fully utilized unless you set the level of parallelism high enough.We recommend 2-3 tasks per CPU core in your cluster.
Parallel Listing on Input Paths
For jobs with large numbers of input directories, increase directory listing parallelism:Memory Usage of Reduce Tasks
If you get OutOfMemoryError in reduce tasks:Broadcasting Large Variables
Use broadcast variables to reduce the size of each serialized task:If your tasks use any large object from the driver program (e.g., a static lookup table), consider broadcasting it. Tasks larger than about 20 KiB are worth optimizing.
Data Locality
Data locality can have a major impact on performance. If data and code are together, computation tends to be fast. Locality Levels (from closest to farthest):PROCESS_LOCAL- Data is in the same JVM as the running code (best)NODE_LOCAL- Data is on the same node (e.g., in HDFS or another executor)NO_PREF- Data is accessed equally quickly from anywhereRACK_LOCAL- Data is on the same rack of serversANY- Data is elsewhere on the network (worst)
Increase these settings if your tasks are long and see poor locality, but the default usually works well.
Performance Tuning Checklist
Persist Data Appropriately
Use serialized storage levels for large datasets that don’t fit in memory.
Next Steps
Hardware Provisioning
Learn how to provision hardware for optimal Spark performance
Spark Configuration
Explore detailed configuration options
