Skip to main content
Choosing the right hardware configuration is crucial for running Spark efficiently. While the optimal hardware depends on your specific workload, this guide provides general recommendations that work well for most Spark deployments.

Overview

Spark’s performance depends on four key hardware components:

Storage Systems

Proximity to data sources like HDFS

Local Disks

For intermediate data and spills

Memory

For in-memory computation and caching

Network

For shuffle and data transfer

CPU Cores

For parallel task execution

Storage Systems

Since most Spark jobs read input data from external storage systems, placing Spark as close to this system as possible is critical for performance.

Co-location with HDFS

Run Spark on the same nodes as HDFSThe simplest approach is to set up a Spark standalone mode cluster on the same nodes and configure resource usage to avoid interference:Hadoop configuration:
  • mapred.child.java.opts - Control per-task memory
  • mapreduce.tasktracker.map.tasks.maximum - Limit map tasks
  • mapreduce.tasktracker.reduce.tasks.maximum - Limit reduce tasks
Alternatively, run Hadoop and Spark on a common cluster manager like YARN.
Data locality is crucial for Spark performance. Always aim to minimize the distance between computation and storage.

Local Disks

While Spark performs much of its computation in memory, it still uses local disks for data that doesn’t fit in RAM and for preserving intermediate output between stages.

Disk Configuration

1

Number of Disks

We recommend 4-8 disks per node for optimal I/O throughput.
2

RAID Configuration

Configure disks without RAID (just as separate mount points). This provides better I/O parallelism.
3

Mount Options

On Linux, mount disks with the noatime option to reduce unnecessary writes:
mount -o noatime /dev/sdb1 /mnt/disk1
4

Spark Configuration

Configure spark.local.dir as a comma-separated list of the local disks:
spark.local.dir=/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4
If you’re running HDFS, it’s fine to use the same disks as HDFS. Spark will write to different directories.

Disk Recommendations

ConfigurationRecommendationNotes
Number of disks4-8 per nodeMore disks = more I/O parallelism
RAIDNo RAIDBetter performance without RAID
Disk typeSSD or fast HDDSSDs recommended for shuffle-heavy workloads
Mount optionnoatimeReduces write overhead

Memory

Memory is one of the most important resources for Spark applications. Proper memory allocation ensures efficient in-memory computation while leaving enough for the operating system.

Memory Sizing Guidelines

Spark can run well with anywhere from 8 GiB to hundreds of gigabytes of memory per machine.
We recommend allocating at most 75% of the memory for Spark. Leave the rest for the operating system and buffer cache.

Memory Allocation Formula

Total Machine Memory: 256 GB
├─ OS and Buffer Cache: 64 GB (25%)
└─ Spark Available: 192 GB (75%)
   ├─ Executor 1: 64 GB
   ├─ Executor 2: 64 GB
   └─ Executor 3: 64 GB

Memory Configuration Examples

spark.executor.memory=6g
spark.driver.memory=4g
Leaves 2 GB for OS on executor nodes.
spark.executor.memory=48g
spark.executor.cores=8
spark.driver.memory=16g
Single executor per node with 16 GB reserved for OS.
spark.executor.memory=56g
spark.executor.cores=8
spark.executor.instances=3
spark.driver.memory=32g
Three executors per node, each with 56 GB. Total: 168 GB for Spark, 88 GB for OS.

Network

In our experience, when data is in memory, many Spark applications become network-bound. A fast network is essential for shuffle-heavy workloads.

Network Recommendations

Network Speed

Use a 10 Gigabit or higher network

Network Topology

Ensure low latency between nodes in the same rack
This is especially true for “distributed reduce” applications such as group-bys, reduce-bys, and SQL joins.

Monitoring Network Usage

You can see how much data Spark shuffles across the network in the application’s monitoring UI at http://<driver-node>:4040. Look for:
  • Shuffle read/write metrics
  • Network I/O time
  • Data locality levels
If you see high shuffle volumes or poor data locality, consider:
  • Increasing network bandwidth
  • Optimizing your application to reduce shuffles
  • Improving data co-location

Network Configuration

MetricRecommendationUse Case
Bandwidth10 Gbps minimumStandard workloads
Bandwidth25-100 GbpsLarge-scale shuffle operations
Latency< 1ms within rackLow-latency requirements
Latency< 5ms across racksAcceptable for most workloads

CPU Cores

Spark scales well to tens of CPU cores per machine because it performs minimal sharing between threads. More cores allow more tasks to run in parallel.

CPU Recommendations

1

Minimum Cores

Provision at least 8-16 cores per machine for good parallelism.
2

Scaling Considerations

Once data is in memory, most applications are either CPU-bound or network-bound. Depending on your workload’s CPU cost, you may need more cores.
3

Hyperthreading

Hyperthreading can provide some benefit, but physical cores are more valuable than logical cores.

Core Configuration

# Allocate cores per executor
spark.executor.cores=4

# Total number of executors
spark.executor.instances=10

# Total cores used: 4 * 10 = 40 cores
Spark performs minimal thread contention, so it scales efficiently with more cores. Don’t be afraid to use all available cores.

CPU Configuration Examples

For compute-intensive operations like machine learning:
spark.executor.cores=16
spark.executor.memory=32g
More cores per executor for better CPU utilization.
For typical ETL and analytics:
spark.executor.cores=4
spark.executor.memory=16g
Balanced core-to-memory ratio.
For caching-heavy applications:
spark.executor.cores=2
spark.executor.memory=32g
Fewer cores with more memory per executor.

Hardware Provisioning Checklist

1

Assess Your Workload

Determine if your workload is CPU-bound, memory-bound, or network-bound.
2

Plan Storage

Co-locate Spark with your data source when possible.
3

Configure Disks

Set up 4-8 local disks per node without RAID.
4

Allocate Memory

Use 75% of system memory for Spark, leaving 25% for the OS.
5

Provision Network

Ensure at least 10 Gbps network connectivity.
6

Configure CPUs

Provision 8-16 cores per machine minimum.
7

Test and Tune

Run representative workloads and adjust based on monitoring data.

Example Hardware Configurations

Suitable for development and small datasets
ComponentSpecification
Nodes5-10 nodes
CPU8 cores per node
Memory32 GB per node
Local disks2-4 x 1TB HDD
Network1 Gbps
StorageCo-located HDFS
Total cluster: 40-80 cores, 160-320 GB RAM

Cloud Deployment Recommendations

When deploying Spark on cloud platforms:

AWS

Recommended instance types:
  • r5d.4xlarge (128 GB RAM, 16 vCPUs)
  • r5d.8xlarge (256 GB RAM, 32 vCPUs)
  • i3.8xlarge (244 GB RAM, 32 vCPUs, NVMe SSDs)

Azure

Recommended instance types:
  • Standard_D16s_v3 (64 GB RAM, 16 vCPUs)
  • Standard_E32s_v3 (256 GB RAM, 32 vCPUs)
  • Standard_L16s_v2 (128 GB RAM, 16 vCPUs, NVMe SSDs)

GCP

Recommended instance types:
  • n1-highmem-16 (104 GB RAM, 16 vCPUs)
  • n1-highmem-32 (208 GB RAM, 32 vCPUs)
  • n2-standard-32 (128 GB RAM, 32 vCPUs)
Cloud instance costs vary by region and commitment level. Use reserved instances or committed use discounts for production clusters.

Next Steps

Performance Tuning

Optimize your Spark applications for the hardware you’ve provisioned

Spark Configuration

Configure Spark properties to match your hardware setup

Build docs developers (and LLMs) love