Spark Configuration

You can configure Spark through three main locations: Spark properties for application parameters, environment variables for per-machine settings, and logging configuration. This guide covers all configuration options and best practices.

Configuration Methods

Spark provides three locations to configure the system:

Spark Properties

Control application parameters using SparkConf or Java system properties

Environment Variables

Set per-machine settings through conf/spark-env.sh on each node

Logging

Configure logging behavior through log4j2.properties

Spark Properties

Spark properties control most application settings and are configured separately for each application. You can set these properties directly on a SparkConf object passed to your SparkContext.

Setting Properties in Code

Here’s how to initialize an application with basic configuration:

val conf = new SparkConf()
             .setMaster("local[2]")
             .setAppName("CountingSheep")
val sc = new SparkContext(conf)

We run with local[2], meaning two threads - which represents “minimal” parallelism. This helps detect bugs that only exist in distributed contexts.

Property Value Formats

Spark accepts specific formats for time and size values:

Time Duration
Byte Size

Properties that specify time duration should include a unit:

25ms (milliseconds)
5s (seconds)
10m or 10min (minutes)
3h (hours)
5d (days)
1y (years)

Properties that specify byte size should include a unit:

1b (bytes)
1k or 1kb (kibibytes = 1024 bytes)
1m or 1mb (mebibytes = 1024 kibibytes)
1g or 1gb (gibibytes = 1024 mebibytes)
1t or 1tb (tebibytes = 1024 gibibytes)
1p or 1pb (pebibytes = 1024 tebibytes)

Dynamically Loading Properties

You can avoid hard-coding configurations by creating an empty conf and supplying values at runtime:

val sc = new SparkContext(new SparkConf())

Then supply configuration values when submitting:

./bin/spark-submit \
  --name "My app" \
  --master "local[4]" \
  --conf spark.eventLog.enabled=false \
  --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
  myApp.jar

Configuration File

The spark-submit tool reads configuration options from conf/spark-defaults.conf, where each line consists of a key and value separated by whitespace:

spark.master            spark://5.6.7.8:7077
spark.executor.memory   4g
spark.eventLog.enabled  true
spark.serializer        org.apache.spark.serializer.KryoSerializer

You can pass a custom properties file using --properties-file. When set, Spark will not load conf/spark-defaults.conf unless you also provide --load-spark-defaults.

Configuration Precedence

Configuration values take precedence in the following order:

Properties set directly on SparkConf (highest precedence)
Flags passed to spark-submit or spark-shell via --conf
Options in the spark-defaults.conf file (lowest precedence)

Viewing Properties

You can view your application’s properties in the web UI at http://<driver>:4040 under the “Environment” tab. This helps verify that your properties are set correctly.

Only values explicitly specified through spark-defaults.conf, SparkConf, or the command line will appear. For all other properties, the default value is used.

Application Properties

These properties control core application behavior:

Property	Default	Description	Since
`spark.app.name`	(none)	The name of your application. This appears in the UI and log data.	0.9.0
`spark.master`	(none)	The cluster manager to connect to (e.g., `local`, `yarn`, `spark://host:port`).	0.9.0
`spark.submit.deployMode`	client	Deploy mode: `client` or `cluster`. Client launches driver locally, cluster launches it on one of the cluster nodes.	1.5.0

Driver Configuration

Property	Default	Description	Since
`spark.driver.cores`	1	Number of cores to use for the driver process, only in cluster mode.	1.3.0
`spark.driver.memory`	1g	Amount of memory to use for the driver process (e.g., `512m`, `2g`).	1.1.1
`spark.driver.maxResultSize`	1g	Limit of total size of serialized results of all partitions for each action (e.g., collect). Should be at least 1M, or 0 for unlimited.	1.2.0
`spark.driver.memoryOverhead`	driverMemory * 0.10, min 384m	Amount of non-heap memory to be allocated per driver in cluster mode. Accounts for VM overheads, interned strings, and other native overheads.	2.3.0

In client mode, spark.driver.memory must be set through the --driver-memory command line option or in your properties file, not through SparkConf, because the driver JVM has already started.

Executor Configuration

Property	Default	Description	Since
`spark.executor.memory`	1g	Amount of memory to use per executor process (e.g., `512m`, `2g`). Minimum value is 450m.	0.7.0
`spark.executor.cores`	1 in YARN mode, all available in standalone	Number of cores to use for each executor.	1.0.0
`spark.executor.memoryOverhead`	executorMemory * 0.10, min 384m	Amount of additional memory to be allocated per executor. Accounts for VM overheads, interned strings, and other native overheads.	2.3.0
`spark.executor.pyspark.memory`	Not set	Amount of memory allocated to PySpark in each executor, in MiB. If not set, Python’s memory use is not limited.	2.4.0

Resource Allocation

For GPU and custom resource allocation:

Property	Default	Description	Since
`spark.driver.resource.{resourceName}.amount`	0	Amount of a particular resource type to use on the driver. Must also specify discoveryScript.	3.0.0
`spark.executor.resource.{resourceName}.amount`	0	Amount of a particular resource type to use per executor. Must also specify discoveryScript.	3.0.0
`spark.executor.resource.{resourceName}.vendor`	None	Vendor of resources for executors (e.g., `nvidia.com` for GPUs on Kubernetes).	3.0.0

Runtime Environment

Configure the execution environment for your application:

Property	Default	Description	Since
`spark.driver.extraClassPath`	(none)	Extra classpath entries to prepend to the driver’s classpath.	1.0.0
`spark.executor.extraClassPath`	(none)	Extra classpath entries to prepend to executors’ classpath.	1.0.0
`spark.driver.extraJavaOptions`	(none)	Extra JVM options to pass to the driver. Cannot set max heap size (-Xmx) with this.	1.0.0
`spark.executor.extraJavaOptions`	(none)	Extra JVM options to pass to executors. Cannot set Spark properties or max heap size (-Xmx) with this.	1.0.0
`spark.executorEnv.[EnvironmentVariableName]`	(none)	Add environment variable to executor process. You can specify multiple variables.	0.9.0

Python Configuration

Property	Default	Description	Since
`spark.pyspark.python`	(none)	Python binary executable for PySpark in both driver and executors.	2.1.0
`spark.pyspark.driver.python`	spark.pyspark.python	Python binary executable for PySpark in driver.	2.1.0
`spark.python.worker.memory`	512m	Amount of memory to use per Python worker during aggregation.	1.1.0
`spark.python.worker.reuse`	true	Reuse Python workers. Useful when there is a large broadcast.	1.2.0

Shuffle Behavior

Configure shuffle operations for optimal performance:

Property	Default	Description	Since
`spark.reducer.maxSizeInFlight`	48m	Maximum size of map outputs to fetch simultaneously from each reduce task. Represents fixed memory overhead per reduce task.	1.4.0
`spark.shuffle.compress`	true	Whether to compress map output files. Uses `spark.io.compression.codec`.	0.6.0
`spark.shuffle.file.buffer`	32k	Size of the in-memory buffer for each shuffle file output stream. These buffers reduce disk seeks and system calls.	1.4.0
`spark.shuffle.spill.diskWriteBufferSize`	1024 * 1024	Buffer size in bytes when writing sorted records to disk.	2.3.0
`spark.shuffle.io.maxRetries`	3	Fetches that fail due to IO exceptions are automatically retried. Helps stabilize large shuffles.	1.2.0
`spark.shuffle.io.retryWait`	5s	How long to wait between retries of fetches. Maximum delay is `maxRetries * retryWait`.	1.2.1

External Shuffle Service

Property	Default	Description	Since
`spark.shuffle.service.enabled`	false	Enables external shuffle service. This preserves shuffle files written by executors so executors can be safely removed.	1.2.0
`spark.shuffle.service.port`	7337	Port on which the external shuffle service will run.	1.2.0
`spark.shuffle.service.name`	spark_shuffle	Name of the Spark shuffle service the client should communicate with. Must match YARN NodeManager configuration.	3.2.0

Storage Configuration

Property	Default	Description	Since
`spark.local.dir`	/tmp	Directory for “scratch” space in Spark, including map output files and RDDs stored on disk. Should be on fast, local disk. Can be comma-separated list.	0.5.0

spark.local.dir will be overridden by SPARK_LOCAL_DIRS (Standalone) or LOCAL_DIRS (YARN) environment variables set by the cluster manager.

Logging Configuration

Property	Default	Description	Since
`spark.logConf`	false	Logs the effective SparkConf as INFO when SparkContext starts.	0.9.0
`spark.log.level`	(none)	When set, overrides any user-defined log settings. Valid levels: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN.	3.5.0
`spark.driver.log.localDir`	(none)	Specifies a local directory to write driver logs and enable Driver Log UI Tab.	4.0.0
`spark.driver.log.dfsDir`	(none)	Base directory where Spark driver logs are synced when persistToDfs is enabled.	3.0.0
`spark.driver.log.persistToDfs.enabled`	false	If true, Spark application in client mode will write driver logs to persistent storage.	3.0.0

Best Practices

Separate Deploy and Runtime Properties

Deploy-related properties like spark.driver.memory and spark.executor.instances may not be affected when set programmatically through SparkConf at runtime. Set these through configuration files or spark-submit command line options.Runtime control properties like spark.task.maxFailures can be set in either way.

Use Appropriate Memory Settings

Set spark.driver.memory and spark.executor.memory based on your workload:

For memory-intensive operations, allocate more memory
Leave at least 25% of system memory for the OS and buffer cache
Use spark.driver.maxResultSize to prevent driver OOM errors

Configure Shuffle for Your Workload

For shuffle-heavy workloads:

Increase spark.reducer.maxSizeInFlight if you have plenty of memory
Enable spark.shuffle.service.enabled for dynamic allocation
Tune spark.shuffle.io.maxRetries for unstable networks

Environment-Specific Configuration

Use --properties-file to maintain different configurations for development, staging, and production environments.

Get Started

Core Concepts

Spark SQL

Structured Streaming

Machine Learning

Graph Processing

Deployment

Configuration & Tuning

Monitoring

Configuration Methods

Spark Properties

Environment Variables

Logging

Spark Properties

Setting Properties in Code

Property Value Formats

Dynamically Loading Properties

Configuration File

Configuration Precedence

Viewing Properties

Application Properties

Driver Configuration

Executor Configuration

Resource Allocation

Runtime Environment

Python Configuration

Shuffle Behavior

External Shuffle Service

Storage Configuration

Logging Configuration

Best Practices

Next Steps

Performance Tuning

Security Configuration

Build docs developers (and LLMs) love

Get Started

Core Concepts

Spark SQL

Structured Streaming

Machine Learning

Graph Processing

Deployment

Configuration & Tuning

Monitoring

​Configuration Methods

Spark Properties

Environment Variables

Logging

​Spark Properties

​Setting Properties in Code

​Property Value Formats

​Dynamically Loading Properties

​Configuration File

​Configuration Precedence

​Viewing Properties

​Application Properties

​Driver Configuration

​Executor Configuration

​Resource Allocation

​Runtime Environment

​Python Configuration

​Shuffle Behavior

​External Shuffle Service

​Storage Configuration

​Logging Configuration

​Best Practices

​Next Steps

Performance Tuning

Security Configuration

Build docs developers (and LLMs) love

Configuration Methods

Spark Properties

Setting Properties in Code

Property Value Formats

Dynamically Loading Properties

Configuration File

Configuration Precedence

Viewing Properties

Application Properties

Driver Configuration

Executor Configuration

Resource Allocation

Runtime Environment

Python Configuration

Shuffle Behavior

External Shuffle Service

Storage Configuration

Logging Configuration

Best Practices

Next Steps