Skip to main content
You can configure Spark through three main locations: Spark properties for application parameters, environment variables for per-machine settings, and logging configuration. This guide covers all configuration options and best practices.

Configuration Methods

Spark provides three locations to configure the system:

Spark Properties

Control application parameters using SparkConf or Java system properties

Environment Variables

Set per-machine settings through conf/spark-env.sh on each node

Logging

Configure logging behavior through log4j2.properties

Spark Properties

Spark properties control most application settings and are configured separately for each application. You can set these properties directly on a SparkConf object passed to your SparkContext.

Setting Properties in Code

Here’s how to initialize an application with basic configuration:
val conf = new SparkConf()
             .setMaster("local[2]")
             .setAppName("CountingSheep")
val sc = new SparkContext(conf)
We run with local[2], meaning two threads - which represents “minimal” parallelism. This helps detect bugs that only exist in distributed contexts.

Property Value Formats

Spark accepts specific formats for time and size values:
Properties that specify time duration should include a unit:
  • 25ms (milliseconds)
  • 5s (seconds)
  • 10m or 10min (minutes)
  • 3h (hours)
  • 5d (days)
  • 1y (years)

Dynamically Loading Properties

You can avoid hard-coding configurations by creating an empty conf and supplying values at runtime:
val sc = new SparkContext(new SparkConf())
Then supply configuration values when submitting:
./bin/spark-submit \
  --name "My app" \
  --master "local[4]" \
  --conf spark.eventLog.enabled=false \
  --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \
  myApp.jar

Configuration File

The spark-submit tool reads configuration options from conf/spark-defaults.conf, where each line consists of a key and value separated by whitespace:
spark.master            spark://5.6.7.8:7077
spark.executor.memory   4g
spark.eventLog.enabled  true
spark.serializer        org.apache.spark.serializer.KryoSerializer
You can pass a custom properties file using --properties-file. When set, Spark will not load conf/spark-defaults.conf unless you also provide --load-spark-defaults.

Configuration Precedence

Configuration values take precedence in the following order:
  1. Properties set directly on SparkConf (highest precedence)
  2. Flags passed to spark-submit or spark-shell via --conf
  3. Options in the spark-defaults.conf file (lowest precedence)

Viewing Properties

You can view your application’s properties in the web UI at http://<driver>:4040 under the “Environment” tab. This helps verify that your properties are set correctly.
Only values explicitly specified through spark-defaults.conf, SparkConf, or the command line will appear. For all other properties, the default value is used.

Application Properties

These properties control core application behavior:
PropertyDefaultDescriptionSince
spark.app.name(none)The name of your application. This appears in the UI and log data.0.9.0
spark.master(none)The cluster manager to connect to (e.g., local, yarn, spark://host:port).0.9.0
spark.submit.deployModeclientDeploy mode: client or cluster. Client launches driver locally, cluster launches it on one of the cluster nodes.1.5.0

Driver Configuration

PropertyDefaultDescriptionSince
spark.driver.cores1Number of cores to use for the driver process, only in cluster mode.1.3.0
spark.driver.memory1gAmount of memory to use for the driver process (e.g., 512m, 2g).1.1.1
spark.driver.maxResultSize1gLimit of total size of serialized results of all partitions for each action (e.g., collect). Should be at least 1M, or 0 for unlimited.1.2.0
spark.driver.memoryOverheaddriverMemory * 0.10, min 384mAmount of non-heap memory to be allocated per driver in cluster mode. Accounts for VM overheads, interned strings, and other native overheads.2.3.0
In client mode, spark.driver.memory must be set through the --driver-memory command line option or in your properties file, not through SparkConf, because the driver JVM has already started.

Executor Configuration

PropertyDefaultDescriptionSince
spark.executor.memory1gAmount of memory to use per executor process (e.g., 512m, 2g). Minimum value is 450m.0.7.0
spark.executor.cores1 in YARN mode, all available in standaloneNumber of cores to use for each executor.1.0.0
spark.executor.memoryOverheadexecutorMemory * 0.10, min 384mAmount of additional memory to be allocated per executor. Accounts for VM overheads, interned strings, and other native overheads.2.3.0
spark.executor.pyspark.memoryNot setAmount of memory allocated to PySpark in each executor, in MiB. If not set, Python’s memory use is not limited.2.4.0

Resource Allocation

For GPU and custom resource allocation:
PropertyDefaultDescriptionSince
spark.driver.resource.{resourceName}.amount0Amount of a particular resource type to use on the driver. Must also specify discoveryScript.3.0.0
spark.executor.resource.{resourceName}.amount0Amount of a particular resource type to use per executor. Must also specify discoveryScript.3.0.0
spark.executor.resource.{resourceName}.vendorNoneVendor of resources for executors (e.g., nvidia.com for GPUs on Kubernetes).3.0.0

Runtime Environment

Configure the execution environment for your application:
PropertyDefaultDescriptionSince
spark.driver.extraClassPath(none)Extra classpath entries to prepend to the driver’s classpath.1.0.0
spark.executor.extraClassPath(none)Extra classpath entries to prepend to executors’ classpath.1.0.0
spark.driver.extraJavaOptions(none)Extra JVM options to pass to the driver. Cannot set max heap size (-Xmx) with this.1.0.0
spark.executor.extraJavaOptions(none)Extra JVM options to pass to executors. Cannot set Spark properties or max heap size (-Xmx) with this.1.0.0
spark.executorEnv.[EnvironmentVariableName](none)Add environment variable to executor process. You can specify multiple variables.0.9.0

Python Configuration

PropertyDefaultDescriptionSince
spark.pyspark.python(none)Python binary executable for PySpark in both driver and executors.2.1.0
spark.pyspark.driver.pythonspark.pyspark.pythonPython binary executable for PySpark in driver.2.1.0
spark.python.worker.memory512mAmount of memory to use per Python worker during aggregation.1.1.0
spark.python.worker.reusetrueReuse Python workers. Useful when there is a large broadcast.1.2.0

Shuffle Behavior

Configure shuffle operations for optimal performance:
PropertyDefaultDescriptionSince
spark.reducer.maxSizeInFlight48mMaximum size of map outputs to fetch simultaneously from each reduce task. Represents fixed memory overhead per reduce task.1.4.0
spark.shuffle.compresstrueWhether to compress map output files. Uses spark.io.compression.codec.0.6.0
spark.shuffle.file.buffer32kSize of the in-memory buffer for each shuffle file output stream. These buffers reduce disk seeks and system calls.1.4.0
spark.shuffle.spill.diskWriteBufferSize1024 * 1024Buffer size in bytes when writing sorted records to disk.2.3.0
spark.shuffle.io.maxRetries3Fetches that fail due to IO exceptions are automatically retried. Helps stabilize large shuffles.1.2.0
spark.shuffle.io.retryWait5sHow long to wait between retries of fetches. Maximum delay is maxRetries * retryWait.1.2.1

External Shuffle Service

PropertyDefaultDescriptionSince
spark.shuffle.service.enabledfalseEnables external shuffle service. This preserves shuffle files written by executors so executors can be safely removed.1.2.0
spark.shuffle.service.port7337Port on which the external shuffle service will run.1.2.0
spark.shuffle.service.namespark_shuffleName of the Spark shuffle service the client should communicate with. Must match YARN NodeManager configuration.3.2.0

Storage Configuration

PropertyDefaultDescriptionSince
spark.local.dir/tmpDirectory for “scratch” space in Spark, including map output files and RDDs stored on disk. Should be on fast, local disk. Can be comma-separated list.0.5.0
spark.local.dir will be overridden by SPARK_LOCAL_DIRS (Standalone) or LOCAL_DIRS (YARN) environment variables set by the cluster manager.

Logging Configuration

PropertyDefaultDescriptionSince
spark.logConffalseLogs the effective SparkConf as INFO when SparkContext starts.0.9.0
spark.log.level(none)When set, overrides any user-defined log settings. Valid levels: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN.3.5.0
spark.driver.log.localDir(none)Specifies a local directory to write driver logs and enable Driver Log UI Tab.4.0.0
spark.driver.log.dfsDir(none)Base directory where Spark driver logs are synced when persistToDfs is enabled.3.0.0
spark.driver.log.persistToDfs.enabledfalseIf true, Spark application in client mode will write driver logs to persistent storage.3.0.0

Best Practices

Deploy-related properties like spark.driver.memory and spark.executor.instances may not be affected when set programmatically through SparkConf at runtime. Set these through configuration files or spark-submit command line options.Runtime control properties like spark.task.maxFailures can be set in either way.
Set spark.driver.memory and spark.executor.memory based on your workload:
  • For memory-intensive operations, allocate more memory
  • Leave at least 25% of system memory for the OS and buffer cache
  • Use spark.driver.maxResultSize to prevent driver OOM errors
For shuffle-heavy workloads:
  • Increase spark.reducer.maxSizeInFlight if you have plenty of memory
  • Enable spark.shuffle.service.enabled for dynamic allocation
  • Tune spark.shuffle.io.maxRetries for unstable networks
Use --properties-file to maintain different configurations for development, staging, and production environments.

Next Steps

Performance Tuning

Optimize your Spark applications for better performance

Security Configuration

Secure your Spark cluster with authentication and encryption

Build docs developers (and LLMs) love