Configuration Methods
Spark provides three locations to configure the system:Spark Properties
Control application parameters using SparkConf or Java system properties
Environment Variables
Set per-machine settings through
conf/spark-env.sh on each nodeLogging
Configure logging behavior through
log4j2.propertiesSpark Properties
Spark properties control most application settings and are configured separately for each application. You can set these properties directly on a SparkConf object passed to your SparkContext.Setting Properties in Code
Here’s how to initialize an application with basic configuration:We run with
local[2], meaning two threads - which represents “minimal” parallelism. This helps detect bugs that only exist in distributed contexts.Property Value Formats
Spark accepts specific formats for time and size values:- Time Duration
- Byte Size
Properties that specify time duration should include a unit:
25ms(milliseconds)5s(seconds)10mor10min(minutes)3h(hours)5d(days)1y(years)
Dynamically Loading Properties
You can avoid hard-coding configurations by creating an empty conf and supplying values at runtime:Configuration File
Thespark-submit tool reads configuration options from conf/spark-defaults.conf, where each line consists of a key and value separated by whitespace:
You can pass a custom properties file using
--properties-file. When set, Spark will not load conf/spark-defaults.conf unless you also provide --load-spark-defaults.Configuration Precedence
Configuration values take precedence in the following order:- Properties set directly on SparkConf (highest precedence)
- Flags passed to
spark-submitorspark-shellvia--conf - Options in the
spark-defaults.conffile (lowest precedence)
Viewing Properties
You can view your application’s properties in the web UI athttp://<driver>:4040 under the “Environment” tab. This helps verify that your properties are set correctly.
Application Properties
These properties control core application behavior:| Property | Default | Description | Since |
|---|---|---|---|
spark.app.name | (none) | The name of your application. This appears in the UI and log data. | 0.9.0 |
spark.master | (none) | The cluster manager to connect to (e.g., local, yarn, spark://host:port). | 0.9.0 |
spark.submit.deployMode | client | Deploy mode: client or cluster. Client launches driver locally, cluster launches it on one of the cluster nodes. | 1.5.0 |
Driver Configuration
| Property | Default | Description | Since |
|---|---|---|---|
spark.driver.cores | 1 | Number of cores to use for the driver process, only in cluster mode. | 1.3.0 |
spark.driver.memory | 1g | Amount of memory to use for the driver process (e.g., 512m, 2g). | 1.1.1 |
spark.driver.maxResultSize | 1g | Limit of total size of serialized results of all partitions for each action (e.g., collect). Should be at least 1M, or 0 for unlimited. | 1.2.0 |
spark.driver.memoryOverhead | driverMemory * 0.10, min 384m | Amount of non-heap memory to be allocated per driver in cluster mode. Accounts for VM overheads, interned strings, and other native overheads. | 2.3.0 |
In client mode,
spark.driver.memory must be set through the --driver-memory command line option or in your properties file, not through SparkConf, because the driver JVM has already started.Executor Configuration
| Property | Default | Description | Since |
|---|---|---|---|
spark.executor.memory | 1g | Amount of memory to use per executor process (e.g., 512m, 2g). Minimum value is 450m. | 0.7.0 |
spark.executor.cores | 1 in YARN mode, all available in standalone | Number of cores to use for each executor. | 1.0.0 |
spark.executor.memoryOverhead | executorMemory * 0.10, min 384m | Amount of additional memory to be allocated per executor. Accounts for VM overheads, interned strings, and other native overheads. | 2.3.0 |
spark.executor.pyspark.memory | Not set | Amount of memory allocated to PySpark in each executor, in MiB. If not set, Python’s memory use is not limited. | 2.4.0 |
Resource Allocation
For GPU and custom resource allocation:| Property | Default | Description | Since |
|---|---|---|---|
spark.driver.resource.{resourceName}.amount | 0 | Amount of a particular resource type to use on the driver. Must also specify discoveryScript. | 3.0.0 |
spark.executor.resource.{resourceName}.amount | 0 | Amount of a particular resource type to use per executor. Must also specify discoveryScript. | 3.0.0 |
spark.executor.resource.{resourceName}.vendor | None | Vendor of resources for executors (e.g., nvidia.com for GPUs on Kubernetes). | 3.0.0 |
Runtime Environment
Configure the execution environment for your application:| Property | Default | Description | Since |
|---|---|---|---|
spark.driver.extraClassPath | (none) | Extra classpath entries to prepend to the driver’s classpath. | 1.0.0 |
spark.executor.extraClassPath | (none) | Extra classpath entries to prepend to executors’ classpath. | 1.0.0 |
spark.driver.extraJavaOptions | (none) | Extra JVM options to pass to the driver. Cannot set max heap size (-Xmx) with this. | 1.0.0 |
spark.executor.extraJavaOptions | (none) | Extra JVM options to pass to executors. Cannot set Spark properties or max heap size (-Xmx) with this. | 1.0.0 |
spark.executorEnv.[EnvironmentVariableName] | (none) | Add environment variable to executor process. You can specify multiple variables. | 0.9.0 |
Python Configuration
| Property | Default | Description | Since |
|---|---|---|---|
spark.pyspark.python | (none) | Python binary executable for PySpark in both driver and executors. | 2.1.0 |
spark.pyspark.driver.python | spark.pyspark.python | Python binary executable for PySpark in driver. | 2.1.0 |
spark.python.worker.memory | 512m | Amount of memory to use per Python worker during aggregation. | 1.1.0 |
spark.python.worker.reuse | true | Reuse Python workers. Useful when there is a large broadcast. | 1.2.0 |
Shuffle Behavior
Configure shuffle operations for optimal performance:| Property | Default | Description | Since |
|---|---|---|---|
spark.reducer.maxSizeInFlight | 48m | Maximum size of map outputs to fetch simultaneously from each reduce task. Represents fixed memory overhead per reduce task. | 1.4.0 |
spark.shuffle.compress | true | Whether to compress map output files. Uses spark.io.compression.codec. | 0.6.0 |
spark.shuffle.file.buffer | 32k | Size of the in-memory buffer for each shuffle file output stream. These buffers reduce disk seeks and system calls. | 1.4.0 |
spark.shuffle.spill.diskWriteBufferSize | 1024 * 1024 | Buffer size in bytes when writing sorted records to disk. | 2.3.0 |
spark.shuffle.io.maxRetries | 3 | Fetches that fail due to IO exceptions are automatically retried. Helps stabilize large shuffles. | 1.2.0 |
spark.shuffle.io.retryWait | 5s | How long to wait between retries of fetches. Maximum delay is maxRetries * retryWait. | 1.2.1 |
External Shuffle Service
| Property | Default | Description | Since |
|---|---|---|---|
spark.shuffle.service.enabled | false | Enables external shuffle service. This preserves shuffle files written by executors so executors can be safely removed. | 1.2.0 |
spark.shuffle.service.port | 7337 | Port on which the external shuffle service will run. | 1.2.0 |
spark.shuffle.service.name | spark_shuffle | Name of the Spark shuffle service the client should communicate with. Must match YARN NodeManager configuration. | 3.2.0 |
Storage Configuration
| Property | Default | Description | Since |
|---|---|---|---|
spark.local.dir | /tmp | Directory for “scratch” space in Spark, including map output files and RDDs stored on disk. Should be on fast, local disk. Can be comma-separated list. | 0.5.0 |
Logging Configuration
| Property | Default | Description | Since |
|---|---|---|---|
spark.logConf | false | Logs the effective SparkConf as INFO when SparkContext starts. | 0.9.0 |
spark.log.level | (none) | When set, overrides any user-defined log settings. Valid levels: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN. | 3.5.0 |
spark.driver.log.localDir | (none) | Specifies a local directory to write driver logs and enable Driver Log UI Tab. | 4.0.0 |
spark.driver.log.dfsDir | (none) | Base directory where Spark driver logs are synced when persistToDfs is enabled. | 3.0.0 |
spark.driver.log.persistToDfs.enabled | false | If true, Spark application in client mode will write driver logs to persistent storage. | 3.0.0 |
Best Practices
Separate Deploy and Runtime Properties
Separate Deploy and Runtime Properties
Deploy-related properties like
spark.driver.memory and spark.executor.instances may not be affected when set programmatically through SparkConf at runtime. Set these through configuration files or spark-submit command line options.Runtime control properties like spark.task.maxFailures can be set in either way.Use Appropriate Memory Settings
Use Appropriate Memory Settings
Set
spark.driver.memory and spark.executor.memory based on your workload:- For memory-intensive operations, allocate more memory
- Leave at least 25% of system memory for the OS and buffer cache
- Use
spark.driver.maxResultSizeto prevent driver OOM errors
Configure Shuffle for Your Workload
Configure Shuffle for Your Workload
For shuffle-heavy workloads:
- Increase
spark.reducer.maxSizeInFlightif you have plenty of memory - Enable
spark.shuffle.service.enabledfor dynamic allocation - Tune
spark.shuffle.io.maxRetriesfor unstable networks
Environment-Specific Configuration
Environment-Specific Configuration
Use
--properties-file to maintain different configurations for development, staging, and production environments.Next Steps
Performance Tuning
Optimize your Spark applications for better performance
Security Configuration
Secure your Spark cluster with authentication and encryption
