Skip to main content
Support for running on YARN (Hadoop NextGen) was added to Spark in version 0.6.0, and improved in subsequent releases.
Security features like authentication are not enabled by default. When deploying a cluster that is open to the internet or an untrusted network, it’s important to secure access to the cluster to prevent unauthorized applications from running on the cluster. See Spark Security before running Spark.

Launching Spark on YARN

Apache Hadoop does not support Java 17 as of 3.4.1, while Apache Spark requires at least Java 17 since 4.0.0, so a different JDK should be configured for Spark applications.
Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster. These configs are used to write to HDFS and connect to the YARN ResourceManager.

Deploy Modes

There are two deploy modes that can be used to launch Spark applications on YARN:
In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application.To launch a Spark application in cluster mode:
$ ./bin/spark-submit \
    --class org.apache.spark.examples.SparkPi \
    --master yarn \
    --deploy-mode cluster \
    --driver-memory 4g \
    --executor-memory 2g \
    --executor-cores 1 \
    --queue thequeue \
    examples/jars/spark-examples*.jar \
    10
The above starts a YARN client program which starts the default Application Master. Then SparkPi will be run as a child thread of Application Master. The client will periodically poll the Application Master for status updates and display them in the console.
In YARN mode, the ResourceManager’s address is picked up from the Hadoop configuration. Thus, the --master parameter is yarn.

Adding Other JARs

In cluster mode, the driver runs on a different machine than the client, so SparkContext.addJar won’t work out of the box with files that are local to the client. To make files on the client available to SparkContext.addJar, include them with the --jars option in the launch command:
$ ./bin/spark-submit \
    --class my.main.Class \
    --master yarn \
    --deploy-mode cluster \
    --jars my-other-jar.jar,my-other-other-jar.jar \
    my-main-jar.jar \
    app_arg1 app_arg2

Preparations

Running Spark on YARN requires a binary distribution of Spark which is built with YARN support. Binary distributions can be downloaded from the downloads page of the project website.

Spark Distribution Variants

There are two variants of Spark binary distributions:
This distribution is pre-built with a certain version of Apache Hadoop and contains built-in Hadoop runtime. By default, when a job is submitted to Hadoop YARN cluster, to prevent jar conflict, it will not populate YARN’s classpath into Spark.To override this behavior, set:
spark.yarn.populateHadoopClasspath=true
This distribution is pre-built with user-provided Hadoop. Since it doesn’t contain a built-in Hadoop runtime, it’s smaller, but users have to provide a Hadoop installation separately. Spark will populate YARN’s classpath by default in order to get Hadoop runtime.
To make Spark runtime jars accessible from YARN side, you can specify spark.yarn.archive or spark.yarn.jars. If neither is specified, Spark will create a zip file with all jars under $SPARK_HOME/jars and upload it to the distributed cache.

Debugging Your Application

In YARN terminology, executors and application masters run inside “containers”. YARN has two modes for handling container logs after an application has completed.

Log Aggregation

If log aggregation is turned on (with the yarn.log-aggregation-enable config), container logs are copied to HDFS and deleted on the local machine. These logs can be viewed from anywhere on the cluster with the yarn logs command:
yarn logs -applicationId <app ID>
This will print out the contents of all log files from all containers from the given application.
The logs are also available on the Spark Web UI under the Executors Tab. You need to have both the Spark history server and the MapReduce history server running and configure yarn.log.server.url in yarn-site.xml properly.

Without Log Aggregation

When log aggregation isn’t turned on, logs are retained locally on each machine under YARN_APP_LOGS_DIR, which is usually configured to /tmp/logs or $HADOOP_HOME/logs/userlogs depending on the Hadoop version and installation. Viewing logs for a container requires going to the host that contains them and looking in this directory. Subdirectories organize log files by application ID and container ID.

Spark Properties

Here are some key YARN-specific Spark properties:

Application Master Configuration

Property NameDefaultMeaning
spark.yarn.am.memory512mAmount of memory to use for the YARN Application Master in client mode. In cluster mode, use spark.driver.memory instead
spark.yarn.am.cores1Number of cores to use for the YARN Application Master in client mode. In cluster mode, use spark.driver.cores instead

Executor Configuration

Property NameDefaultMeaning
spark.executor.instances2The number of executors for static allocation. With spark.dynamicAllocation.enabled, the initial set of executors will be at least this large
spark.yarn.executor.memoryOverheadAM memory * 0.10, with minimum of 384Amount of additional memory to be allocated per executor process, in MiB

File Distribution

Property NameDefaultMeaning
spark.yarn.archive(none)An archive containing needed Spark jars for distribution to the YARN cache. If set, this configuration replaces spark.yarn.jars
spark.yarn.jars(none)List of libraries containing Spark code to distribute to YARN containers
spark.yarn.dist.files(none)Comma-separated list of files to be placed in the working directory of each executor
spark.yarn.dist.archives(none)Comma separated list of archives to be extracted into the working directory of each executor

Submission Configuration

Property NameDefaultMeaning
spark.yarn.submit.file.replicationThe default HDFS replication (usually 3)HDFS replication level for the files uploaded into HDFS for the application
spark.yarn.stagingDirCurrent user’s home directory in the filesystemStaging directory used while submitting applications
spark.yarn.queuedefaultThe name of the YARN queue to which the application is submitted

Configuring the External Shuffle Service

To start the Spark Shuffle Service on each NodeManager in your YARN cluster, follow these instructions:
1

Build Spark with YARN profile

Skip this step if you are using a pre-packaged distribution.
2

Locate the shuffle JAR

Locate the spark-<version>-yarn-shuffle.jar. This should be under $SPARK_HOME/common/network-yarn/target/scala-<version> if you are building Spark yourself, and under yarn if you are using a distribution.
3

Add JAR to NodeManager classpath

Add this jar to the classpath of all NodeManagers in your cluster.
4

Configure yarn-site.xml

In the yarn-site.xml on each node:
  • Add spark_shuffle to yarn.nodemanager.aux-services
  • Set yarn.nodemanager.aux-services.spark_shuffle.class to org.apache.spark.network.yarn.YarnShuffleService
5

Increase NodeManager heap size

Increase NodeManager's heap size by setting YARN_HEAPSIZE (1000 by default) in etc/hadoop/yarn-env.sh to avoid garbage collection issues during shuffle.
6

Restart NodeManagers

Restart all NodeManagers in your cluster.

Shuffle Service Properties

Property NameDefaultMeaning
spark.yarn.shuffle.stopOnFailurefalseWhether to stop the NodeManager when there’s a failure in the Spark Shuffle Service’s initialization
spark.shuffle.service.db.backendROCKSDBWhen work-preserving restart is enabled in YARN, this is used to specify the disk-base store used in shuffle service state store

Kerberos

Standard Kerberos support in Spark is covered in the Security page. In YARN mode, when accessing Hadoop file systems, aside from the default file system in the hadoop configuration, Spark will also automatically obtain delegation tokens for the service hosting the staging directory of the Spark application.

YARN-specific Kerberos Configuration

Property NameDefaultMeaning
spark.kerberos.keytab(none)The full path to the file that contains the keytab for the principal. This keytab will be copied to the node running the YARN Application Master via the YARN Distributed Cache
spark.kerberos.principal(none)Principal to be used to login to KDC, while running on secure clusters
spark.yarn.kerberos.relogin.period1mHow often to check whether the kerberos TGT should be renewed

Important Notes

  • Whether core requests are honored in scheduling decisions depends on which scheduler is in use and how it is configured.
  • In cluster mode, the local directories used by the Spark executors and the Spark driver will be the local directories configured for YARN (Hadoop YARN config yarn.nodemanager.local-dirs).
  • In client mode, the Spark executors will use the local directories configured for YARN while the Spark driver will use those defined in spark.local.dir.
  • The --files and --archives options support specifying file names with the # similar to Hadoop. For example: --files localtest.txt#appSees.txt

Build docs developers (and LLMs) love