Cluster Mode Overview

This guide provides an overview of how Spark runs on clusters and the components involved in distributed execution.

Cluster Components

Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).

Connect to Cluster Manager

The SparkContext connects to a cluster manager (Standalone, YARN, or Kubernetes) which allocates resources across applications.

Acquire Executors

Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application.

Distribute Code

Spark sends your application code (defined by JAR or Python files passed to SparkContext) to the executors.

Execute Tasks

Finally, SparkContext sends tasks to the executors to run.

Cluster Manager Types

Spark currently supports several cluster managers:

Standalone

A simple cluster manager included with Spark that makes it easy to set up a cluster.

Hadoop YARN

The resource manager in Hadoop 3. Well-suited for shared clusters with multiple frameworks.

Kubernetes

An open-source system for automating deployment, scaling, and management of containerized applications.

Architecture Considerations

Isolation: Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side and executor side.

However, data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.

Cluster Manager Agnostic: Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications.

Network Requirements

The driver program must listen for and accept incoming connections from its executors throughout its lifetime. As such, the driver program must be network addressable from the worker nodes.

You can configure the driver port in the network configuration:

spark.driver.port=7077

Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. If you’d like to send requests to the cluster remotely, it’s better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes.

Submitting Applications

Applications can be submitted to a cluster of any type using the spark-submit script.

Basic Submission

./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --executor-memory 2G \
  --total-executor-cores 10 \
  /path/to/examples.jar \
  1000

Master URLs

The --master parameter specifies the cluster manager to use:

Master URL	Description
`local`	Run Spark locally with one worker thread (no parallelism).
`local[K]`	Run Spark locally with K worker threads (set K to cores on your machine).
`local[*]`	Run Spark locally with as many worker threads as logical cores on your machine.
`spark://HOST:PORT`	Connect to a Spark standalone cluster master. The port is 7077 by default.
`yarn`	Connect to a YARN cluster in client or cluster mode.
`k8s://HOST:PORT`	Connect to a Kubernetes cluster in cluster mode.

Deploy Modes

Spark supports two deploy modes:

Client Mode
Cluster Mode

In client mode, the driver runs on the machine where you submit the application.

./bin/spark-submit \
  --deploy-mode client \
  --master yarn \
  myapp.jar

Use When:

You need to see output immediately
Running interactive shells
Developing and testing

In cluster mode, the driver runs inside the cluster on a worker node.

./bin/spark-submit \
  --deploy-mode cluster \
  --master yarn \
  myapp.jar

Use When:

Running production applications
Driver needs to be close to the data
You want the cluster to manage the driver

Resource Allocation

Standalone Mode

# Limit cores per application
./bin/spark-submit \
  --master spark://host:7077 \
  --conf spark.cores.max=20 \
  myapp.jar

By default, applications submitted to standalone mode cluster will run in FIFO order, and each application will try to use all available nodes.

YARN Mode

# Configure executors and resources
./bin/spark-submit \
  --master yarn \
  --num-executors 50 \
  --executor-memory 4G \
  --executor-cores 2 \
  myapp.jar

Configuration options:

--num-executors - Number of executors to allocate
--executor-memory - Memory per executor
--executor-cores - Cores per executor

Kubernetes Mode

# Submit to Kubernetes
./bin/spark-submit \
  --master k8s://https://kubernetes.example.com:443 \
  --deploy-mode cluster \
  --conf spark.kubernetes.container.image=spark:latest \
  --conf spark.kubernetes.executor.request.cores=1 \
  --conf spark.kubernetes.executor.limit.cores=2 \
  myapp.jar

Monitoring

Each driver program has a web UI, typically on port 4040, that displays information about:

Running tasks - See active and completed tasks
Executors - Monitor executor status and resources
Storage usage - View cached RDDs and memory usage
Environment - Check Spark properties and configuration

# Access the Spark UI
http://<driver-node>:4040

If multiple applications are running on the same host, they will bind to successive ports beginning with 4040 (4041, 4042, etc).

Terminology

Term	Definition
Application	User program built on Spark. Consists of a driver program and executors on the cluster.
Application jar	A jar containing your Spark application. Should never include Hadoop or Spark libraries.
Driver program	The process running the main() function of the application and creating the SparkContext.
Cluster manager	An external service for acquiring resources on the cluster (e.g., standalone manager, YARN, Kubernetes).
Deploy mode	Distinguishes where the driver process runs. In “cluster” mode, the framework launches the driver inside the cluster. In “client” mode, the submitter launches the driver outside the cluster.
Worker node	Any node that can run application code in the cluster.
Executor	A process launched for an application on a worker node that runs tasks and keeps data in memory or disk storage.
Task	A unit of work that will be sent to one executor.
Job	A parallel computation consisting of multiple tasks spawned in response to a Spark action (e.g., save, collect).
Stage	Each job gets divided into smaller sets of tasks called stages that depend on each other.

Best Practices

Application Isolation

Each application gets its own executor processes. This provides isolation but means data cannot be shared across applications without external storage.

Network Locality

Run the driver close to worker nodes on the same local area network to minimize latency and maximize throughput.

Resource Configuration

Configure executor memory and cores based on your workload. Start with 2-4 cores per executor and adjust based on monitoring.

Deploy Mode Selection

Use client mode for development and interactive work. Use cluster mode for production applications that need to run reliably.

Get Started

Core Concepts

Spark SQL

Structured Streaming

Machine Learning

Graph Processing

Deployment

Configuration & Tuning

Monitoring

Cluster Components

Cluster Manager Types

Standalone

Hadoop YARN

Kubernetes

Architecture Considerations

Network Requirements

Submitting Applications

Basic Submission

Master URLs

Deploy Modes

Resource Allocation

Standalone Mode

YARN Mode

Kubernetes Mode

Monitoring

Terminology

Best Practices

Next Steps

Job Scheduling

Application Submission

Build docs developers (and LLMs) love

Get Started

Core Concepts

Spark SQL

Structured Streaming

Machine Learning

Graph Processing

Deployment

Configuration & Tuning

Monitoring

​Cluster Components

​Cluster Manager Types

Standalone

Hadoop YARN

Kubernetes

​Architecture Considerations

​Network Requirements

​Submitting Applications

​Basic Submission

​Master URLs

​Deploy Modes

​Resource Allocation

​Standalone Mode

​YARN Mode

​Kubernetes Mode

​Monitoring

​Terminology

​Best Practices

​Next Steps

Job Scheduling

Application Submission

Build docs developers (and LLMs) love

Cluster Components

Cluster Manager Types

Architecture Considerations

Network Requirements

Submitting Applications

Basic Submission

Master URLs

Deploy Modes

Resource Allocation

Standalone Mode

YARN Mode

Kubernetes Mode

Monitoring

Terminology

Best Practices

Next Steps