Skip to main content
This guide provides an overview of how Spark runs on clusters and the components involved in distributed execution.

Cluster Components

Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).
1

Connect to Cluster Manager

The SparkContext connects to a cluster manager (Standalone, YARN, or Kubernetes) which allocates resources across applications.
2

Acquire Executors

Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application.
3

Distribute Code

Spark sends your application code (defined by JAR or Python files passed to SparkContext) to the executors.
4

Execute Tasks

Finally, SparkContext sends tasks to the executors to run.

Cluster Manager Types

Spark currently supports several cluster managers:

Standalone

A simple cluster manager included with Spark that makes it easy to set up a cluster.

Hadoop YARN

The resource manager in Hadoop 3. Well-suited for shared clusters with multiple frameworks.

Kubernetes

An open-source system for automating deployment, scaling, and management of containerized applications.

Architecture Considerations

Isolation: Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side and executor side.
However, data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.
Cluster Manager Agnostic: Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications.

Network Requirements

The driver program must listen for and accept incoming connections from its executors throughout its lifetime. As such, the driver program must be network addressable from the worker nodes.
You can configure the driver port in the network configuration:
spark.driver.port=7077
Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. If you’d like to send requests to the cluster remotely, it’s better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes.

Submitting Applications

Applications can be submitted to a cluster of any type using the spark-submit script.

Basic Submission

./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --executor-memory 2G \
  --total-executor-cores 10 \
  /path/to/examples.jar \
  1000

Master URLs

The --master parameter specifies the cluster manager to use:
Master URLDescription
localRun Spark locally with one worker thread (no parallelism).
local[K]Run Spark locally with K worker threads (set K to cores on your machine).
local[*]Run Spark locally with as many worker threads as logical cores on your machine.
spark://HOST:PORTConnect to a Spark standalone cluster master. The port is 7077 by default.
yarnConnect to a YARN cluster in client or cluster mode.
k8s://HOST:PORTConnect to a Kubernetes cluster in cluster mode.

Deploy Modes

Spark supports two deploy modes:
In client mode, the driver runs on the machine where you submit the application.
./bin/spark-submit \
  --deploy-mode client \
  --master yarn \
  myapp.jar
Use When:
  • You need to see output immediately
  • Running interactive shells
  • Developing and testing

Resource Allocation

Standalone Mode

# Limit cores per application
./bin/spark-submit \
  --master spark://host:7077 \
  --conf spark.cores.max=20 \
  myapp.jar
By default, applications submitted to standalone mode cluster will run in FIFO order, and each application will try to use all available nodes.

YARN Mode

# Configure executors and resources
./bin/spark-submit \
  --master yarn \
  --num-executors 50 \
  --executor-memory 4G \
  --executor-cores 2 \
  myapp.jar
Configuration options:
  • --num-executors - Number of executors to allocate
  • --executor-memory - Memory per executor
  • --executor-cores - Cores per executor

Kubernetes Mode

# Submit to Kubernetes
./bin/spark-submit \
  --master k8s://https://kubernetes.example.com:443 \
  --deploy-mode cluster \
  --conf spark.kubernetes.container.image=spark:latest \
  --conf spark.kubernetes.executor.request.cores=1 \
  --conf spark.kubernetes.executor.limit.cores=2 \
  myapp.jar

Monitoring

Each driver program has a web UI, typically on port 4040, that displays information about:
  • Running tasks - See active and completed tasks
  • Executors - Monitor executor status and resources
  • Storage usage - View cached RDDs and memory usage
  • Environment - Check Spark properties and configuration
# Access the Spark UI
http://<driver-node>:4040
If multiple applications are running on the same host, they will bind to successive ports beginning with 4040 (4041, 4042, etc).

Terminology

TermDefinition
ApplicationUser program built on Spark. Consists of a driver program and executors on the cluster.
Application jarA jar containing your Spark application. Should never include Hadoop or Spark libraries.
Driver programThe process running the main() function of the application and creating the SparkContext.
Cluster managerAn external service for acquiring resources on the cluster (e.g., standalone manager, YARN, Kubernetes).
Deploy modeDistinguishes where the driver process runs. In “cluster” mode, the framework launches the driver inside the cluster. In “client” mode, the submitter launches the driver outside the cluster.
Worker nodeAny node that can run application code in the cluster.
ExecutorA process launched for an application on a worker node that runs tasks and keeps data in memory or disk storage.
TaskA unit of work that will be sent to one executor.
JobA parallel computation consisting of multiple tasks spawned in response to a Spark action (e.g., save, collect).
StageEach job gets divided into smaller sets of tasks called stages that depend on each other.

Best Practices

Each application gets its own executor processes. This provides isolation but means data cannot be shared across applications without external storage.
Run the driver close to worker nodes on the same local area network to minimize latency and maximize throughput.
Configure executor memory and cores based on your workload. Start with 2-4 cores per executor and adjust based on monitoring.
Use client mode for development and interactive work. Use cluster mode for production applications that need to run reliably.

Next Steps

Job Scheduling

Learn about resource allocation and scheduling within applications

Application Submission

Detailed guide on submitting Spark applications

Build docs developers (and LLMs) love