Cluster Components
Spark applications run as independent sets of processes on a cluster, coordinated by theSparkContext object in your main program (called the driver program).
Connect to Cluster Manager
The SparkContext connects to a cluster manager (Standalone, YARN, or Kubernetes) which allocates resources across applications.
Acquire Executors
Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application.
Distribute Code
Spark sends your application code (defined by JAR or Python files passed to SparkContext) to the executors.
Cluster Manager Types
Spark currently supports several cluster managers:Standalone
A simple cluster manager included with Spark that makes it easy to set up a cluster.
Hadoop YARN
The resource manager in Hadoop 3. Well-suited for shared clusters with multiple frameworks.
Kubernetes
An open-source system for automating deployment, scaling, and management of containerized applications.
Architecture Considerations
Isolation: Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side and executor side.
Cluster Manager Agnostic: Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications.
Network Requirements
You can configure the driver port in the network configuration:Submitting Applications
Applications can be submitted to a cluster of any type using thespark-submit script.
Basic Submission
Master URLs
The--master parameter specifies the cluster manager to use:
| Master URL | Description |
|---|---|
local | Run Spark locally with one worker thread (no parallelism). |
local[K] | Run Spark locally with K worker threads (set K to cores on your machine). |
local[*] | Run Spark locally with as many worker threads as logical cores on your machine. |
spark://HOST:PORT | Connect to a Spark standalone cluster master. The port is 7077 by default. |
yarn | Connect to a YARN cluster in client or cluster mode. |
k8s://HOST:PORT | Connect to a Kubernetes cluster in cluster mode. |
Deploy Modes
Spark supports two deploy modes:- Client Mode
- Cluster Mode
In client mode, the driver runs on the machine where you submit the application.Use When:
- You need to see output immediately
- Running interactive shells
- Developing and testing
Resource Allocation
Standalone Mode
YARN Mode
--num-executors- Number of executors to allocate--executor-memory- Memory per executor--executor-cores- Cores per executor
Kubernetes Mode
Monitoring
Each driver program has a web UI, typically on port 4040, that displays information about:- Running tasks - See active and completed tasks
- Executors - Monitor executor status and resources
- Storage usage - View cached RDDs and memory usage
- Environment - Check Spark properties and configuration
If multiple applications are running on the same host, they will bind to successive ports beginning with 4040 (4041, 4042, etc).
Terminology
| Term | Definition |
|---|---|
| Application | User program built on Spark. Consists of a driver program and executors on the cluster. |
| Application jar | A jar containing your Spark application. Should never include Hadoop or Spark libraries. |
| Driver program | The process running the main() function of the application and creating the SparkContext. |
| Cluster manager | An external service for acquiring resources on the cluster (e.g., standalone manager, YARN, Kubernetes). |
| Deploy mode | Distinguishes where the driver process runs. In “cluster” mode, the framework launches the driver inside the cluster. In “client” mode, the submitter launches the driver outside the cluster. |
| Worker node | Any node that can run application code in the cluster. |
| Executor | A process launched for an application on a worker node that runs tasks and keeps data in memory or disk storage. |
| Task | A unit of work that will be sent to one executor. |
| Job | A parallel computation consisting of multiple tasks spawned in response to a Spark action (e.g., save, collect). |
| Stage | Each job gets divided into smaller sets of tasks called stages that depend on each other. |
Best Practices
Application Isolation
Application Isolation
Each application gets its own executor processes. This provides isolation but means data cannot be shared across applications without external storage.
Network Locality
Network Locality
Run the driver close to worker nodes on the same local area network to minimize latency and maximize throughput.
Resource Configuration
Resource Configuration
Configure executor memory and cores based on your workload. Start with 2-4 cores per executor and adjust based on monitoring.
Deploy Mode Selection
Deploy Mode Selection
Use client mode for development and interactive work. Use cluster mode for production applications that need to run reliably.
Next Steps
Job Scheduling
Learn about resource allocation and scheduling within applications
Application Submission
Detailed guide on submitting Spark applications
