Core Components
Spark applications run as independent sets of processes on a cluster, coordinated by theSparkContext object in your main program (called the driver program).
Driver Program
The driver program runs themain() function of your application and creates the SparkContext. It is responsible for:
- Converting your program into tasks
- Scheduling tasks on executors
- Coordinating task execution across the cluster
Cluster Manager
To run on a cluster, the SparkContext connects to several types of cluster managers which allocate resources across applications:- Standalone - A simple cluster manager included with Spark that makes it easy to set up a cluster
- Hadoop YARN - The resource manager in Hadoop 3
- Kubernetes - An open-source system for automating deployment, scaling, and management of containerized applications
Executors
Once connected, Spark acquires executors on nodes in the cluster. Executors are processes that:- Run computations for your application
- Store data in memory or disk storage
- Stay up for the duration of the whole application
- Run tasks in multiple threads
Tasks
The SparkContext sends tasks to the executors to run. Tasks are units of work sent to one executor.Architecture Principles
There are several important aspects of this architecture:Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs).
Network Considerations
The driver program must listen for and accept incoming connections from its executors throughout its lifetime. This means:- The driver must be network addressable from the worker nodes
- The driver should be run close to the worker nodes, preferably on the same local area network
spark.driver.port in the network configuration section.
Execution Model
When you run a Spark application:- Application Submission - Submit your application using
spark-submit - Resource Allocation - The cluster manager allocates resources and launches executors
- Code Distribution - Spark sends your application code (JAR or Python files) to the executors
- Task Execution - SparkContext sends tasks to executors to run
- Result Collection - Results are returned to the driver program
Jobs, Stages, and Tasks
Spark organizes computation into a hierarchy:- Job - A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g.
save,collect) - Stage - Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce)
- Task - A unit of work sent to one executor
Monitoring
Each driver program has a web UI, typically on port 4040, that displays information about:- Running tasks
- Executors
- Storage usage
http://<driver-node>:4040 in a web browser.
Terminology Reference
| Term | Meaning |
|---|---|
| Application | User program built on Spark. Consists of a driver program and executors on the cluster. |
| Application jar | A jar containing your Spark application. Should never include Hadoop or Spark libraries. |
| Driver program | The process running the main() function of the application and creating the SparkContext. |
| Cluster manager | An external service for acquiring resources on the cluster (e.g. standalone manager, YARN, Kubernetes). |
| Deploy mode | Distinguishes where the driver process runs. In “cluster” mode, the framework launches the driver inside of the cluster. In “client” mode, the submitter launches the driver outside of the cluster. |
| Worker node | Any node that can run application code in the cluster. |
| Executor | A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. |
| Task | A unit of work that will be sent to one executor. |
| Job | A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action. |
| Stage | Each job gets divided into smaller sets of tasks called stages that depend on each other. |
Next Steps
RDD Programming
Learn how to work with Resilient Distributed Datasets
Job Scheduling
Understand how Spark schedules and allocates resources
