main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.
What are RDDs?
RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing collection in the driver program, and transforming it. You can:- Persist an RDD in memory, allowing it to be reused efficiently across parallel operations
- Automatically recover from node failures
RDDs automatically recover from node failures, making them resilient to failures in the cluster.
Initializing Spark
The first thing a Spark program must do is create aSparkContext object, which tells Spark how to access a cluster.
appName parameter is a name for your application to show on the cluster UI. The master is a Spark or YARN cluster URL, or a special “local” string to run in local mode.
Creating RDDs
There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system.Parallelized Collections
Parallelized collections are created by callingparallelize on an existing collection:
One important parameter for parallel collections is the number of partitions to cut the dataset into. Spark will run one task for each partition of the cluster. Typically you want 2-4 partitions for each CPU in your cluster.
External Datasets
Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, and Amazon S3.File Reading Notes
- All file-based input methods support running on directories, compressed files, and wildcards (e.g.,
textFile("/my/directory/*.txt")) - The
textFilemethod takes an optional second argument for controlling the number of partitions - By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS)
RDD Operations
RDDs support two types of operations:- Transformations - Create a new dataset from an existing one (lazy evaluation)
- Actions - Return a value to the driver program after running a computation
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset. The transformations are only computed when an action requires a result to be returned to the driver program.
Basic Example
Common Transformations
| Transformation | Description |
|---|---|
map(func) | Return a new distributed dataset formed by passing each element through a function. |
filter(func) | Return a new dataset formed by selecting those elements on which func returns true. |
flatMap(func) | Similar to map, but each input item can be mapped to 0 or more output items. |
groupByKey() | When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. |
reduceByKey(func) | When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where values for each key are aggregated. |
sortByKey() | When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset sorted by keys. |
join(otherDataset) | When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs. |
distinct() | Return a new dataset that contains the distinct elements of the source dataset. |
Common Actions
| Action | Description |
|---|---|
reduce(func) | Aggregate the elements of the dataset using a function (which takes two arguments and returns one). |
collect() | Return all elements of the dataset as an array at the driver program. |
count() | Return the number of elements in the dataset. |
first() | Return the first element of the dataset. |
take(n) | Return an array with the first n elements of the dataset. |
saveAsTextFile(path) | Write the elements of the dataset as a text file in a given directory. |
foreach(func) | Run a function on each element of the dataset. |
Working with Key-Value Pairs
Many Spark operations work on RDDs of key-value pairs:RDD Persistence
One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations.Storage Levels
| Storage Level | Description |
|---|---|
MEMORY_ONLY | Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached. This is the default level. |
MEMORY_AND_DISK | Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don’t fit on disk. |
MEMORY_ONLY_SER | Store RDD as serialized Java objects (one byte array per partition). More space-efficient but more CPU-intensive to read. |
DISK_ONLY | Store the RDD partitions only on disk. |
MEMORY_ONLY_2 | Same as MEMORY_ONLY but replicate each partition on two cluster nodes. |
Shared Variables
Spark provides two types of shared variables:Broadcast Variables
Broadcast variables allow you to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.Accumulators
Accumulators are variables that are only “added” to through an associative and commutative operation. They can be used to implement counters or sums.Next Steps
Cluster Overview
Understand Spark’s cluster deployment modes
Job Scheduling
Learn about resource allocation and scheduling
