Spark Streaming is the previous generation of Spark’s streaming engine. There are no longer updates to Spark Streaming and it’s a legacy project. There is a newer and easier to use streaming engine in Spark called Structured Streaming. You should use Structured Streaming for your streaming applications and pipelines. See the Structured Streaming Programming Guide.
Overview
Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions likemap, reduce, join and window.
Finally, processed data can be pushed out to filesystems, databases, and live dashboards. You can apply Spark’s machine learning and graph processing algorithms on data streams.
How It Works
Internally, Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.Quick Example
Let’s look at a simple example of counting words in text data received from a TCP socket:- Python
- Scala
- Java
When these lines are executed, Spark Streaming only sets up the computation it will perform when it is started, and no real processing has started yet. To start the processing after all the transformations have been setup, you call
start().Initializing StreamingContext
To initialize a Spark Streaming program, aStreamingContext object must be created, which is the main entry point of all Spark Streaming functionality.
- Python
- Scala
- Java
Key Points
Discretized Streams (DStreams)
Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream. Internally, a DStream is represented by a continuous series of RDDs. Each RDD in a DStream contains data from a certain interval. Any operation applied on a DStream translates to operations on the underlying RDDs. For example, in the earlier example of converting a stream of lines to words, theflatMap operation is applied on each RDD in the lines DStream to generate the RDDs of the words DStream.
Transformations on DStreams
Similar to RDDs, transformations allow the data from the input DStream to be modified. DStreams support many of the transformations available on normal Spark RDDs:map(func)
map(func)
Return a new DStream by passing each element of the source DStream through a function.
flatMap(func)
flatMap(func)
Similar to map, but each input item can be mapped to 0 or more output items.
filter(func)
filter(func)
Return a new DStream by selecting only the records of the source DStream on which func returns true.
reduceByKey(func, [numTasks])
reduceByKey(func, [numTasks])
When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function.
window(windowLength, slideInterval)
window(windowLength, slideInterval)
Return a new DStream which is computed based on windowed batches of the source DStream.
countByWindow(windowLength, slideInterval)
countByWindow(windowLength, slideInterval)
Return a sliding window count of elements in the stream.
reduceByKeyAndWindow(func, windowLength, slideInterval)
reduceByKeyAndWindow(func, windowLength, slideInterval)
When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function over batches in a sliding window.
Window Operations
Spark Streaming provides windowed computations, which allow you to apply transformations over a sliding window of data:- Python
- Scala
- Java
- window length: The duration of the window (30 seconds in the example)
- sliding interval: The interval at which the window operation is performed (10 seconds in the example)
Both parameters must be multiples of the batch interval of the source DStream.
Input Sources
Spark Streaming provides two categories of built-in streaming sources:Basic Sources
- File streams: For reading data from files on any HDFS-compatible filesystem
- Socket streams: For reading data from a TCP socket connection
- Queue streams: For testing with queues of RDDs
Advanced Sources
- Kafka: See the Kafka Integration Guide
- Kinesis: See the Kinesis Integration Guide
Output Operations
Output operations allow DStream’s data to be pushed out to external systems like databases or filesystems. Common output operations include:print(): Prints the first ten elements of every batch of datasaveAsTextFiles(prefix, [suffix]): Saves the DStream’s contents as text filessaveAsObjectFiles(prefix, [suffix]): Saves the DStream’s contents as serialized Java objectsforeachRDD(func): Applies a function to each RDD generated from the stream
- Python
- Scala
Checkpointing
Checkpointing is essential for fault-tolerant stream processing. You can enable checkpointing by setting the checkpoint directory:- Python
- Scala
- Java
The checkpoint directory must be a path in an HDFS-compatible file system that is accessible by both the driver and executors.
Running the Application
After defining all the transformations, you need to start the streaming computation:Linking
For Scala/Java applications using SBT/Maven project definitions, link your streaming application with the following artifact:- Kafka:
spark-streaming-kafka-0-10_2.12 - Kinesis:
spark-streaming-kinesis-asl_2.12
Migration to Structured Streaming
Migrate to Structured Streaming
For new applications, we strongly recommend using Structured Streaming instead of DStreams. Structured Streaming provides a more modern, easier-to-use API with better performance and more features.
