Skip to main content
Spark Streaming is the previous generation of Spark’s streaming engine. There are no longer updates to Spark Streaming and it’s a legacy project. There is a newer and easier to use streaming engine in Spark called Structured Streaming. You should use Structured Streaming for your streaming applications and pipelines. See the Structured Streaming Programming Guide.

Overview

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. You can apply Spark’s machine learning and graph processing algorithms on data streams.

How It Works

Internally, Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.

Quick Example

Let’s look at a simple example of counting words in text data received from a TCP socket:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

# Create a local StreamingContext with two working threads and batch interval of 1 second
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 1)

# Create a DStream that will connect to hostname:port, like localhost:9999
lines = ssc.socketTextStream("localhost", 9999)

# Split each line into words
words = lines.flatMap(lambda line: line.split(" "))

# Count each word in each batch
pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)

# Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.pprint()

ssc.start()             # Start the computation
ssc.awaitTermination()  # Wait for the computation to terminate
When these lines are executed, Spark Streaming only sets up the computation it will perform when it is started, and no real processing has started yet. To start the processing after all the transformations have been setup, you call start().

Initializing StreamingContext

To initialize a Spark Streaming program, a StreamingContext object must be created, which is the main entry point of all Spark Streaming functionality.
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

sc = SparkContext(master, appName)
ssc = StreamingContext(sc, 1)
The batch interval must be set based on the latency requirements of your application and available cluster resources.

Key Points

  • Once a context has been started, no new streaming computations can be set up or added to it
  • Once a context has been stopped, it cannot be restarted
  • Only one StreamingContext can be active in a JVM at the same time
  • stop() on StreamingContext also stops the SparkContext. To stop only the StreamingContext, set the optional parameter stopSparkContext to false

Discretized Streams (DStreams)

Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream. Internally, a DStream is represented by a continuous series of RDDs. Each RDD in a DStream contains data from a certain interval. Any operation applied on a DStream translates to operations on the underlying RDDs. For example, in the earlier example of converting a stream of lines to words, the flatMap operation is applied on each RDD in the lines DStream to generate the RDDs of the words DStream.

Transformations on DStreams

Similar to RDDs, transformations allow the data from the input DStream to be modified. DStreams support many of the transformations available on normal Spark RDDs:
Return a new DStream by passing each element of the source DStream through a function.
Similar to map, but each input item can be mapped to 0 or more output items.
Return a new DStream by selecting only the records of the source DStream on which func returns true.
When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function.
Return a new DStream which is computed based on windowed batches of the source DStream.
Return a sliding window count of elements in the stream.
When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function over batches in a sliding window.

Window Operations

Spark Streaming provides windowed computations, which allow you to apply transformations over a sliding window of data:
# Reduce last 30 seconds of data, every 10 seconds
windowedWordCounts = pairs.reduceByKeyAndWindow(
    lambda x, y: x + y, 
    lambda x, y: x - y, 
    30, 
    10
)
Window operations need to specify two parameters:
  • window length: The duration of the window (30 seconds in the example)
  • sliding interval: The interval at which the window operation is performed (10 seconds in the example)
Both parameters must be multiples of the batch interval of the source DStream.

Input Sources

Spark Streaming provides two categories of built-in streaming sources:

Basic Sources

  • File streams: For reading data from files on any HDFS-compatible filesystem
  • Socket streams: For reading data from a TCP socket connection
  • Queue streams: For testing with queues of RDDs

Advanced Sources

When running locally, do not use “local” or “local[1]” as the master URL. Either would mean that only one thread will be used for running tasks locally. If you are using an input DStream based on a receiver (e.g. sockets, Kafka, etc.), then the single thread will be used to run the receiver, leaving no thread for processing the received data. Always use at least “local[2]” when running locally.

Output Operations

Output operations allow DStream’s data to be pushed out to external systems like databases or filesystems. Common output operations include:
  • print(): Prints the first ten elements of every batch of data
  • saveAsTextFiles(prefix, [suffix]): Saves the DStream’s contents as text files
  • saveAsObjectFiles(prefix, [suffix]): Saves the DStream’s contents as serialized Java objects
  • foreachRDD(func): Applies a function to each RDD generated from the stream
# Print to console
wordCounts.pprint()

# Save to text files
wordCounts.saveAsTextFiles("output")

# Apply custom function to each RDD
def sendToDatabase(rdd):
    for record in rdd.collect():
        # Send to database
        pass

wordCounts.foreachRDD(sendToDatabase)

Checkpointing

Checkpointing is essential for fault-tolerant stream processing. You can enable checkpointing by setting the checkpoint directory:
ssc.checkpoint("/path/to/checkpoint/directory")
The checkpoint directory must be a path in an HDFS-compatible file system that is accessible by both the driver and executors.

Running the Application

After defining all the transformations, you need to start the streaming computation:
ssc.start()             # Start the computation
ssc.awaitTermination()  # Wait for the computation to terminate

Linking

For Scala/Java applications using SBT/Maven project definitions, link your streaming application with the following artifact:
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming_2.12</artifactId>
    <version>3.5.0</version>
</dependency>
For ingesting data from sources like Kafka and Kinesis, you will need to add the corresponding artifact:
  • Kafka: spark-streaming-kafka-0-10_2.12
  • Kinesis: spark-streaming-kinesis-asl_2.12

Migration to Structured Streaming

Migrate to Structured Streaming

For new applications, we strongly recommend using Structured Streaming instead of DStreams. Structured Streaming provides a more modern, easier-to-use API with better performance and more features.

Build docs developers (and LLMs) love