Spark Streaming Programming Guide (DStreams)

Spark Streaming is the previous generation of Spark’s streaming engine. There are no longer updates to Spark Streaming and it’s a legacy project. There is a newer and easier to use streaming engine in Spark called Structured Streaming. You should use Structured Streaming for your streaming applications and pipelines. See the Structured Streaming Programming Guide.

Overview

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases, and live dashboards. You can apply Spark’s machine learning and graph processing algorithms on data streams.

How It Works

Internally, Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.

Quick Example

Let’s look at a simple example of counting words in text data received from a TCP socket:

Python
Scala
Java

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

# Create a local StreamingContext with two working threads and batch interval of 1 second
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 1)

# Create a DStream that will connect to hostname:port, like localhost:9999
lines = ssc.socketTextStream("localhost", 9999)

# Split each line into words
words = lines.flatMap(lambda line: line.split(" "))

# Count each word in each batch
pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)

# Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.pprint()

ssc.start()             # Start the computation
ssc.awaitTermination()  # Wait for the computation to terminate

import org.apache.spark._
import org.apache.spark.streaming._

// Create a local StreamingContext with two working threads and batch interval of 1 second
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))

// Create a DStream that will connect to hostname:port, like localhost:9999
val lines = ssc.socketTextStream("localhost", 9999)

// Split each line into words
val words = lines.flatMap(_.split(" "))

// Count each word in each batch
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)

// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()

ssc.start()             // Start the computation
ssc.awaitTermination()  // Wait for the computation to terminate

import org.apache.spark.*;
import org.apache.spark.api.java.function.*;
import org.apache.spark.streaming.*;
import org.apache.spark.streaming.api.java.*;
import scala.Tuple2;
import java.util.Arrays;

// Create a local StreamingContext with two working threads and batch interval of 1 second
SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));

// Create a DStream that will connect to hostname:port, like localhost:9999
JavaReceiverInputDStream<String> lines = jssc.socketTextStream("localhost", 9999);

// Split each line into words
JavaDStream<String> words = lines.flatMap(x -> Arrays.asList(x.split(" ")).iterator());

// Count each word in each batch
JavaPairDStream<String, Integer> pairs = words.mapToPair(s -> new Tuple2<>(s, 1));
JavaPairDStream<String, Integer> wordCounts = pairs.reduceByKey((i1, i2) -> i1 + i2);

// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print();

jssc.start();              // Start the computation
jssc.awaitTermination();   // Wait for the computation to terminate

When these lines are executed, Spark Streaming only sets up the computation it will perform when it is started, and no real processing has started yet. To start the processing after all the transformations have been setup, you call start().

Initializing StreamingContext

To initialize a Spark Streaming program, a StreamingContext object must be created, which is the main entry point of all Spark Streaming functionality.

Python
Scala
Java

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

sc = SparkContext(master, appName)
ssc = StreamingContext(sc, 1)

The batch interval must be set based on the latency requirements of your application and available cluster resources.

import org.apache.spark._
import org.apache.spark.streaming._

val conf = new SparkConf().setAppName(appName).setMaster(master)
val ssc = new StreamingContext(conf, Seconds(1))

You can also create a StreamingContext from an existing SparkContext:

val sc = ...  // existing SparkContext
val ssc = new StreamingContext(sc, Seconds(1))

import org.apache.spark.*;
import org.apache.spark.streaming.api.java.*;

SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
JavaStreamingContext ssc = new JavaStreamingContext(conf, new Duration(1000));

Key Points

Once a context has been started, no new streaming computations can be set up or added to it
Once a context has been stopped, it cannot be restarted
Only one StreamingContext can be active in a JVM at the same time
stop() on StreamingContext also stops the SparkContext. To stop only the StreamingContext, set the optional parameter stopSparkContext to false

Discretized Streams (DStreams)

Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream. Internally, a DStream is represented by a continuous series of RDDs. Each RDD in a DStream contains data from a certain interval. Any operation applied on a DStream translates to operations on the underlying RDDs. For example, in the earlier example of converting a stream of lines to words, the flatMap operation is applied on each RDD in the lines DStream to generate the RDDs of the words DStream.

Transformations on DStreams

Similar to RDDs, transformations allow the data from the input DStream to be modified. DStreams support many of the transformations available on normal Spark RDDs:

map(func)

Return a new DStream by passing each element of the source DStream through a function.

flatMap(func)

Similar to map, but each input item can be mapped to 0 or more output items.

filter(func)

Return a new DStream by selecting only the records of the source DStream on which func returns true.

reduceByKey(func, [numTasks])

When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function.

window(windowLength, slideInterval)

Return a new DStream which is computed based on windowed batches of the source DStream.

countByWindow(windowLength, slideInterval)

Return a sliding window count of elements in the stream.

reduceByKeyAndWindow(func, windowLength, slideInterval)

When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function over batches in a sliding window.

Window Operations

Spark Streaming provides windowed computations, which allow you to apply transformations over a sliding window of data:

Python
Scala
Java

# Reduce last 30 seconds of data, every 10 seconds
windowedWordCounts = pairs.reduceByKeyAndWindow(
    lambda x, y: x + y, 
    lambda x, y: x - y, 
    30, 
    10
)

// Reduce last 30 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow(
    (a:Int,b:Int) => (a + b), 
    Seconds(30), 
    Seconds(10)
)

// Reduce last 30 seconds of data, every 10 seconds
JavaPairDStream<String, Integer> windowedWordCounts = pairs.reduceByKeyAndWindow(
    (i1, i2) -> i1 + i2, 
    Durations.seconds(30), 
    Durations.seconds(10)
);

Window operations need to specify two parameters:

window length: The duration of the window (30 seconds in the example)
sliding interval: The interval at which the window operation is performed (10 seconds in the example)

Both parameters must be multiples of the batch interval of the source DStream.

Input Sources

Spark Streaming provides two categories of built-in streaming sources:

Basic Sources

File streams: For reading data from files on any HDFS-compatible filesystem
Socket streams: For reading data from a TCP socket connection
Queue streams: For testing with queues of RDDs

Advanced Sources

Kafka: See the Kafka Integration Guide
Kinesis: See the Kinesis Integration Guide

When running locally, do not use “local” or “local[1]” as the master URL. Either would mean that only one thread will be used for running tasks locally. If you are using an input DStream based on a receiver (e.g. sockets, Kafka, etc.), then the single thread will be used to run the receiver, leaving no thread for processing the received data. Always use at least “local[2]” when running locally.

Output Operations

Output operations allow DStream’s data to be pushed out to external systems like databases or filesystems. Common output operations include:

print(): Prints the first ten elements of every batch of data
saveAsTextFiles(prefix, [suffix]): Saves the DStream’s contents as text files
saveAsObjectFiles(prefix, [suffix]): Saves the DStream’s contents as serialized Java objects
foreachRDD(func): Applies a function to each RDD generated from the stream

Python
Scala

# Print to console
wordCounts.pprint()

# Save to text files
wordCounts.saveAsTextFiles("output")

# Apply custom function to each RDD
def sendToDatabase(rdd):
    for record in rdd.collect():
        # Send to database
        pass

wordCounts.foreachRDD(sendToDatabase)

// Print to console
wordCounts.print()

// Save to text files
wordCounts.saveAsTextFiles("output")

// Apply custom function to each RDD
wordCounts.foreachRDD { rdd =>
  rdd.foreach { record =>
    // Send to database
  }
}

Checkpointing

Checkpointing is essential for fault-tolerant stream processing. You can enable checkpointing by setting the checkpoint directory:

Python
Scala
Java

ssc.checkpoint("/path/to/checkpoint/directory")

ssc.checkpoint("/path/to/checkpoint/directory")

jssc.checkpoint("/path/to/checkpoint/directory");

The checkpoint directory must be a path in an HDFS-compatible file system that is accessible by both the driver and executors.

Running the Application

After defining all the transformations, you need to start the streaming computation:

ssc.start()             # Start the computation
ssc.awaitTermination()  # Wait for the computation to terminate

Linking

For Scala/Java applications using SBT/Maven project definitions, link your streaming application with the following artifact:

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-streaming_2.12</artifactId>
    <version>3.5.0</version>
</dependency>

For ingesting data from sources like Kafka and Kinesis, you will need to add the corresponding artifact:

Kafka: spark-streaming-kafka-0-10_2.12
Kinesis: spark-streaming-kinesis-asl_2.12

Migration to Structured Streaming

Migrate to Structured Streaming

For new applications, we strongly recommend using Structured Streaming instead of DStreams. Structured Streaming provides a more modern, easier-to-use API with better performance and more features.

Get Started

Core Concepts

Spark SQL

Structured Streaming

Machine Learning

Graph Processing

Deployment

Configuration & Tuning

Monitoring

Spark Streaming Programming Guide (DStreams)

Overview

How It Works

Quick Example

Initializing StreamingContext

Key Points

Discretized Streams (DStreams)

Transformations on DStreams

Window Operations

Input Sources

Basic Sources

Advanced Sources

Output Operations

Checkpointing

Running the Application

Linking

Migration to Structured Streaming

Migrate to Structured Streaming

Build docs developers (and LLMs) love

Get Started

Core Concepts

Spark SQL

Structured Streaming

Machine Learning

Graph Processing

Deployment

Configuration & Tuning

Monitoring

​Overview

​How It Works

​Quick Example

​Initializing StreamingContext

​Key Points

​Discretized Streams (DStreams)

​Transformations on DStreams

​Window Operations

​Input Sources

​Basic Sources

​Advanced Sources

​Output Operations

​Checkpointing

​Running the Application

​Linking

​Migration to Structured Streaming

Migrate to Structured Streaming

Build docs developers (and LLMs) love

Overview

How It Works

Quick Example

Initializing StreamingContext

Key Points

Discretized Streams (DStreams)

Transformations on DStreams

Window Operations

Input Sources

Basic Sources

Advanced Sources

Output Operations

Checkpointing

Running the Application

Linking

Migration to Structured Streaming