Spark Connect Overview

In Apache Spark 3.4, Spark Connect introduced a decoupled client-server architecture that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The separation between client and server allows Spark and its open ecosystem to be leveraged from everywhere. It can be embedded in modern data applications, in IDEs, Notebooks and programming languages.

To get started quickly, see the Quickstart: Spark Connect guide.

How Spark Connect Works

The Spark Connect client library is designed to simplify Spark application development. It is a thin API that can be embedded everywhere: in application servers, IDEs, notebooks, and programming languages. The Spark Connect API builds on Spark’s DataFrame API using unresolved logical plans as a language-agnostic protocol between the client and the Spark driver.

Architecture

The Spark Connect client translates DataFrame operations into unresolved logical query plans which are encoded using protocol buffers. These are sent to the server using the gRPC framework. The Spark Connect endpoint embedded on the Spark Server receives and translates unresolved logical plans into Spark’s logical plan operators. This is similar to parsing a SQL query, where attributes and relations are parsed and an initial parse plan is built. From there, the standard Spark execution process kicks in, ensuring that Spark Connect leverages all of Spark’s optimizations and enhancements. Results are streamed back to the client through gRPC as Apache Arrow-encoded row batches.

Key Differences from Classic Spark

One of the main design goals of Spark Connect is to enable full separation and isolation of the client from the server. As a consequence, there are some changes you need to be aware of:

Important Differences:

No Direct Driver Access - The client does not run in the same process as the Spark driver. In PySpark, the client does not use Py4J, so you cannot access private fields like df._jdf.
No RDD Support - Spark Connect uses logical plans as the abstraction and does not support RDD operations.
Session-Based Client - The client does not have access to cluster-wide properties. You cannot access the static Spark configuration or SparkContext.

Operational Benefits

Spark Connect provides several operational advantages for multi-tenant environments:

Stability

Applications that use too much memory will now only impact their own environment as they can run in their own processes. You can define your own dependencies on the client without worrying about conflicts with the Spark driver.

Upgradability

The Spark driver can now seamlessly be upgraded independently of applications, for example to benefit from performance improvements and security fixes. Applications can be forward-compatible, as long as the server-side RPC definitions are designed to be backwards compatible.

Debuggability and Observability

Spark Connect enables interactive debugging during development directly from your favorite IDE. Similarly, you can monitor applications using the application’s framework native metrics and logging libraries.

Getting Started

Starting the Spark Connect Server

First, download and extract Spark from the Download Apache Spark page. Start the Spark Connect server:

./sbin/start-connect-server.sh

Connecting from Client Applications

Python
Scala

Option 1: Using SPARK_REMOTE environment variable

export SPARK_REMOTE="sc://localhost"
./bin/pyspark

Option 2: Specifying remote in code

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .remote("sc://localhost") \
    .getOrCreate()

# Verify connection type
type(spark)
# <class 'pyspark.sql.connect.session.SparkSession'>

# Run operations
columns = ["id", "name"]
data = [(1, "Sarah"), (2, "Maria")]
df = spark.createDataFrame(data).toDF(*columns)
df.show()

Installing the client:

pip install pyspark-client

Starting the shell:

./bin/spark-shell --remote "sc://localhost"

In your code:

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
  .remote("sc://localhost")
  .getOrCreate()

// Run operations
spark.range(10).count()
// res0: Long = 10L

Using environment variable:

export SPARK_REMOTE="sc://myhost.com:443/;token=ABCDEFG"
./bin/spark-shell

Maven/SBT dependency:

libraryDependencies += "org.apache.spark" %% "spark-connect-client-jvm" % "4.0.0"

Standalone Applications

Python Example

"""SimpleApp.py"""
from pyspark.sql import SparkSession

logFile = "YOUR_SPARK_HOME/README.md"
spark = SparkSession.builder \
    .remote("sc://localhost") \
    .appName("SimpleApp") \
    .getOrCreate()

logData = spark.read.text(logFile).cache()

numAs = logData.filter(logData.value.contains('a')).count()
numBs = logData.filter(logData.value.contains('b')).count()

print(f"Lines with a: {numAs}, lines with b: {numBs}")

spark.stop()

Run with:

python SimpleApp.py

Scala Example

For Scala applications with UDFs or custom code:

Operations that reference User Defined Code (UDFs, filter, map, etc.) require a ClassFinder to be registered to upload classfiles. JAR dependencies must also be uploaded using SparkSession.addArtifact.

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.connect.client.REPLClassDirMonitor

val spark = SparkSession.builder()
  .remote("sc://localhost")
  .getOrCreate()

// Register ClassFinder for UDFs
val classFinder = new REPLClassDirMonitor("<PATH_TO_BUILD_OUTPUT>")
spark.registerClassFinder(classFinder)

// Upload JAR dependencies
spark.addArtifact("<PATH_TO_JAR>")

Authentication

While Spark Connect does not have built-in authentication, it is designed to work seamlessly with your existing authentication infrastructure. Its gRPC HTTP/2 interface allows for the use of authenticating proxies, which makes it possible to secure Spark Connect without implementing authentication logic in Spark directly.

API Support

PySpark

Since Spark 3.4, Spark Connect supports most PySpark APIs, including DataFrame, Functions, and Column. However, SparkContext and RDD are not supported.

Check the API reference for APIs labeled “Supports Spark Connect” to verify compatibility before migrating code.

Scala

Since Spark 3.5, Spark Connect supports most Scala APIs, including Dataset, functions, Column, Catalog, and KeyValueGroupedDataset.

User-Defined Functions (UDFs) are supported by default in the shell and in standalone applications with additional setup requirements. Majority of the Streaming API is supported, including DataStreamReader, DataStreamWriter, StreamingQuery, and StreamingQueryListener. SparkContext and RDD are unsupported in Spark Connect.

Additional Resources

Support for more APIs is planned for upcoming Spark releases.

Programming Languages

Spark Connect

How Spark Connect Works

Architecture

Key Differences from Classic Spark

Operational Benefits

Stability

Upgradability

Debuggability and Observability

Getting Started

Starting the Spark Connect Server

Connecting from Client Applications

Standalone Applications

Python Example

Scala Example

Authentication

API Support

PySpark

Scala

Additional Resources

Build docs developers (and LLMs) love

Programming Languages

Spark Connect

​How Spark Connect Works

​Architecture

​Key Differences from Classic Spark

​Operational Benefits

​Stability

​Upgradability

​Debuggability and Observability

​Getting Started

​Starting the Spark Connect Server

​Connecting from Client Applications

​Standalone Applications

​Python Example

​Scala Example

​Authentication

​API Support

​PySpark

​Scala

​Additional Resources

Build docs developers (and LLMs) love

How Spark Connect Works

Architecture

Key Differences from Classic Spark

Operational Benefits

Stability

Upgradability

Debuggability and Observability

Getting Started

Starting the Spark Connect Server

Connecting from Client Applications

Standalone Applications

Python Example

Scala Example

Authentication

API Support

PySpark

Scala

Additional Resources