How Spark Connect Works
The Spark Connect client library is designed to simplify Spark application development. It is a thin API that can be embedded everywhere: in application servers, IDEs, notebooks, and programming languages. The Spark Connect API builds on Spark’s DataFrame API using unresolved logical plans as a language-agnostic protocol between the client and the Spark driver.
Architecture
The Spark Connect client translates DataFrame operations into unresolved logical query plans which are encoded using protocol buffers. These are sent to the server using the gRPC framework. The Spark Connect endpoint embedded on the Spark Server receives and translates unresolved logical plans into Spark’s logical plan operators. This is similar to parsing a SQL query, where attributes and relations are parsed and an initial parse plan is built. From there, the standard Spark execution process kicks in, ensuring that Spark Connect leverages all of Spark’s optimizations and enhancements. Results are streamed back to the client through gRPC as Apache Arrow-encoded row batches.
Key Differences from Classic Spark
One of the main design goals of Spark Connect is to enable full separation and isolation of the client from the server. As a consequence, there are some changes you need to be aware of:Operational Benefits
Spark Connect provides several operational advantages for multi-tenant environments:Stability
Applications that use too much memory will now only impact their own environment as they can run in their own processes. You can define your own dependencies on the client without worrying about conflicts with the Spark driver.Upgradability
The Spark driver can now seamlessly be upgraded independently of applications, for example to benefit from performance improvements and security fixes. Applications can be forward-compatible, as long as the server-side RPC definitions are designed to be backwards compatible.Debuggability and Observability
Spark Connect enables interactive debugging during development directly from your favorite IDE. Similarly, you can monitor applications using the application’s framework native metrics and logging libraries.Getting Started
Starting the Spark Connect Server
First, download and extract Spark from the Download Apache Spark page. Start the Spark Connect server:Connecting from Client Applications
- Python
- Scala
Option 1: Using SPARK_REMOTE environment variableOption 2: Specifying remote in codeInstalling the client:
Standalone Applications
Python Example
Scala Example
For Scala applications with UDFs or custom code:Authentication
While Spark Connect does not have built-in authentication, it is designed to work seamlessly with your existing authentication infrastructure. Its gRPC HTTP/2 interface allows for the use of authenticating proxies, which makes it possible to secure Spark Connect without implementing authentication logic in Spark directly.API Support
PySpark
Since Spark 3.4, Spark Connect supports most PySpark APIs, including DataFrame, Functions, and Column. However, SparkContext and RDD are not supported.
Scala
Since Spark 3.5, Spark Connect supports most Scala APIs, including Dataset, functions, Column, Catalog, and KeyValueGroupedDataset.
Additional Resources
- Application Development with Spark Connect
- Spark Connect Gotchas
- Client Connection String Reference
- Python API Reference
- Scala API Reference
Support for more APIs is planned for upcoming Spark releases.
