The comparison highlights key differences between Spark Connect and Spark Classic in terms of execution and analysis behavior. While both utilize lazy execution for transformations, Spark Connect also defers analysis, introducing unique considerations like temporary view handling and UDF evaluation.
These differences are particularly important when migrating existing code from Spark Classic to Spark Connect, or when writing code that needs to work with both modes.
For an overview of Spark Connect, see Spark Connect Overview.
Query Execution: Both Lazy
Spark Classic
In traditional Spark, DataFrame transformations (e.g., filter, limit) are lazy. They are not executed immediately but are encoded in a logical plan. The actual computation is triggered only when an action (e.g., show(), collect()) is called.
Spark Connect
Spark Connect follows a similar lazy evaluation model. Transformations are constructed on the client side and sent as unresolved plans to the server. The server then performs the necessary analysis and execution when an action is called.
Comparison
Both Spark Classic and Spark Connect follow the same lazy execution model for query execution.
| Aspect | Spark Classic & Spark Connect |
|---|
Transformations: df.filter(...), df.select(...), df.limit(...), etc | Lazy execution |
SQL queries: spark.sql("select …") | Lazy execution |
Actions: df.collect(), df.show(), etc | Eager execution |
SQL commands: spark.sql("insert …"), spark.sql("create …"), etc | Eager execution |
Schema Analysis: Eager vs. Lazy
Spark Classic
Traditionally, Spark Classic performs analysis eagerly during logical plan construction. This analysis phase converts the unresolved plan into a fully resolved logical plan and verifies that the operation can be executed by Spark. One of the key benefits of performing this work eagerly is that you receive immediate feedback when a mistake is made.
For example, executing spark.sql("select 1 as a, 2 as b").filter("c > 1") will throw an error eagerly, indicating the column c cannot be found.
Spark Connect
Spark Connect differs from Classic because the client constructs unresolved plans during transformation and defers their analysis. Any operation that requires a resolved plan—such as accessing a schema, explaining the plan, persisting a DataFrame, or executing an action—causes the client to send the unresolved plans to the server over RPC. The server then performs full analysis to get its resolved logical plan and do the operation.
For example, spark.sql("select 1 as a, 2 as b").filter("c > 1") will not throw any error because the unresolved plan is client-side only, but on df.columns or df.show() an error will be thrown because the unresolved plan is sent to the server for analysis.
Comparison
Unlike query execution, Spark Classic and Spark Connect differ in when schema analysis occurs.
| Aspect | Spark Classic | Spark Connect |
|---|
Transformations: df.filter(...), df.select(...), df.limit(...), etc | Eager | Lazy |
Schema access: df.columns, df.schema, df.isStreaming, etc | Eager | Eager (Triggers an analysis RPC request) |
Actions: df.collect(), df.show(), etc | Eager | Eager |
| Dependent session state of DataFrames: UDFs, temporary views, configs, etc | Eager | Lazy (Evaluated during plan execution) |
| Dependent session state of temporary views: UDFs, temporary views, configs, etc | Eager | Eager (Analysis triggered when creating the view) |
Common Gotchas (with Mitigations)
If you are not careful about the difference between lazy vs. eager analysis, there are four key gotchas to be aware of:
- Overwriting temporary view names
- Capturing external variables in UDFs
- Delayed error detection
- Excessive schema access on new DataFrames
1. Reusing Temporary View Names
In Spark Connect, DataFrames store only a reference to temporary views by name. If you replace a temp view, all DataFrames referencing it will see the new data at execution time.
The Problem
def create_temp_view_and_create_dataframe(x):
spark.range(x).createOrReplaceTempView("temp_view")
return spark.table("temp_view")
df10 = create_temp_view_and_create_dataframe(10)
assert len(df10.collect()) == 10
df100 = create_temp_view_and_create_dataframe(100)
assert len(df10.collect()) == 10 # FAILS! df10 now references the new view
assert len(df100.collect()) == 100
In Spark Connect, the DataFrame stores only a reference to the temporary view by name. As a result, if the temp view is later replaced, the data in the DataFrame will also change because it looks up the view by name at execution time.
This behavior differs from Spark Classic, where eager analysis embeds the logical plan of the temp view into the DataFrame at creation time.
Mitigation
Create unique temporary view names using UUIDs:
import uuid
def create_temp_view_and_create_dataframe(x):
temp_view_name = f"`temp_view_{uuid.uuid4()}`"
spark.range(x).createOrReplaceTempView(temp_view_name)
return spark.table(temp_view_name)
df10 = create_temp_view_and_create_dataframe(10)
assert len(df10.collect()) == 10
df100 = create_temp_view_and_create_dataframe(100)
assert len(df10.collect()) == 10 # Works correctly now!
assert len(df100.collect()) == 100
2. UDFs with Mutable External Variables
In Spark Connect, Python UDFs are lazy—their serialization and registration are deferred until execution time. This means UDFs capture variable values at execution time, not at definition time.
The Problem
from pyspark.sql.functions import udf
x = 123
@udf("INT")
def foo():
return x
df = spark.range(1).select(foo())
x = 456
df.show() # Prints 456, not 123!
This happens because Python closures capture variables by reference, not by value, and UDF serialization is deferred until execution.
Another Example: UDFs in Loops
import json
from pyspark.sql.functions import udf, col
df = spark.createDataFrame([{"values": '{"column_1": 1, "column_2": 2}'}], ["values"])
for j in ['column_1', 'column_2']:
def extract_value(col):
return json.loads(col).get(j)
extract_value_udf = udf(extract_value)
df = df.withColumn(j, extract_value_udf(col('values')))
df.show() # Shows 2 for both columns!
Both UDFs end up using the last value of j (‘column_2’).
Mitigation
Use a function factory (closure with early binding) to capture variable values:
from pyspark.sql.functions import udf
def make_udf(value):
def foo():
return value
return udf(foo)
x = 123
foo_udf = make_udf(x)
x = 456
df = spark.range(1).select(foo_udf())
df.show() # Prints 123 as expected
3. Delayed Error Detection
In Spark Connect, errors in transformations are not detected until an action or schema access is performed.
The Problem
df = spark.createDataFrame([("Alice", 25), ("Bob", 30)], ["name", "age"])
try:
df = df.select("name", "age")
df = df.withColumn(
"age_group",
when(col("age") < 18, "minor").otherwise("adult"))
df = df.filter(col("age_with_typo") > 6) # Typo! But no error yet
except Exception as e:
print(f"Error: {repr(e)}")
This error handling works in Spark Classic because it performs eager analysis. However, in Spark Connect, no error is thrown until an action is executed.
Mitigation
Trigger eager analysis explicitly when you need to catch errors:
try:
df = df.filter(col("age_with_typo") > 6)
df.columns # Trigger eager analysis
except Exception as e:
print(f"Error: {repr(e)}")
4. Excessive Schema Access on New DataFrames
Accessing the schema of many new DataFrames (via df.columns, df.schema, etc.) results in many analysis RPC requests, which can severely impact performance.
Problem 1: Schema Access in Loops
import pyspark.sql.functions as F
df = spark.range(10)
for i in range(200):
new_column_name = str(i)
if new_column_name not in df.columns: # Each access triggers analysis!
df = df.withColumn(new_column_name, F.col("id") + i)
df.show()
Each iteration triggers a new analysis request to the server.
Mitigation 1: Track Columns Locally
df = spark.range(10)
columns = set(df.columns) # Get columns once
for i in range(200):
new_column_name = str(i)
if new_column_name not in columns: # Check the set
df = df.withColumn(new_column_name, F.col("id") + i)
columns.add(new_column_name)
df.show()
from pyspark.sql.types import StructType
df = ... # DataFrame with many struct columns
# Anti-pattern: Creates and analyzes many DataFrames
struct_column_fields = {
column_schema.name: df.select(column_schema.name + ".*").columns
for column_schema in df.schema
if isinstance(column_schema.dataType, StructType)
}
Mitigation 2: Use Schema Directly
from pyspark.sql.types import StructType
df = ...
# Better: Extract from schema without creating DataFrames
struct_column_fields = {
column_schema.name: [f.name for f in column_schema.dataType.fields]
for column_schema in df.schema
if isinstance(column_schema.dataType, StructType)
}
print(struct_column_fields)
This approach is significantly faster when dealing with many columns because it avoids creating and analyzing numerous DataFrames.
Summary
| Aspect | Spark Classic | Spark Connect |
|---|
| Query execution | Lazy | Lazy |
| Command execution | Eager | Eager |
| Schema analysis | Eager | Lazy |
| Schema access | Local | Triggers RPC, caches on first access |
| Temporary views | Plan embedded | Name lookup |
| UDF serialization | At creation | At execution |
The key difference is that Spark Connect defers analysis and name resolution to execution time.
Additional Resources
When migrating code to Spark Connect, pay special attention to temporary views, UDFs with closures, and code that frequently accesses DataFrame schemas.