Spark Connect Gotchas

The comparison highlights key differences between Spark Connect and Spark Classic in terms of execution and analysis behavior. While both utilize lazy execution for transformations, Spark Connect also defers analysis, introducing unique considerations like temporary view handling and UDF evaluation.

These differences are particularly important when migrating existing code from Spark Classic to Spark Connect, or when writing code that needs to work with both modes.

For an overview of Spark Connect, see Spark Connect Overview.

Query Execution: Both Lazy

Spark Classic

In traditional Spark, DataFrame transformations (e.g., filter, limit) are lazy. They are not executed immediately but are encoded in a logical plan. The actual computation is triggered only when an action (e.g., show(), collect()) is called.

Spark Connect

Spark Connect follows a similar lazy evaluation model. Transformations are constructed on the client side and sent as unresolved plans to the server. The server then performs the necessary analysis and execution when an action is called.

Comparison

Both Spark Classic and Spark Connect follow the same lazy execution model for query execution.

Aspect	Spark Classic & Spark Connect
Transformations: `df.filter(...)`, `df.select(...)`, `df.limit(...)`, etc	Lazy execution
SQL queries: `spark.sql("select …")`	Lazy execution
Actions: `df.collect()`, `df.show()`, etc	Eager execution
SQL commands: `spark.sql("insert …")`, `spark.sql("create …")`, etc	Eager execution

Schema Analysis: Eager vs. Lazy

Spark Classic

Traditionally, Spark Classic performs analysis eagerly during logical plan construction. This analysis phase converts the unresolved plan into a fully resolved logical plan and verifies that the operation can be executed by Spark. One of the key benefits of performing this work eagerly is that you receive immediate feedback when a mistake is made. For example, executing spark.sql("select 1 as a, 2 as b").filter("c > 1") will throw an error eagerly, indicating the column c cannot be found.

Spark Connect

Spark Connect differs from Classic because the client constructs unresolved plans during transformation and defers their analysis. Any operation that requires a resolved plan—such as accessing a schema, explaining the plan, persisting a DataFrame, or executing an action—causes the client to send the unresolved plans to the server over RPC. The server then performs full analysis to get its resolved logical plan and do the operation. For example, spark.sql("select 1 as a, 2 as b").filter("c > 1") will not throw any error because the unresolved plan is client-side only, but on df.columns or df.show() an error will be thrown because the unresolved plan is sent to the server for analysis.

Comparison

Unlike query execution, Spark Classic and Spark Connect differ in when schema analysis occurs.

Aspect	Spark Classic	Spark Connect
Transformations: `df.filter(...)`, `df.select(...)`, `df.limit(...)`, etc	Eager	Lazy
Schema access: `df.columns`, `df.schema`, `df.isStreaming`, etc	Eager	Eager (Triggers an analysis RPC request)
Actions: `df.collect()`, `df.show()`, etc	Eager	Eager
Dependent session state of DataFrames: UDFs, temporary views, configs, etc	Eager	Lazy (Evaluated during plan execution)
Dependent session state of temporary views: UDFs, temporary views, configs, etc	Eager	Eager (Analysis triggered when creating the view)

Common Gotchas (with Mitigations)

If you are not careful about the difference between lazy vs. eager analysis, there are four key gotchas to be aware of:

Overwriting temporary view names
Capturing external variables in UDFs
Delayed error detection
Excessive schema access on new DataFrames

1. Reusing Temporary View Names

In Spark Connect, DataFrames store only a reference to temporary views by name. If you replace a temp view, all DataFrames referencing it will see the new data at execution time.

The Problem

def create_temp_view_and_create_dataframe(x):
    spark.range(x).createOrReplaceTempView("temp_view")
    return spark.table("temp_view")

df10 = create_temp_view_and_create_dataframe(10)
assert len(df10.collect()) == 10

df100 = create_temp_view_and_create_dataframe(100)
assert len(df10.collect()) == 10  # FAILS! df10 now references the new view
assert len(df100.collect()) == 100

In Spark Connect, the DataFrame stores only a reference to the temporary view by name. As a result, if the temp view is later replaced, the data in the DataFrame will also change because it looks up the view by name at execution time. This behavior differs from Spark Classic, where eager analysis embeds the logical plan of the temp view into the DataFrame at creation time.

Mitigation

Create unique temporary view names using UUIDs:

import uuid

def create_temp_view_and_create_dataframe(x):
    temp_view_name = f"`temp_view_{uuid.uuid4()}`"
    spark.range(x).createOrReplaceTempView(temp_view_name)
    return spark.table(temp_view_name)

df10 = create_temp_view_and_create_dataframe(10)
assert len(df10.collect()) == 10

df100 = create_temp_view_and_create_dataframe(100)
assert len(df10.collect()) == 10  # Works correctly now!
assert len(df100.collect()) == 100

2. UDFs with Mutable External Variables

In Spark Connect, Python UDFs are lazy—their serialization and registration are deferred until execution time. This means UDFs capture variable values at execution time, not at definition time.

The Problem

from pyspark.sql.functions import udf

x = 123

@udf("INT")
def foo():
    return x

df = spark.range(1).select(foo())
x = 456
df.show()  # Prints 456, not 123!

This happens because Python closures capture variables by reference, not by value, and UDF serialization is deferred until execution.

Another Example: UDFs in Loops

import json
from pyspark.sql.functions import udf, col

df = spark.createDataFrame([{"values": '{"column_1": 1, "column_2": 2}'}], ["values"])

for j in ['column_1', 'column_2']:
    def extract_value(col):
        return json.loads(col).get(j)
    extract_value_udf = udf(extract_value)
    df = df.withColumn(j, extract_value_udf(col('values')))

df.show()  # Shows 2 for both columns!

Both UDFs end up using the last value of j (‘column_2’).

Mitigation

Use a function factory (closure with early binding) to capture variable values:

from pyspark.sql.functions import udf

def make_udf(value):
    def foo():
        return value
    return udf(foo)

x = 123
foo_udf = make_udf(x)
x = 456
df = spark.range(1).select(foo_udf())
df.show()  # Prints 123 as expected

3. Delayed Error Detection

In Spark Connect, errors in transformations are not detected until an action or schema access is performed.

The Problem

df = spark.createDataFrame([("Alice", 25), ("Bob", 30)], ["name", "age"])

try:
    df = df.select("name", "age")
    df = df.withColumn(
        "age_group",
        when(col("age") < 18, "minor").otherwise("adult"))
    df = df.filter(col("age_with_typo") > 6)  # Typo! But no error yet
except Exception as e:
    print(f"Error: {repr(e)}")

This error handling works in Spark Classic because it performs eager analysis. However, in Spark Connect, no error is thrown until an action is executed.

Mitigation

Trigger eager analysis explicitly when you need to catch errors:

try:
    df = df.filter(col("age_with_typo") > 6)
    df.columns  # Trigger eager analysis
except Exception as e:
    print(f"Error: {repr(e)}")

4. Excessive Schema Access on New DataFrames

Accessing the schema of many new DataFrames (via df.columns, df.schema, etc.) results in many analysis RPC requests, which can severely impact performance.

Problem 1: Schema Access in Loops

import pyspark.sql.functions as F

df = spark.range(10)
for i in range(200):
    new_column_name = str(i)
    if new_column_name not in df.columns:  # Each access triggers analysis!
        df = df.withColumn(new_column_name, F.col("id") + i)
df.show()

Each iteration triggers a new analysis request to the server.

Mitigation 1: Track Columns Locally

df = spark.range(10)
columns = set(df.columns)  # Get columns once

for i in range(200):
    new_column_name = str(i)
    if new_column_name not in columns:  # Check the set
        df = df.withColumn(new_column_name, F.col("id") + i)
        columns.add(new_column_name)
df.show()

Problem 2: Creating Intermediate DataFrames

from pyspark.sql.types import StructType

df = ...  # DataFrame with many struct columns

# Anti-pattern: Creates and analyzes many DataFrames
struct_column_fields = {
    column_schema.name: df.select(column_schema.name + ".*").columns
    for column_schema in df.schema
    if isinstance(column_schema.dataType, StructType)
}

Mitigation 2: Use Schema Directly

from pyspark.sql.types import StructType

df = ...

# Better: Extract from schema without creating DataFrames
struct_column_fields = {
    column_schema.name: [f.name for f in column_schema.dataType.fields]
    for column_schema in df.schema
    if isinstance(column_schema.dataType, StructType)
}
print(struct_column_fields)

This approach is significantly faster when dealing with many columns because it avoids creating and analyzing numerous DataFrames.

Summary

Aspect	Spark Classic	Spark Connect
Query execution	Lazy	Lazy
Command execution	Eager	Eager
Schema analysis	Eager	Lazy
Schema access	Local	Triggers RPC, caches on first access
Temporary views	Plan embedded	Name lookup
UDF serialization	At creation	At execution

The key difference is that Spark Connect defers analysis and name resolution to execution time.

Additional Resources

When migrating code to Spark Connect, pay special attention to temporary views, UDFs with closures, and code that frequently accesses DataFrame schemas.

Programming Languages

Spark Connect

Query Execution: Both Lazy

Spark Classic

Spark Connect

Comparison

Schema Analysis: Eager vs. Lazy

Spark Classic

Spark Connect

Comparison

Common Gotchas (with Mitigations)

1. Reusing Temporary View Names

The Problem

Mitigation

2. UDFs with Mutable External Variables

The Problem

Another Example: UDFs in Loops

Mitigation

3. Delayed Error Detection

The Problem

Mitigation

4. Excessive Schema Access on New DataFrames

Problem 1: Schema Access in Loops

Mitigation 1: Track Columns Locally

Problem 2: Creating Intermediate DataFrames

Mitigation 2: Use Schema Directly

Summary

Additional Resources

Build docs developers (and LLMs) love

Programming Languages

Spark Connect

​Query Execution: Both Lazy

​Spark Classic

​Spark Connect

​Comparison

​Schema Analysis: Eager vs. Lazy

​Spark Classic

​Spark Connect

​Comparison

​Common Gotchas (with Mitigations)

​1. Reusing Temporary View Names

​The Problem

​Mitigation

​2. UDFs with Mutable External Variables

​The Problem

​Another Example: UDFs in Loops

​Mitigation

​3. Delayed Error Detection

​The Problem

​Mitigation

​4. Excessive Schema Access on New DataFrames

​Problem 1: Schema Access in Loops

​Mitigation 1: Track Columns Locally

​Problem 2: Creating Intermediate DataFrames

​Mitigation 2: Use Schema Directly

​Summary

​Additional Resources

Build docs developers (and LLMs) love

Query Execution: Both Lazy

Spark Classic

Spark Connect

Comparison

Schema Analysis: Eager vs. Lazy

Spark Classic

Spark Connect

Comparison

Common Gotchas (with Mitigations)

1. Reusing Temporary View Names

The Problem

Mitigation

2. UDFs with Mutable External Variables

The Problem

Another Example: UDFs in Loops

Mitigation

3. Delayed Error Detection

The Problem

Mitigation

4. Excessive Schema Access on New DataFrames

Problem 1: Schema Access in Loops

Mitigation 1: Track Columns Locally

Problem 2: Creating Intermediate DataFrames

Mitigation 2: Use Schema Directly

Summary

Additional Resources