SQL Migration Guide

This guide covers migration for Spark SQL, including DataFrames, Datasets, SQL queries, and data sources. Review the sections relevant to your upgrade path.

Upgrading from Spark SQL 4.1 to 4.2

Shuffle Checksum Validation

Spark 4.2 enables order-independent checksums for shuffle outputs by default to detect data inconsistencies during indeterminate shuffle stage retries.

If a checksum mismatch is detected, Spark rolls back and re-executes all dependent stages. If rollback isn’t possible, the job fails.

Migration:

# Restore previous behavior if needed
spark.conf.set("spark.sql.shuffle.orderIndependentChecksum.enabled", "false")
spark.conf.set("spark.sql.shuffle.orderIndependentChecksum.enableFullRetryOnMismatch", "false")

Derby JDBC Deprecation

Support for Derby JDBC datasource is deprecated and will be removed in a future version.

Upgrading from Spark SQL 4.0 to 4.1

Parquet Struct Nullness Detection

The Parquet reader no longer assumes all struct values are null when requested fields are missing. Before (Spark 4.0 and earlier):

# If all requested struct fields are missing, returns null struct
df = spark.read.parquet("data.parquet").select("struct_col.missing_field")
# Result: struct_col = null

After (Spark 4.1):

# Reads an additional present field to determine nullness
df = spark.read.parquet("data.parquet").select("struct_col.missing_field")
# Result: struct_col = non-null with missing_field = null

Migration:

# Restore previous behavior
spark.conf.set("spark.sql.legacy.parquet.returnNullStructIfAllFieldsMissing", "true")

Thrift Server Column Ordinal Position

The Spark Thrift Server now returns 1-based ORDINAL_POSITION in GetColumns operation results, instead of incorrectly returning 0-based positions.

# Restore previous 0-based behavior if needed
spark.conf.set("spark.sql.legacy.hive.thriftServer.useZeroBasedColumnOrdinalPosition", "true")

Upgrading from Spark SQL 3.5 to 4.0

ANSI Mode Enabled by Default

Spark 4.0 enables ANSI SQL mode by default, providing stricter SQL semantics and better SQL standard compliance.

ANSI mode changes error handling behavior. Operations that previously returned NULL for errors now throw exceptions.

Before (Spark 3.5):

-- Division by zero returns NULL
SELECT 1 / 0;  -- Result: NULL

-- Overflow returns NULL
SELECT CAST(1000 AS BYTE);  -- Result: NULL

After (Spark 4.0):

-- Division by zero throws exception
SELECT 1 / 0;  -- Error: DIVIDE_BY_ZERO

-- Overflow throws exception
SELECT CAST(1000 AS BYTE);  -- Error: CAST_OVERFLOW

Migration:

# Disable ANSI mode to restore previous behavior
spark.conf.set("spark.sql.ansi.enabled", "false")
# Or use environment variable
# SPARK_ANSI_SQL_MODE=false

# Better: Handle errors explicitly with try_* functions
spark.sql("""
  SELECT 
    try_divide(1, 0) as safe_division,
    try_cast('1000' as byte) as safe_cast
""")

CREATE TABLE Default Provider

CREATE TABLE syntax without USING or STORED AS now uses spark.sql.sources.default instead of Hive. Before (Spark 3.5):

-- Creates Hive table
CREATE TABLE users (id INT, name STRING);

After (Spark 4.0):

-- Creates table using spark.sql.sources.default (typically Parquet)
CREATE TABLE users (id INT, name STRING);

-- Explicitly specify Hive if needed
CREATE TABLE users (id INT, name STRING) STORED AS PARQUET;

Migration:

# Restore previous behavior
spark.conf.set("spark.sql.legacy.createHiveTableByDefault", "true")
# Or set via environment variable
# SPARK_SQL_LEGACY_CREATE_HIVE_TABLE=true

Map Key Normalization

Map creation functions now normalize keys by converting -0.0 to 0.0. Before (Spark 3.5):

from pyspark.sql.functions import create_map, lit

# Keys -0.0 and 0.0 are treated as different
df = spark.sql("SELECT map(-0.0, 'a', 0.0, 'b') as m")
df.show()
# Result: {-0.0 -> a, 0.0 -> b}

After (Spark 4.0):

# Keys are normalized, -0.0 becomes 0.0
df = spark.sql("SELECT map(-0.0, 'a', 0.0, 'b') as m")
df.show()
# Result: {0.0 -> b}  (last value wins)

Migration:

# Restore previous behavior
spark.conf.set("spark.sql.legacy.disableMapKeyNormalization", "true")

File Reading Configuration Changes

SQL-specific configs now take precedence for SQL table reads. Before (Spark 3.5):

# Uses core config
spark.conf.set("spark.files.ignoreCorruptFiles", "true")
df = spark.read.parquet("table")  # Uses core config

After (Spark 4.0):

# Must use SQL-specific config
spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")
df = spark.read.parquet("table")  # Uses SQL config

AccessControlException and BlockMissingException now fail tasks even when ignoreCorruptFiles is enabled.

JDBC Type Mapping Changes

Multiple JDBC datasources have updated type mappings for better compatibility:

PostgreSQL

# Before (Spark 3.5): TIMESTAMP WITH TIME ZONE -> TimestampNTZ (if preferTimestampNTZ=true)
# After (Spark 4.0): TIMESTAMP WITH TIME ZONE -> Timestamp (regardless of preferTimestampNTZ)

df = spark.read.jdbc(
    url="jdbc:postgresql://host/db",
    table="events",
    properties={"preferTimestampNTZ": "true"}
)

# Restore previous behavior
spark.conf.set("spark.sql.legacy.postgres.datetimeMapping.enabled", "true")

MySQL

Type mapping changes:

MySQL Type	Spark 3.5	Spark 4.0
SMALLINT	IntegerType	ShortType
FLOAT	DoubleType	FloatType
BIT(n > 1)	LongType	BinaryType
TINYINT(n > 1)	IntegerType	ByteType

# Restore legacy MySQL type mappings
spark.conf.set("spark.sql.legacy.mysql.timestampNTZMapping.enabled", "true")
spark.conf.set("spark.sql.legacy.mysql.bitArrayMapping.enabled", "true")

Encode/Decode Function Charset Restrictions

The encode() and decode() functions now support only standard charsets. Supported charsets:

US-ASCII
ISO-8859-1
UTF-8
UTF-16BE, UTF-16LE, UTF-16
UTF-32

# Before (Spark 3.5): Supports any JDK charset
df = spark.sql("SELECT encode('text', 'windows-1252')")

# After (Spark 4.0): Limited to standard charsets
df = spark.sql("SELECT encode('text', 'UTF-8')")

# Restore previous behavior (allows all JDK charsets)
spark.conf.set("spark.sql.legacy.javaCharsets", "true")

Error handling change:

# Spark 4.0: Raises error for unmappable characters
spark.sql("SELECT decode(X'E97374', 'UTF-8')")  # Error: MALFORMED_CHARACTER_CODING

# Restore previous behavior (returns mojibake)
spark.conf.set("spark.sql.legacy.codingErrorAction", "true")

CTE Precedence Policy

Inner CTE definitions now take precedence over outer definitions by default. Before (Spark 3.5):

-- Throws EXCEPTION by default
WITH t AS (SELECT 1 as a),
     t2 AS (
       WITH t AS (SELECT 2 as a)
       SELECT * FROM t
     )
SELECT * FROM t2;  -- Error: Ambiguous CTE reference

After (Spark 4.0):

-- Inner CTE takes precedence
WITH t AS (SELECT 1 as a),
     t2 AS (
       WITH t AS (SELECT 2 as a)
       SELECT * FROM t
     )
SELECT * FROM t2;  -- Result: 2

# Restore previous behavior
spark.conf.set("spark.sql.legacy.ctePrecedencePolicy", "EXCEPTION")

View Schema Compensation

Views now tolerate column type changes with automatic cast insertion.

-- Create view
CREATE VIEW v AS SELECT CAST(1 AS INT) as col;

-- Spark 3.5: Breaks if underlying query type changes
-- Spark 4.0: Automatically compensates with casts

-- Disable auto-compensation

spark.conf.set("spark.sql.legacy.viewSchemaCompensation", "false")

Upgrading from Spark SQL 3.4 to 3.5

JDBC Pushdown Options

JDBC pushdown options are now enabled by default.

# These are now true by default:
# - pushDownAggregate
# - pushDownLimit  
# - pushDownOffset
# - pushDownTableSample

# Restore previous behavior
spark.conf.set("spark.sql.catalog.mycatalog.pushDownAggregate", "false")

Array Insert Function

The array_insert function now uses 1-based indexing for negative indexes. Before (Spark 3.4):

from pyspark.sql.functions import array_insert, array, lit

# Index -1 refers to position before first element
df = spark.sql("SELECT array_insert(array(1,2,3), -1, 0)")
# Result: [0, 1, 2, 3]

After (Spark 3.5):

# Index -1 refers to last position (appends)
df = spark.sql("SELECT array_insert(array(1,2,3), -1, 0)")
# Result: [1, 2, 3, 0]

# Restore previous behavior
spark.conf.set("spark.sql.legacy.negativeIndexInArrayInsert", "true")

Best Practices

Testing Strategy

Create a test environment with the target Spark version
Run your test suite and capture any failures
Review configuration changes needed for compatibility
Update code to use new APIs instead of legacy flags when possible

Gradual Migration

# Example: Gradual ANSI mode adoption
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config("spark.sql.ansi.enabled", "true") \
    .config("spark.sql.storeAssignmentPolicy", "ANSI") \
    .getOrCreate()

# Use try_* functions for graceful degradation
df = spark.sql("""
    SELECT 
        try_cast(input_col AS INT) as safe_value,
        CASE 
            WHEN try_cast(input_col AS INT) IS NULL 
            THEN 'INVALID'
            ELSE 'VALID'
        END as status
    FROM source_table
""")

Logging Configuration Changes

Track which legacy configurations your application uses:

import logging

logger = logging.getLogger(__name__)

legacy_configs = [
    "spark.sql.ansi.enabled",
    "spark.sql.legacy.timeParserPolicy",
    "spark.sql.legacy.createHiveTableByDefault"
]

for config in legacy_configs:
    value = spark.conf.get(config, None)
    if value:
        logger.warning(f"Using legacy config: {config}={value}")

Version Migration

Upgrading from Spark SQL 4.1 to 4.2

Shuffle Checksum Validation

Derby JDBC Deprecation

Upgrading from Spark SQL 4.0 to 4.1

Parquet Struct Nullness Detection

Thrift Server Column Ordinal Position

Upgrading from Spark SQL 3.5 to 4.0

ANSI Mode Enabled by Default

CREATE TABLE Default Provider

Map Key Normalization

File Reading Configuration Changes

JDBC Type Mapping Changes

PostgreSQL

MySQL

Encode/Decode Function Charset Restrictions

CTE Precedence Policy

View Schema Compensation

Upgrading from Spark SQL 3.4 to 3.5

JDBC Pushdown Options

Array Insert Function

Best Practices

Testing Strategy

Gradual Migration

Logging Configuration Changes

See Also

Build docs developers (and LLMs) love

Version Migration

​Upgrading from Spark SQL 4.1 to 4.2

​Shuffle Checksum Validation

​Derby JDBC Deprecation

​Upgrading from Spark SQL 4.0 to 4.1

​Parquet Struct Nullness Detection

​Thrift Server Column Ordinal Position

​Upgrading from Spark SQL 3.5 to 4.0

​ANSI Mode Enabled by Default

​CREATE TABLE Default Provider

​Map Key Normalization

​File Reading Configuration Changes

​JDBC Type Mapping Changes

​PostgreSQL

​MySQL

​Encode/Decode Function Charset Restrictions

​CTE Precedence Policy

​View Schema Compensation

​Upgrading from Spark SQL 3.4 to 3.5

​JDBC Pushdown Options

​Array Insert Function

​Best Practices

​Testing Strategy

​Gradual Migration

​Logging Configuration Changes

​See Also

Build docs developers (and LLMs) love

Upgrading from Spark SQL 4.1 to 4.2

Shuffle Checksum Validation

Derby JDBC Deprecation

Upgrading from Spark SQL 4.0 to 4.1

Parquet Struct Nullness Detection

Thrift Server Column Ordinal Position

Upgrading from Spark SQL 3.5 to 4.0

ANSI Mode Enabled by Default

CREATE TABLE Default Provider

Map Key Normalization

File Reading Configuration Changes

JDBC Type Mapping Changes

PostgreSQL

MySQL

Encode/Decode Function Charset Restrictions

CTE Precedence Policy

View Schema Compensation

Upgrading from Spark SQL 3.4 to 3.5

JDBC Pushdown Options

Array Insert Function

Best Practices

Testing Strategy

Gradual Migration

Logging Configuration Changes

See Also