Skip to main content
This guide covers migration for Spark SQL, including DataFrames, Datasets, SQL queries, and data sources. Review the sections relevant to your upgrade path.

Upgrading from Spark SQL 4.1 to 4.2

Shuffle Checksum Validation

Spark 4.2 enables order-independent checksums for shuffle outputs by default to detect data inconsistencies during indeterminate shuffle stage retries.
If a checksum mismatch is detected, Spark rolls back and re-executes all dependent stages. If rollback isn’t possible, the job fails.
Migration:
# Restore previous behavior if needed
spark.conf.set("spark.sql.shuffle.orderIndependentChecksum.enabled", "false")
spark.conf.set("spark.sql.shuffle.orderIndependentChecksum.enableFullRetryOnMismatch", "false")

Derby JDBC Deprecation

Support for Derby JDBC datasource is deprecated and will be removed in a future version.

Upgrading from Spark SQL 4.0 to 4.1

Parquet Struct Nullness Detection

The Parquet reader no longer assumes all struct values are null when requested fields are missing. Before (Spark 4.0 and earlier):
# If all requested struct fields are missing, returns null struct
df = spark.read.parquet("data.parquet").select("struct_col.missing_field")
# Result: struct_col = null
After (Spark 4.1):
# Reads an additional present field to determine nullness
df = spark.read.parquet("data.parquet").select("struct_col.missing_field")
# Result: struct_col = non-null with missing_field = null
Migration:
# Restore previous behavior
spark.conf.set("spark.sql.legacy.parquet.returnNullStructIfAllFieldsMissing", "true")

Thrift Server Column Ordinal Position

The Spark Thrift Server now returns 1-based ORDINAL_POSITION in GetColumns operation results, instead of incorrectly returning 0-based positions.
# Restore previous 0-based behavior if needed
spark.conf.set("spark.sql.legacy.hive.thriftServer.useZeroBasedColumnOrdinalPosition", "true")

Upgrading from Spark SQL 3.5 to 4.0

ANSI Mode Enabled by Default

Spark 4.0 enables ANSI SQL mode by default, providing stricter SQL semantics and better SQL standard compliance.
ANSI mode changes error handling behavior. Operations that previously returned NULL for errors now throw exceptions.
Before (Spark 3.5):
-- Division by zero returns NULL
SELECT 1 / 0;  -- Result: NULL

-- Overflow returns NULL
SELECT CAST(1000 AS BYTE);  -- Result: NULL
After (Spark 4.0):
-- Division by zero throws exception
SELECT 1 / 0;  -- Error: DIVIDE_BY_ZERO

-- Overflow throws exception
SELECT CAST(1000 AS BYTE);  -- Error: CAST_OVERFLOW
Migration:
# Disable ANSI mode to restore previous behavior
spark.conf.set("spark.sql.ansi.enabled", "false")
# Or use environment variable
# SPARK_ANSI_SQL_MODE=false

# Better: Handle errors explicitly with try_* functions
spark.sql("""
  SELECT 
    try_divide(1, 0) as safe_division,
    try_cast('1000' as byte) as safe_cast
""")

CREATE TABLE Default Provider

CREATE TABLE syntax without USING or STORED AS now uses spark.sql.sources.default instead of Hive. Before (Spark 3.5):
-- Creates Hive table
CREATE TABLE users (id INT, name STRING);
After (Spark 4.0):
-- Creates table using spark.sql.sources.default (typically Parquet)
CREATE TABLE users (id INT, name STRING);

-- Explicitly specify Hive if needed
CREATE TABLE users (id INT, name STRING) STORED AS PARQUET;
Migration:
# Restore previous behavior
spark.conf.set("spark.sql.legacy.createHiveTableByDefault", "true")
# Or set via environment variable
# SPARK_SQL_LEGACY_CREATE_HIVE_TABLE=true

Map Key Normalization

Map creation functions now normalize keys by converting -0.0 to 0.0. Before (Spark 3.5):
from pyspark.sql.functions import create_map, lit

# Keys -0.0 and 0.0 are treated as different
df = spark.sql("SELECT map(-0.0, 'a', 0.0, 'b') as m")
df.show()
# Result: {-0.0 -> a, 0.0 -> b}
After (Spark 4.0):
# Keys are normalized, -0.0 becomes 0.0
df = spark.sql("SELECT map(-0.0, 'a', 0.0, 'b') as m")
df.show()
# Result: {0.0 -> b}  (last value wins)
Migration:
# Restore previous behavior
spark.conf.set("spark.sql.legacy.disableMapKeyNormalization", "true")

File Reading Configuration Changes

SQL-specific configs now take precedence for SQL table reads. Before (Spark 3.5):
# Uses core config
spark.conf.set("spark.files.ignoreCorruptFiles", "true")
df = spark.read.parquet("table")  # Uses core config
After (Spark 4.0):
# Must use SQL-specific config
spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")
df = spark.read.parquet("table")  # Uses SQL config
AccessControlException and BlockMissingException now fail tasks even when ignoreCorruptFiles is enabled.

JDBC Type Mapping Changes

Multiple JDBC datasources have updated type mappings for better compatibility:

PostgreSQL

# Before (Spark 3.5): TIMESTAMP WITH TIME ZONE -> TimestampNTZ (if preferTimestampNTZ=true)
# After (Spark 4.0): TIMESTAMP WITH TIME ZONE -> Timestamp (regardless of preferTimestampNTZ)

df = spark.read.jdbc(
    url="jdbc:postgresql://host/db",
    table="events",
    properties={"preferTimestampNTZ": "true"}
)

# Restore previous behavior
spark.conf.set("spark.sql.legacy.postgres.datetimeMapping.enabled", "true")

MySQL

Type mapping changes:
MySQL TypeSpark 3.5Spark 4.0
SMALLINTIntegerTypeShortType
FLOATDoubleTypeFloatType
BIT(n > 1)LongTypeBinaryType
TINYINT(n > 1)IntegerTypeByteType
# Restore legacy MySQL type mappings
spark.conf.set("spark.sql.legacy.mysql.timestampNTZMapping.enabled", "true")
spark.conf.set("spark.sql.legacy.mysql.bitArrayMapping.enabled", "true")

Encode/Decode Function Charset Restrictions

The encode() and decode() functions now support only standard charsets. Supported charsets:
  • US-ASCII
  • ISO-8859-1
  • UTF-8
  • UTF-16BE, UTF-16LE, UTF-16
  • UTF-32
# Before (Spark 3.5): Supports any JDK charset
df = spark.sql("SELECT encode('text', 'windows-1252')")

# After (Spark 4.0): Limited to standard charsets
df = spark.sql("SELECT encode('text', 'UTF-8')")

# Restore previous behavior (allows all JDK charsets)
spark.conf.set("spark.sql.legacy.javaCharsets", "true")
Error handling change:
# Spark 4.0: Raises error for unmappable characters
spark.sql("SELECT decode(X'E97374', 'UTF-8')")  # Error: MALFORMED_CHARACTER_CODING

# Restore previous behavior (returns mojibake)
spark.conf.set("spark.sql.legacy.codingErrorAction", "true")

CTE Precedence Policy

Inner CTE definitions now take precedence over outer definitions by default. Before (Spark 3.5):
-- Throws EXCEPTION by default
WITH t AS (SELECT 1 as a),
     t2 AS (
       WITH t AS (SELECT 2 as a)
       SELECT * FROM t
     )
SELECT * FROM t2;  -- Error: Ambiguous CTE reference
After (Spark 4.0):
-- Inner CTE takes precedence
WITH t AS (SELECT 1 as a),
     t2 AS (
       WITH t AS (SELECT 2 as a)
       SELECT * FROM t
     )
SELECT * FROM t2;  -- Result: 2
# Restore previous behavior
spark.conf.set("spark.sql.legacy.ctePrecedencePolicy", "EXCEPTION")

View Schema Compensation

Views now tolerate column type changes with automatic cast insertion.
-- Create view
CREATE VIEW v AS SELECT CAST(1 AS INT) as col;

-- Spark 3.5: Breaks if underlying query type changes
-- Spark 4.0: Automatically compensates with casts

-- Disable auto-compensation
spark.conf.set("spark.sql.legacy.viewSchemaCompensation", "false")

Upgrading from Spark SQL 3.4 to 3.5

JDBC Pushdown Options

JDBC pushdown options are now enabled by default.
# These are now true by default:
# - pushDownAggregate
# - pushDownLimit  
# - pushDownOffset
# - pushDownTableSample

# Restore previous behavior
spark.conf.set("spark.sql.catalog.mycatalog.pushDownAggregate", "false")

Array Insert Function

The array_insert function now uses 1-based indexing for negative indexes. Before (Spark 3.4):
from pyspark.sql.functions import array_insert, array, lit

# Index -1 refers to position before first element
df = spark.sql("SELECT array_insert(array(1,2,3), -1, 0)")
# Result: [0, 1, 2, 3]
After (Spark 3.5):
# Index -1 refers to last position (appends)
df = spark.sql("SELECT array_insert(array(1,2,3), -1, 0)")
# Result: [1, 2, 3, 0]
# Restore previous behavior
spark.conf.set("spark.sql.legacy.negativeIndexInArrayInsert", "true")

Best Practices

Testing Strategy

  1. Create a test environment with the target Spark version
  2. Run your test suite and capture any failures
  3. Review configuration changes needed for compatibility
  4. Update code to use new APIs instead of legacy flags when possible

Gradual Migration

# Example: Gradual ANSI mode adoption
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config("spark.sql.ansi.enabled", "true") \
    .config("spark.sql.storeAssignmentPolicy", "ANSI") \
    .getOrCreate()

# Use try_* functions for graceful degradation
df = spark.sql("""
    SELECT 
        try_cast(input_col AS INT) as safe_value,
        CASE 
            WHEN try_cast(input_col AS INT) IS NULL 
            THEN 'INVALID'
            ELSE 'VALID'
        END as status
    FROM source_table
""")

Logging Configuration Changes

Track which legacy configurations your application uses:
import logging

logger = logging.getLogger(__name__)

legacy_configs = [
    "spark.sql.ansi.enabled",
    "spark.sql.legacy.timeParserPolicy",
    "spark.sql.legacy.createHiveTableByDefault"
]

for config in legacy_configs:
    value = spark.conf.get(config, None)
    if value:
        logger.warning(f"Using legacy config: {config}={value}")

See Also

Build docs developers (and LLMs) love