Skip to main content
This section provides comprehensive migration guides for each Apache Spark component to help you effectively migrate your applications across versions. Each guide covers breaking changes, deprecations, and behavior modifications.

Available Migration Guides

Spark’s migration guides are organized by component:

Core Components

Spark Core

RDD APIs, scheduling, storage, and runtime behavior changes

Spark SQL

SQL, DataFrames, and Dataset API modifications

Machine Learning

MLlib

Machine learning algorithms, pipelines, and model changes

Language-Specific APIs

PySpark

Python API changes and behavior updates

SparkR

R API changes and behavior updates

Migration Strategy

When upgrading Spark versions, follow this recommended approach:

1. Review Breaking Changes

Start by reviewing the breaking changes section for your target version. These changes require code modifications and may cause your application to fail if not addressed.
Breaking changes are incompatible modifications that require you to update your application code before upgrading.

2. Check Deprecations

Identify deprecated APIs and plan to replace them with recommended alternatives. While deprecated features still work in the current version, they will be removed in future releases.

3. Test Behavior Changes

Some changes modify existing behavior without breaking APIs. Test your application thoroughly to ensure results remain consistent or adjust your code accordingly.

4. Update Dependencies

Ensure all external libraries and connectors are compatible with your target Spark version.

Version-Specific Considerations

Upgrading to Spark 4.0

Spark 4.0 includes several major changes:
  • ANSI SQL mode is enabled by default - Set spark.sql.ansi.enabled=false to restore previous behavior
  • Default table provider changed - CREATE TABLE without USING now defaults to the value in spark.sql.sources.default instead of Hive
  • Java 17 is required - JDK 8 and 11 are no longer supported
  • Hadoop 3.3.6+ is required - Earlier Hadoop versions are not supported
Spark 4.0 removes support for Apache Mesos as a resource manager. If you’re using Mesos, plan to migrate to YARN, Kubernetes, or Standalone mode.

Upgrading to Spark 3.0

Spark 3.0 was a major release with significant changes:
  • Adaptive Query Execution (AQE) is enabled by default in 3.2+
  • Proleptic Gregorian calendar replaced the hybrid calendar for date/timestamp operations
  • Built-in Hive upgraded from 1.2 to 2.3

Configuration Changes

Many behavior changes can be reverted using legacy configuration flags. However, we recommend adapting to new behaviors rather than relying on legacy modes, as these flags may be removed in future versions.

Example: Restoring Legacy Behavior

# Spark 4.0: Disable ANSI mode if needed
spark.conf.set("spark.sql.ansi.enabled", "false")

# Spark 3.0: Use legacy datetime parsing
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")

Cross-Version Compatibility

If you need to maintain code that works across multiple Spark versions:
  1. Check Spark version at runtime
val sparkVersion = spark.version
if (sparkVersion.startsWith("3.")) {
  // Use Spark 3.x APIs
} else if (sparkVersion.startsWith("4.")) {
  // Use Spark 4.x APIs
}
  1. Use configuration flags judiciously - Set legacy flags only when necessary
  2. Test thoroughly - Run your test suite against all supported Spark versions

Additional Resources

Release Notes

Detailed release notes for each Spark version

API Documentation

Complete API reference for all Spark APIs

Getting Help

If you encounter migration issues:

Build docs developers (and LLMs) love