Available Migration Guides
Spark’s migration guides are organized by component:Core Components
Spark Core
RDD APIs, scheduling, storage, and runtime behavior changes
Spark SQL
SQL, DataFrames, and Dataset API modifications
Machine Learning
MLlib
Machine learning algorithms, pipelines, and model changes
Language-Specific APIs
PySpark
Python API changes and behavior updates
SparkR
R API changes and behavior updates
Migration Strategy
When upgrading Spark versions, follow this recommended approach:1. Review Breaking Changes
Start by reviewing the breaking changes section for your target version. These changes require code modifications and may cause your application to fail if not addressed.Breaking changes are incompatible modifications that require you to update your application code before upgrading.
2. Check Deprecations
Identify deprecated APIs and plan to replace them with recommended alternatives. While deprecated features still work in the current version, they will be removed in future releases.3. Test Behavior Changes
Some changes modify existing behavior without breaking APIs. Test your application thoroughly to ensure results remain consistent or adjust your code accordingly.4. Update Dependencies
Ensure all external libraries and connectors are compatible with your target Spark version.Version-Specific Considerations
Upgrading to Spark 4.0
Spark 4.0 includes several major changes:- ANSI SQL mode is enabled by default - Set
spark.sql.ansi.enabled=falseto restore previous behavior - Default table provider changed -
CREATE TABLEwithoutUSINGnow defaults to the value inspark.sql.sources.defaultinstead of Hive - Java 17 is required - JDK 8 and 11 are no longer supported
- Hadoop 3.3.6+ is required - Earlier Hadoop versions are not supported
Upgrading to Spark 3.0
Spark 3.0 was a major release with significant changes:- Adaptive Query Execution (AQE) is enabled by default in 3.2+
- Proleptic Gregorian calendar replaced the hybrid calendar for date/timestamp operations
- Built-in Hive upgraded from 1.2 to 2.3
Configuration Changes
Many behavior changes can be reverted using legacy configuration flags. However, we recommend adapting to new behaviors rather than relying on legacy modes, as these flags may be removed in future versions.Example: Restoring Legacy Behavior
Cross-Version Compatibility
If you need to maintain code that works across multiple Spark versions:- Check Spark version at runtime
- Use configuration flags judiciously - Set legacy flags only when necessary
- Test thoroughly - Run your test suite against all supported Spark versions
Additional Resources
Release Notes
Detailed release notes for each Spark version
API Documentation
Complete API reference for all Spark APIs
Getting Help
If you encounter migration issues:- Check the Spark JIRA for known issues
- Ask questions on the Spark user mailing list
- Review Stack Overflow for community solutions
