Skip to main content
The PySpark migration guide documents Python-specific changes for Apache Spark. For general Spark SQL and DataFrame changes that also affect PySpark, review the SQL Migration Guide.
The comprehensive PySpark migration guide is now maintained in the PySpark API documentation.

Complete Migration Guide

You can access the full PySpark migration guide, which includes detailed information about all Python API changes, behavior modifications, and deprecated features:

PySpark Migration Guide

Complete guide covering PySpark migrations from version 1.0 to the latest release

Key Migration Topics

The full migration guide covers these important areas:

PySpark SQL and DataFrames

  • DataFrame API changes
  • SQL function updates
  • Type handling modifications
  • Pandas integration changes

PySpark Streaming

  • Structured Streaming API changes
  • DStream API deprecations
  • Windowing and watermark updates

PySpark MLlib

  • Machine learning algorithm changes
  • Pipeline API updates
  • Model persistence changes

PySpark Core

  • RDD API modifications
  • SparkContext changes
  • Configuration updates

Quick Start Examples

Here are some common migration scenarios:

Upgrading to PySpark 4.0

from pyspark.sql import SparkSession

# ANSI mode is now enabled by default
spark = SparkSession.builder \
    .appName("PySpark 4.0 App") \
    .config("spark.sql.ansi.enabled", "true") \
    .getOrCreate()

# Use try_* functions for error handling
from pyspark.sql.functions import try_divide, try_cast

df = spark.createDataFrame([(1, 0), (2, 2)], ["a", "b"])
result = df.select(
    try_divide("a", "b").alias("safe_division"),
    try_cast("a", "byte").alias("safe_cast")
)

Type Hints and Modern Python

PySpark now leverages Python type hints for better IDE support:
from pyspark.sql import DataFrame, SparkSession
from pyspark.sql.functions import col
from typing import List

def filter_data(df: DataFrame, columns: List[str]) -> DataFrame:
    """Filter DataFrame to specific columns with type safety"""
    return df.select([col(c) for c in columns])

# Type hints provide better autocomplete and error checking
result: DataFrame = filter_data(df, ["id", "name"])

Pandas API on Spark

Use the Pandas API for familiar syntax:
import pyspark.pandas as ps

# Read data with Pandas-like API
psdf = ps.read_csv("data.csv")

# Use familiar Pandas operations
result = psdf.groupby("category").agg({
    "value": ["mean", "sum"],
    "count": "count"
})

# Convert to Spark DataFrame when needed
spark_df = result.to_spark()

Working with Pandas UDFs

from pyspark.sql.functions import pandas_udf
import pandas as pd

# Define a Pandas UDF for vectorized operations
@pandas_udf("double")
def multiply_by_two(values: pd.Series) -> pd.Series:
    return values * 2

# Use the UDF
df = spark.range(100)
result = df.select(multiply_by_two("id").alias("doubled"))

Version-Specific Highlights

PySpark 4.0

  • Enhanced type hints throughout the API
  • Improved error messages with better stack traces
  • Better integration with Python 3.9+ features
  • Performance improvements for Pandas UDFs

PySpark 3.0

  • Introduction of Pandas API on Spark (formerly Koalas)
  • Python 3.6+ required (Python 2 support dropped)
  • Pandas UDF improvements
  • Better support for complex types

PySpark 2.4

  • Apache Arrow integration for Pandas conversion
  • Scalar Pandas UDFs
  • Improved Python worker reuse

Common Migration Patterns

Handling ANSI Mode Changes

# Before: Errors returned NULL
df = spark.sql("SELECT 1 / 0")
# Result: NULL

# After (PySpark 4.0): Errors throw exceptions
from pyspark.sql.utils import AnalysisException

try:
    df = spark.sql("SELECT 1 / 0")
except AnalysisException as e:
    print(f"Error: {e}")
    # Handle error appropriately

# Better: Use try_* functions
df = spark.sql("SELECT try_divide(1, 0)")
# Result: NULL (no exception)

Migrating from RDD to DataFrame APIs

# Old approach (RDD)
rdd = sc.textFile("data.txt")
processed = rdd.map(lambda x: x.split(",")) \
               .filter(lambda x: int(x[1]) > 10) \
               .map(lambda x: (x[0], int(x[1])))

# Modern approach (DataFrame)
df = spark.read.csv("data.txt", header=True, inferSchema=True)
processed = df.filter(col("value") > 10) \
              .select("name", "value")

Updated String Handling

from pyspark.sql.functions import col, upper, trim

# PySpark 3.0+: Better string function support
df = spark.createDataFrame([(" hello ",), (" world ",)], ["text"])

# Chain string operations
result = df.select(
    trim(col("text")).alias("trimmed"),
    upper(trim(col("text"))).alias("upper_trimmed")
)

Testing Across Python Versions

import sys
import pytest
from pyspark.sql import SparkSession

@pytest.fixture(scope="session")
def spark():
    """Create Spark session for testing"""
    return SparkSession.builder \
        .master("local[2]") \
        .appName("test") \
        .getOrCreate()

def test_dataframe_operation(spark):
    """Test that works across Python versions"""
    df = spark.createDataFrame([(1, "a"), (2, "b")], ["id", "value"])
    result = df.filter(col("id") > 0).count()
    assert result == 2

if __name__ == "__main__":
    # Check Python version
    if sys.version_info < (3, 8):
        print("Warning: PySpark 4.0+ requires Python 3.8+")

Environment Setup

Python Version Requirements

  • PySpark 4.0+: Python 3.8 or higher
  • PySpark 3.0+: Python 3.6 or higher
  • PySpark 2.x: Python 2.7 or 3.4+

Dependency Management

# requirements.txt for PySpark 4.0
pyspark>=4.0.0
pandas>=1.5.0
pyarrow>=10.0.0
numpy>=1.23.0

# Install dependencies
pip install -r requirements.txt

Virtual Environment Setup

# Create virtual environment
python -m venv pyspark_env
source pyspark_env/bin/activate  # On Windows: pyspark_env\Scripts\activate

# Install PySpark
pip install pyspark==4.0.0

# Verify installation
python -c "import pyspark; print(pyspark.__version__)"

Performance Optimization

Use Pandas UDFs for Better Performance

from pyspark.sql.functions import pandas_udf, PandasUDFType
import pandas as pd
import numpy as np

# Pandas UDF (vectorized)
@pandas_udf("double")
def pandas_compute(values: pd.Series) -> pd.Series:
    return values.apply(lambda x: x ** 2 + 2 * x + 1)

# Regular UDF (row-by-row, slower)
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType

@udf(DoubleType())
def regular_compute(value):
    return value ** 2 + 2 * value + 1

# Pandas UDF is typically 10-100x faster
df = spark.range(1000000)
pandas_result = df.select(pandas_compute("id"))  # Fast
regular_result = df.select(regular_compute("id"))  # Slower

Efficient Data Collection

# Avoid collecting large DataFrames
# Bad: Collects all data to driver
all_data = large_df.collect()  # Can cause OOM

# Good: Use sampling or limits
sample_data = large_df.sample(0.01).collect()  # 1% sample
limited_data = large_df.limit(1000).collect()  # First 1000 rows

# Better: Use Pandas conversion for larger subsets
subset_df = large_df.filter(col("category") == "A")
pandas_df = subset_df.toPandas()  # More efficient than collect()

Additional Resources

PySpark API Reference

Complete PySpark API documentation

PySpark SQL Guide

SQL and DataFrame programming guide

Pandas API on Spark

Guide for Pandas-like operations in PySpark

PySpark Examples

Example PySpark applications

Getting Help

For PySpark-specific questions:

Build docs developers (and LLMs) love