PySpark Migration Guide

The PySpark migration guide documents Python-specific changes for Apache Spark. For general Spark SQL and DataFrame changes that also affect PySpark, review the SQL Migration Guide.

The comprehensive PySpark migration guide is now maintained in the PySpark API documentation.

Complete Migration Guide

You can access the full PySpark migration guide, which includes detailed information about all Python API changes, behavior modifications, and deprecated features:

PySpark Migration Guide

Complete guide covering PySpark migrations from version 1.0 to the latest release

Key Migration Topics

The full migration guide covers these important areas:

PySpark SQL and DataFrames

DataFrame API changes
SQL function updates
Type handling modifications
Pandas integration changes

PySpark Streaming

Structured Streaming API changes
DStream API deprecations
Windowing and watermark updates

PySpark MLlib

Machine learning algorithm changes
Pipeline API updates
Model persistence changes

PySpark Core

RDD API modifications
SparkContext changes
Configuration updates

Quick Start Examples

Here are some common migration scenarios:

Upgrading to PySpark 4.0

from pyspark.sql import SparkSession

# ANSI mode is now enabled by default
spark = SparkSession.builder \
    .appName("PySpark 4.0 App") \
    .config("spark.sql.ansi.enabled", "true") \
    .getOrCreate()

# Use try_* functions for error handling
from pyspark.sql.functions import try_divide, try_cast

df = spark.createDataFrame([(1, 0), (2, 2)], ["a", "b"])
result = df.select(
    try_divide("a", "b").alias("safe_division"),
    try_cast("a", "byte").alias("safe_cast")
)

Type Hints and Modern Python

PySpark now leverages Python type hints for better IDE support:

from pyspark.sql import DataFrame, SparkSession
from pyspark.sql.functions import col
from typing import List

def filter_data(df: DataFrame, columns: List[str]) -> DataFrame:
    """Filter DataFrame to specific columns with type safety"""
    return df.select([col(c) for c in columns])

# Type hints provide better autocomplete and error checking
result: DataFrame = filter_data(df, ["id", "name"])

Pandas API on Spark

Use the Pandas API for familiar syntax:

import pyspark.pandas as ps

# Read data with Pandas-like API
psdf = ps.read_csv("data.csv")

# Use familiar Pandas operations
result = psdf.groupby("category").agg({
    "value": ["mean", "sum"],
    "count": "count"
})

# Convert to Spark DataFrame when needed
spark_df = result.to_spark()

Working with Pandas UDFs

from pyspark.sql.functions import pandas_udf
import pandas as pd

# Define a Pandas UDF for vectorized operations
@pandas_udf("double")
def multiply_by_two(values: pd.Series) -> pd.Series:
    return values * 2

# Use the UDF
df = spark.range(100)
result = df.select(multiply_by_two("id").alias("doubled"))

Version-Specific Highlights

PySpark 4.0

Enhanced type hints throughout the API
Improved error messages with better stack traces
Better integration with Python 3.9+ features
Performance improvements for Pandas UDFs

PySpark 3.0

Introduction of Pandas API on Spark (formerly Koalas)
Python 3.6+ required (Python 2 support dropped)
Pandas UDF improvements
Better support for complex types

PySpark 2.4

Apache Arrow integration for Pandas conversion
Scalar Pandas UDFs
Improved Python worker reuse

Common Migration Patterns

Handling ANSI Mode Changes

# Before: Errors returned NULL
df = spark.sql("SELECT 1 / 0")
# Result: NULL

# After (PySpark 4.0): Errors throw exceptions
from pyspark.sql.utils import AnalysisException

try:
    df = spark.sql("SELECT 1 / 0")
except AnalysisException as e:
    print(f"Error: {e}")
    # Handle error appropriately

# Better: Use try_* functions
df = spark.sql("SELECT try_divide(1, 0)")
# Result: NULL (no exception)

Migrating from RDD to DataFrame APIs

# Old approach (RDD)
rdd = sc.textFile("data.txt")
processed = rdd.map(lambda x: x.split(",")) \
               .filter(lambda x: int(x[1]) > 10) \
               .map(lambda x: (x[0], int(x[1])))

# Modern approach (DataFrame)
df = spark.read.csv("data.txt", header=True, inferSchema=True)
processed = df.filter(col("value") > 10) \
              .select("name", "value")

Updated String Handling

from pyspark.sql.functions import col, upper, trim

# PySpark 3.0+: Better string function support
df = spark.createDataFrame([(" hello ",), (" world ",)], ["text"])

# Chain string operations
result = df.select(
    trim(col("text")).alias("trimmed"),
    upper(trim(col("text"))).alias("upper_trimmed")
)

Testing Across Python Versions

import sys
import pytest
from pyspark.sql import SparkSession

@pytest.fixture(scope="session")
def spark():
    """Create Spark session for testing"""
    return SparkSession.builder \
        .master("local[2]") \
        .appName("test") \
        .getOrCreate()

def test_dataframe_operation(spark):
    """Test that works across Python versions"""
    df = spark.createDataFrame([(1, "a"), (2, "b")], ["id", "value"])
    result = df.filter(col("id") > 0).count()
    assert result == 2

if __name__ == "__main__":
    # Check Python version
    if sys.version_info < (3, 8):
        print("Warning: PySpark 4.0+ requires Python 3.8+")

Environment Setup

Python Version Requirements

PySpark 4.0+: Python 3.8 or higher
PySpark 3.0+: Python 3.6 or higher
PySpark 2.x: Python 2.7 or 3.4+

Dependency Management

# requirements.txt for PySpark 4.0
pyspark>=4.0.0
pandas>=1.5.0
pyarrow>=10.0.0
numpy>=1.23.0

# Install dependencies
pip install -r requirements.txt

Virtual Environment Setup

# Create virtual environment
python -m venv pyspark_env
source pyspark_env/bin/activate  # On Windows: pyspark_env\Scripts\activate

# Install PySpark
pip install pyspark==4.0.0

# Verify installation
python -c "import pyspark; print(pyspark.__version__)"

Performance Optimization

Use Pandas UDFs for Better Performance

from pyspark.sql.functions import pandas_udf, PandasUDFType
import pandas as pd
import numpy as np

# Pandas UDF (vectorized)
@pandas_udf("double")
def pandas_compute(values: pd.Series) -> pd.Series:
    return values.apply(lambda x: x ** 2 + 2 * x + 1)

# Regular UDF (row-by-row, slower)
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType

@udf(DoubleType())
def regular_compute(value):
    return value ** 2 + 2 * value + 1

# Pandas UDF is typically 10-100x faster
df = spark.range(1000000)
pandas_result = df.select(pandas_compute("id"))  # Fast
regular_result = df.select(regular_compute("id"))  # Slower

Efficient Data Collection

# Avoid collecting large DataFrames
# Bad: Collects all data to driver
all_data = large_df.collect()  # Can cause OOM

# Good: Use sampling or limits
sample_data = large_df.sample(0.01).collect()  # 1% sample
limited_data = large_df.limit(1000).collect()  # First 1000 rows

# Better: Use Pandas conversion for larger subsets
subset_df = large_df.filter(col("category") == "A")
pandas_df = subset_df.toPandas()  # More efficient than collect()

Additional Resources

PySpark API Reference

Complete PySpark API documentation

PySpark SQL Guide

SQL and DataFrame programming guide

Pandas API on Spark

Guide for Pandas-like operations in PySpark

PySpark Examples

Example PySpark applications

Getting Help

For PySpark-specific questions:

Stack Overflow: Use the pyspark tag
Apache Spark mailing lists: [email protected]
GitHub Issues: Apache Spark repository

Version Migration

PySpark Migration Guide

Complete Migration Guide

PySpark Migration Guide

Key Migration Topics

PySpark SQL and DataFrames

PySpark Streaming

PySpark MLlib

PySpark Core

Quick Start Examples

Upgrading to PySpark 4.0

Type Hints and Modern Python

Pandas API on Spark

Working with Pandas UDFs

Version-Specific Highlights

PySpark 4.0

PySpark 3.0

PySpark 2.4

Common Migration Patterns

Handling ANSI Mode Changes

Migrating from RDD to DataFrame APIs

Updated String Handling

Testing Across Python Versions

Environment Setup

Python Version Requirements

Dependency Management

Virtual Environment Setup

Performance Optimization

Use Pandas UDFs for Better Performance

Efficient Data Collection

Additional Resources

PySpark API Reference

PySpark SQL Guide

Pandas API on Spark

PySpark Examples

Getting Help

Build docs developers (and LLMs) love

Version Migration

​Complete Migration Guide

PySpark Migration Guide

​Key Migration Topics

​PySpark SQL and DataFrames

​PySpark Streaming

​PySpark MLlib

​PySpark Core

​Quick Start Examples

​Upgrading to PySpark 4.0

​Type Hints and Modern Python

​Pandas API on Spark

​Working with Pandas UDFs

​Version-Specific Highlights

​PySpark 4.0

​PySpark 3.0

​PySpark 2.4

​Common Migration Patterns

​Handling ANSI Mode Changes

​Migrating from RDD to DataFrame APIs

​Updated String Handling

​Testing Across Python Versions

​Environment Setup

​Python Version Requirements

​Dependency Management

​Virtual Environment Setup

​Performance Optimization

​Use Pandas UDFs for Better Performance

​Efficient Data Collection

​Additional Resources

PySpark API Reference

PySpark SQL Guide

Pandas API on Spark

PySpark Examples

​Getting Help

Build docs developers (and LLMs) love

Complete Migration Guide

Key Migration Topics

PySpark SQL and DataFrames

PySpark Streaming

PySpark MLlib

PySpark Core

Quick Start Examples

Upgrading to PySpark 4.0

Type Hints and Modern Python

Pandas API on Spark

Working with Pandas UDFs

Version-Specific Highlights

PySpark 4.0

PySpark 3.0

PySpark 2.4

Common Migration Patterns

Handling ANSI Mode Changes

Migrating from RDD to DataFrame APIs

Updated String Handling

Testing Across Python Versions

Environment Setup

Python Version Requirements

Dependency Management

Virtual Environment Setup

Performance Optimization

Use Pandas UDFs for Better Performance

Efficient Data Collection

Additional Resources

Getting Help