The PySpark migration guide documents Python-specific changes for Apache Spark. For general Spark SQL and DataFrame changes that also affect PySpark, review the SQL Migration Guide.
The comprehensive PySpark migration guide is now maintained in the PySpark API documentation.
You can access the full PySpark migration guide, which includes detailed information about all Python API changes, behavior modifications, and deprecated features:
PySpark Migration Guide
Complete guide covering PySpark migrations from version 1.0 to the latest release
PySpark now leverages Python type hints for better IDE support:
from pyspark.sql import DataFrame, SparkSessionfrom pyspark.sql.functions import colfrom typing import Listdef filter_data(df: DataFrame, columns: List[str]) -> DataFrame: """Filter DataFrame to specific columns with type safety""" return df.select([col(c) for c in columns])# Type hints provide better autocomplete and error checkingresult: DataFrame = filter_data(df, ["id", "name"])
import pyspark.pandas as ps# Read data with Pandas-like APIpsdf = ps.read_csv("data.csv")# Use familiar Pandas operationsresult = psdf.groupby("category").agg({ "value": ["mean", "sum"], "count": "count"})# Convert to Spark DataFrame when neededspark_df = result.to_spark()
from pyspark.sql.functions import pandas_udfimport pandas as pd# Define a Pandas UDF for vectorized operations@pandas_udf("double")def multiply_by_two(values: pd.Series) -> pd.Series: return values * 2# Use the UDFdf = spark.range(100)result = df.select(multiply_by_two("id").alias("doubled"))
# Avoid collecting large DataFrames# Bad: Collects all data to driverall_data = large_df.collect() # Can cause OOM# Good: Use sampling or limitssample_data = large_df.sample(0.01).collect() # 1% samplelimited_data = large_df.limit(1000).collect() # First 1000 rows# Better: Use Pandas conversion for larger subsetssubset_df = large_df.filter(col("category") == "A")pandas_df = subset_df.toPandas() # More efficient than collect()