Skip to main content
The Scala API is Spark’s primary API, providing strongly-typed Dataset and DataFrame operations with functional programming capabilities.

API Documentation

Access the complete Scala API documentation (Scaladoc) at: Spark Scala API (Scaladoc)

Core Packages

org.apache.spark.sql

The main package for working with structured data. You’ll use this package for most DataFrame and Dataset operations. Key Classes:
  • SparkSession - Entry point for Spark functionality. Use this to create DataFrames, read data, and configure Spark.
  • Dataset[T] - Strongly-typed distributed collection of data. Provides type-safe operations.
  • DataFrame - Type alias for Dataset[Row]. Use for semi-structured data.
  • Column - Represents a column in a DataFrame.
  • Row - Represents a row of data.
  • functions - Built-in functions for DataFrame operations.

org.apache.spark.sql.types

Data types for Spark SQL schemas. Key Classes:

org.apache.spark.sql.streaming

Structured Streaming API for processing real-time data streams. Key Classes:

org.apache.spark.sql.catalog

Manage metadata for databases, tables, functions, and views. Key Classes:
  • Catalog - Interface for catalog operations

org.apache.spark

Core Spark functionality (Note: SparkContext and RDD are not supported in Spark Connect). Key Classes:
  • SparkContext - Main entry point for Spark Classic (not available in Spark Connect)
  • SparkConf - Configuration for Spark applications

Quick Start Example

Here’s a simple example to get you started with the Scala API:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

// Create SparkSession
val spark = SparkSession.builder()
  .appName("Scala API Example")
  .master("local[*]")
  .getOrCreate()

// Create a DataFrame
val data = Seq(
  ("Alice", 25, "Engineering"),
  ("Bob", 30, "Sales"),
  ("Charlie", 35, "Engineering")
)

val df = spark.createDataFrame(data)
  .toDF("name", "age", "department")

// Perform transformations
val result = df
  .filter(col("age") > 25)
  .groupBy("department")
  .agg(avg("age").as("avg_age"))
  .orderBy(desc("avg_age"))

// Show results
result.show()

// Stop the session
spark.stop()

Working with Datasets

Datasets provide type-safe operations with compile-time checking:
case class Person(name: String, age: Int, department: String)

// Create a strongly-typed Dataset
val ds = spark.createDataset(Seq(
  Person("Alice", 25, "Engineering"),
  Person("Bob", 30, "Sales")
))

// Type-safe transformations
val adults = ds.filter(_.age >= 18)
val names = ds.map(_.name)

User-Defined Functions (UDFs)

Create custom functions for your transformations:
import org.apache.spark.sql.functions.udf

// Define a UDF
val upperCase = udf((s: String) => s.toUpperCase)

// Use the UDF
val result = df.withColumn("name_upper", upperCase(col("name")))

Spark Connect Support

Since Spark 3.5, most Scala APIs are supported in Spark Connect, including Dataset, functions, Column, Catalog, and streaming APIs. However, SparkContext and RDD are not supported.
When using Spark Connect, create your session with the remote parameter:
val spark = SparkSession.builder()
  .remote("sc://localhost")
  .getOrCreate()

Additional Resources

For the most up-to-date API documentation, always refer to the official Scaladoc linked at the top of this page.

Build docs developers (and LLMs) love