Scala API Reference

The Scala API is Spark’s primary API, providing strongly-typed Dataset and DataFrame operations with functional programming capabilities.

API Documentation

Access the complete Scala API documentation (Scaladoc) at: Spark Scala API (Scaladoc)

Core Packages

org.apache.spark.sql

The main package for working with structured data. You’ll use this package for most DataFrame and Dataset operations. Key Classes:

SparkSession - Entry point for Spark functionality. Use this to create DataFrames, read data, and configure Spark.
Dataset[T] - Strongly-typed distributed collection of data. Provides type-safe operations.
DataFrame - Type alias for Dataset[Row]. Use for semi-structured data.
Column - Represents a column in a DataFrame.
Row - Represents a row of data.
functions - Built-in functions for DataFrame operations.

org.apache.spark.sql.types

Data types for Spark SQL schemas. Key Classes:

DataType - Base type for all data types
StructType - Schema definition for DataFrames
StructField - Field in a StructType schema

org.apache.spark.sql.streaming

Structured Streaming API for processing real-time data streams. Key Classes:

DataStreamReader - Read streaming data sources
DataStreamWriter - Write streaming data to sinks
StreamingQuery - Handle to a running streaming query
StreamingQueryListener - Monitor streaming query events

org.apache.spark.sql.catalog

Manage metadata for databases, tables, functions, and views. Key Classes:

Catalog - Interface for catalog operations

org.apache.spark

Core Spark functionality (Note: SparkContext and RDD are not supported in Spark Connect). Key Classes:

SparkContext - Main entry point for Spark Classic (not available in Spark Connect)
SparkConf - Configuration for Spark applications

Quick Start Example

Here’s a simple example to get you started with the Scala API:

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

// Create SparkSession
val spark = SparkSession.builder()
  .appName("Scala API Example")
  .master("local[*]")
  .getOrCreate()

// Create a DataFrame
val data = Seq(
  ("Alice", 25, "Engineering"),
  ("Bob", 30, "Sales"),
  ("Charlie", 35, "Engineering")
)

val df = spark.createDataFrame(data)
  .toDF("name", "age", "department")

// Perform transformations
val result = df
  .filter(col("age") > 25)
  .groupBy("department")
  .agg(avg("age").as("avg_age"))
  .orderBy(desc("avg_age"))

// Show results
result.show()

// Stop the session
spark.stop()

Working with Datasets

Datasets provide type-safe operations with compile-time checking:

case class Person(name: String, age: Int, department: String)

// Create a strongly-typed Dataset
val ds = spark.createDataset(Seq(
  Person("Alice", 25, "Engineering"),
  Person("Bob", 30, "Sales")
))

// Type-safe transformations
val adults = ds.filter(_.age >= 18)
val names = ds.map(_.name)

User-Defined Functions (UDFs)

Create custom functions for your transformations:

import org.apache.spark.sql.functions.udf

// Define a UDF
val upperCase = udf((s: String) => s.toUpperCase)

// Use the UDF
val result = df.withColumn("name_upper", upperCase(col("name")))

Spark Connect Support

Since Spark 3.5, most Scala APIs are supported in Spark Connect, including Dataset, functions, Column, Catalog, and streaming APIs. However, SparkContext and RDD are not supported.

When using Spark Connect, create your session with the remote parameter:

val spark = SparkSession.builder()
  .remote("sc://localhost")
  .getOrCreate()

Additional Resources

For the most up-to-date API documentation, always refer to the official Scaladoc linked at the top of this page.

Programming Languages

Spark Connect

API Documentation

Core Packages

org.apache.spark.sql

org.apache.spark.sql.types

org.apache.spark.sql.streaming

org.apache.spark.sql.catalog

org.apache.spark

Quick Start Example

Working with Datasets

User-Defined Functions (UDFs)

Spark Connect Support

Additional Resources

Build docs developers (and LLMs) love

Programming Languages

Spark Connect

​API Documentation

​Core Packages

​org.apache.spark.sql

​org.apache.spark.sql.types

​org.apache.spark.sql.streaming

​org.apache.spark.sql.catalog

​org.apache.spark

​Quick Start Example

​Working with Datasets

​User-Defined Functions (UDFs)

​Spark Connect Support

​Additional Resources

Build docs developers (and LLMs) love

API Documentation

Core Packages

org.apache.spark.sql

org.apache.spark.sql.types

org.apache.spark.sql.streaming

org.apache.spark.sql.catalog

org.apache.spark

Quick Start Example

Working with Datasets

User-Defined Functions (UDFs)

Spark Connect Support

Additional Resources