Skip to main content
The Java API provides access to Spark’s powerful data processing capabilities through Java-friendly interfaces and method signatures.

API Documentation

Access the complete Java API documentation (Javadoc) at: Spark Java API (Javadoc)

Core Packages

org.apache.spark.sql

The main package for structured data processing in Java. Key Classes:
  • SparkSession - Entry point for Spark functionality
  • Dataset<Row> - Distributed collection of data (Java uses Dataset<Row> for DataFrames)
  • Row - Represents a row of data
  • Column - Column expression
  • functions - Built-in SQL functions

org.apache.spark.sql.types

Data type definitions for schemas. Key Classes:

org.apache.spark.sql.streaming

Streaming API for real-time data processing. Key Classes:

org.apache.spark.sql.catalog

Catalog management for metadata operations. Key Classes:
  • Catalog - Metadata operations interface

Quick Start Example

Here’s a basic example using the Java API:
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import static org.apache.spark.sql.functions.*;

public class SparkJavaExample {
    public static void main(String[] args) {
        // Create SparkSession
        SparkSession spark = SparkSession.builder()
            .appName("Java API Example")
            .master("local[*]")
            .getOrCreate();
        
        // Read data
        Dataset<Row> df = spark.read()
            .option("header", "true")
            .option("inferSchema", "true")
            .csv("data/people.csv");
        
        // Perform transformations
        Dataset<Row> result = df
            .filter(col("age").gt(25))
            .groupBy("department")
            .agg(avg("age").as("avg_age"))
            .orderBy(desc("avg_age"));
        
        // Show results
        result.show();
        
        // Stop the session
        spark.stop();
    }
}

Working with Encoders

For strongly-typed Datasets in Java, use Encoders:
import org.apache.spark.sql.Encoders;
import java.io.Serializable;

public class Person implements Serializable {
    private String name;
    private int age;
    private String department;
    
    // Constructors, getters, and setters
    public Person() {}
    
    public Person(String name, int age, String department) {
        this.name = name;
        this.age = age;
        this.department = department;
    }
    
    // Getters and setters
    public String getName() { return name; }
    public void setName(String name) { this.name = name; }
    public int getAge() { return age; }
    public void setAge(int age) { this.age = age; }
    public String getDepartment() { return department; }
    public void setDepartment(String department) { this.department = department; }
}

// Create a typed Dataset
Dataset<Person> people = spark.createDataset(
    Arrays.asList(
        new Person("Alice", 25, "Engineering"),
        new Person("Bob", 30, "Sales")
    ),
    Encoders.bean(Person.class)
);

// Type-safe operations
Dataset<Person> adults = people.filter(p -> p.getAge() >= 18);

User-Defined Functions (UDFs)

Register and use custom functions:
import org.apache.spark.sql.api.java.UDF1;
import org.apache.spark.sql.types.DataTypes;

// Register a UDF
spark.udf().register(
    "upperCase",
    (UDF1<String, String>) String::toUpperCase,
    DataTypes.StringType
);

// Use the UDF
Dataset<Row> result = df.withColumn(
    "name_upper",
    callUDF("upperCase", col("name"))
);

Creating Schemas

Define schemas explicitly for better control:
import org.apache.spark.sql.types.*;

StructType schema = new StructType(new StructField[] {
    new StructField("name", DataTypes.StringType, false, Metadata.empty()),
    new StructField("age", DataTypes.IntegerType, false, Metadata.empty()),
    new StructField("department", DataTypes.StringType, true, Metadata.empty())
});

Dataset<Row> df = spark.read()
    .schema(schema)
    .csv("data/people.csv");

SQL Queries

Execute SQL queries directly:
// Register DataFrame as a temporary view
df.createOrReplaceTempView("people");

// Run SQL query
Dataset<Row> sqlResult = spark.sql(
    "SELECT department, AVG(age) as avg_age " +
    "FROM people " +
    "WHERE age > 25 " +
    "GROUP BY department " +
    "ORDER BY avg_age DESC"
);

sqlResult.show();

Maven Dependencies

Add Spark dependencies to your pom.xml:
<dependencies>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.13</artifactId>
        <version>4.0.0</version>
    </dependency>
</dependencies>
For Spark Connect:
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-connect-client-jvm_2.13</artifactId>
    <version>4.0.0</version>
</dependency>

Spark Connect Support

Most Java APIs are supported in Spark Connect. However, SparkContext and RDD-based operations are not available.
Connect to a remote Spark cluster:
SparkSession spark = SparkSession.builder()
    .remote("sc://localhost")
    .getOrCreate();

Additional Resources

Java uses Dataset<Row> to represent DataFrames, while Scala uses DataFrame as a type alias for Dataset[Row].

Build docs developers (and LLMs) love