The Java API provides access to Spark’s powerful data processing capabilities through Java-friendly interfaces and method signatures.
API Documentation
Access the complete Java API documentation (Javadoc) at:
Spark Java API (Javadoc)
Core Packages
org.apache.spark.sql
The main package for structured data processing in Java.
Key Classes:
- SparkSession - Entry point for Spark functionality
- Dataset<Row> - Distributed collection of data (Java uses
Dataset<Row> for DataFrames)
- Row - Represents a row of data
- Column - Column expression
- functions - Built-in SQL functions
org.apache.spark.sql.types
Data type definitions for schemas.
Key Classes:
org.apache.spark.sql.streaming
Streaming API for real-time data processing.
Key Classes:
org.apache.spark.sql.catalog
Catalog management for metadata operations.
Key Classes:
- Catalog - Metadata operations interface
Quick Start Example
Here’s a basic example using the Java API:
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import static org.apache.spark.sql.functions.*;
public class SparkJavaExample {
public static void main(String[] args) {
// Create SparkSession
SparkSession spark = SparkSession.builder()
.appName("Java API Example")
.master("local[*]")
.getOrCreate();
// Read data
Dataset<Row> df = spark.read()
.option("header", "true")
.option("inferSchema", "true")
.csv("data/people.csv");
// Perform transformations
Dataset<Row> result = df
.filter(col("age").gt(25))
.groupBy("department")
.agg(avg("age").as("avg_age"))
.orderBy(desc("avg_age"));
// Show results
result.show();
// Stop the session
spark.stop();
}
}
Working with Encoders
For strongly-typed Datasets in Java, use Encoders:
import org.apache.spark.sql.Encoders;
import java.io.Serializable;
public class Person implements Serializable {
private String name;
private int age;
private String department;
// Constructors, getters, and setters
public Person() {}
public Person(String name, int age, String department) {
this.name = name;
this.age = age;
this.department = department;
}
// Getters and setters
public String getName() { return name; }
public void setName(String name) { this.name = name; }
public int getAge() { return age; }
public void setAge(int age) { this.age = age; }
public String getDepartment() { return department; }
public void setDepartment(String department) { this.department = department; }
}
// Create a typed Dataset
Dataset<Person> people = spark.createDataset(
Arrays.asList(
new Person("Alice", 25, "Engineering"),
new Person("Bob", 30, "Sales")
),
Encoders.bean(Person.class)
);
// Type-safe operations
Dataset<Person> adults = people.filter(p -> p.getAge() >= 18);
User-Defined Functions (UDFs)
Register and use custom functions:
import org.apache.spark.sql.api.java.UDF1;
import org.apache.spark.sql.types.DataTypes;
// Register a UDF
spark.udf().register(
"upperCase",
(UDF1<String, String>) String::toUpperCase,
DataTypes.StringType
);
// Use the UDF
Dataset<Row> result = df.withColumn(
"name_upper",
callUDF("upperCase", col("name"))
);
Creating Schemas
Define schemas explicitly for better control:
import org.apache.spark.sql.types.*;
StructType schema = new StructType(new StructField[] {
new StructField("name", DataTypes.StringType, false, Metadata.empty()),
new StructField("age", DataTypes.IntegerType, false, Metadata.empty()),
new StructField("department", DataTypes.StringType, true, Metadata.empty())
});
Dataset<Row> df = spark.read()
.schema(schema)
.csv("data/people.csv");
SQL Queries
Execute SQL queries directly:
// Register DataFrame as a temporary view
df.createOrReplaceTempView("people");
// Run SQL query
Dataset<Row> sqlResult = spark.sql(
"SELECT department, AVG(age) as avg_age " +
"FROM people " +
"WHERE age > 25 " +
"GROUP BY department " +
"ORDER BY avg_age DESC"
);
sqlResult.show();
Maven Dependencies
Add Spark dependencies to your pom.xml:
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.13</artifactId>
<version>4.0.0</version>
</dependency>
</dependencies>
For Spark Connect:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-connect-client-jvm_2.13</artifactId>
<version>4.0.0</version>
</dependency>
Spark Connect Support
Most Java APIs are supported in Spark Connect. However, SparkContext and RDD-based operations are not available.
Connect to a remote Spark cluster:
SparkSession spark = SparkSession.builder()
.remote("sc://localhost")
.getOrCreate();
Additional Resources
Java uses Dataset<Row> to represent DataFrames, while Scala uses DataFrame as a type alias for Dataset[Row].