Skip to main content
Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, Spark SQL provides more information about the structure of both the data and the computation being performed, enabling powerful optimizations.

What is Spark SQL?

Spark SQL offers several ways to interact with structured data:
  • SQL queries - Execute SQL queries directly
  • DataFrame API - Work with distributed collections of data organized into named columns
  • Dataset API - Combine the benefits of RDDs with Spark SQL’s optimized execution engine (Scala/Java only)
When computing a result, the same execution engine is used regardless of which API or language you choose. This unification means you can easily switch between different APIs based on which provides the most natural way to express a given transformation.

Key Features

Unified Data Access

Spark SQL provides a common way to access a variety of data sources:
  • Parquet, ORC, JSON, CSV files
  • Hive tables
  • JDBC databases
  • Avro, Protobuf data
  • Custom data sources

Powerful Optimizations

Spark SQL uses extra information about data structure to perform optimizations:
  • Catalyst Optimizer - Automatically optimizes query plans
  • Tungsten Execution - Generates optimized bytecode at runtime
  • Columnar Storage - Efficient in-memory caching
  • Predicate Pushdown - Pushes filters down to data sources

DataFrames and Datasets

A DataFrame is a distributed collection of data organized into named columns, conceptually equivalent to a table in a relational database.
A Dataset is a distributed collection of data that provides:
  • Strong typing (Scala/Java)
  • The benefits of RDDs (lambda functions)
  • Spark SQL’s optimized execution engine
In Scala and Java, a DataFrame is represented by a Dataset[Row]. Python and R use DataFrames due to their dynamic nature.

SQL Queries

You can execute SQL queries in multiple ways:

Programmatic SQL

Run SQL queries from your application and get results as DataFrames:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SQL Example").getOrCreate()

# Register DataFrame as temporary view
df.createOrReplaceTempView("people")

# Run SQL query
results = spark.sql("SELECT * FROM people WHERE age > 21")
results.show()

Command-Line Interface

Use the Spark SQL CLI to run queries interactively:
./bin/spark-sql

JDBC/ODBC Server

Connect to Spark SQL via JDBC/ODBC for BI tool integration:
./sbin/start-thriftserver.sh

Hive Compatibility

Spark SQL provides built-in support for Hive features:
  • Write queries using HiveQL
  • Access Hive UDFs
  • Read data from Hive tables
You don’t need an existing Hive setup to use these features. Spark will create a local Hive metastore for you.

Functions

Spark SQL supports two types of functions:

Scalar Functions

Functions that return a single value per row:
  • String manipulation (upper, lower, concat)
  • Date arithmetic (date_add, date_sub, datediff)
  • Math operations (abs, round, sqrt)
  • User-defined scalar functions (UDFs)

Aggregate Functions

Functions that return a single value for a group of rows:
  • Common aggregations (count, sum, avg, max, min)
  • Statistical functions (stddev, variance)
  • User-defined aggregate functions (UDAFs)

Next Steps

Getting Started

Create your first Spark SQL application

Data Sources

Learn to work with different data formats

Performance Tuning

Optimize your SQL queries

Distributed SQL Engine

Use Spark as a SQL query engine

Build docs developers (and LLMs) love