Spark SQL Overview

Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, Spark SQL provides more information about the structure of both the data and the computation being performed, enabling powerful optimizations.

What is Spark SQL?

Spark SQL offers several ways to interact with structured data:

SQL queries - Execute SQL queries directly
DataFrame API - Work with distributed collections of data organized into named columns
Dataset API - Combine the benefits of RDDs with Spark SQL’s optimized execution engine (Scala/Java only)

When computing a result, the same execution engine is used regardless of which API or language you choose. This unification means you can easily switch between different APIs based on which provides the most natural way to express a given transformation.

Key Features

Unified Data Access

Spark SQL provides a common way to access a variety of data sources:

Parquet, ORC, JSON, CSV files
Hive tables
JDBC databases
Avro, Protobuf data
Custom data sources

Powerful Optimizations

Spark SQL uses extra information about data structure to perform optimizations:

Catalyst Optimizer - Automatically optimizes query plans
Tungsten Execution - Generates optimized bytecode at runtime
Columnar Storage - Efficient in-memory caching
Predicate Pushdown - Pushes filters down to data sources

DataFrames and Datasets

A DataFrame is a distributed collection of data organized into named columns, conceptually equivalent to a table in a relational database.

A Dataset is a distributed collection of data that provides:

Strong typing (Scala/Java)
The benefits of RDDs (lambda functions)
Spark SQL’s optimized execution engine

In Scala and Java, a DataFrame is represented by a Dataset[Row]. Python and R use DataFrames due to their dynamic nature.

SQL Queries

You can execute SQL queries in multiple ways:

Programmatic SQL

Run SQL queries from your application and get results as DataFrames:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SQL Example").getOrCreate()

# Register DataFrame as temporary view
df.createOrReplaceTempView("people")

# Run SQL query
results = spark.sql("SELECT * FROM people WHERE age > 21")
results.show()

Command-Line Interface

Use the Spark SQL CLI to run queries interactively:

./bin/spark-sql

JDBC/ODBC Server

Connect to Spark SQL via JDBC/ODBC for BI tool integration:

./sbin/start-thriftserver.sh

Hive Compatibility

Spark SQL provides built-in support for Hive features:

Write queries using HiveQL
Access Hive UDFs
Read data from Hive tables

You don’t need an existing Hive setup to use these features. Spark will create a local Hive metastore for you.

Functions

Spark SQL supports two types of functions:

Scalar Functions

Functions that return a single value per row:

String manipulation (upper, lower, concat)
Date arithmetic (date_add, date_sub, datediff)
Math operations (abs, round, sqrt)
User-defined scalar functions (UDFs)

Aggregate Functions

Functions that return a single value for a group of rows:

Common aggregations (count, sum, avg, max, min)
Statistical functions (stddev, variance)
User-defined aggregate functions (UDAFs)

Next Steps

Getting Started

Create your first Spark SQL application

Data Sources

Learn to work with different data formats

Performance Tuning

Optimize your SQL queries

Distributed SQL Engine

Use Spark as a SQL query engine

Get Started

Core Concepts

Spark SQL

Structured Streaming

Machine Learning

Graph Processing

Deployment

Configuration & Tuning

Monitoring

What is Spark SQL?

Key Features

Unified Data Access

Powerful Optimizations

DataFrames and Datasets

SQL Queries

Programmatic SQL

Command-Line Interface

JDBC/ODBC Server

Hive Compatibility

Functions

Scalar Functions

Aggregate Functions

Next Steps

Getting Started

Data Sources

Performance Tuning

Distributed SQL Engine

Build docs developers (and LLMs) love

Get Started

Core Concepts

Spark SQL

Structured Streaming

Machine Learning

Graph Processing

Deployment

Configuration & Tuning

Monitoring

​What is Spark SQL?

​Key Features

​Unified Data Access

​Powerful Optimizations

​DataFrames and Datasets

​SQL Queries

​Programmatic SQL

​Command-Line Interface

​JDBC/ODBC Server

​Hive Compatibility

​Functions

​Scalar Functions

​Aggregate Functions

​Next Steps

Getting Started

Data Sources

Performance Tuning

Distributed SQL Engine

Build docs developers (and LLMs) love

What is Spark SQL?

Key Features

Unified Data Access

Powerful Optimizations

DataFrames and Datasets

SQL Queries

Programmatic SQL

Command-Line Interface

JDBC/ODBC Server

Hive Compatibility

Functions

Scalar Functions

Aggregate Functions

Next Steps