Distributed SQL Engine

Spark SQL can act as a distributed query engine using its JDBC/ODBC or command-line interface. In this mode, end-users or applications can interact with Spark SQL directly to run SQL queries without writing any code.

Overview

The distributed SQL engine allows you to:

Run SQL queries without writing Spark applications
Connect BI tools via JDBC/ODBC
Use interactive SQL CLI for data exploration
Leverage Spark’s distributed computing for large-scale queries

The Thrift JDBC/ODBC server corresponds to HiveServer2 in Hive, providing compatibility with existing Hive-based tools.

Running the Thrift JDBC/ODBC Server

The Thrift JDBC/ODBC server allows remote clients to execute SQL queries on Spark and retrieve results:

Starting the Server

./sbin/start-thriftserver.sh

By default, the server listens on localhost:10000. You can override this using environment variables:

export HIVE_SERVER2_THRIFT_PORT=<listening-port>
export HIVE_SERVER2_THRIFT_BIND_HOST=<listening-host>
./sbin/start-thriftserver.sh \
  --master <master-uri> \
  --conf spark.executor.memory=4g

Or using system properties:

./sbin/start-thriftserver.sh \
  --hiveconf hive.server2.thrift.port=<listening-port> \
  --hiveconf hive.server2.thrift.bind.host=<listening-host> \
  --master <master-uri> \
  --conf spark.executor.memory=4g

The script accepts all bin/spark-submit command line options. Use ./sbin/start-thriftserver.sh --help for a complete list.

Connecting with Beeline

Test the JDBC/ODBC server using beeline:

# Start beeline
./bin/beeline

# Connect to the server
beeline> !connect jdbc:hive2://localhost:10000

Beeline will ask for username and password:

Enter Username

In non-secure mode, enter your machine username

Enter Password

In non-secure mode, leave password blank

Run Queries

Execute SQL queries normally

-- Create a table
CREATE TABLE employees (id INT, name STRING, salary DOUBLE);

-- Load data
LOAD DATA LOCAL INPATH 'examples/employees.csv' INTO TABLE employees;

-- Query the table
SELECT name, salary FROM employees WHERE salary > 50000;

HTTP Transport Mode

The Thrift JDBC server supports HTTP transport for better firewall compatibility:

Configure HTTP Mode

Set these properties in hive-site.xml in conf/:

<property>
  <name>hive.server2.transport.mode</name>
  <value>http</value>
</property>
<property>
  <name>hive.server2.thrift.http.port</name>
  <value>10001</value>
</property>
<property>
  <name>hive.server2.http.endpoint</name>
  <value>cliservice</value>
</property>

Connect via HTTP

beeline> !connect jdbc:hive2://<host>:<port>/<database>;transportMode=http;httpPath=<http_endpoint>

Running the Spark SQL CLI

The Spark SQL CLI provides an interactive command-line interface for executing SQL queries:

Starting the CLI

./bin/spark-sql

The CLI connects to a local Spark instance and provides an interactive SQL prompt.

Basic Usage

-- Show databases
SHOW DATABASES;

-- Use a database
USE default;

-- Show tables
SHOW TABLES;

-- Create table from Parquet file
CREATE TABLE users
USING parquet
OPTIONS (path 'examples/users.parquet');

-- Query the table
SELECT * FROM users WHERE age > 21;

-- Aggregate queries
SELECT department, AVG(salary) as avg_salary
FROM employees
GROUP BY department
ORDER BY avg_salary DESC;

Configuration

You can set Spark configurations when starting the CLI:

./bin/spark-sql \
  --master local[4] \
  --conf spark.sql.shuffle.partitions=200 \
  --conf spark.executor.memory=4g

Place your hive-site.xml, core-site.xml, and hdfs-site.xml files in the conf/ directory to configure Hive integration.

JDBC Connection from Applications

Connect to the Spark SQL Thrift server from your application:

import jaydebeapi

# Connect to Spark SQL
conn = jaydebeapi.connect(
    "org.apache.hive.jdbc.HiveDriver",
    "jdbc:hive2://localhost:10000/default",
    ["username", ""],
    "/path/to/hive-jdbc-standalone.jar"
)

# Execute query
cursor = conn.cursor()
cursor.execute("SELECT * FROM employees WHERE salary > 50000")
results = cursor.fetchall()

for row in results:
    print(row)

cursor.close()
conn.close()

Make sure to include the Hive JDBC driver in your application’s classpath.

Connecting BI Tools

Many BI tools support JDBC connections and can connect to Spark SQL:

Tableau

Select Connector

Choose “Spark SQL” or “Apache Hive” connector

Enter Connection Details

Server: localhost
Port: 10000
Type: SparkThriftServer

Authenticate

Enter username (password optional in non-secure mode)

Query Data

Start building visualizations

Power BI

Get Data

Select “Get Data” > “More” > “Spark”

Configure Connection

Enter server URL and protocol (HTTP or Binary)

Load Tables

Select tables and load data

Best Practices

Security

For production deployments:

Enable Kerberos authentication
Use SSL/TLS for encrypted connections
Implement proper access controls
Follow the beeline documentation for secure mode setup

Performance

Cache frequently accessed tables
Use partitioned tables for large datasets
Adjust spark.sql.shuffle.partitions based on workload
Monitor queries using the Spark UI

Resource Management

Set appropriate memory limits for executors
Configure concurrent query limits
Use dynamic allocation for varying workloads
Monitor resource usage

High Availability

Use ZooKeeper for automatic failover
Deploy multiple Thrift servers behind a load balancer
Configure session persistence

Troubleshooting

Connection Refused

Check that:

Thrift server is running: jps | grep SparkSubmit
Correct host and port are specified
Firewall allows connections on the port
Network connectivity exists between client and server

Out of Memory Errors

Increase executor memory:

./sbin/start-thriftserver.sh \
  --conf spark.executor.memory=8g \
  --conf spark.driver.memory=4g

Slow Queries

Check query execution plan with EXPLAIN
Collect table statistics with ANALYZE TABLE
Enable adaptive query execution
Review partition count and size

CTAS Issues After Session Close

Set fs.%s.impl.disable.cache to true in hive-site.xml. See SPARK-21067 for details.

Configuration Reference

Key configuration properties for the distributed SQL engine:

hive.server2.thrift.port

int

default:"10000"

Port number for the Thrift server to listen on

hive.server2.thrift.bind.host

string

default:"localhost"

Host address for the Thrift server to bind to

hive.server2.transport.mode

string

default:"binary"

Transport mode: binary or http

hive.server2.thrift.http.port

int

default:"10001"

Port number for HTTP transport mode

hive.server2.thrift.min.worker.threads

int

default:"5"

Minimum number of worker threads

hive.server2.thrift.max.worker.threads

int

default:"500"

Maximum number of worker threads

Next Steps

Performance Tuning

Optimize your distributed SQL queries

Data Sources

Learn about supported data sources

Get Started

Core Concepts

Spark SQL

Structured Streaming

Machine Learning

Graph Processing

Deployment

Configuration & Tuning

Monitoring

Overview

Running the Thrift JDBC/ODBC Server

Starting the Server

Connecting with Beeline

HTTP Transport Mode

Running the Spark SQL CLI

Starting the CLI

Basic Usage

Configuration

JDBC Connection from Applications

Connecting BI Tools

Tableau

Power BI

Best Practices

Troubleshooting

Configuration Reference

Next Steps

Performance Tuning

Data Sources

Build docs developers (and LLMs) love

Get Started

Core Concepts

Spark SQL

Structured Streaming

Machine Learning

Graph Processing

Deployment

Configuration & Tuning

Monitoring

​Overview

​Running the Thrift JDBC/ODBC Server

​Starting the Server

​Connecting with Beeline

​HTTP Transport Mode

​Running the Spark SQL CLI

​Starting the CLI

​Basic Usage

​Configuration

​JDBC Connection from Applications

​Connecting BI Tools

​Tableau

​Power BI

​Best Practices

​Troubleshooting

​Configuration Reference

​Next Steps

Performance Tuning

Data Sources

Build docs developers (and LLMs) love

Overview

Running the Thrift JDBC/ODBC Server

Starting the Server

Connecting with Beeline

HTTP Transport Mode

Running the Spark SQL CLI

Starting the CLI

Basic Usage

Configuration

JDBC Connection from Applications

Connecting BI Tools

Tableau

Power BI

Best Practices

Troubleshooting

Configuration Reference

Next Steps