Skip to main content
Spark SQL can act as a distributed query engine using its JDBC/ODBC or command-line interface. In this mode, end-users or applications can interact with Spark SQL directly to run SQL queries without writing any code.

Overview

The distributed SQL engine allows you to:
  • Run SQL queries without writing Spark applications
  • Connect BI tools via JDBC/ODBC
  • Use interactive SQL CLI for data exploration
  • Leverage Spark’s distributed computing for large-scale queries
The Thrift JDBC/ODBC server corresponds to HiveServer2 in Hive, providing compatibility with existing Hive-based tools.

Running the Thrift JDBC/ODBC Server

The Thrift JDBC/ODBC server allows remote clients to execute SQL queries on Spark and retrieve results:

Starting the Server

./sbin/start-thriftserver.sh
By default, the server listens on localhost:10000. You can override this using environment variables:
export HIVE_SERVER2_THRIFT_PORT=<listening-port>
export HIVE_SERVER2_THRIFT_BIND_HOST=<listening-host>
./sbin/start-thriftserver.sh \
  --master <master-uri> \
  --conf spark.executor.memory=4g
Or using system properties:
./sbin/start-thriftserver.sh \
  --hiveconf hive.server2.thrift.port=<listening-port> \
  --hiveconf hive.server2.thrift.bind.host=<listening-host> \
  --master <master-uri> \
  --conf spark.executor.memory=4g
The script accepts all bin/spark-submit command line options. Use ./sbin/start-thriftserver.sh --help for a complete list.

Connecting with Beeline

Test the JDBC/ODBC server using beeline:
# Start beeline
./bin/beeline

# Connect to the server
beeline> !connect jdbc:hive2://localhost:10000
Beeline will ask for username and password:
1

Enter Username

In non-secure mode, enter your machine username
2

Enter Password

In non-secure mode, leave password blank
3

Run Queries

Execute SQL queries normally
-- Create a table
CREATE TABLE employees (id INT, name STRING, salary DOUBLE);

-- Load data
LOAD DATA LOCAL INPATH 'examples/employees.csv' INTO TABLE employees;

-- Query the table
SELECT name, salary FROM employees WHERE salary > 50000;

HTTP Transport Mode

The Thrift JDBC server supports HTTP transport for better firewall compatibility:
1

Configure HTTP Mode

Set these properties in hive-site.xml in conf/:
<property>
  <name>hive.server2.transport.mode</name>
  <value>http</value>
</property>
<property>
  <name>hive.server2.thrift.http.port</name>
  <value>10001</value>
</property>
<property>
  <name>hive.server2.http.endpoint</name>
  <value>cliservice</value>
</property>
2

Connect via HTTP

beeline> !connect jdbc:hive2://<host>:<port>/<database>;transportMode=http;httpPath=<http_endpoint>

Running the Spark SQL CLI

The Spark SQL CLI provides an interactive command-line interface for executing SQL queries:

Starting the CLI

./bin/spark-sql
The CLI connects to a local Spark instance and provides an interactive SQL prompt.

Basic Usage

-- Show databases
SHOW DATABASES;

-- Use a database
USE default;

-- Show tables
SHOW TABLES;

-- Create table from Parquet file
CREATE TABLE users
USING parquet
OPTIONS (path 'examples/users.parquet');

-- Query the table
SELECT * FROM users WHERE age > 21;

-- Aggregate queries
SELECT department, AVG(salary) as avg_salary
FROM employees
GROUP BY department
ORDER BY avg_salary DESC;

Configuration

You can set Spark configurations when starting the CLI:
./bin/spark-sql \
  --master local[4] \
  --conf spark.sql.shuffle.partitions=200 \
  --conf spark.executor.memory=4g
Place your hive-site.xml, core-site.xml, and hdfs-site.xml files in the conf/ directory to configure Hive integration.

JDBC Connection from Applications

Connect to the Spark SQL Thrift server from your application:
import jaydebeapi

# Connect to Spark SQL
conn = jaydebeapi.connect(
    "org.apache.hive.jdbc.HiveDriver",
    "jdbc:hive2://localhost:10000/default",
    ["username", ""],
    "/path/to/hive-jdbc-standalone.jar"
)

# Execute query
cursor = conn.cursor()
cursor.execute("SELECT * FROM employees WHERE salary > 50000")
results = cursor.fetchall()

for row in results:
    print(row)

cursor.close()
conn.close()
Make sure to include the Hive JDBC driver in your application’s classpath.

Connecting BI Tools

Many BI tools support JDBC connections and can connect to Spark SQL:

Tableau

1

Select Connector

Choose “Spark SQL” or “Apache Hive” connector
2

Enter Connection Details

  • Server: localhost
  • Port: 10000
  • Type: SparkThriftServer
3

Authenticate

Enter username (password optional in non-secure mode)
4

Query Data

Start building visualizations

Power BI

1

Get Data

Select “Get Data” > “More” > “Spark”
2

Configure Connection

Enter server URL and protocol (HTTP or Binary)
3

Load Tables

Select tables and load data

Best Practices

For production deployments:
  • Enable Kerberos authentication
  • Use SSL/TLS for encrypted connections
  • Implement proper access controls
  • Follow the beeline documentation for secure mode setup
  • Cache frequently accessed tables
  • Use partitioned tables for large datasets
  • Adjust spark.sql.shuffle.partitions based on workload
  • Monitor queries using the Spark UI
  • Set appropriate memory limits for executors
  • Configure concurrent query limits
  • Use dynamic allocation for varying workloads
  • Monitor resource usage
  • Use ZooKeeper for automatic failover
  • Deploy multiple Thrift servers behind a load balancer
  • Configure session persistence

Troubleshooting

Check that:
  • Thrift server is running: jps | grep SparkSubmit
  • Correct host and port are specified
  • Firewall allows connections on the port
  • Network connectivity exists between client and server
Increase executor memory:
./sbin/start-thriftserver.sh \
  --conf spark.executor.memory=8g \
  --conf spark.driver.memory=4g
  • Check query execution plan with EXPLAIN
  • Collect table statistics with ANALYZE TABLE
  • Enable adaptive query execution
  • Review partition count and size
Set fs.%s.impl.disable.cache to true in hive-site.xml. See SPARK-21067 for details.

Configuration Reference

Key configuration properties for the distributed SQL engine:
hive.server2.thrift.port
int
default:"10000"
Port number for the Thrift server to listen on
hive.server2.thrift.bind.host
string
default:"localhost"
Host address for the Thrift server to bind to
hive.server2.transport.mode
string
default:"binary"
Transport mode: binary or http
hive.server2.thrift.http.port
int
default:"10001"
Port number for HTTP transport mode
hive.server2.thrift.min.worker.threads
int
default:"5"
Minimum number of worker threads
hive.server2.thrift.max.worker.threads
int
default:"500"
Maximum number of worker threads

Next Steps

Performance Tuning

Optimize your distributed SQL queries

Data Sources

Learn about supported data sources

Build docs developers (and LLMs) love