Overview
The distributed SQL engine allows you to:- Run SQL queries without writing Spark applications
- Connect BI tools via JDBC/ODBC
- Use interactive SQL CLI for data exploration
- Leverage Spark’s distributed computing for large-scale queries
The Thrift JDBC/ODBC server corresponds to HiveServer2 in Hive, providing compatibility with existing Hive-based tools.
Running the Thrift JDBC/ODBC Server
The Thrift JDBC/ODBC server allows remote clients to execute SQL queries on Spark and retrieve results:Starting the Server
localhost:10000. You can override this using environment variables:
Connecting with Beeline
Test the JDBC/ODBC server using beeline:HTTP Transport Mode
The Thrift JDBC server supports HTTP transport for better firewall compatibility:Running the Spark SQL CLI
The Spark SQL CLI provides an interactive command-line interface for executing SQL queries:Starting the CLI
Basic Usage
Configuration
You can set Spark configurations when starting the CLI:Place your
hive-site.xml, core-site.xml, and hdfs-site.xml files in the conf/ directory to configure Hive integration.JDBC Connection from Applications
Connect to the Spark SQL Thrift server from your application:Connecting BI Tools
Many BI tools support JDBC connections and can connect to Spark SQL:Tableau
Power BI
Best Practices
Security
Security
For production deployments:
- Enable Kerberos authentication
- Use SSL/TLS for encrypted connections
- Implement proper access controls
- Follow the beeline documentation for secure mode setup
Performance
Performance
- Cache frequently accessed tables
- Use partitioned tables for large datasets
- Adjust
spark.sql.shuffle.partitionsbased on workload - Monitor queries using the Spark UI
Resource Management
Resource Management
- Set appropriate memory limits for executors
- Configure concurrent query limits
- Use dynamic allocation for varying workloads
- Monitor resource usage
High Availability
High Availability
- Use ZooKeeper for automatic failover
- Deploy multiple Thrift servers behind a load balancer
- Configure session persistence
Troubleshooting
Connection Refused
Connection Refused
Check that:
- Thrift server is running:
jps | grep SparkSubmit - Correct host and port are specified
- Firewall allows connections on the port
- Network connectivity exists between client and server
Out of Memory Errors
Out of Memory Errors
Increase executor memory:
Slow Queries
Slow Queries
- Check query execution plan with
EXPLAIN - Collect table statistics with
ANALYZE TABLE - Enable adaptive query execution
- Review partition count and size
CTAS Issues After Session Close
CTAS Issues After Session Close
Set
fs.%s.impl.disable.cache to true in hive-site.xml. See SPARK-21067 for details.Configuration Reference
Key configuration properties for the distributed SQL engine:Port number for the Thrift server to listen on
Host address for the Thrift server to bind to
Transport mode:
binary or httpPort number for HTTP transport mode
Minimum number of worker threads
Maximum number of worker threads
Next Steps
Performance Tuning
Optimize your distributed SQL queries
Data Sources
Learn about supported data sources
