Generic Load/Save Functions
The default data source is Parquet (unless configured otherwise viaspark.sql.sources.default):
Manually Specifying Options
You can specify the data source format and pass additional options:JSON Files
Built-in data sources have short names:
json, parquet, jdbc, orc, csv, text. You can also use fully qualified names like org.apache.spark.sql.parquet.CSV Files
Run SQL on Files Directly
You can query files directly with SQL without loading them into a DataFrame:Save Modes
Save operations can specify how to handle existing data:When saving a DataFrame, if data already exists, an exception is thrown.
append
When saving a DataFrame, if data already exists, contents are appended to existing data.
overwrite
When saving a DataFrame, if data already exists, existing data is overwritten.
ignore
When saving a DataFrame, if data already exists, the save operation does not save contents and does not change existing data.
Parquet Files
Parquet is a columnar format supported by many data processing systems. Spark SQL provides support for reading and writing Parquet files that automatically preserves the schema.Loading Parquet Files
Partition Discovery
Table partitioning is a common optimization. Spark SQL automatically discovers and infers partitioning information:path/to/table, Spark SQL automatically extracts gender and country as partitioning columns:
Partition column data types are automatically inferred. You can disable automatic inference by setting
spark.sql.sources.partitionColumnTypeInference.enabled to false.Schema Merging
Parquet supports schema evolution. Spark SQL can automatically detect and merge schemas from multiple Parquet files:JDBC Databases
Spark SQL can load data from JDBC databases:Bucketing, Sorting and Partitioning
For file-based data sources, you can optimize output with bucketing, sorting, and partitioning:Partitioning
Bucketing
Bucketing is applicable only to persistent tables:SQL
Combined Partitioning and Bucketing
Partitioning creates a directory structure and is limited to low-cardinality columns. Bucketing distributes data across a fixed number of buckets and works well with high-cardinality columns.
Supported Data Formats
Spark SQL supports these built-in data sources:Parquet
Columnar format with automatic schema preservation
ORC
Optimized Row Columnar format
JSON
Line-delimited JSON files
CSV
Comma-separated values with header support
Text
Plain text files
Avro
Binary format with schema evolution
Protobuf
Protocol buffer format
JDBC
Relational databases via JDBC
Next Steps
Performance Tuning
Learn to optimize your data source operations
Distributed SQL Engine
Use Spark as a distributed query engine
