SparkSession provides built-in support for Hive features including HiveQL, Hive UDFs, and reading from Hive tables. You don’t need an existing Hive setup to use these features.
In Python, access DataFrame columns by attribute (df.age) or by indexing (df['age']). The indexing form is future-proof and won’t break with column names that are also DataFrame attributes.
You can run SQL queries programmatically by registering DataFrames as temporary views:
# Register the DataFrame as a temporary viewdf.createOrReplaceTempView("people")# Run SQL querysqlDF = spark.sql("SELECT * FROM people")sqlDF.show()# +----+-------+# | age| name|# +----+-------+# |null|Michael|# | 30| Andy|# | 19| Justin|# +----+-------+# Complex SQL queryresults = spark.sql(""" SELECT name, age FROM people WHERE age > 21 ORDER BY age DESC""")results.show()
Temporary views are session-scoped and disappear when the session terminates. For views shared across all sessions, create a global temporary view:
# Register as global temporary viewdf.createGlobalTempView("people")# Global temporary view is in the global_temp databasespark.sql("SELECT * FROM global_temp.people").show()# Global temporary view is cross-sessionspark.newSession().sql("SELECT * FROM global_temp.people").show()
Global temporary views are tied to the system-preserved database global_temp and kept alive until the Spark application terminates.
Spark SQL can infer the schema from the RDD structure:
from pyspark.sql import Rowsc = spark.sparkContext# Load a text file and convert to RDD of Row objectslines = sc.textFile("examples/src/main/resources/people.txt")parts = lines.map(lambda l: l.split(","))people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))# Infer schema and create DataFrameschemaPeople = spark.createDataFrame(people)schemaPeople.createOrReplaceTempView("people")# Run SQL queriesteenagers = spark.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")teenagers.show()