How to Use union in PySpark - Combining DataFrames
Combining DataFrames is a common operation in data processing. In Apache PySpark, you can use the union
function to merge two DataFrames with the same schema using . This tutorial will guide you through the process of using this function with practical examples and explanations.
Category | Item | Quantity | Price |
---|---|---|---|
Fruit | Apple | 10 | 1.5 |
Fruit | Banana | 20 | 0.5 |
Vegetable | Carrot | 15 | 0.7 |
Category | Item | Quantity | Price |
---|---|---|---|
Fruit | Orange | 30 | 0.8 |
Fruit | Pear | 10 | 1.0 |
Vegetable | Potato | 25 | 0.3 |
Before we can use union
, we need to import the necessary libraries:
from pyspark.sql import SparkSession from pyspark.sql.types import DoubleType, IntegerType, StringType, StructField, StructType from pyspark.sql import Row
Now that we have our DataFrames, we can combine them using the union
function:
combined_df = df1.union(df2)
from pyspark.sql import SparkSession from pyspark.sql.types import DoubleType, IntegerType, StringType, StructField, StructType from pyspark.sql import Row # Initialize SparkSession spark = SparkSession.builder \ .appName("Use union in PySpark") \ .master("local") \ .getOrCreate() # Define the schema schema = StructType([ StructField("category", StringType(), True), StructField("item", StringType(), True), StructField("quantity", IntegerType(), True), StructField("price", DoubleType(), True) ]) # Create the data for the first DataFrame data1 = [ Row("Fruit", "Apple", 10, 1.5), Row("Fruit", "Banana", 20, 0.5), Row("Vegetable", "Carrot", 15, 0.7) ] # Create the data for the second DataFrame data2 = [ Row("Fruit", "Orange", 30, 0.8), Row("Fruit", "Pear", 10, 1.0), Row("Vegetable", "Potato", 25, 0.3) ] # Create the DataFrames rdd1 = spark.sparkContext.parallelize(data1) df1 = spark.createDataFrame(rdd1, schema) rdd2 = spark.sparkContext.parallelize(data2) df2 = spark.createDataFrame(rdd2, schema) # Perform the union operation combined_df = df1.union(df2) # Show the result combined_df.show()
In this tutorial, we have demonstrated how to use the union
function in PySpark with to combine two DataFrames with the same schema. This is a powerful tool for data integration and processing tasks..
union