How to Use union in Spark Scala - Combining DataFrames
Combining DataFrames is a common operation in data processing. In Apache Spark, you can use the union
function to merge two DataFrames with the same schema using Scala. This tutorial will guide you through the process of using this function with practical examples and explanations.
Category | Item | Quantity | Price |
---|---|---|---|
Fruit | Apple | 10 | 1.5 |
Fruit | Banana | 20 | 0.5 |
Vegetable | Carrot | 15 | 0.7 |
Category | Item | Quantity | Price |
---|---|---|---|
Fruit | Orange | 30 | 0.8 |
Fruit | Pear | 10 | 1.0 |
Vegetable | Potato | 25 | 0.3 |
val schema = StructType(Array( StructField("category", StringType, true), StructField("item", StringType, true), StructField("quantity", IntegerType, true), StructField("price", DoubleType, true) )) // Create the data for first DataFrame val data1 = Seq( Row("Fruit", "Apple", 10, 1.5), Row("Fruit", "Banana", 20, 0.5), Row("Vegetable", "Carrot", 15, 0.7) ) // Create the data for second DataFrame val data2 = Seq( Row("Fruit", "Orange", 30, 0.8), Row("Fruit", "Pear", 10, 1.0), Row("Vegetable", "Potato", 25, 0.3) ) // Create the DataFrames val rdd1 = sparkSession.sparkContext.parallelize(data1) val df1 = sparkSession.createDataFrame(rdd1, schema) val rdd2 = sparkSession.sparkContext.parallelize(data2) val df2 = sparkSession.createDataFrame(rdd2, schema)
Before we can use union
, we need to import the necessary libraries:
import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types.{DoubleType, IntegerType, StringType, StructField, StructType}
Now that we have our DataFrames, we can combine them using the union
function:
val combinedDF = df1.union(df2) combinedDF.show()
import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types.{DoubleType, IntegerType, StringType, StructField, StructType} object UnionInSpark { def main(args: Array[String]): Unit = { val sparkSession = SparkSession .builder() .appName("Use union in scala spark") .master("local") .getOrCreate() val schema = StructType(Array( StructField("category", StringType, true), StructField("item", StringType, true), StructField("quantity", IntegerType, true), StructField("price", DoubleType, true) )) // Create the data for first DataFrame val data1 = Seq( Row("Fruit", "Apple", 10, 1.5), Row("Fruit", "Banana", 20, 0.5), Row("Vegetable", "Carrot", 15, 0.7) ) // Create the data for second DataFrame val data2 = Seq( Row("Fruit", "Orange", 30, 0.8), Row("Fruit", "Pear", 10, 1.0), Row("Vegetable", "Potato", 25, 0.3) ) // Create the DataFrames val rdd1 = sparkSession.sparkContext.parallelize(data1) val df1 = sparkSession.createDataFrame(rdd1, schema) val rdd2 = sparkSession.sparkContext.parallelize(data2) val df2 = sparkSession.createDataFrame(rdd2, schema) val combinedDF = df1.union(df2) combinedDF.show() } }
In this tutorial, we have demonstrated how to use the union
function in Spark with Scala to combine two DataFrames with the same schema. This is a powerful tool for data integration and processing tasks..
union