How to Use union in Spark Scala - Combining DataFrames

Using union in Spark Scala

Combining DataFrames is a common operation in data processing. In Apache Spark, you can use the union function to merge two DataFrames with the same schema using Scala. This tutorial will guide you through the process of using this function with practical examples and explanations.

1. Creating Sample DataFrames

Let's start by creating two sample DataFrames. We'll use the following data:

Category	Item	Quantity	Price
Fruit	Apple	10	1.5
Fruit	Banana	20	0.5
Vegetable	Carrot	15	0.7

Category	Item	Quantity	Price
Fruit	Orange	30	0.8
Fruit	Pear	10	1.0
Vegetable	Potato	25	0.3

  val schema = StructType(Array(
    StructField("category", StringType, true),
    StructField("item", StringType, true),
    StructField("quantity", IntegerType, true),
    StructField("price", DoubleType, true)
  ))

  // Create the data for first DataFrame
  val data1 = Seq(
    Row("Fruit", "Apple", 10, 1.5),
    Row("Fruit", "Banana", 20, 0.5),
    Row("Vegetable", "Carrot", 15, 0.7)
  )

  // Create the data for second DataFrame
  val data2 = Seq(
    Row("Fruit", "Orange", 30, 0.8),
    Row("Fruit", "Pear", 10, 1.0),
    Row("Vegetable", "Potato", 25, 0.3)
  )

  // Create the DataFrames
  val rdd1 = sparkSession.sparkContext.parallelize(data1)
  val df1 = sparkSession.createDataFrame(rdd1, schema)

  val rdd2 = sparkSession.sparkContext.parallelize(data2)
  val df2 = sparkSession.createDataFrame(rdd2, schema)

2. Importing Necessary Libraries

Before we can use union, we need to import the necessary libraries:

import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{DoubleType, IntegerType, StringType, StructField, StructType}

3. Performing the union Operation

Now that we have our DataFrames, we can combine them using the union function:

val combinedDF = df1.union(df2)
combinedDF.show()

Complete Code

  import org.apache.spark.sql.{Row, SparkSession}
  import org.apache.spark.sql.types.{DoubleType, IntegerType, StringType, StructField, StructType}
  
  object UnionInSpark {
    def main(args: Array[String]): Unit = {
      val sparkSession = SparkSession
        .builder()
        .appName("Use union in scala spark")
        .master("local")
        .getOrCreate()
  
      val schema = StructType(Array(
        StructField("category", StringType, true),
        StructField("item", StringType, true),
        StructField("quantity", IntegerType, true),
        StructField("price", DoubleType, true)
      ))
  
      // Create the data for first DataFrame
      val data1 = Seq(
        Row("Fruit", "Apple", 10, 1.5),
        Row("Fruit", "Banana", 20, 0.5),
        Row("Vegetable", "Carrot", 15, 0.7)
      )
  
      // Create the data for second DataFrame
      val data2 = Seq(
        Row("Fruit", "Orange", 30, 0.8),
        Row("Fruit", "Pear", 10, 1.0),
        Row("Vegetable", "Potato", 25, 0.3)
      )
  
      // Create the DataFrames
      val rdd1 = sparkSession.sparkContext.parallelize(data1)
      val df1 = sparkSession.createDataFrame(rdd1, schema)
  
      val rdd2 = sparkSession.sparkContext.parallelize(data2)
      val df2 = sparkSession.createDataFrame(rdd2, schema)
  
      val combinedDF = df1.union(df2)
  
      combinedDF.show()
  
  
    }
  
  }

4. Output

In this tutorial, we have demonstrated how to use the union function in Spark with Scala to combine two DataFrames with the same schema. This is a powerful tool for data integration and processing tasks..