How to Use union in Spark Scala - Combining DataFrames

Using union in Spark Scala

Combining DataFrames is a common operation in data processing. In Apache Spark, you can use the union function to merge two DataFrames with the same schema using Scala. This tutorial will guide you through the process of using this function with practical examples and explanations.

1. Creating Sample DataFrames

Let's start by creating two sample DataFrames. We'll use the following data:
Category Item Quantity Price
Fruit Apple 10 1.5
Fruit Banana 20 0.5
Vegetable Carrot 15 0.7

Category Item Quantity Price
Fruit Orange 30 0.8
Fruit Pear 10 1.0
Vegetable Potato 25 0.3
  val schema = StructType(Array(
    StructField("category", StringType, true),
    StructField("item", StringType, true),
    StructField("quantity", IntegerType, true),
    StructField("price", DoubleType, true)
  ))

  // Create the data for first DataFrame
  val data1 = Seq(
    Row("Fruit", "Apple", 10, 1.5),
    Row("Fruit", "Banana", 20, 0.5),
    Row("Vegetable", "Carrot", 15, 0.7)
  )

  // Create the data for second DataFrame
  val data2 = Seq(
    Row("Fruit", "Orange", 30, 0.8),
    Row("Fruit", "Pear", 10, 1.0),
    Row("Vegetable", "Potato", 25, 0.3)
  )

  // Create the DataFrames
  val rdd1 = sparkSession.sparkContext.parallelize(data1)
  val df1 = sparkSession.createDataFrame(rdd1, schema)

  val rdd2 = sparkSession.sparkContext.parallelize(data2)
  val df2 = sparkSession.createDataFrame(rdd2, schema)

2. Importing Necessary Libraries

Before we can use union, we need to import the necessary libraries:

import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{DoubleType, IntegerType, StringType, StructField, StructType}

3. Performing the union Operation

Now that we have our DataFrames, we can combine them using the union function:

val combinedDF = df1.union(df2)
combinedDF.show()

Complete Code

  import org.apache.spark.sql.{Row, SparkSession}
  import org.apache.spark.sql.types.{DoubleType, IntegerType, StringType, StructField, StructType}
  
  object UnionInSpark {
    def main(args: Array[String]): Unit = {
      val sparkSession = SparkSession
        .builder()
        .appName("Use union in scala spark")
        .master("local")
        .getOrCreate()
  
      val schema = StructType(Array(
        StructField("category", StringType, true),
        StructField("item", StringType, true),
        StructField("quantity", IntegerType, true),
        StructField("price", DoubleType, true)
      ))
  
      // Create the data for first DataFrame
      val data1 = Seq(
        Row("Fruit", "Apple", 10, 1.5),
        Row("Fruit", "Banana", 20, 0.5),
        Row("Vegetable", "Carrot", 15, 0.7)
      )
  
      // Create the data for second DataFrame
      val data2 = Seq(
        Row("Fruit", "Orange", 30, 0.8),
        Row("Fruit", "Pear", 10, 1.0),
        Row("Vegetable", "Potato", 25, 0.3)
      )
  
      // Create the DataFrames
      val rdd1 = sparkSession.sparkContext.parallelize(data1)
      val df1 = sparkSession.createDataFrame(rdd1, schema)
  
      val rdd2 = sparkSession.sparkContext.parallelize(data2)
      val df2 = sparkSession.createDataFrame(rdd2, schema)
  
      val combinedDF = df1.union(df2)
  
      combinedDF.show()
  
  
    }
  
  }
  

4. Output

In this tutorial, we have demonstrated how to use the union function in Spark with Scala to combine two DataFrames with the same schema. This is a powerful tool for data integration and processing tasks..

Alps

Note: Remember to ensure that the schemas of the DataFrames match before using union