How to Use dropDuplicates in Spark Scala - Removing Duplicates
Removing duplicate rows is a common operation in data processing. In Apache Spark, you can use the dropDuplicates function to eliminate duplicate rows from a DataFrame using Scala. This tutorial will guide you through the process of using this function with practical examples and explanations.
| Category | Item | Quantity | Price | 
|---|---|---|---|
| Fruit | Apple | 10 | 1.5 | 
| Fruit | Apple | 10 | 1.5 | 
| Vegetable | Carrot | 15 | 0.7 | 
| Vegetable | Potato | 25 | 0.3 | 
Before we can use dropDuplicates, we need to import the necessary libraries:
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{DoubleType, IntegerType, StringType, StructField, StructType}  
  val schema = StructType(Array(
    StructField("category", StringType, true),
    StructField("item", StringType, true),
    StructField("quantity", IntegerType, true),
    StructField("price", DoubleType, true)
  ))
  // Create the data
  val data = Seq(
    Row("Fruit", "Apple", 10, 1.5),
    Row("Fruit", "Apple", 10, 1.5),
    Row("Vegetable", "Carrot", 15, 0.7),
    Row("Vegetable", "Potato", 25, 0.3)
  )
  // Create the DataFrame
  val rdd = sparkSession.sparkContext.parallelize(data)
  val df = sparkSession.createDataFrame(rdd, schema)
    To remove duplicate rows in a DataFrame, use the dropDuplicates function. This function can be used without any arguments to remove fully duplicate rows:
val uniqueDF = df.dropDuplicates() uniqueDF.show()
You can also specify columns to consider for identifying duplicates. For example, to remove rows that have the same "item" and "quantity" values:
val uniqueSpecificDF = df.dropDuplicates("item", "quantity")
uniqueSpecificDF.show()
    You can combine dropDuplicates with other transformations for more complex data processing. Here is an example:
  import org.apache.spark.sql.functions.col
  import org.apache.spark.sql.{Row, SparkSession}
  import org.apache.spark.sql.types.{DoubleType, IntegerType, StringType, StructField, StructType}
      
  object RemoveDuplicates {
        def main(args: Array[String]): Unit = {
          val sparkSession = SparkSession
            .builder()
            .appName("remove dupilcates from a spark dataframe")
            .master("local")
            .getOrCreate()
          val schema = StructType(Array(
            StructField("category", StringType, true),
            StructField("item", StringType, true),
            StructField("quantity", IntegerType, true),
            StructField("price", DoubleType, true)
          ))
          // Create the data
          val data = Seq(
            Row("Fruit", "Apple", 10, 1.5),
            Row("Fruit", "Apple", 10, 1.5),
            Row("Vegetable", "Carrot", 15, 0.7),
            Row("Vegetable", "Potato", 25, 0.3)
          )
          // Create the DataFrame
          val rdd = sparkSession.sparkContext.parallelize(data)
          val rawdf = sparkSession.createDataFrame(rdd, schema)
          val uniqueSpecificDF = rawdf.dropDuplicates()
          uniqueSpecificDF.show()
          
        }
      
 }
      
In this tutorial, we have seen how to use the dropDuplicates function in Spark with Scala to remove duplicate rows from a DataFrame. This function is very useful for data cleaning and preparation tasks. For more advanced operations, consider combining it with other DataFrame transformations.
