How to Use dropDuplicates in Spark Scala - Removing Duplicates
Removing duplicate rows is a common operation in data processing. In Apache Spark, you can use the dropDuplicates
function to eliminate duplicate rows from a DataFrame using Scala. This tutorial will guide you through the process of using this function with practical examples and explanations.
Category | Item | Quantity | Price |
---|---|---|---|
Fruit | Apple | 10 | 1.5 |
Fruit | Apple | 10 | 1.5 |
Vegetable | Carrot | 15 | 0.7 |
Vegetable | Potato | 25 | 0.3 |
Before we can use dropDuplicates
, we need to import the necessary libraries:
import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.functions._ import org.apache.spark.sql.types.{DoubleType, IntegerType, StringType, StructField, StructType}
val schema = StructType(Array( StructField("category", StringType, true), StructField("item", StringType, true), StructField("quantity", IntegerType, true), StructField("price", DoubleType, true) )) // Create the data val data = Seq( Row("Fruit", "Apple", 10, 1.5), Row("Fruit", "Apple", 10, 1.5), Row("Vegetable", "Carrot", 15, 0.7), Row("Vegetable", "Potato", 25, 0.3) ) // Create the DataFrame val rdd = sparkSession.sparkContext.parallelize(data) val df = sparkSession.createDataFrame(rdd, schema)
To remove duplicate rows in a DataFrame, use the dropDuplicates
function. This function can be used without any arguments to remove fully duplicate rows:
val uniqueDF = df.dropDuplicates() uniqueDF.show()
You can also specify columns to consider for identifying duplicates. For example, to remove rows that have the same "item" and "quantity" values:
val uniqueSpecificDF = df.dropDuplicates("item", "quantity") uniqueSpecificDF.show()
You can combine dropDuplicates
with other transformations for more complex data processing. Here is an example:
import org.apache.spark.sql.functions.col import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types.{DoubleType, IntegerType, StringType, StructField, StructType} object RemoveDuplicates { def main(args: Array[String]): Unit = { val sparkSession = SparkSession .builder() .appName("remove dupilcates from a spark dataframe") .master("local") .getOrCreate() val schema = StructType(Array( StructField("category", StringType, true), StructField("item", StringType, true), StructField("quantity", IntegerType, true), StructField("price", DoubleType, true) )) // Create the data val data = Seq( Row("Fruit", "Apple", 10, 1.5), Row("Fruit", "Apple", 10, 1.5), Row("Vegetable", "Carrot", 15, 0.7), Row("Vegetable", "Potato", 25, 0.3) ) // Create the DataFrame val rdd = sparkSession.sparkContext.parallelize(data) val rawdf = sparkSession.createDataFrame(rdd, schema) val uniqueSpecificDF = rawdf.dropDuplicates() uniqueSpecificDF.show() } }
In this tutorial, we have seen how to use the dropDuplicates
function in Spark with Scala to remove duplicate rows from a DataFrame. This function is very useful for data cleaning and preparation tasks. For more advanced operations, consider combining it with other DataFrame transformations.