Guide to Apache Spark DataFrame distinct() Method

Select distinct rows in Spark DataFrame - Scala

The distinct() method in Apache Spark DataFrame is used to return a new DataFrame with unique rows based on all columns. Here are five key points about distinct():

For example I have considered below sample data

Sample Data

Roll First Name Age Last Name
1 Rahul 30 Yadav
2 Sanjay 20 gupta
3 Ranjan 67 kumar
3 Ranjan 67 kumar

Step 1: Import Required Libraries

First, you need to import the necessary libraries:

import org.apache.spark.sql.functions.{col, lit}
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SparkSession}

Step 2: Create Sample DataFrame

For demonstration purposes, let's create a sample DataFrame:

val schema = StructType(Array(
  StructField("roll", IntegerType, true),
  StructField("first_name", StringType, true),
  StructField("age", IntegerType, true),
   StructField("last_name", StringType, true)
))
val data = Seq(
  Row(1, "rahul", 30, "yadav"),
  Row(2, "sanjay", 20, "gupta"),
  Row(3, "ranjan", 67, "kumar"),
  Row(3, "ranjan", 67, "kumar"),
)
val rdd = sparkSession.sparkContext.parallelize(data)
val testDF = sparkSession.createDataFrame(rdd, schema)
        

Step 3: Use distinct method

Using distinct:

val transformedDF=testDF.distinct()

Complete Code

  import org.apache.spark.sql.{Row, SparkSession}
  import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
  
  object DistinctInSpark {
    def main(args: Array[String]): Unit = {
      val sparkSession = SparkSession
        .builder()
        .appName("select distinct rows in spark scala")
        .master("local")
        .getOrCreate()
      val schema = StructType(Array(
        StructField("roll", IntegerType, true),
        StructField("first_name", StringType, true),
        StructField("age", IntegerType, true),
        StructField("last_name", StringType, true)
      ))
      val data = Seq(
        Row(1, "rahul", 30, "yadav"),
        Row(2, "sanjay", 20, "gupta"),
        Row(3, "ranjan", 67, "kumar"),
        Row(3, "ranjan", 67, "kumar"),
      )
      val rdd = sparkSession.sparkContext.parallelize(data)
      val testDF = sparkSession.createDataFrame(rdd, schema)
      val transformedDF=testDF.distinct()
      transformedDF.show()
      sparkSession.stop()
  
    }
  
  }
  

That's it! You've successfully applied filter and where conditions to a DataFrame in Spark using Scala.

Output

Alps