Guide to Apache Spark DataFrame distinct() Method
The distinct() method in Apache Spark DataFrame is used to return a new DataFrame with unique rows based on all columns. Here are five key points about distinct():
Roll | First Name | Age | Last Name |
---|---|---|---|
1 | Rahul | 30 | Yadav |
2 | Sanjay | 20 | gupta |
3 | Ranjan | 67 | kumar |
3 | Ranjan | 67 | kumar |
First, you need to import the necessary libraries:
import org.apache.spark.sql.functions.{col, lit} import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType} import org.apache.spark.sql.{Row, SparkSession}
For demonstration purposes, let's create a sample DataFrame:
val schema = StructType(Array( StructField("roll", IntegerType, true), StructField("first_name", StringType, true), StructField("age", IntegerType, true), StructField("last_name", StringType, true) )) val data = Seq( Row(1, "rahul", 30, "yadav"), Row(2, "sanjay", 20, "gupta"), Row(3, "ranjan", 67, "kumar"), Row(3, "ranjan", 67, "kumar"), ) val rdd = sparkSession.sparkContext.parallelize(data) val testDF = sparkSession.createDataFrame(rdd, schema)
val transformedDF=testDF.distinct()
import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType} object DistinctInSpark { def main(args: Array[String]): Unit = { val sparkSession = SparkSession .builder() .appName("select distinct rows in spark scala") .master("local") .getOrCreate() val schema = StructType(Array( StructField("roll", IntegerType, true), StructField("first_name", StringType, true), StructField("age", IntegerType, true), StructField("last_name", StringType, true) )) val data = Seq( Row(1, "rahul", 30, "yadav"), Row(2, "sanjay", 20, "gupta"), Row(3, "ranjan", 67, "kumar"), Row(3, "ranjan", 67, "kumar"), ) val rdd = sparkSession.sparkContext.parallelize(data) val testDF = sparkSession.createDataFrame(rdd, schema) val transformedDF=testDF.distinct() transformedDF.show() sparkSession.stop() } }
That's it! You've successfully applied filter and where conditions to a DataFrame in Spark using Scala.