How to Use orderBy in Spark Scala - Sorting DataFrames

Using orderBy or Sort in Spark Scala to Sort DataFrames

Sorting data is a crucial step in data processing and analysis. In Apache Spark, you can use the orderBy function to sort DataFrames in Scala. This tutorial will guide you through the process of using orderBy with practical examples and explanations.

For example I have considered below sample data

Sample Data

Roll	First Name	Age	Last Name
1	Rahul	30	Yadav
2	Sanjay	20	gupta
3	Ranjan	67	kumar

Step 1: Import Required Libraries

First, you need to import the necessary libraries:

import org.apache.spark.sql.functions.col
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}

Step 2: Create Sample DataFrame

For demonstration purposes, let's create a sample DataFrame:

val schema = StructType(Array(
  StructField("roll", IntegerType, true),
  StructField("first_name", StringType, true),
  StructField("age", IntegerType, true),
   StructField("last_name", StringType, true)
))
val data = Seq(
  Row(1, "rahul", 30, "yadav"),
  Row(2, "sanjay", 20, "gupta"),
  Row(3, "ranjan", 67, "kumar")
)
val rdd = sparkSession.sparkContext.parallelize(data)
val testDF = sparkSession.createDataFrame(rdd, schema)

Step 3: Use orderBy method to sort single or multiple columns

Sort using Single column:

val transformedDF=testDF.orderBy("age")

Sort using Single column: To sort in descending order

val transformedDF=testDF.orderBy(col("age").desc)

Sort using Multiple column

You can also sort by multiple columns. For example, to sort by "age" and then by "roll", use:

val transformedDF=testDF.orderBy(col("age").desc,col("roll"))

Complete Code

  import org.apache.spark.sql.functions.col
  import org.apache.spark.sql.{Row, SparkSession}
  import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
  
  object OrderByColumnSpark {
    def main(args: Array[String]): Unit = {
      val sparkSession = SparkSession
        .builder()
        .appName("order by or sort column in a spark dataframe")
        .master("local")
        .getOrCreate()
      val schema = StructType(Array(
        StructField("roll", IntegerType, true),
        StructField("first_name", StringType, true),
        StructField("age", IntegerType, true),
        StructField("last_name", StringType, true)
      ))
      val data = Seq(
        Row(1, "rahul", 30, "yadav"),
        Row(2, "sanjay", 20, "gupta"),
        Row(3, "ranjan", 67, "kumar"),
      )
      val rdd = sparkSession.sparkContext.parallelize(data)
      val testDF = sparkSession.createDataFrame(rdd, schema)
      val transformedDF=testDF.orderBy("age")
      transformedDF.show()
      sparkSession.stop()
  
    }
  
  }

That's it! You've successfully applied orderBy to a DataFrame in Spark using Scala.

Output