How to Use orderBy in Spark Scala - Sorting DataFrames
Sorting data is a crucial step in data processing and analysis. In Apache Spark, you can use the orderBy
function to sort DataFrames in Scala. This tutorial will guide you through the process of using orderBy
with practical examples and explanations.
Roll | First Name | Age | Last Name |
---|---|---|---|
1 | Rahul | 30 | Yadav |
2 | Sanjay | 20 | gupta |
3 | Ranjan | 67 | kumar |
First, you need to import the necessary libraries:
import org.apache.spark.sql.functions.col import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
For demonstration purposes, let's create a sample DataFrame:
val schema = StructType(Array( StructField("roll", IntegerType, true), StructField("first_name", StringType, true), StructField("age", IntegerType, true), StructField("last_name", StringType, true) )) val data = Seq( Row(1, "rahul", 30, "yadav"), Row(2, "sanjay", 20, "gupta"), Row(3, "ranjan", 67, "kumar") ) val rdd = sparkSession.sparkContext.parallelize(data) val testDF = sparkSession.createDataFrame(rdd, schema)
val transformedDF=testDF.orderBy("age")
val transformedDF=testDF.orderBy(col("age").desc)
You can also sort by multiple columns. For example, to sort by "age" and then by "roll", use:
val transformedDF=testDF.orderBy(col("age").desc,col("roll"))
import org.apache.spark.sql.functions.col import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType} object OrderByColumnSpark { def main(args: Array[String]): Unit = { val sparkSession = SparkSession .builder() .appName("order by or sort column in a spark dataframe") .master("local") .getOrCreate() val schema = StructType(Array( StructField("roll", IntegerType, true), StructField("first_name", StringType, true), StructField("age", IntegerType, true), StructField("last_name", StringType, true) )) val data = Seq( Row(1, "rahul", 30, "yadav"), Row(2, "sanjay", 20, "gupta"), Row(3, "ranjan", 67, "kumar"), ) val rdd = sparkSession.sparkContext.parallelize(data) val testDF = sparkSession.createDataFrame(rdd, schema) val transformedDF=testDF.orderBy("age") transformedDF.show() sparkSession.stop() } }
That's it! You've successfully applied orderBy to a DataFrame in Spark using Scala.