How to Use orderBy in PySpark - Sorting DataFrames
Sorting data is a crucial step in data processing and analysis. In Apache Spark, you can use the orderBy
function to sort DataFrames in PySpark. This tutorial will guide you through the process of using orderBy
with practical examples and explanations.
Roll | First Name | Age | Last Name |
---|---|---|---|
1 | Ankit | 25 | Sharma |
2 | Vijay | 35 | Singh |
3 | Rohit | 29 | Mehta |
Here’s how you can sort the above DataFrame using PySpark:
from pyspark.sql import SparkSession from pyspark.sql import Row from pyspark.sql.types import StructType, StructField, IntegerType, StringType from pyspark.sql.functions import col # Initialize Spark session spark = SparkSession.builder \ .appName("Order by or sort column in a Spark DataFrame") \ .master("local") \ .getOrCreate() # Define the schema schema = StructType([ StructField("roll", IntegerType(), True), StructField("first_name", StringType(), True), StructField("age", IntegerType(), True), StructField("last_name", StringType(), True) ]) # Create data data = [ Row(1, "Ankit", 25, "Sharma"), Row(2, "Vijay", 35, "Singh"), Row(3, "Rohit", 29, "Mehta") ] # Parallelize the data rdd = spark.sparkContext.parallelize(data) # Create DataFrame testDF = spark.createDataFrame(rdd, schema) # Order DataFrame by age transformedDF = testDF.orderBy(col("age")) # Show results transformedDF.show()
This will sort the DataFrame by the age
column in ascending order.
# Order DataFrame by age in descending order transformedDF = testDF.orderBy(col("age").desc()) # Show results transformedDF.show()
By using the orderBy
function, you can easily sort your PySpark DataFrames according to the specific requirements of your analysis.