How to use groupBy with multiple columns in Spark Scala
Grouping and aggregating data is a fundamental part of data analysis. In Apache Spark, you can use the groupBy
function to group DataFrame data in Scala. This tutorial will guide you through the process of using this function with practical examples and explanations.
Roll | First Name | Age | Last Name | subject | Marks |
---|---|---|---|---|---|
1 | Rahul | 18 | Yadav | PHYSICS | 80 |
1 | Rahul | 18 | Yadav | CHEMISTRY | 77 |
1 | Rahul | 18 | Yadav | BIOLOGY | 70 |
2 | Vinay | 17 | kumar | PHYSICS | 80 |
2 | Vinay | 17 | kumar | CHEMISTRY | 77 |
2 | Vinay | 17 | kumar | BIOLOGY | 66 |
First, you need to import the necessary libraries:
import org.apache.spark.sql.{Row, SparkSession, functions} import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
For demonstration purposes, let's create a sample DataFrame:
val schema = StructType(Array( StructField("roll", IntegerType, true), StructField("first_name", StringType, true), StructField("age", IntegerType, true), StructField("last_name", StringType, true), StructField("subject", StringType, true), StructField("Marks", IntegerType, true) )) val data = Seq( Row(1, "rahul", 18, "yadav","PHYSICS",80), Row(1, "rahul", 18, "yadav","CHEMISTRY",77), Row(1, "rahul", 18, "yadav","BIOLOGY",70), Row(2, "Vinay", 17, "kumar","PHYSICS",80), Row(2, "Vinay", 17, "kumar","CHEMISTRY",77), Row(2, "Vinay", 17, "kumar","BIOLOGY",66), ) val rdd = sparkSession.sparkContext.parallelize(data) val testDF = sparkSession.createDataFrame(rdd, schema)
val groupedDF=testDF.groupBy("roll").agg( functions.sum("Marks").as("total_marks") )
val groupedDF=testDF.groupBy("roll","first_name","last_name") .agg( functions.sum("Marks").as("total_marks") )
import org.apache.spark.sql.{Row, SparkSession, functions} import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType} object GroupByInScalaSpark { def main(args: Array[String]): Unit = { val sparkSession = SparkSession .builder() .appName("group by in scala spark dataframe") .master("local") .getOrCreate() val schema = StructType(Array( StructField("roll", IntegerType, true), StructField("first_name", StringType, true), StructField("age", IntegerType, true), StructField("last_name", StringType, true), StructField("subject", StringType, true), StructField("Marks", IntegerType, true) )) val data = Seq( Row(1, "rahul", 18, "yadav","PHYSICS",80), Row(1, "rahul", 18, "yadav","CHEMISTRY",77), Row(1, "rahul", 18, "yadav","BIOLOGY",70), Row(2, "Vinay", 17, "kumar","PHYSICS",80), Row(2, "Vinay", 17, "kumar","CHEMISTRY",77), Row(2, "Vinay", 17, "kumar","BIOLOGY",66), ) val rdd = sparkSession.sparkContext.parallelize(data) val testDF = sparkSession.createDataFrame(rdd, schema) val groupedDF=testDF.groupBy("roll") .agg( functions.sum("Marks").as("total_marks") ) groupedDF.show() sparkSession.stop() } }
That's it! You've successfully applied withColumnRenamed to a DataFrame in Spark using Scala.