How to use groupBy with multiple columns in PySpark
Grouping and aggregating data is essential in data analysis. In Apache PySpark, the `groupBy` function allows you to efficiently group data within a DataFrame. This tutorial will walk you through how to use the groupBy function, providing practical examples and detailed explanations to help you master this fundamental technique.
Roll | First Name | Age | Last Name | subject | Marks |
---|---|---|---|---|---|
1 | Rahul | 18 | Yadav | PHYSICS | 80 |
1 | Rahul | 18 | Yadav | CHEMISTRY | 77 |
1 | Rahul | 18 | Yadav | BIOLOGY | 70 |
2 | Vinay | 17 | kumar | PHYSICS | 80 |
2 | Vinay | 17 | kumar | CHEMISTRY | 77 |
2 | Vinay | 17 | kumar | BIOLOGY | 66 |
First, you need to import the necessary libraries:
from pyspark.sql import SparkSession, functions as F from pyspark.sql.types import IntegerType, StringType, StructField, StructType from pyspark.sql import Row
For demonstration purposes, let's create a sample DataFrame:
# Define the schema schema = StructType([ StructField("roll", IntegerType(), True), StructField("first_name", StringType(), True), StructField("age", IntegerType(), True), StructField("last_name", StringType(), True), StructField("subject", StringType(), True), StructField("Marks", IntegerType(), True) ]) # Create the data data = [ Row(1, "rahul", 18, "yadav", "PHYSICS", 80), Row(1, "rahul", 18, "yadav", "CHEMISTRY", 77), Row(1, "rahul", 18, "yadav", "BIOLOGY", 70), Row(2, "Vinay", 17, "kumar", "PHYSICS", 80), Row(2, "Vinay", 17, "kumar", "CHEMISTRY", 77), Row(2, "Vinay", 17, "kumar", "BIOLOGY", 66) ] # Create the DataFrame rdd = spark.sparkContext.parallelize(data) df = spark.createDataFrame(rdd, schema)
grouped_df = df.groupBy("roll").agg(F.sum("Marks").alias("total_marks"))
grouped_df = df.groupBy("roll","first_name","last_name").agg(F.sum("Marks").alias("total_marks"))
from pyspark.sql import SparkSession, functions as F from pyspark.sql.types import IntegerType, StringType, StructField, StructType from pyspark.sql import Row # Initialize SparkSession spark = SparkSession.builder \ .appName("Group by in PySpark DataFrame") \ .master("local") \ .getOrCreate() # Define the schema schema = StructType([ StructField("roll", IntegerType(), True), StructField("first_name", StringType(), True), StructField("age", IntegerType(), True), StructField("last_name", StringType(), True), StructField("subject", StringType(), True), StructField("Marks", IntegerType(), True) ]) # Create the data data = [ Row(1, "rahul", 18, "yadav", "PHYSICS", 80), Row(1, "rahul", 18, "yadav", "CHEMISTRY", 77), Row(1, "rahul", 18, "yadav", "BIOLOGY", 70), Row(2, "Vinay", 17, "kumar", "PHYSICS", 80), Row(2, "Vinay", 17, "kumar", "CHEMISTRY", 77), Row(2, "Vinay", 17, "kumar", "BIOLOGY", 66) ] # Create the DataFrame rdd = spark.sparkContext.parallelize(data) df = spark.createDataFrame(rdd, schema) # Perform groupBy and aggregation grouped_df = df.groupBy("roll").agg(F.sum("Marks").alias("total_marks")) # Show the result grouped_df.show() # Stop the SparkSession spark.stop()
That's it! You've successfully applied withColumnRenamed to a DataFrame in PySpark using .