How to use groupBy with multiple columns in PySpark

How to Use groupBy in PySpark: A Guide to Grouping and Aggregating Data

Grouping and aggregating data is essential in data analysis. In Apache PySpark, the `groupBy` function allows you to efficiently group data within a DataFrame. This tutorial will walk you through how to use the groupBy function, providing practical examples and detailed explanations to help you master this fundamental technique.

For example I have considered below sample data

Sample Data


Roll First Name Age Last Name subject Marks
1 Rahul 18 Yadav PHYSICS 80
1 Rahul 18 Yadav CHEMISTRY 77
1 Rahul 18 Yadav BIOLOGY 70
2 Vinay 17 kumar PHYSICS 80
2 Vinay 17 kumar CHEMISTRY 77
2 Vinay 17 kumar BIOLOGY 66

Step 1: Import Required Libraries

First, you need to import the necessary libraries:

from pyspark.sql import SparkSession, functions as F
from pyspark.sql.types import IntegerType, StringType, StructField, StructType
from pyspark.sql import Row

Step 2: Create Sample DataFrame

For demonstration purposes, let's create a sample DataFrame:

# Define the schema
schema = StructType([
StructField("roll", IntegerType(), True),
StructField("first_name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("last_name", StringType(), True),
StructField("subject", StringType(), True),
StructField("Marks", IntegerType(), True)
])

# Create the data
data = [
Row(1, "rahul", 18, "yadav", "PHYSICS", 80),
Row(1, "rahul", 18, "yadav", "CHEMISTRY", 77),
Row(1, "rahul", 18, "yadav", "BIOLOGY", 70),
Row(2, "Vinay", 17, "kumar", "PHYSICS", 80),
Row(2, "Vinay", 17, "kumar", "CHEMISTRY", 77),
Row(2, "Vinay", 17, "kumar", "BIOLOGY", 66)
]

# Create the DataFrame
rdd = spark.sparkContext.parallelize(data)
df = spark.createDataFrame(rdd, schema)

Step 3: Use groupBy for single column

grouped_df = df.groupBy("roll").agg(F.sum("Marks").alias("total_marks"))

Step 3: Use Multiple Column in groupby

grouped_df = df.groupBy("roll","first_name","last_name").agg(F.sum("Marks").alias("total_marks"))

Complete Code

from pyspark.sql import SparkSession, functions as F
from pyspark.sql.types import IntegerType, StringType, StructField, StructType
from pyspark.sql import Row

# Initialize SparkSession
spark = SparkSession.builder \
.appName("Group by in PySpark DataFrame") \
.master("local") \
.getOrCreate()

# Define the schema
schema = StructType([
StructField("roll", IntegerType(), True),
StructField("first_name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("last_name", StringType(), True),
StructField("subject", StringType(), True),
StructField("Marks", IntegerType(), True)
])

# Create the data
data = [
Row(1, "rahul", 18, "yadav", "PHYSICS", 80),
Row(1, "rahul", 18, "yadav", "CHEMISTRY", 77),
Row(1, "rahul", 18, "yadav", "BIOLOGY", 70),
Row(2, "Vinay", 17, "kumar", "PHYSICS", 80),
Row(2, "Vinay", 17, "kumar", "CHEMISTRY", 77),
Row(2, "Vinay", 17, "kumar", "BIOLOGY", 66)
]

# Create the DataFrame
rdd = spark.sparkContext.parallelize(data)
df = spark.createDataFrame(rdd, schema)

# Perform groupBy and aggregation
grouped_df = df.groupBy("roll").agg(F.sum("Marks").alias("total_marks"))

# Show the result
grouped_df.show()

# Stop the SparkSession
spark.stop()
 

That's it! You've successfully applied withColumnRenamed to a DataFrame in PySpark using .

Output

Alps