Guide to Apache PySpark DataFrame distinct() Method

Select distinct rows in PySpark DataFrame

The distinct() method in Apache PySpark DataFrame is used to generate a new DataFrame containing only unique rows based on all columns. Here are five key points about distinct():

Eliminates Duplicate Rows: The distinct() method removes all duplicate rows from the DataFrame, ensuring each row is unique.

Operates Across All Columns: It considers all columns in the DataFrame when determining uniqueness, making it a comprehensive tool for de-duplication.

Performance Consideration: Applying distinct() can be computationally expensive, especially on large datasets, as it requires a full shuffle of data across the cluster.

No In-Place Modification: The distinct() method returns a new DataFrame, leaving the original DataFrame unaltered.

Commonly Used in Data Cleaning: This method is frequently used in data cleaning processes to ensure the dataset is free of duplicates before further analysis.

For example I have considered below sample data

Sample Data

Roll First Name Age Last Name
1 Arjun 30 Singh
2 Vikram 67 Sharma
3 Amit 67 Verma
4 Vikram 67 Sharma

Step 1: Import Required Libraries

First, you need to import the necessary libraries:

from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

Step 2: Create Sample DataFrame

For demonstration purposes, let's create a sample DataFrame:

# Initialize SparkSession
spark = SparkSession.builder.master("local").appName("distinct in pyspark").getOrCreate()

# Define the schema
schema = StructType([
StructField("roll", IntegerType(), True),
StructField("first_name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("last_name", StringType(), True)
])

# Create the data
data = [
Row(1, "arjun", 30, "singh"),
Row(3, "Vikram", 67, "Sharma"),
Row(4, "Amit", 67, "Verma"),
Row(3, "Vikram", 67, "Sharma"),
]

# Parallelize the data and create DataFrame
rdd = spark.sparkContext.parallelize(data)
testDF = spark.createDataFrame(rdd, schema)

Step 3: Use distinct method

Using distinct:

val transformedDF=testDF.distinct()

Complete Code

from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Initialize SparkSession
spark = SparkSession.builder.master("local").appName("distinct in pyspark").getOrCreate()

# Define the schema
schema = StructType([
StructField("roll", IntegerType(), True),
StructField("first_name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("last_name", StringType(), True)
])

# Create the data
data = [
Row(1, "arjun", 30, "singh"),
Row(3, "Vikram", 67, "Sharma"),
Row(4, "Amit", 67, "Verma"),
Row(3, "Vikram", 67, "Sharma"),
]

# Parallelize the data and create DataFrame
rdd = spark.sparkContext.parallelize(data)
testDF = spark.createDataFrame(rdd, schema)
transformedDF = testDF.distinct()
# Show the DataFrame
transformedDF.show()

That's it! You've successfully applied filter and where conditions to a DataFrame in PySpark using .

Output

Alps