Guide to Apache PySpark DataFrame distinct() Method
The distinct() method in Apache PySpark DataFrame is used to generate a new DataFrame containing only unique rows based on all columns. Here are five key points about distinct():
Roll | First Name | Age | Last Name |
---|---|---|---|
1 | Arjun | 30 | Singh |
2 | Vikram | 67 | Sharma |
3 | Amit | 67 | Verma |
4 | Vikram | 67 | Sharma |
First, you need to import the necessary libraries:
from pyspark.sql import SparkSession from pyspark.sql import Row from pyspark.sql.types import StructType, StructField, IntegerType, StringType
For demonstration purposes, let's create a sample DataFrame:
# Initialize SparkSession spark = SparkSession.builder.master("local").appName("distinct in pyspark").getOrCreate() # Define the schema schema = StructType([ StructField("roll", IntegerType(), True), StructField("first_name", StringType(), True), StructField("age", IntegerType(), True), StructField("last_name", StringType(), True) ]) # Create the data data = [ Row(1, "arjun", 30, "singh"), Row(3, "Vikram", 67, "Sharma"), Row(4, "Amit", 67, "Verma"), Row(3, "Vikram", 67, "Sharma"), ] # Parallelize the data and create DataFrame rdd = spark.sparkContext.parallelize(data) testDF = spark.createDataFrame(rdd, schema)
val transformedDF=testDF.distinct()
from pyspark.sql import SparkSession from pyspark.sql import Row from pyspark.sql.types import StructType, StructField, IntegerType, StringType # Initialize SparkSession spark = SparkSession.builder.master("local").appName("distinct in pyspark").getOrCreate() # Define the schema schema = StructType([ StructField("roll", IntegerType(), True), StructField("first_name", StringType(), True), StructField("age", IntegerType(), True), StructField("last_name", StringType(), True) ]) # Create the data data = [ Row(1, "arjun", 30, "singh"), Row(3, "Vikram", 67, "Sharma"), Row(4, "Amit", 67, "Verma"), Row(3, "Vikram", 67, "Sharma"), ] # Parallelize the data and create DataFrame rdd = spark.sparkContext.parallelize(data) testDF = spark.createDataFrame(rdd, schema) transformedDF = testDF.distinct() # Show the DataFrame transformedDF.show()
That's it! You've successfully applied filter and where conditions to a DataFrame in PySpark using .