Filter and Where Conditions in Spark DataFrame PySpark
Learn how to use filter and where conditions when working with Spark DataFrames using PySpark. This tutorial will guide you through the process of applying conditional logic to your data filtering, allowing you to retrieve specific subsets of data based on given criteria.
ID | First Name | Age | Last Name |
---|---|---|---|
101 | Amit | 28 | kumar |
102 | Bony | 34 | kumar |
103 | chandan | 45 | kumar |
104 | Deepak | 23 | kumar |
First, you need to import the necessary libraries:
from pyspark.sql import SparkSession from pyspark.sql.functions import col, lit from pyspark.sql.types import IntegerType, StringType, StructField, StructType
For demonstration purposes, let's create a sample DataFrame:
spark = SparkSession.builder \ .appName("Filter and Where Conditions") \ .master("local") \ .getOrCreate() schema = StructType([ StructField("ID", IntegerType(), True), StructField("First_Name", StringType(), True), StructField("Age", IntegerType(), True), StructField("Last_Name", StringType(), True) ]) data = [ (101, "Alice", 28, "Smith"), (102, "Bob", 34, "Johnson"), (103, "Charlie", 45, "Williams"), (104, "David", 23, "Brown") ] df = spark.createDataFrame(data, schema) df.show()
Now, let's apply filter and where conditions to the DataFrame:
filtered_df = df.filter(col("Age") > lit(30)) filtered_df.show()
where_df = df.where(col("Age") > lit(30)) where_df.show()
from pyspark.sql import SparkSession from pyspark.sql.functions import col, lit from pyspark.sql.types import IntegerType, StringType, StructField, StructType def main(): spark = SparkSession.builder \ .appName("Filter and Where Conditions") \ .master("local") \ .getOrCreate() schema = StructType([ StructField("ID", IntegerType(), True), StructField("First_Name", StringType(), True), StructField("Age", IntegerType(), True), StructField("Last_Name", StringType(), True) ]) data = [ (101, "Alice", 28, "Smith"), (102, "Bob", 34, "Johnson"), (103, "Charlie", 45, "Williams"), (104, "David", 23, "Brown") ] df = spark.createDataFrame(data, schema) filtered_df = df.filter(col("Age") > lit(30)) filtered_df.show() where_df = df.where(col("Age") > lit(30)) where_df.show() spark.stop() if __name__ == "__main__": main()
That's it! You've successfully applied filter and where conditions to a DataFrame in Spark using PySpark.