Deleting a Column in PySpark

How to Delete a Column in PySpark DataFrame

In PySpark, removing a column from a DataFrame is quite simple. This tutorial will show you how to do it. We’ll provide clear, step-by-step examples to make the process easy to follow.

For example I have considered below sample data

Sample Data


Roll First Name Age Last Name
1 Ali 30 Khan
2 Sanjay 20 Kumar
3 Rahul 67 kumar

Use drop Method to Delete a Column

You can delete a column from a PySpark DataFrame using the drop method. Here's an example:

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("Delete Column Example").getOrCreate()

# Sample DataFrame
data = [("Ali", "Khan", 30),
("Sanjay", "Kumar", 20),
("Rahul", "Kumar", 67)]

columns = ["FirstName", "LastName", "Age"]

df = spark.createDataFrame(data, schema=columns)

# Delete the 'Age' column
df = df.drop("Age")
df.show()
PySpark rdd real world example

Delete Multiple Columns

  # Delete the 'Age' column
df = df.drop("Age","LastName")
df.show()
 
PySpark rdd real world example

Complete Code:

  from pyspark.sql import SparkSession

  # Create a Spark session
  spark = SparkSession.builder.appName("Delete Column Example").getOrCreate()
  
  # Sample DataFrame
  data = [("Ali", "Khan", 30),
          ("Sanjay", "Kumar", 20),
          ("Rahul", "Kumar", 67)]
  
  columns = ["FirstName", "LastName", "Age"]
  
  df = spark.createDataFrame(data, schema=columns)
  
  # Delete the 'Age' column
  df = df.drop("Age")
  #df = df.drop("Age","LastName")
  df.show()