Adding and Renaming Columns in PySpark

How to Add and Rename Columns in PySpark DataFrame

In PySpark, adding new columns and renaming existing columns are common operations. This tutorial will guide you through these processes with clear examples.

1. Use withColumn Method to Add a New Column

To add a new column in PySpark, you can use the withColumn method. Here's an example:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit

# Create a Spark session
spark = SparkSession.builder.appName("Add and Rename Columns Example").getOrCreate()

# Sample DataFrame
data = [("James", "Smith", "M", 30),
        ("Anna", "Rose", "F", 41),
        ("Robert", "Williams", "M", 62)]

columns = ["FirstName", "LastName", "Gender", "Age"]

df = spark.createDataFrame(data, schema=columns)

# Add a new column 'Country'
df = df.withColumn("Country", lit("USA"))
df.show()
      

2. Use withColumnRenamed Method to Rename a Column

To rename a column in a PySpark DataFrame, you can use the withColumnRenamed method. Here's how:

# Rename the 'FirstName' column to 'GivenName'
df = df.withColumnRenamed("FirstName", "GivenName")
df.show()
      

3. Complete Code for Adding and Renaming Columns in PySpark

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit

# Create a Spark session
spark = SparkSession.builder.appName("Add and Rename Columns Example").getOrCreate()

# Sample DataFrame
data = [("James", "Smith", "M", 30),
        ("Anna", "Rose", "F", 41),
        ("Robert", "Williams", "M", 62)]

columns = ["FirstName", "LastName", "Gender", "Age"]

df = spark.createDataFrame(data, schema=columns)

# Add a new column 'Country'
df = df.withColumn("Country", lit("USA"))

# Rename the 'FirstName' column to 'GivenName'
df = df.withColumnRenamed("FirstName", "GivenName")

# Show the final DataFrame
df.show()
      
PySpark rdd real world example