Adding and Renaming Columns in PySpark
In PySpark, adding new columns and renaming existing columns are common operations. This tutorial will guide you through these processes with clear examples.
To add a new column in PySpark, you can use the withColumn method. Here's an example:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit
# Create a Spark session
spark = SparkSession.builder.appName("Add and Rename Columns Example").getOrCreate()
# Sample DataFrame
data = [("James", "Smith", "M", 30),
("Anna", "Rose", "F", 41),
("Robert", "Williams", "M", 62)]
columns = ["FirstName", "LastName", "Gender", "Age"]
df = spark.createDataFrame(data, schema=columns)
# Add a new column 'Country'
df = df.withColumn("Country", lit("USA"))
df.show()
To rename a column in a PySpark DataFrame, you can use the withColumnRenamed method. Here's how:
# Rename the 'FirstName' column to 'GivenName'
df = df.withColumnRenamed("FirstName", "GivenName")
df.show()
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit
# Create a Spark session
spark = SparkSession.builder.appName("Add and Rename Columns Example").getOrCreate()
# Sample DataFrame
data = [("James", "Smith", "M", 30),
("Anna", "Rose", "F", 41),
("Robert", "Williams", "M", 62)]
columns = ["FirstName", "LastName", "Gender", "Age"]
df = spark.createDataFrame(data, schema=columns)
# Add a new column 'Country'
df = df.withColumn("Country", lit("USA"))
# Rename the 'FirstName' column to 'GivenName'
df = df.withColumnRenamed("FirstName", "GivenName")
# Show the final DataFrame
df.show()