Adding and Renaming Columns in PySpark
In PySpark, adding new columns and renaming existing columns are common operations. This tutorial will guide you through these processes with clear examples.
To add a new column in PySpark, you can use the withColumn
method. Here's an example:
from pyspark.sql import SparkSession from pyspark.sql.functions import col, lit # Create a Spark session spark = SparkSession.builder.appName("Add and Rename Columns Example").getOrCreate() # Sample DataFrame data = [("James", "Smith", "M", 30), ("Anna", "Rose", "F", 41), ("Robert", "Williams", "M", 62)] columns = ["FirstName", "LastName", "Gender", "Age"] df = spark.createDataFrame(data, schema=columns) # Add a new column 'Country' df = df.withColumn("Country", lit("USA")) df.show()
To rename a column in a PySpark DataFrame, you can use the withColumnRenamed
method. Here's how:
# Rename the 'FirstName' column to 'GivenName' df = df.withColumnRenamed("FirstName", "GivenName") df.show()
from pyspark.sql import SparkSession from pyspark.sql.functions import col, lit # Create a Spark session spark = SparkSession.builder.appName("Add and Rename Columns Example").getOrCreate() # Sample DataFrame data = [("James", "Smith", "M", 30), ("Anna", "Rose", "F", 41), ("Robert", "Williams", "M", 62)] columns = ["FirstName", "LastName", "Gender", "Age"] df = spark.createDataFrame(data, schema=columns) # Add a new column 'Country' df = df.withColumn("Country", lit("USA")) # Rename the 'FirstName' column to 'GivenName' df = df.withColumnRenamed("FirstName", "GivenName") # Show the final DataFrame df.show()