How to Use left join in PySpark
The join function in Apache PySpark is a vital tool for merging two DataFrames based on shared columns or keys. This operation is crucial for data integration and analysis. In this tutorial, we’ll focus on how to perform a left join in PySpark, providing practical examples to illustrate the process.
id | name | age |
---|---|---|
1 | John | 28 |
2 | Alice | 34 |
4 | Bob | 23 |
id | salary |
---|---|
1 | 50000 |
2 | 60000 |
# Create employee data employee_data = [ (1, "rahul", 28), (2, "anil", 34), (4, "Raju", 23) ] # Create salary data salary_data = [ (1, 50000), (2, 60000) ] # Create DataFrames employee_df = spark.createDataFrame(employee_data, ["id", "name", "age"]) salary_df = spark.createDataFrame(salary_data, ["id", "salary"])
Before we can use left join
, we need to import the necessary libraries:
from pyspark.sql import SparkSession from pyspark.sql.functions import col
Now that we have our DataFrames, we can combine them using the join
function:
# Style 1: Left join using the common column name left_join_df1 = employee_df.join(salary_df, ["id"], "left") left_join_df1.show()
This approach is particularly useful for controlling column selection when dealing with multiple common columns across both DataFrames.
# Style 2: Left join using aliases (useful when column names are the same except the joining column) left_join_df2 = employee_df.alias("emp").join( salary_df.alias("sal"), col("emp.id") == col("sal.id"), "left" ).select("emp.id", "emp.name", "emp.age", "sal.salary")
from pyspark.sql import SparkSession from pyspark.sql.functions import col # Initialize SparkSession spark = SparkSession.builder \ .appName("Use left join in PySpark") \ .master("local") \ .getOrCreate() # Create employee data employee_data = [ (1, "rahul", 28), (2, "anil", 34), (4, "Raju", 23) ] # Create salary data salary_data = [ (1, 50000), (2, 60000) ] # Create DataFrames employee_df = spark.createDataFrame(employee_data, ["id", "name", "age"]) salary_df = spark.createDataFrame(salary_data, ["id", "salary"]) # Style 1: Left join using the common column name left_join_df1 = employee_df.join(salary_df, ["id"], "left") left_join_df1.show() # Style 2: Left join using aliases (useful when column names are the same except the joining column) left_join_df2 = employee_df.alias("emp").join( salary_df.alias("sal"), col("emp.id") == col("sal.id"), "left" ).select("emp.id", "emp.name", "emp.age", "sal.salary") left_join_df2.show() # Stop the SparkSession spark.stop()