How to Use right join in PySpark
The join function in PySpark is a fundamental tool for combining two DataFrames based on common columns or keys. This operation is vital for data integration and analysis. In this tutorial, we’ll explore how to perform a right join in PySpark, complete with practical examples to guide you through the process.
id | name | age |
---|---|---|
1 | John | 28 |
2 | Alice | 34 |
4 | Bob | 23 |
id | salary |
---|---|
1 | 50000 |
2 | 60000 |
# Create employee data employee_data = [ (1, "rahul", 28), (2, "anil", 34), (4, "Raju", 23) ] # Create salary data salary_data = [ (1, 50000), (2, 60000) ] # Create DataFrames employee_df = spark.createDataFrame(employee_data, ["id", "name", "age"]) salary_df = spark.createDataFrame(salary_data, ["id", "salary"])
Before we can use right join
, we need to import the necessary libraries:
from pyspark.sql import SparkSession from pyspark.sql.functions import col
Now that we have our DataFrames, we can combine them using the join
function:
# Style 1: Right join using the common column name right_join_df1 = employee_df.join(salary_df, ["id"], "right") right_join_df1.show()
This approach is particularly useful for controlling column selection when dealing with multiple common columns across both DataFrames.
# Style 2: Right join using aliases (useful when column names are the same except the joining column) right_join_df2 = employee_df.alias("emp").join( salary_df.alias("sal"), col("emp.id") == col("sal.id"), "right" ).select("emp.id", "emp.name", "emp.age", "sal.salary")
from pyspark.sql import SparkSession from pyspark.sql.functions import col # Initialize SparkSession spark = SparkSession.builder \ .appName("Use right join in PySpark") \ .master("local") \ .getOrCreate() # Create employee data employee_data = [ (1, "rahul", 28), (2, "anil", 34), (4, "Raju", 23) ] # Create salary data salary_data = [ (1, 50000), (2, 60000) ] # Create DataFrames employee_df = spark.createDataFrame(employee_data, ["id", "name", "age"]) salary_df = spark.createDataFrame(salary_data, ["id", "salary"]) # Style 1: Right join using the common column name right_join_df1 = employee_df.join(salary_df, ["id"], "right") right_join_df1.show() # Style 2: Right join using aliases (useful when column names are the same except the joining column) right_join_df2 = employee_df.alias("emp").join( salary_df.alias("sal"), col("emp.id") == col("sal.id"), "right" ).select("emp.id", "emp.name", "emp.age", "sal.salary") right_join_df2.show() # Stop the SparkSession spark.stop()