Save column value into string variable - PySpark
The collect
function in Apache PySpark is used to retrieve all rows from a DataFrame as an array. This operation is useful for retrieving data to the driver node for further processing in local memory. In this tutorial, we will cover how to use the collect
function in PySpark with practical examples.
Roll | First Name | Age | Last Name |
---|---|---|---|
1 | Rahul | 30 | Yadav |
2 | Sanjay | 20 | gupta |
3 | Ranjan | 67 | kumar |
We need to import the necessary PySpark libraries:
from pyspark.sql import SparkSession from pyspark.sql.functions import col, lit from pyspark.sql import Row
# Define the schema schema = "roll INT, first_name STRING, age INT, last_name STRING" # Create the data data = [ Row(1, "rahul", 30, "yadav"), Row(2, "sanjay", 20, "gupta"), Row(3, "ranjan", 67, "kumar"), ] # Create the DataFrame test_df = spark.createDataFrame(data, schema=schema)
The collect
function allows you to retrieve all rows from a DataFrame as an array.
# Get the name of the student whose roll number is 2 first_name = test_df.filter(col("roll") == lit(2)).collect()[0][1] print(first_name)
from pyspark.sql import SparkSession from pyspark.sql.functions import col, lit from pyspark.sql import Row # Initialize SparkSession spark = SparkSession.builder \ .appName("Collect in PySpark") \ .master("local") \ .getOrCreate() # Define the schema schema = "roll INT, first_name STRING, age INT, last_name STRING" # Create the data data = [ Row(1, "rahul", 30, "yadav"), Row(2, "sanjay", 20, "gupta"), Row(3, "ranjan", 67, "kumar"), ] # Create the DataFrame test_df = spark.createDataFrame(data, schema=schema) # Collect all rows from the DataFrame collected_rows = test_df.collect() for row in collected_rows: print(row) # Get the name of the student whose roll number is 2 first_name = test_df.filter(col("roll") == lit(2)).collect()[0][1] print(first_name) # Stop the SparkSession spark.stop()