Save PySpark DataFrame in Parquet Format
The Parquet format is a highly efficient columnar storage format that is widely used in big data and analytics applications, particularly with Apache Spark and PySpark.
| Roll | Name | Age |
|---|---|---|
| 1 | Rahul | 30 |
| 2 | Sanjay | 67 |
| 3 | Ranjan | 55 |
First, import the necessary libraries:
from pyspark.sql import SparkSession
Create a sample DataFrame:
spark = SparkSession.builder.appName("Save DataFrame").getOrCreate()
data = [("Rahul", 30), ("Sanjay", 67), ("Ranjan", 55)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
df.show()
Save the DataFrame as a Parquet file:
df.write.parquet("data/parquet/output/")
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Save DataFrame").getOrCreate()
data = [("Rahul", 30), ("Sanjay", 67), ("Ranjan", 55)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
df.show()
df.write.parquet("data/parquet/output/")
spark.stop()