Save PySpark DataFrame in Parquet Format
The Parquet format is a highly efficient columnar storage format that is widely used in big data and analytics applications, particularly with Apache Spark and PySpark.
Roll | Name | Age |
---|---|---|
1 | Rahul | 30 |
2 | Sanjay | 67 |
3 | Ranjan | 55 |
First, import the necessary libraries:
from pyspark.sql import SparkSession
Create a sample DataFrame:
spark = SparkSession.builder.appName("Save DataFrame").getOrCreate() data = [("Rahul", 30), ("Sanjay", 67), ("Ranjan", 55)] columns = ["Name", "Age"] df = spark.createDataFrame(data, columns) df.show()
Save the DataFrame as a Parquet file:
df.write.parquet("data/parquet/output/")
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Save DataFrame").getOrCreate() data = [("Rahul", 30), ("Sanjay", 67), ("Ranjan", 55)] columns = ["Name", "Age"] df = spark.createDataFrame(data, columns) df.show() df.write.parquet("data/parquet/output/") spark.stop()