Save PySpark DataFrame in Parquet Format

How to Save PySpark DataFrame in Parquet Format

The Parquet format is a highly efficient columnar storage format that is widely used in big data and analytics applications, particularly with Apache Spark and PySpark.

Why the Parquet Format is Important in Spark?

  • Efficient Data Compression: Parquet files are optimized for storage efficiency. They use advanced compression techniques to reduce the size of data, which helps save disk space and improve I/O efficiency.
  • Columnar Storage: Unlike row-based formats, Parquet stores data in columns. This makes it particularly effective for analytical queries that typically involve operations on specific columns rather than entire rows. Columnar storage allows for better performance in read-heavy operations.
  • Schema Evolution: Parquet supports schema evolution, meaning you can add or remove columns without affecting existing data. This flexibility is crucial for managing large datasets that may undergo changes over time.

Sample Data

Roll Name Age
1 Rahul 30
2 Sanjay 67
3 Ranjan 55

Step 1: Import Required Libraries

First, import the necessary libraries:

from pyspark.sql import SparkSession
      

Step 2: Create a PySpark DataFrame

Create a sample DataFrame:

spark = SparkSession.builder.appName("Save DataFrame").getOrCreate()
data = [("Rahul", 30), ("Sanjay", 67), ("Ranjan", 55)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
df.show()
      

Step 3: Save DataFrame in Parquet Format

Save the DataFrame as a Parquet file:

df.write.parquet("data/parquet/output/")
      

Complete Code

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Save DataFrame").getOrCreate()

data = [("Rahul", 30), ("Sanjay", 67), ("Ranjan", 55)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)
df.show()

df.write.parquet("data/parquet/output/")
spark.stop()