Save Spark DataFrame in Parquet Format Using Spark Scala

How to Save Spark DataFrame in Parquet Format Using Spark Scala

The Parquet format is a highly efficient columnar storage format that is widely used in big data and analytics applications, particularly with Apache Spark.

why the Parquet format is important in Spark?

  • efficient Data Compression: Parquet files are optimized for storage efficiency. They use advanced compression techniques to reduce the size of data, which helps save disk space and improve I/O efficiency.
  • Columnar Storage: Unlike row-based formats, Parquet stores data in columns. This makes it particularly effective for analytical queries that typically involve operations on specific columns rather than entire rows. Columnar storage allows for better performance in read-heavy operations.
  • Schema Evolution: Schema Evolution: Parquet supports schema evolution, meaning you can add or remove columns without affecting existing data. This flexibility is crucial for managing large datasets that may undergo changes over time.

Sample Data

Roll Name Age
1 Rahul 30
2 Sanjay 67
3 Ranjan 67
  object SaveDataFrame {
    def main(args: Array[String]): Unit = {
      val sparkSession = SparkSession
        .builder()
        .appName("Our First scala spark code")
        .master("local")
        .getOrCreate()
      val testDF=sparkSession.read.option("header","true").csv("data/csv/test.csv")
      testDF.show()
      //line to save as parquet file
      testDF.write.parquet("data/parquet/data/")
      sparkSession.stop()
  
    }
  
  }

Output:

Alps