Read a Parquet File Using Scala Spark

How to Read a Parquet File Using Spark Scala with Example

The Parquet format is a highly efficient columnar storage format that is widely used in big data and analytics applications, particularly with Apache Spark.

why the Parquet format is important in Spark?

  • efficient Data Compression: Parquet files are optimized for storage efficiency. They use advanced compression techniques to reduce the size of data, which helps save disk space and improve I/O efficiency.
  • Columnar Storage: Unlike row-based formats, Parquet stores data in columns. This makes it particularly effective for analytical queries that typically involve operations on specific columns rather than entire rows. Columnar storage allows for better performance in read-heavy operations.
  • Schema Evolution: Schema Evolution: Parquet supports schema evolution, meaning you can add or remove columns without affecting existing data. This flexibility is crucial for managing large datasets that may undergo changes over time.

Sample Data

Roll Name Age
1 Rahul 30
2 Sanjay 67
3 Ranjan 67
  
object ReadParquet {

  def main(args: Array[String]): Unit = {
    val sparkSession = SparkSession
      .builder()
      .appName("Reading parquet file using scala spark")
      .master("local")
      .getOrCreate()
    val parquetDF=sparkSession.read.parquet("data/parquet/data")
    parquetDF.show()
    sparkSession.stop()

  }

}

Output:

Alps

Below list contains some most commonly used options while reading a csv file