Read a Parquet File Using Scala Spark
How to Read a Parquet File Using Spark Scala with Example
The Parquet format is a highly efficient columnar storage format that is widely used in big data and analytics applications, particularly with Apache Spark.
why the Parquet format is important in Spark?
- efficient Data Compression: Parquet files are optimized for storage efficiency. They use advanced compression techniques to reduce the size of data, which helps save disk space and improve I/O efficiency.
- Columnar Storage: Unlike row-based formats, Parquet stores data in columns. This makes it particularly effective for analytical queries that typically involve operations on specific columns rather than entire rows. Columnar storage allows for better performance in read-heavy operations.
- Schema Evolution: Schema Evolution: Parquet supports schema evolution, meaning you can add or remove columns without affecting existing data. This flexibility is crucial for managing large datasets that may undergo changes over time.
Sample Data
Roll |
Name |
Age |
1 |
Rahul |
30 |
2 |
Sanjay |
67 |
3 |
Ranjan |
67 |
object ReadParquet {
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession
.builder()
.appName("Reading parquet file using scala spark")
.master("local")
.getOrCreate()
val parquetDF=sparkSession.read.parquet("data/parquet/data")
parquetDF.show()
sparkSession.stop()
}
}
Output:
Below list contains some most commonly used options while reading a csv file
- mergeSchema : Sets whether we should merge schemas collected from all Parquet part-files.
- compression: Compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, uncompressed, snappy, gzip, lzo, brotli, lz4, and zstd).