Save Spark DataFrame in Parquet Format Using Spark Scala
How to Save Spark DataFrame in Parquet Format Using Spark Scala
The Parquet format is a highly efficient columnar storage format that is widely used in big data and analytics applications, particularly with Apache Spark.
why the Parquet format is important in Spark?
- efficient Data Compression: Parquet files are optimized for storage efficiency. They use advanced compression techniques to reduce the size of data, which helps save disk space and improve I/O efficiency.
- Columnar Storage: Unlike row-based formats, Parquet stores data in columns. This makes it particularly effective for analytical queries that typically involve operations on specific columns rather than entire rows. Columnar storage allows for better performance in read-heavy operations.
- Schema Evolution: Schema Evolution: Parquet supports schema evolution, meaning you can add or remove columns without affecting existing data. This flexibility is crucial for managing large datasets that may undergo changes over time.
Sample Data
Roll |
Name |
Age |
1 |
Rahul |
30 |
2 |
Sanjay |
67 |
3 |
Ranjan |
67 |
object SaveDataFrame {
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession
.builder()
.appName("Our First scala spark code")
.master("local")
.getOrCreate()
val testDF=sparkSession.read.option("header","true").csv("data/csv/test.csv")
testDF.show()
//line to save as parquet file
testDF.write.parquet("data/parquet/data/")
sparkSession.stop()
}
}
Output: