Comprehensive Guide to DataFrames in Scala Spark
DataFrames are a key feature in Spark, representing distributed collections of data organized into named columns. They provide a higher-level abstraction than RDDs and offer optimizations, such as automatic query optimization.
A DataFrame is a distributed collection of data organized into named columns, similar to a table in a relational database or a data frame in R/Python. DataFrames allow for the use of SQL queries and can be created from various data sources, including structured data files, tables in Hive, external databases, or existing RDDs.
DataFrames can be created using various methods in Spark:
You can create a DataFrame from an existing RDD using a case class and the toDF
method:
import org.apache.spark.sql.{SparkSession, Row} import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType} val spark = SparkSession.builder.appName("DataFrame Example").getOrCreate() case class Person(name: String, age: Int) val rdd = spark.sparkContext.parallelize(Seq(Person("John", 30), Person("Doe", 25))) val df = rdd.toDF() df.show()
You can create DataFrames from structured data files such as CSV, JSON, and Parquet using the read
method:
val testDF=sparkSession.read.option("header","true").csv("data/csv/test.csv") testDF.show()
import org.apache.spark.sql.{DataFrame, SparkSession} object ReadCSV { def main(args: Array[String]): Unit = { val sparkSession = SparkSession .builder() .appName("create dataframe scala spark") .master("local") .getOrCreate() val testDF=sparkSession.read.option("header","true").csv("data/csv/test.csv") testDF.show() sparkSession.stop() } }