Comprehensive Guide to DataFrames in Scala Spark

Introduction to DataFrames in Scala Spark

DataFrames are a key feature in Spark, representing distributed collections of data organized into named columns. They provide a higher-level abstraction than RDDs and offer optimizations, such as automatic query optimization.

1. What is a DataFrame?

A DataFrame is a distributed collection of data organized into named columns, similar to a table in a relational database or a data frame in R/Python. DataFrames allow for the use of SQL queries and can be created from various data sources, including structured data files, tables in Hive, external databases, or existing RDDs.

2. Creating DataFrames

DataFrames can be created using various methods in Spark:

2.1. Creating DataFrames from RDDs

You can create a DataFrame from an existing RDD using a case class and the toDF method:

import org.apache.spark.sql.{SparkSession, Row}
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}

val spark = SparkSession.builder.appName("DataFrame Example").getOrCreate()

case class Person(name: String, age: Int)

val rdd = spark.sparkContext.parallelize(Seq(Person("John", 30), Person("Doe", 25)))
val df = rdd.toDF()

df.show()

2.2. Creating DataFrames from Structured Data Files

You can create DataFrames from structured data files such as CSV, JSON, and Parquet using the read method:

val testDF=sparkSession.read.option("header","true").csv("data/csv/test.csv")
testDF.show()

Complete Code


import org.apache.spark.sql.{DataFrame, SparkSession}

object ReadCSV {

  def main(args: Array[String]): Unit = {

    val sparkSession = SparkSession
      .builder()
      .appName("create dataframe scala spark")
      .master("local")
      .getOrCreate()
    val testDF=sparkSession.read.option("header","true").csv("data/csv/test.csv")
    testDF.show()
    sparkSession.stop()

  }

}