Comprehensive Guide to DataFrames in PySpark
DataFrames are a key feature in PySpark, representing distributed collections of data organized into named columns. They provide a higher-level abstraction than RDDs and offer optimizations, such as automatic query optimization.
A PySpark DataFrame is a distributed collection of data organized into named columns, similar to a table in a relational database or a data frame in Pandas. Built on top of RDDs, DataFrames in PySpark provide a higher-level abstraction for structured data processing, offering various optimizations and operations for efficient querying and analysis. They support a wide range of data sources, including JSON, CSV, and Parquet, and integrate with PySpark SQL for complex SQL queries, enabling scalable and parallel processing of large datasets.
DataFrames can be created using various methods in PySpark:
You can create a DataFrame from an existing RDD using a case class and the toDF
data=[Row("book",100),Row("pen",10),Row("bottle",250)] schema=StructType( [StructField("item",StringType(),True),StructField("price",IntegerType(),True)]) df=spark_session.createDataFrame(data=data,schema=schema)
You can create DataFrames from structured data files such as CSV, JSON, and Parquet using the read
#from external source"header",True).csv("/Users/apple/PycharmProjects/pyspark/data/csv/data.csv")
from pyspark import Row from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField, StringType, IntegerType spark_session = SparkSession.builder.master("local").appName("testing").getOrCreate() #using a list data=[Row("book",100),Row("pen",10),Row("bottle",250)] schema=StructType( [StructField("item",StringType(),True),StructField("price",IntegerType(),True)]) df=spark_session.createDataFrame(data=data,schema=schema) #from external source"header",True).csv("/Users/apple/PycharmProjects/pyspark/data/csv/data.csv")