Pyspark read csv from local file system

How to read CSV with header using PySpark

CSV files are a popular format for data storage, and Spark offers robust tools for handling them efficiently. In this guide, we’ll explore how to read a CSV file using PySpark. We'll cover setting up your Spark session, loading the CSV file into a DataFrame, and performing basic data operations. By the end, you'll be equipped to handle CSV files effectively in your PySpark applications. Let’s dive into the process step by step.

Use the read method to load the CSV file into a DataFrame. For example I have considered below sample data

Item Prices

Item Price
Book 100
Pen 10
Bottle 250
Sofa 10000

Read CSV with header using PySpark

from pyspark import Row
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark_session = SparkSession.builder.master("local").appName("testing").getOrCreate()
raw_df_with_header=spark_session.read.option("header",True).csv("/Users/apple/PycharmProjects/pyspark/data/csv/data.csv")
raw_df_with_header.show()

Output:

Alps

Read CSV with schema using PySpark


schema=StructType( [StructField("item",StringType(),True),StructField("price",IntegerType(),True)])
csv_df_with_schema=spark_session.read.option("header",False).option("schema",schema).csv("/Users/apple/PycharmProjects/pyspark/data/csv/data_without_header.csv")
raw_df_with_header.show()

Output:

Alps

Below list contains some most commonly used options while reading a csv file


Fully Detailed Code Example


from pyspark import Row
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
  
spark_session = SparkSession.builder.master("local").appName("testing").getOrCreate()
  
schema=StructType( [StructField("item",StringType(),True),StructField("price",IntegerType(),True)])
raw_df_with_header=spark_session.read.option("header",True).csv("/Users/apple/PycharmProjects/pyspark/data/csv/data.csv")
raw_df_with_header.show()
  
csv_df_with_schema=spark_session.read.option("header",False).option("schema",schema).csv("/Users/apple/PycharmProjects/pyspark/data/csv/data_without_header.csv")
raw_df_with_header.show()