Pyspark read csv from local file system
CSV files are a popular format for data storage, and Spark offers robust tools for handling them efficiently. In this guide, we’ll explore how to read a CSV file using PySpark. We'll cover setting up your Spark session, loading the CSV file into a DataFrame, and performing basic data operations. By the end, you'll be equipped to handle CSV files effectively in your PySpark applications. Let’s dive into the process step by step.
| Item | Price |
|---|---|
| Book | 100 |
| Pen | 10 |
| Bottle | 250 |
| Sofa | 10000 |
from pyspark import Row
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
spark_session = SparkSession.builder.master("local").appName("testing").getOrCreate()
raw_df_with_header=spark_session.read.option("header",True).csv("/Users/apple/PycharmProjects/pyspark/data/csv/data.csv")
raw_df_with_header.show()
Output:
schema=StructType( [StructField("item",StringType(),True),StructField("price",IntegerType(),True)])
csv_df_with_schema=spark_session.read.option("header",False).option("schema",schema).csv("/Users/apple/PycharmProjects/pyspark/data/csv/data_without_header.csv")
raw_df_with_header.show()
Output:
from pyspark import Row
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
spark_session = SparkSession.builder.master("local").appName("testing").getOrCreate()
schema=StructType( [StructField("item",StringType(),True),StructField("price",IntegerType(),True)])
raw_df_with_header=spark_session.read.option("header",True).csv("/Users/apple/PycharmProjects/pyspark/data/csv/data.csv")
raw_df_with_header.show()
csv_df_with_schema=spark_session.read.option("header",False).option("schema",schema).csv("/Users/apple/PycharmProjects/pyspark/data/csv/data_without_header.csv")
raw_df_with_header.show()