Introduction to RDD in PySpark - Comprehensive Guide

Understanding RDD in PySpark

Resilient Distributed Datasets (RDD) are the fundamental data structure of PySpark. RDDs are immutable distributed collections of objects. In this comprehensive guide, we will explore the creation, transformation, and operations on RDDs using PySpark.

What is an RDD in PySpark?

An RDD (Resilient Distributed Dataset) is a fundamental data structure of PySpark, representing an immutable, distributed collection of objects that can be processed in parallel. Partitioned across multiple nodes in a cluster, RDDs are designed for fault tolerance and can efficiently recompute lost data. They are built through transformations on data stored in various systems, such as HDFS or local file systems, enabling scalable and resilient big data processing.

Creating RDD in PySpark

There are multiple ways to create RDDs in PySpark:

From a local collection
From external storage (HDFS, S3, etc.)
By transforming an existing PySpark RDD

Creating RDDs from a Local Collection using PySpark

You can create an RDD from a local collection (such as an array) using the parallelize method:

#using a list
data=["rahul","hello","hi"]
df=spark_session.sparkContext.parallelize(data)
df.foreach( lambda data: print(data) ) #for local machine

How To print RDD in PySpark

df=spark_session.sparkContext.parallelize(data)
data=df.collect()
data.foreach( lambda data: print(data) )

Creating RDDs from External Storage using PySpark

RDDs can be created from data in external storage systems, such as HDFS, S3, or any Hadoop-supported file system:

#from external source
raw_df=spark_session.sparkContext.textFile("/Users/apple/PycharmProjects/pyspark/data/text/data.txt")
raw_df.foreach( lambda data: print(data) )

Entire code snippet

from pyspark.sql import SparkSession

spark_session = SparkSession.builder.master("local").appName("testing").getOrCreate()
#using a list
data=["rahul","hello","hi"]
df=spark_session.sparkContext.parallelize(data)
df.foreach( lambda data: print(data) )
  
#from external source
raw_df=spark_session.sparkContext.textFile("/Users/apple/PycharmProjects/pyspark/data/text/data.txt")
raw_df.foreach( lambda data: print(data) )