Introduction to RDD in PySpark - Comprehensive Guide
Resilient Distributed Datasets (RDD) are the fundamental data structure of PySpark. RDDs are immutable distributed collections of objects. In this comprehensive guide, we will explore the creation, transformation, and operations on RDDs using PySpark.
An RDD (Resilient Distributed Dataset) is a fundamental data structure of PySpark, representing an immutable, distributed collection of objects that can be processed in parallel. Partitioned across multiple nodes in a cluster, RDDs are designed for fault tolerance and can efficiently recompute lost data. They are built through transformations on data stored in various systems, such as HDFS or local file systems, enabling scalable and resilient big data processing.
There are multiple ways to create RDDs in PySpark:
You can create an RDD from a local collection (such as an array) using the parallelize
method:
#using a list data=["rahul","hello","hi"] df=spark_session.sparkContext.parallelize(data) df.foreach( lambda data: print(data) ) #for local machine
df=spark_session.sparkContext.parallelize(data) data=df.collect() data.foreach( lambda data: print(data) )
RDDs can be created from data in external storage systems, such as HDFS, S3, or any Hadoop-supported file system:
#from external source raw_df=spark_session.sparkContext.textFile("/Users/apple/PycharmProjects/pyspark/data/text/data.txt") raw_df.foreach( lambda data: print(data) )
from pyspark.sql import SparkSession spark_session = SparkSession.builder.master("local").appName("testing").getOrCreate() #using a list data=["rahul","hello","hi"] df=spark_session.sparkContext.parallelize(data) df.foreach( lambda data: print(data) ) #from external source raw_df=spark_session.sparkContext.textFile("/Users/apple/PycharmProjects/pyspark/data/text/data.txt") raw_df.foreach( lambda data: print(data) )