Pyspark print Word frequency
Word count is one of the most common tasks in data processing, often used as a simple example to introduce big data processing frameworks like Apache Spark. PySpark, the Python API for Spark, allows you to leverage the power of distributed computing to perform word count operations on large datasets efficiently. In this blog post, we'll walk you through creating a word count program in PySpark that calculates word frequency.
i like pyspark program |
pyspark is very simple |
i like pyspark program |
rdd=spark_session.sparkContext.textFile("/Users/apple/PycharmProjects/pyspark/data/text/word_count.txt") data=rdd.collect() for x in data: print(x)
Output:
df_split=rdd.flatMap(lambda data : data.split(' ')) data=df_split.collect() for x in data: print(x)
Output:
To count the frequency of each word, map each word to a key-value pair where the key is the word and the value is 1. This can be done using the map transformation.
df_pair=df_split.map( lambda value : (value,1)) data=df_pair.collect() for x in data: print(x)
Output:
Now, sum the values for each key (word) using the reduceByKey transformation. This will aggregate the counts for each word.
df_count=df_pair.reduceByKey(lambda x,y: x+y ) df_count=df_pair.reduceByKey(add) # this works as well data=df_count.collect() for x in data: print(x)
Output:
from operator import add from pyspark.sql importSparkSession spark_session:SparkSession=SparkSession.builder.master("local").appName("world_count").getOrCreate() rdd=spark_session.sparkContext.textFile("/Users/apple/PycharmProjects/pyspark/data/text/word_count.txt") data=rdd.collect() for x in data: print(x) df_split=rdd.flatMap(lambda data : data.split(' ')) data=df_split.collect() for x in data: print(x) df_pair=df_split.map( lambda value : (value,1)) data=df_pair.collect() for x in data: print(x) df_count=df_pair.reduceByKey(lambda x,y: x+y ) #df_count=df_pair.reduceByKey(add) - this works as well data=df_count.collect() for x in data: print(x)