Relative Content

Tag Archive for pysparkapache-spark-sqlrdd

Pyspark mapPartition evaluates the function more times than expected

I’m working with PySpark to process large amounts of data. However, I noticed that the function called by mapPartitions is executed one more time than expected. For instance, in the following code block, the reformat function should be called four times but is called five times: four times when the DataFrame is cached and a fifth time when the show method is invoked.

Pyspark mapPartition evaluates the function more times than expected

I’m working with PySpark to process large amounts of data. However, I noticed that the function called by mapPartitions is executed one more time than expected. For instance, in the following code block, the reformat function should be called four times but is called five times: four times when the DataFrame is cached and a fifth time when the show method is invoked.