Pyspark mapPartition evaluates the function more times than expected
I’m working with PySpark to process large amounts of data. However, I noticed that the function called by mapPartitions
is executed one more time than expected. For instance, in the following code block, the reformat
function should be called four times but is called five times: four times when the DataFrame is cached and a fifth time when the show
method is invoked.
Pyspark mapPartition evaluates the function more times than expected
I’m working with PySpark to process large amounts of data. However, I noticed that the function called by mapPartitions
is executed one more time than expected. For instance, in the following code block, the reformat
function should be called four times but is called five times: four times when the DataFrame is cached and a fifth time when the show
method is invoked.