Relative Content

Tag Archive for scalaapache-sparkpysparkapache-spark-sql

Parallelize writing to S3 using foreach Pyspark

I have an use case to write data in list to S3 parallelly.
The list I have is a list of lists -> [[guid1, guid2], [guid3, guid4],...]
The function get_guids_combined() is responsible for returning the above list
I have to parallelize writing for each list in a list by filtering it from main DF.
I am facing issues when using sparkContext (sc). It’s getting executed on the worker node, where as we are only supposed to execute it on the driver. How do I achieve the same circumventing this problem
Code: