Relative Content

Tag Archive for pysparkdata-engineering

How do I speed up the execution time of my PySpark Script?

I am new to PySpark and I’ve been tasked with developing a script that processes, transforms and writes around 60 million records in an hour, and around 600 million records in a day, so far I’ve developed a script that applies almost 20 transformations and then saves all the data in multiple directories, I’ve been able to process around 65 million records in around 28 minutes for now, there are some stages that take a lot of time, is there any way I can speed those up if possible? Stage 10 and stage 37 (similar to stage 10) takes up most of the time