Relative Content

Tag Archive for pythonlistmathpysparkdatabricks

Using PySpark what is the fastest way of finding most frequent combinations that appears in list of list?

I am using a Databricks PySpark notebook. I am trying to find the most efficient way of finding the most frequent combinations in a list of lists. The number of combinations is 3.8 million and number of list of lists is 170000. I wrote an algorithm to do is this in stanard python, but when I fed in 10 lists this took 1 minute to process, so expanding that it would take 280 hours to process all lists. But since I have PySpark I think there may be a more efficient way of handling this, or a more efficient way of using standard python code – whichever works. I’d like to keep the processing time under 30 minutes, ideally 10.