Relative Content

Tag Archive for pythondataframepyspark

popping chunks of rows from a pyspark dataframe

I’m looking for a way to process a pyspark data frame in chunks – so regardless of the number of rows in it (whether 5 rows, 2,000,000, etc) – I will process it in a loop and “pop” a chunk of x rows at a time (or less, if no more than that available), until it’s all processed. It’s a very similar approach to “pop”, but with multiple rows at a time.

Calculate rolling counts from two different time series columns in pyspark

I have a pyspark dataframe that contains two columns. Arrival and departure. The idea is to calculate the number of departure events that fall within a specified window calculated based on arrival time.
So for example, if an item arrived on 23:00 then I would like to take a window of -12 hours [11:00, 23:00] and calculate the number of items that left within that time interval.