I am using pandas to get the correlation between column “day_lag” and “Rank”, group by “ID” and “Month”

But the run time is quiet long.

df_daylag_corr = -df_new.groupby(['ID','Month'])['day_lag'].corr(df['Rank'])

df_daylag_corr = df_daylag_corr.reset_index()

df_daylag_corr= df_daylag_corr.rename(columns={'day_lag': 'cor_result'})

df_new = df_new.merge(df_daylag_corr, on=['ID','Month'], how='left')

How can I convert above code to pyspark?

I try below code but still get error

from pyspark.mllib.stat import Statistics
group_cols = ["ID", "Month"]
AttributeError: 'GroupedData' object has no attribute 'Statistics'

And should I really need to use pyspark? or any optional lib to reduce the run time for doing correlation?

solution to reduce the run time or convert the code to pyspark

