I am using pandas to get the correlation between column “day_lag” and “Rank”, group by “ID” and “Month”

But the run time is quiet long.

df_daylag_corr = -df_new.groupby(['ID','Month'])['day_lag'].corr(df['Rank'])

df_daylag_corr = df_daylag_corr.reset_index()

df_daylag_corr= df_daylag_corr.rename(columns={'day_lag': 'cor_result'})

df_new = df_new.merge(df_daylag_corr, on=['ID','Month'], how='left')

How can I convert above code to pyspark?

I try below code but still get error

from pyspark.mllib.stat import Statistics
group_cols = ["ID", "Month"]
df_spark.withColumn("cor_result",df_spark.groupby(group_cols).Statistics.corr("day_lag","Rank"))
AttributeError: 'GroupedData' object has no attribute 'Statistics'

And should I really need to use pyspark? or any optional lib to reduce the run time for doing correlation?

solution to reduce the run time or convert the code to pyspark

New contributor

Ray Lao is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.