I am trying to run regression within each group and year. Here is what I have done:
data = pd.DataFrame({
"Year": [2000, 2000, 2000, 2000, 2001, 2001, 2001, 2001],
"Group": ["A", "A", "B", "B", "A", "A", "B", "B"],
"ID": [1, 2, 3, 5, 2, 1, 2, 3],
"Value 1": [40, 20, 30, 45, 22, 34, 11, 88],
"Value 2": [3, 22, 11, 55, 5, 9, 4, 15],
})
def func(row, var):
X = row[var] # independent variable
y = row["Value 2"] # dependent variable
X = sm.add_constant(X)
row["Residual"] = sm.OLS(y, X, missing="drop").fit().resid
return row
data.groupby(["Year", "Group"], group_keys=False).apply(func, var="Value 1")
It works just fine. However, my real dataset is huge. Is there a way to run this more efficiently? For example, using matrix? Becuase I kept getting following error and it’s pretty slow.
“MemoryError: Unable to allocate 5.73 GiB for an array with shape (229, 3359769) and data type float64”