I am attempting to create a ‘counter’ of consecutive binary values = 1, resetting when the binary value = 0, for each group. Example of data:
data = {'city_id': [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6],
'week': [1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7],
'binary': [0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1]}
df = pd.DataFrame(data)
For each id, the first binary = 1
should begin with a consecutive_count = 1
rather than 0. And this should reset each time binary = 0
, along with each time we move on to a new id.
I have already created a solution that does this. It looks like this:
df['consecutive'] = 0
for city in df['city_id'].unique():
city_df = df[df['city_id'] == city]
consecutive_count = 0
for i in range(len(city_df)):
if city_df['binary'].iloc[i] == 1:
consecutive_count += 1
else:
consecutive_count = 0
df.loc[(df['city_id'] == city) & (df['week'] == city_df['week'].iloc[i]), 'consecutive'] = consecutive_count
The main issue is that my solution is extremely inefficient for large data. I have a large set of ids, ~2.5M, and this solution either times out or runs for hours and hours, so I am struggling in making this more efficient. TIA.
The issue with your approach is that you’re repeatedly slicing. You should use the builtin groupby
functions for efficiency.
You can form a custom group with groupby.cumsum
to reset the groups on 0s, then use this to compute the consecutive counts:
df['consecutive'] = df.groupby(
['city_id', df['binary'].eq(0).groupby(df['city_id']).cumsum()]
)['binary'].cumsum()
Output:
city_id week binary consecutive
0 1 1 0 0
1 1 2 1 1
2 1 3 1 2
3 1 4 1 3
4 1 5 0 0
5 1 6 0 0
6 1 7 0 0
7 2 1 0 0
8 2 2 1 1
9 2 3 1 2
10 2 4 0 0
11 2 5 1 1
12 2 6 0 0
13 2 7 1 1
14 3 1 1 1
15 3 2 1 2
16 3 3 1 3
17 3 4 0 0
18 3 5 0 0
19 3 6 0 0
20 3 7 0 0
21 4 1 0 0
22 4 2 0 0
23 4 3 1 1
24 4 4 1 2
25 4 5 1 3
26 4 6 1 4
27 4 7 1 5
28 5 1 1 1
29 5 2 1 2
30 5 3 1 3
31 5 4 0 0
32 5 5 0 0
33 5 6 0 0
34 5 7 0 0
35 6 1 1 1
36 6 2 0 0
37 6 3 1 1
38 6 4 0 0
39 6 5 1 1
40 6 6 0 0
41 6 7 1 1
Code
grp = df.groupby(['city_id'])['binary'].shift().ne(df['binary']).cumsum()
df['consecutive'] = df.groupby(['city_id', grp])['binary'].cumsum()