Consecutive count of binary column by group

I am attempting to create a ‘counter’ of consecutive binary values = 1, resetting when the binary value = 0, for each group. Example of data:

data = {'city_id': [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6],
        'week': [1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7],
        'binary': [0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1]}

df = pd.DataFrame(data)

For each id, the first binary = 1 should begin with a consecutive_count = 1 rather than 0. And this should reset each time binary = 0, along with each time we move on to a new id.

I have already created a solution that does this. It looks like this:

df['consecutive'] = 0
for city in df['city_id'].unique():
    city_df = df[df['city_id'] == city]
    consecutive_count = 0
    for i in range(len(city_df)):
        if city_df['binary'].iloc[i] == 1:
            consecutive_count += 1
        else:
            consecutive_count = 0
        df.loc[(df['city_id'] == city) & (df['week'] == city_df['week'].iloc[i]), 'consecutive'] = consecutive_count

The main issue is that my solution is extremely inefficient for large data. I have a large set of ids, ~2.5M, and this solution either times out or runs for hours and hours, so I am struggling in making this more efficient. TIA.

The issue with your approach is that you’re repeatedly slicing. You should use the builtin groupby functions for efficiency.

You can form a custom group with groupby.cumsum to reset the groups on 0s, then use this to compute the consecutive counts:

df['consecutive'] = df.groupby(
    ['city_id', df['binary'].eq(0).groupby(df['city_id']).cumsum()]
)['binary'].cumsum()

Output:

    city_id  week  binary  consecutive
0         1     1       0            0
1         1     2       1            1
2         1     3       1            2
3         1     4       1            3
4         1     5       0            0
5         1     6       0            0
6         1     7       0            0
7         2     1       0            0
8         2     2       1            1
9         2     3       1            2
10        2     4       0            0
11        2     5       1            1
12        2     6       0            0
13        2     7       1            1
14        3     1       1            1
15        3     2       1            2
16        3     3       1            3
17        3     4       0            0
18        3     5       0            0
19        3     6       0            0
20        3     7       0            0
21        4     1       0            0
22        4     2       0            0
23        4     3       1            1
24        4     4       1            2
25        4     5       1            3
26        4     6       1            4
27        4     7       1            5
28        5     1       1            1
29        5     2       1            2
30        5     3       1            3
31        5     4       0            0
32        5     5       0            0
33        5     6       0            0
34        5     7       0            0
35        6     1       1            1
36        6     2       0            0
37        6     3       1            1
38        6     4       0            0
39        6     5       1            1
40        6     6       0            0
41        6     7       1            1

Code

grp = df.groupby(['city_id'])['binary'].shift().ne(df['binary']).cumsum()
df['consecutive'] = df.groupby(['city_id', grp])['binary'].cumsum()

Filed under: Kiến thức lập trình - @ 15:22

Thẻ: pythonpandasdataframeperformance

Thiết kế website giá rẻ

Danh mục

Consecutive count of binary column by group