Pandas Groupby and Filter based on first record having date greater than specific date

I have a dataframe that shows details about employees and the site they are at and the positions they have held. The dataframe has columns for Site Id, Employee ID, and StartDate (plus a lot more fields). I have this sorted by Site and Employee ID ASC and then EffectiveDate DESC (latest record is first)

Site     EmployeeID         StartDate
   1            123        2024-09-01
   1            123        2024-08-01
   1            123        2024-06-01
   1            123        2024-05-01
   2            100        2024-06-01
   2            100        2024-03-01

I need to create a new column called EndDate which is the date of the previous record minus 1 day. We are moving to a new system so we only care about the dates that include the range 7/1/24 (or after). So for my example df, it would look like

Site     EmployeeID         StartDate       EndDate    Import
   1            123        2024-09-01                       Y
   1            123        2024-08-01    2024-08-31         Y
   1            123        2024-06-01    2024-07-31         Y
   1            123        2024-05-01    2024-05-31         N
   2            100        2024-06-01                       Y
   2            100        2024-03-01    2024-05-31         N

And then filtering for df['Import'] ='Y'

My initial idea was to iterate over df.groupby(by=['Site','EmployeeID']) and use .iloc[] to get the next values date, subtract 1 day, check if the EndDate is greater than 7/1/24, then set Import to Y or N accordingly. The problem is that this is a very large dataset (~300K rows) and this operation would take a very long.

Convert dates to datetime, so you can groupby on ["Site", "EmployeeID"] and transform accordingly.

df["StartDate"] = pd.to_datetime(df["StartDate"])
df["EndDate"] = df.groupby(["Site", "EmployeeID"])["StartDate"].transform(
    lambda x: x.shift() - pd.Timedelta(days=1)
)
df["Import"] = np.where((df["EndDate"] > "2024-07-01") | df["EndDate"].isna(), "Y", "N")

   Site  EmployeeID  StartDate    EndDate Import
0     1         123 2024-09-01        NaT      Y
1     1         123 2024-08-01 2024-08-31      Y
2     1         123 2024-06-01 2024-07-31      Y
3     1         123 2024-05-01 2024-05-31      N
4     2         100 2024-06-01        NaT      Y
5     2         100 2024-03-01 2024-05-31      N

You can use groupby.shift, subtract 1 day with Timedelta. Create the Import column with numpy.where:

# ensure datetime
df['StartDate'] = pd.to_datetime(df['StartDate'])

# get previous date per group and remove one day
df['EndDate'] = (df.groupby('Site')['StartDate']
                   .shift().sub(pd.Timedelta('1d'))
                )

# if the date is before 2024-07-01 flag as N, else Y (includes NaNs)
df['Import'] = np.where(df['EndDate'].lt('2024-07-01'), 'N', 'Y')

Output:

   Site  EmployeeID  StartDate    EndDate Import
0     1         123 2024-09-01        NaT      Y
1     1         123 2024-08-01 2024-08-31      Y
2     1         123 2024-06-01 2024-07-31      Y
3     1         123 2024-05-01 2024-05-31      N
4     2         100 2024-06-01        NaT      Y
5     2         100 2024-03-01 2024-05-31      N

import pandas as pd
import numpy as np

data = {
    'site': [1, 1, 1, 1, 2, 2],
    'id': [123, 123, 123, 123, 100, 100],
    'startDate': ['2024-09-01', '2024-08-01', '2024-06-01', '2024-05-01', '2024-06-01', '2024-03-01']
}

df = pd.DataFrame(data)

df['startDate'] = pd.to_datetime(df['startDate'])
df['previous_start_date']  = aa = df.groupby(['site','id'])['startDate'].shift(1)
df['bb'] = bb = pd.Timedelta(days =1)
df['EndDate'] = ab = aa - bb  
cutoff_date = pd.Timestamp('2024-07-01')
df['Import'] = np.where(
df['EndDate'].isna() | (df['EndDate'] >= cutoff_date), 'Y','N'    
)
print(df)
'''
   site   id  startDate previous_start_date     bb    EndDate Import
0     1  123 2024-09-01                 NaT 1 days        NaT      Y
1     1  123 2024-08-01          2024-09-01 1 days 2024-08-31      Y
2     1  123 2024-06-01          2024-08-01 1 days 2024-07-31      Y
3     1  123 2024-05-01          2024-06-01 1 days 2024-05-31      N
4     2  100 2024-06-01                 NaT 1 days        NaT      Y
5     2  100 2024-03-01          2024-06-01 1 days 2024-05-31      N
'''

Filed under: Kiến thức lập trình - @ 14:52

Thẻ: pythonpandasgroup-by

Thiết kế website giá rẻ

Danh mục

Pandas Groupby and Filter based on first record having date greater than specific date