I have a dataframe that shows details about employees and the site they are at and the positions they have held. The dataframe has columns for Site Id, Employee ID, and StartDate (plus a lot more fields). I have this sorted by Site and Employee ID ASC and then EffectiveDate DESC (latest record is first)
Site EmployeeID StartDate
1 123 2024-09-01
1 123 2024-08-01
1 123 2024-06-01
1 123 2024-05-01
2 100 2024-06-01
2 100 2024-03-01
I need to create a new column called EndDate
which is the date of the previous record minus 1 day. We are moving to a new system so we only care about the dates that include the range 7/1/24 (or after). So for my example df, it would look like
Site EmployeeID StartDate EndDate Import
1 123 2024-09-01 Y
1 123 2024-08-01 2024-08-31 Y
1 123 2024-06-01 2024-07-31 Y
1 123 2024-05-01 2024-05-31 N
2 100 2024-06-01 Y
2 100 2024-03-01 2024-05-31 N
And then filtering for df['Import'] ='Y'
My initial idea was to iterate over df.groupby(by=['Site','EmployeeID'])
and use .iloc[]
to get the next values date, subtract 1 day, check if the EndDate
is greater than 7/1/24, then set Import to Y
or N
accordingly. The problem is that this is a very large dataset (~300K rows) and this operation would take a very long.
Convert dates to datetime, so you can groupby
on ["Site", "EmployeeID"]
and transform
accordingly.
df["StartDate"] = pd.to_datetime(df["StartDate"])
df["EndDate"] = df.groupby(["Site", "EmployeeID"])["StartDate"].transform(
lambda x: x.shift() - pd.Timedelta(days=1)
)
df["Import"] = np.where((df["EndDate"] > "2024-07-01") | df["EndDate"].isna(), "Y", "N")
Site EmployeeID StartDate EndDate Import
0 1 123 2024-09-01 NaT Y
1 1 123 2024-08-01 2024-08-31 Y
2 1 123 2024-06-01 2024-07-31 Y
3 1 123 2024-05-01 2024-05-31 N
4 2 100 2024-06-01 NaT Y
5 2 100 2024-03-01 2024-05-31 N
You can use groupby.shift
, subtract 1 day with Timedelta
. Create the Import column with numpy.where
:
# ensure datetime
df['StartDate'] = pd.to_datetime(df['StartDate'])
# get previous date per group and remove one day
df['EndDate'] = (df.groupby('Site')['StartDate']
.shift().sub(pd.Timedelta('1d'))
)
# if the date is before 2024-07-01 flag as N, else Y (includes NaNs)
df['Import'] = np.where(df['EndDate'].lt('2024-07-01'), 'N', 'Y')
Output:
Site EmployeeID StartDate EndDate Import
0 1 123 2024-09-01 NaT Y
1 1 123 2024-08-01 2024-08-31 Y
2 1 123 2024-06-01 2024-07-31 Y
3 1 123 2024-05-01 2024-05-31 N
4 2 100 2024-06-01 NaT Y
5 2 100 2024-03-01 2024-05-31 N
import pandas as pd
import numpy as np
data = {
'site': [1, 1, 1, 1, 2, 2],
'id': [123, 123, 123, 123, 100, 100],
'startDate': ['2024-09-01', '2024-08-01', '2024-06-01', '2024-05-01', '2024-06-01', '2024-03-01']
}
df = pd.DataFrame(data)
df['startDate'] = pd.to_datetime(df['startDate'])
df['previous_start_date'] = aa = df.groupby(['site','id'])['startDate'].shift(1)
df['bb'] = bb = pd.Timedelta(days =1)
df['EndDate'] = ab = aa - bb
cutoff_date = pd.Timestamp('2024-07-01')
df['Import'] = np.where(
df['EndDate'].isna() | (df['EndDate'] >= cutoff_date), 'Y','N'
)
print(df)
'''
site id startDate previous_start_date bb EndDate Import
0 1 123 2024-09-01 NaT 1 days NaT Y
1 1 123 2024-08-01 2024-09-01 1 days 2024-08-31 Y
2 1 123 2024-06-01 2024-08-01 1 days 2024-07-31 Y
3 1 123 2024-05-01 2024-06-01 1 days 2024-05-31 N
4 2 100 2024-06-01 NaT 1 days NaT Y
5 2 100 2024-03-01 2024-06-01 1 days 2024-05-31 N
'''