I have two pandas dataframes, and I’m looking to find a faster way to apply the labels defined in the first dataframe to the rows in the second:
- df1: The first dataframe has thresholds for a number of categories along with associated labels
id | Category A | Category B | Upper Bound | Lower Bound | Label |
---|---|---|---|---|---|
1 | Red | One | 1 | 3 | X |
2 | Yellow | Two | 3 | 5 | Y |
3 | Blue | Three | 5 | 7 | Z |
- df2: The second has the raw data values that need to be labeled according to the thresholds in the first table. If Value in df2 is between Upper Bound and Lower Bound in df1, then apply the appropriate label to that row (assuming match on Cat A and B).
id | Category A | Category B | Value |
---|---|---|---|
1 | Red | One | 2 |
2 | Yellow | Two | 4 |
3 | Blue | Three | 6 |
After the labels are applied from the first to the second, the resulting dataframe df2 should look like:
id | Category A | Category B | Value | Label |
---|---|---|---|---|
1 | Red | One | 2 | X |
2 | Yellow | Two | 4 | Y |
3 | Blue | Three | 6 | Z |
I’m currently able to do this using df.apply and iterrows like follows:
def applyLabels(row, dfLabel,_):
for index, rowLabel in dfLabel.iterrows():
if row[Category A] == rowLabel[Category A] & row[Value] between Upper and Lower Bound...
return dfLabel.at[index, 'Label']
df2['Label'] = df2.apply(applyLabels, args=(df2,_), axis=1)
But this is taking forever (I’m working w/ ~10 million rows) – is there a faster way to achieve this result without exporting this whole thing to SQL?