I have two DataFrames that contain some log data, lets call them CorrectData and WrongData
Each row has multiple columns. I cannot sort the DataFrames by columns value. I need a way to take the first row from WrongData and based on its value in two columns Cl1 and Cl2 check if a row in CorrectData DataFrame contains these two values in the same two columns. It can be anywhere in the dataframe and the row can be there multiple times.
They look something like this:
CorrectData:
Cl1 Time Cl2 OtherData Origin
0x18ddE322 21111 0000000000000000 0xFFFFF 2
0x18ffa800 1234 0000000004100ff3 0xDWAFF 1
0x18ddE322 22121 0000000000000000 0xFFFFF 2
0x18ffa800 11111 0000000004100ff3 0xD213F 1
0x18ddE322 21111 0000000000000000 0xFFFFF 2
WrongData:
Cl1 Time Cl2 OtherData Origin
0x18ddE322 21111 0000000000000000 0xFFFFF 2
0x18ddE322 21111 0000000000000000 0xFFFFF 2
0x18ffa800 2211 0000000004100ff3 0xDWAFF 1
0x18ddE322 21111 0000000000000000 0xFFFFF 2
0x18ffa800 2211 0000000004100ff3 0xDWAFF 1
0x18ffa800 92121 0000000004100ff3 0xD213F 1
I have tried this approach so far:
comparison = pd.merge(CorrectData, WrongData, left_on=["Cl1","Cl2"], right_on=["Cl1","Cl2"], how='inner', indicator=True)
comparison = comparison[comparison['_merge'] != 'both']
print(comparison.to_string())
I expected a DataFrame comparison
containing rows flagged by either left_only
or right_only
, but I get empty DataFrame. I know due to manual testing that there are rows with different values that doesn’t match between the DataFrames.
When I tried how="outer"
the returned DataFrame contained rows that were in both DataFrames flagged as left or right only. Both columns have the same type, there are no whitespaces in them or other characters. I have no clue why its not recognizing the data.
I also tried the following:
When how="outer"
I get a DataFrame containing some values as I said above. Let’s say the value is “0x18ffa800”, and its in Cl1 and other is 0000000004100ff3 and its in Cl2
When I tried:
first_row= CorrectData.loc[(CorrectData["Cl1"] == '0x18ffa800') & (CorrectData["Cl2"] == '0000000004100ff3')]
first_row
For BOTH DataFrames CorrectData and WrongData I get number of rows where these values are obtained. But pd.merge somehow doesn’t recognize them
Any help would be greatly appreciated.
7
I think you should post an example where there is a mismatch. Easier to debug. Also, inner join only contains rows from both the dataframe. So to find difference you should use outer join.
Sharing a small piece of code below which works well
# first drop duplicates for efficient merge (optional).
wrong_df_dup = wrong_df.drop_duplicates(subset=["Cl1", "Cl2"])
correct_df_dup = correct_df.drop_duplicates(subset=["Cl1", "Cl2"])
# Next introduce a row in wrong_df which is different from correct_df
wrong_df_dup = pd.concat([wrong_df_dup, pd.DataFrame({"Cl1": ["0x18ffa801"], "Time": [21112], "Cl2": ["0000000004100ff1"], "otherData": ["0xDWAFA"], "Origin": [4]})])
# Do outer merge
df_outer = wrong_df_dup.merge(correct_df_dup, on = ["Cl1", "Cl2"], how="outer",indicator=True)
df_outer[df_outer['_merge']=='left_only'] # This gives the one row we introduced artifically.