Merging 2 DataFrame to find different rows

  Kiến thức lập trình

I have two DataFrames that contain some log data, lets call them CorrectData and WrongData

Each row has multiple columns. I cannot sort the DataFrames by columns value. I need a way to take the first row from WrongData and based on its value in two columns Cl1 and Cl2 check if a row in CorrectData DataFrame contains these two values in the same two columns. It can be anywhere in the dataframe and the row can be there multiple times.

They look something like this:

CorrectData:

 Cl1             Time           Cl2                  OtherData       Origin
 0x18ddE322      21111        0000000000000000         0xFFFFF         2
 0x18ffa800      1234         0000000004100ff3         0xDWAFF         1      
 0x18ddE322      22121        0000000000000000         0xFFFFF         2
 0x18ffa800      11111        0000000004100ff3         0xD213F         1
 0x18ddE322      21111        0000000000000000         0xFFFFF         2

WrongData:

 Cl1             Time           Cl2                  OtherData       Origin
 0x18ddE322      21111        0000000000000000         0xFFFFF         2
 0x18ddE322      21111        0000000000000000         0xFFFFF         2
 0x18ffa800      2211         0000000004100ff3         0xDWAFF         1      
 0x18ddE322      21111        0000000000000000         0xFFFFF         2
 0x18ffa800      2211         0000000004100ff3         0xDWAFF         1  
 0x18ffa800      92121        0000000004100ff3         0xD213F         1

I have tried this approach so far:

comparison = pd.merge(CorrectData, WrongData, left_on=["Cl1","Cl2"], right_on=["Cl1","Cl2"], how='inner', indicator=True)
comparison = comparison[comparison['_merge'] != 'both']
print(comparison.to_string())

I expected a DataFrame comparison containing rows flagged by either left_only or right_only, but I get empty DataFrame. I know due to manual testing that there are rows with different values that doesn’t match between the DataFrames.

When I tried how="outer" the returned DataFrame contained rows that were in both DataFrames flagged as left or right only. Both columns have the same type, there are no whitespaces in them or other characters. I have no clue why its not recognizing the data.

I also tried the following:

When how="outer" I get a DataFrame containing some values as I said above. Let’s say the value is “0x18ffa800”, and its in Cl1 and other is 0000000004100ff3 and its in Cl2

When I tried:

first_row= CorrectData.loc[(CorrectData["Cl1"] == '0x18ffa800') & (CorrectData["Cl2"] == '0000000004100ff3')]
first_row

For BOTH DataFrames CorrectData and WrongData I get number of rows where these values are obtained. But pd.merge somehow doesn’t recognize them

Any help would be greatly appreciated.

7

I think you should post an example where there is a mismatch. Easier to debug. Also, inner join only contains rows from both the dataframe. So to find difference you should use outer join.
Sharing a small piece of code below which works well

# first drop duplicates for efficient merge (optional).
wrong_df_dup = wrong_df.drop_duplicates(subset=["Cl1", "Cl2"])
correct_df_dup = correct_df.drop_duplicates(subset=["Cl1", "Cl2"]) 

# Next introduce a row in wrong_df which is different from correct_df
wrong_df_dup = pd.concat([wrong_df_dup, pd.DataFrame({"Cl1": ["0x18ffa801"], "Time": [21112], "Cl2": ["0000000004100ff1"], "otherData": ["0xDWAFA"], "Origin": [4]})])

# Do outer merge
df_outer = wrong_df_dup.merge(correct_df_dup, on = ["Cl1", "Cl2"], how="outer",indicator=True)
df_outer[df_outer['_merge']=='left_only'] # This gives the one row we introduced artifically.

Theme wordpress giá rẻ Theme wordpress giá rẻ Thiết kế website Kho Theme wordpress Kho Theme WP Theme WP

LEAVE A COMMENT