Relative Content

Tag Archive for pythonsqlite

Finding partial-duplicate rows in a table

I have this pretty big (like 4 million rows) sqlite3 database table that I access using python. In it, there are certain partial records that should never really repeat but, for various reasons, do and I’d like to find an efficient way of cleaning up these near-duplicates.
Like, suppose I have columns: ‘First Name’, ‘Last Name’, ‘Birth Date’, ‘Address’, and I’ve got a row that’s, like, ‘John’, ‘Smith’, ‘June, 8, 2004’, ‘123 Main St.’, and then another row that’s the same, except that the first name is ‘Johnny’. It’s essentially a duplicate record, but not exactly because one column was entered differently.
I’m not a whiz with sql but after a little googling I had thought that my solution to finding all such near-duplicates would be to run a query like:

Rangebreak Error Solved, i read the answer on stack overflow

import pandas as pd import datetime as dt ‘# !pip install yfinance import yfinance as yf import plotly.graph_objects as go import numpy as np # Initial data & get dataframe start = dt.date(2023, 4, 15) end = dt.date(2023, 4, 21) ticker = ‘SPY’ df = yf.download(ticker, start, end, progress=False, interval=’1m’) def calc_rangebreak(time_series: pd.Series): “”” Caculate […]