I am designing a plugin to uniquely identify content on various web pages, based on addresses.
So I may have one address which looks like:
1 someawesome street, anytown, F100 211
later I may find this address in a slightly different format.
1 someawesome street, F100 211,
or perhaps as vague as
someawesome street F100
These are technically the same address, but with a level of similarity. I would like to a) generate a unique identifier for each address to perform lookups, and b) figure out when a very similar address shows up.
What algorithms / techniques / String metrics should I be looking at? Levenshtein distance seems like an obvious choice, but curious if there’s any other approaches that would lend themselves here.
Levenstein’s algorithm is based on the number of insertions, deletions, and substitutions in strings.
Unfortunately it doesn’t take into account a common misspelling which is the transposition of 2 chars (e.g. someawesome vs someaewsome). So I’d prefer the more robust Damerau-Levenstein algorithm.
I don’t think it’s a good idea to apply the distance on whole strings because the time increases abruptly with the length of the strings compared. But even worse, when address components, like ZIP are removed, completely different addresses may match better (measured using online Levenshtein calculator):
1 someawesome street, anytown, F100 211 (reference) 1 someawesome st.,anytown (difference of 15, same address) 1 otherplaces street,anytown,F100211 (difference of 13, different ddress) 1 sameawesome street, othertown, CA98200 (difference of 13, different ddress) anytown, 1 someawesome street (28 different same address) anytown, F100 211, 1 someawesome street (37 different same address)
These effects tend to worsen for shorter street name.
So you’d better use smarter algorithms. For example, Arthur Ratz published on CodeProject an algorithm for smart text comparison. The algorithm doesn’t print out a distance (it can certainly be enriched accordingly), but it identifies some difficult things such as moving of text blocks (e.g. the swap between town and street between my first example and my last example).
If such an algorithm is too general for your case, you should then really work by components and compare only comparable components. This is not an easy thing if you want to parse any address format in the world. But if the target is more specific, say US, it is certainly feasible. For example, “street”, “st.”, “place”, “plazza”, and their usual misspellings could reveal the street part of the address, the leading part of which would in principle be the number. The ZIP code would help to locate the town, or alternatively it is probably the last element of the address, or if you don’t like guessing, you could look for a list of city names (e.g. downloading a free zip code database). You could then apply Damerau-Levenshtein on the relevant components only.
You ask about string similarity algorithms but your strings are addresses. I would submit the addresses to a location API such as Google Place Search and use the
formatted_address as a point of comparison. That seems like the most accurate approach.
For address strings which can’t be located via an API, you could then fall back to similarity algorithms.
Levenshtein distance is better for words
If words are (mainly) spelled correctly then look at bag of words.
I may seem like over kill but TF-IDF and cosine similarity.
Or you could use free Lucene. I think they do cosine similarity.
Firstly, you’d have to parse the webpage for addresses, RegEx is one wrote to take however it can be very difficult to parse addresses using RegEx. You’d likely end up having to go through a list of potential addressing formats and great one or more expressions that match them. I’m not too familiar with address parsing, but I’d recommend taking a look at this question which follows a similar line of thought: General Address Parser for Freeform Text.
Levenshtein distance is useful but only after you’ve seperated the address into it’s parts. Consider the following addresses.
123 someawesome st. and
124 someawesome st. These addresses are totally different locations, but their Levenshtein distance is only 1. This can also be applied to something like
8th st. and
9th st. Similar street names don’t typically appear on the same webpage, but it’s not unheard of. A school’s webpage might have the address of the library across the street for example, or the church a few blocks down. This means that the only data the Levenshtein distance is easily usable for is the distance between 2 data points, such as the distance between the street and the city.
As far as figuring out how to separate the different fields, it’s pretty simple once we get the addresses themselves. Thankfully most addresses come in very specific formats, with a bit of RegEx wizardry it should be possible to separate them into different fields of data. Even if the address aren’t formatted well, there is still some hope. Addresses always(almost) follow the order of magnitude. Your address should fall somewhere on a linear grid like this one depending on how much information is provided, and what it is:
StreetNumber < Street < City < State < Country
It happens rarely, if at all that the address skips from one field to a non adjacent one. You aren’t going to see a Street then Country, or StreetNumber then City, very often.
One cool algorithm that is useful but requires a preset database of prior answers is called: Line edit distance.
Line edit distance, as a function, can return back a “how much different are those two words”.
A word like “dogma” and “dog”, you’ll get back a value of 3 (for 3 extra characters).
Or “cat” and “hat”, get back a value of 1 (for one different character).
(Source: https://en.wikipedia.org/wiki/Edit_distance )
Indeed using some distance function seems like a good approach. But the problem then is to find the closest string from a given address, which is far from trivial.
You are describing a wide category of algorithms here. Check out Nearest neighbor search
As mentioned in a comment, if you find a way to separate the components of the address (street name, number, etc), it will make the task much easier.
LongestCommonSubsequence (from Apache commons-text) can be another approach to try with addresses. If you define similarity of two as ratio of “common subsequence length / max(address lengths)“, then you can apply tolerance threshold – e.g. 0.8 that will define match/no match. This way it will allow you match addresses like “1 someawesome st., anytown” and “1 someawesome street., anytown“.
It is not super fast algorithm, so you might want to apply quick failbacks to minimize comparisons. Example would be – avoid comparison if zip codes do not match, or extracted digit only sequence is different.