How to detect duplicate data?

How to detect duplicate data? - algorithm

I have got a simple contacts database but I'm having problems with users entering in duplicate data. I have implemented a simple data comparison but unfortunately the duplicated data that is being entered is not exactly the same. For example, names are incorrectly spelled or one person will put in 'Bill Smith' and another will put in 'William Smith' for the same person.
So is there some sort of algorithm that can give a percentage for how similar an entry is to another?

So is there some sort of algorithm
that can give a percentage for how
similar an entry is to another?
Algorithms as Soundex and Edit distances (as suggested in a previous post) can solve some of your problems. However, if you are serious about cleaning your data, this will not be enough. As others have stated "Bill" does not sound anything like "William".
The best solution I have found is to use a reduction algorithm and table to reduce the names to it's root name.
To your regular Address table, add Root-versions of the names, e.g
Person (Firstname, RootFirstName, Surname, Rootsurname....)
Now, create a mapping table.
FirstNameMappings (Primary KEY Firstname, Rootname)
Populate your Mapping table by:
Insert IGNORE (select Firstname, "UNDEFINED" from Person) into FirstNameMappings
This will add all firstnames that you have in your person table together with the RootName of "UNDEFINED"
Now, sadly, you will have to go through all the unique first names and map them to a RootName. For example "Bill", "Billl" and "Will" should all be translated to "William"
This is very time consuming, but if data quality really is important for you I think it's one of the best ways.
Now use the newly created mapping table to update the "Rootfirstname" field in your Person table. Repeat for surname and address. Once this is done you should be able to detect duplicates without suffering from spelling errors.

You can compare the names with the Levenshtein distance. If the names are the same, the distance is 0, else it is given by the minimum number of operations needed to transform one string into the other.

I imagine that this problem is well understood but what occurs to me on first reading is:
compare fields individually
count those that match (for a possibly loose definition of match, and possibly weighing the fields differently)
present for human intervention any cases which pass some threshold
Use your existing database to get a good first guess for the threshold, and correct as you accumulate experience.
You may prefer a fairly strong bias toward false positives, at least at first.

While I do not have an algorithm for you, my first action would be to take a look at the process involved in entering a new contact. Perhaps users do not have an easy way to find the contact they are looking for. Much like on Stack Overflow's new question form, you could suggest contacts that already exist on the new contact screen.

If you have access SSIS check out the Fuzzy grouping and Fuzzy lookup transformation.
http://www.sqlteam.com/article/using-fuzzy-lookup-transformations-in-sql-server-integration-services
http://msdn.microsoft.com/en-us/library/ms137786.aspx

If you have a large database with string fields, you can very quickly find a lot of duplicates by using the simhash algorithm.

This may or may not be related but, minor misspellings might be detected by a Soundex search, e.g., this will allow you to consider Britney Spears, Britanny Spares, and Britny Spears as duplicates.
Nickname contractions, however, are difficult to consider as duplicates and I doubt if it is wise. There are bound to be multiple people named Bill Smith and William Smith, and you would have to iterate that with Charles->Chuck, Robert->Bob, etc.
Also, if you are considering, say, Muslim users, the problems become more difficult (there are too many Muslims, for example, that are named Mohammed/Mohammad).

I'm not sure it will work well for the names vs nicknames problem, but the most common algorithm in this sort of area would be the edit distance / Levenshtein distance algorithm. It's basically a count of the number of character changes, additions and removals required to turn one item into another.
For names, I'm not sure you're ever going to get good results with a purely algorithmic approach - What you really need is masses of data. Take, for example, how much better Google spelling suggestions are than those in a normal desktop application. This is because Google can process billions of web queries and look at what queries lead to each other, what 'did you mean' links actually get clicked etc.
There are a few companies which specialise in the name matching problem (mostly for national security and fraud applications). The one I could remember, Search Software America seems to have been bought out by these guys http://www.informatica.com/products_services/identity_resolution/Pages/index.aspx, but I suspect any of these sorts of solutions would be far to expensive for a contacts application.

FullContact.com has API's that can solve this for you, see their documentation here: http://www.fullcontact.com/developer/docs/?category=name.
They have APIs for Name Normalization (Bill into William), Name Deducer (for raw text), and Name Similarity (comparing two names).
All APIs are free at the moment, it could be a good way to get started.

You might also want to look into probabilistic matching.

For those wandering around the web and end up here, might I suggest that you try using a Google Sheet add-on I created called Flookup.
It's particularly good with names and it has a couple of other awesome features which I'll describe below:
Say you have a list of names and there are 2 people called "John Smith". You can use the rank parameter from Flookup to instruct the algorithm to return the 1st, 2nd, 3rd or nth best match. This is helpful if you have additional information that you can use to identify the "John Smith" you want.
Say you have an additional database/list of apartment numbers. You an specify which "John Smith" you want by typing: John Smith & Apartment A or John Smith & Apartment B as the lookup parameter to help distinguish between the two names.
I hope you find Flookup as beneficial as others have.

Related

Fuzzy identity fingerprinting

I have a spreadsheet with values like address, name, IBAN, e-mail and want to identify when a customer last time bought something.
The problem is: some fields contain spelling mistakes, others were deliberately entered wrong.
On GitHub, several libraries like https://github.com/seatgeek/fuzzywuzzy, https://github.com/seamusabshere/fuzzy_match or https://github.com/atom/fuzzaldrin are available to perform fuzzy searches based on a single and comparable column. But I want to combine multiple fields - this sounds like a common problem and I expected to find existing solutions out there.
Can you recommend approaches for such a problem? Are there existing projects for such a problem which I am missing?
Is a regular string-distance over all the fields usually good enough?

I mentioned it in your other question, but the dedupe python library does what you want.
Basically, it calculates the distance between each field in a pair of rows, then learns optimal weights to combine those distances into a single record-pair score.

So far I believe http://blog.yhat.com/posts/fuzzy-matching-with-yhat.html and using fuzzyWuzzy seems to be the best approach.

How two check if two unstructured street adresses strings are the same?

I need to compare two unstructured addresses and be able to identify if they are the same (or similar enough).
Scenario
Address is supplied by the end user in plain text.
There is nothing to help the user to write on a more identifiable manner (no autocomplete, nothing. Just an empty textbox).
"#102 Nice-Looking Street, Gotham City, NY" should match with "Nice Loking St., Gotham City, New York, apt 102".
Using a third-party service is not an option.
Search is not a problem. I already have the two strings. What I need is to check if they represent the same address, despite its differences on structure.
What I have found
I know we can use some Fuzzy logic for this kind of comparison, with some tolerance for misspelling, but...
There are some keywords (like, for instance, comparing "Street" to "St." or comparing "#102" to "apt 102", or "NY" to "New York") that are not supposed to penalize the degree of reliability.
Some words can be placed in different order (like the appartement in the above example).
I do not want to reinvent the Wheel. This problem seems like a common concern in different contexts and I think there is an algorithm (with some slight modifications, maybe) that might be a fit for this scenario.
Thanks in advance

I've helped build some open source tools to do this.
Basically, the approach is to try to split and address into it's constituent parts and then intelligently compare those parts.
Both parts of the problem are hard.
The first part is often called address parsing. Here's what we use: https://github.com/datamade/usaddress
The second part has many, many names but, let's call it fuzzy matching. Here's the library we made for that: https://github.com/datamade/dedupe
We also provided some facilities for using them together: http://dedupe.readthedocs.io/en/latest/Variable-definition.html#address-type

Data matching Algorithm Approach

I don't really know where to start with this project, and so I'm hoping a broad question can at least point me in the right direction.
I have 2 data sets right now, each about 5gb with 2million observations. They are the assessed and historical data gathered for property listings of a given area for a certain amount of time. What I need to do is match properties to one another. So a property may arise in the historical since it gets sold 2 or 3 times during the period. In this historical I have the seller info, the loan info, and sale info. In the assessor data I have all of the characteristics that would describe the property sold. So in order to do any pricing model, I need to match the two.
I have variables that are similar in each, however they are going to differ slightly (misspellings, abbreviations, etc). Does anyone have any recommendations for me about going through this? First off, what program would I want to do this in? I have experience in STATA, R and a little bit of SAS and Matlab, but I'd prefer to use the former two.
I read through this:
Data matching algorithm
Where he uses .NET and one user suggested a Levenshtein approach (where the distance between strings is calculated) so for fields like Address I could use this and weight the approximate accuracy between the two string. Then it was suggested maybe to use Soundex for maybe Name of the seller/owner.
But I'm really lost in how to implement any of this, and before I approach anyone in my department I really need to have some sort of idea of what I'm doing!
Any help or advice would be immensely helpful.

Yes, there are several good algorithms for the string matching problem you describe, namely:
jaro-winkler,
smith-waterman,
dice-sorense
soundex
damerau-levenshtein, and
monge-elkan
to name the few.
I recommend A Comparison of String Distance Metrics for Name-Matching Tasks, by W. W. Cohen, P. Ravikumar, S. Fienberg for an overview of what might be working the best for what.
SoftTFIDF claims to be the best one. It is available as a Java package. There are other implementations of string matching and record linkage algorithms available in:
Java (SecondString),
Python (JellyFish),
C# (FuzzyString), and
Scala StringMetric
libraries.

Best strategy for splitting English-style names into first and last name

I've got a list of names and I need to split them up into first and last names. Since some names have 2-3 spaces in them, a simple split for a space won't do.
What sort of heuristics do people use to perform the split?
Note that this isn't a duplicate of questions that effectively ask how to split at a space; I'm looking for heuristics and algorithms, not actual code help.
Update: I'm limiting the problem set to English-style names. This is all I need to solve and likely all that anyone approaching this (English language) question will need as well.

I've read a very interesting and comprehensive post on this subject:
http://www.w3.org/International/questions/qa-personal-names
It even suggests to ask yourself whether you really need separate fields for first and last names. It seems to depend on the target region(s) of your application.

Two approaches can help, though not fully solve this problem.
Programatically separate the easy ones, the ones that are not easy get pushed into a different list, "remaining to be split". Manually sort that list. As you manually sort, some heuristics might emerge which could be coded, further reducing the size of the remaining list. If this is a one-time thing, and list is not super massive, this will get the job done.
A closely related problem is when a name is split, but you don't know which is the first and which is last. Some systems work around this problem by doing fuzzy lookups such that if on the first attempt no match is found, flip the first and last name and try again. You didn't say why you need to split the names. If it is to lookup against reference data, consider some kind of similar fuzzy lookup heuristics which allow for trying different splits instead of trying to get the split correct up-front.
Not really an answer, but in this case there really is no perfect answer.

Different countries and regions have different formats for names. For example, Asia the family name is usually first and then given names follows. The West, you’ve got the first name and last name convention, but gets complicated when people double barrel, or include middle names. And then some regions people are only given one name.
Personally, I don’t think there’s one single algorithm that can give you 100% accurate results I’m afraid.

The following assumes English-style surnames. If that's not the case, please update your question.
It's usually safe to assume that the last space character signals the start of a person's surname. But since there are exceptions, one strategy would be to compile a large database of known multi-word surnames from some other source. You could then test for these surnames, and treat them as exceptions.

algorithm for ranking search results based on previous usage

First and foremost, no I'm not asking please tell me how Google is built in two sentences. What I am asking is slightly different. I have a database filled with textual data that users input. We also give them the functionality to search for this data later. The problem is, we do a simple full text search now and return the results in any order. I'd like to return the results based on a weight, a weight of how often the user types in something. An an example a user might type in the following:
"foo"
"bo"
"bob"
"bob"
"bob"
"bo"
"foo2"
Based on the above data, a search on 'b' should return bo and bob, but bob should be listed first. It is the most relevant based on usage.
Curious, what algorithm should I research to build this in an effective fashion? Any books based on common web algorithms (I know this isn't just web specific) out there that will explain this?

there is various search algorithms out there.
Here's a little guidepost to some of them:
http://en.wikipedia.org/wiki/Search_algorithm
not an expert myself in this area, so I cannot recommend a specific one.

I don't know how you'd do this in the context of a database, but here's one way to go about it:
Use a trie to store each unique word and the count of how often it was used. When your user starts typing, the trie allows you to efficiently grab all the string with the given prefix, which you can then sort using the words' counts as keys.

We use apache solr for our search.
In this technology, I think, this is normally done via boosting. So index your data and every day or so then boost individual documents based on user queries.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio