I've never built an algorithm for matching before and don't really know where to start. So here is my basic set up and why I'm doing it. Feel free to correct me if I'm not asking the right questions.
I have a database of names and unique identifiers for people. Several generated identifiers (internally generated and some third party), last name, first name, and birth date are the primary ones that I would be using.
Several times throughout the year I receive a list from a third party that needs to be imported and tied to the existing people in my database but the data is never as clean as mine. IDs could change, birth dates could have typos, names could have typos, last names could change, etc.
Each import could have 20,000 records so even if it's 99% accurate that's still 200 records I'd have to go in manually and match. I think I'm looking for more like 99.9% accuracy when it comes to matching the incoming people to my users.
So, how do I go about making an algorithm that can figure this out?
PS Even if you don't have an exact answer but do know of some materials to reference would also be helpful.
PPS Some examples would be similar to what m3rLinEz wrote:
ID: 9876234 Fname: Jose LName: Guitierrez Birthdate:01/20/84 '- Original'
ID: 9876234 Fname: Jose LName: Guitierrez Birthdate:10/20/84 '- Typo in birth date'
ID: 0876234 Fname: Jose LName: Guitierrez Birthdate:01/20/84 '- Wrong ID'
ID: 9876234 Fname: Jose LName: Guitierrez-Brown Birthdate:01/20/84 '- Hyphenated last name'
ID: 9876234 Fname: Jose, A. LName: Guitierrez Birthdate:01/20/84 '- Added middle initial'
ID: 3453555 Fname: Joseph LName: Guitierrez Birthdate:01/20/84 '- Probably someone else with same birthdate and same last name'
You might be interested in Levenshtein distance.
The Levenshtein distance between two
strings is defined as the minimum
number of edits needed to transform
one string into the other, with the
allowable edit operations being
insertion, deletion, or substitution
of a single character. It is named
after Vladimir Levenshtein, who
considered this distance in 1965.1
It is possible to compare every of your fields and computing the total distance. And by trial-and-error you may discover the appropriate threshold to allow records to be interpret as matched. Have not implemented this myself but just thought of the idea :}
For example:
Record A - ID: 4831213321, Name: Jane
Record B - ID: 431213321, Name: Jann
Record C - ID: 4831211021, Name: John
The distance between A and B will be lower than A and C / B and C, which indicates better match.
When it comes to something like this, do not reinvent the wheel. The Levehstein distance is probably your best bet if you HAVE to do this yourself, but otherwise, do some research on existing solutions which do database query and fuzzy searches. They've been doing it longer than you, it'll probably be better, too..
Good luck!
If you're dealing with data sets of this size and different resources being imported, you may want to look into an Identity Management solution. I'm mostly familiar with Sun Identity Manager, but it may be overkill for what you're trying to do. It might be worth looking into.
If the data you are getting from 3rd parties is consistent (same format each time) I'd probably create a table for each of the 3rd parties you are getting data from. Then import each new set of data to the same table each time. I know there's a way to then join the two tables based on common columns in each using an SQL statement. That way you can perform SQL queries and get data from multiple tables, but make it look like it came from one single unified table. Similarly records that were added that don't have matches in both tables could be found and then manually paired. This way you keep your 'clean' data separate from the junk you get from third parties. If you wanted a true import you could then use that joined table to create a third table containing all your data.
I would start with the easy near 100% certain matches and handle them first, so now you have a list of say 200 that need fixing.
For the remaining rows you can use a simplified version of Bayes' Theorem.
For each unmatched row, calculate the likelihood that it is a match for each row in your data set assuming that the data contains certain changes which occur with certain probabilities. For example, a person changes their surname with probability 0.1% (possibly also depends on gender), changes their first name with probability 0.01%, and is a has a single typo with probility 0.2% (use Levenshtein's distance to count the number of typos). Other fields also change with certain probabilities. For each row calculate the likeliness that the row matches considering all the fields that have changed. Then pick the the one that has the highest probability of being a match.
For example a row with only a small typo in one field but equal on all others would have a 0.2% chance of a match, whereas rows which differs in many fields might have only a 0.0000001% chance. So you pick the row with the small typo.
Regular expressions are what you need, why reinvent the wheel?
Related
I'm trying to solve this problem since some days now but it seems I have reached a dead end. Maybe someone would be able to help me.
I have two sheets. The first one contains the list of my clients and their delivery number depending of the weekday.
In my second sheet I would like to get the delivery number of the client (red cells) depending of the weekday I select (yellow cells).
I tried VLOOKUP formula, INDEX/MATCH, QUERY but I wasn't able to find a way to get the delivery number depending of the client's name and the weekday. I think the main issue is that in the first sheet the weekday is a column title.
Maybe the solution is simply to build my tables differently...
Thank you for your help
You can try something like this, assuming A2 and B2 the cells of first name and first day to look:
=INDEX(Sheet1!$1:$1000,MATCH(A2,Sheet1!$A:$A,0),MATCH(B2,Sheet1!$1:$1,0))
Or, if you want this same formula for the full column:
=byrow(A2:A,lambda(each,if(each="","",INDEX(Sheet1!$1:$1000,MATCH(each,Sheet1!$A:$A,0),MATCH(offset(each,0,1),Sheet1!$1:$1,0)))))
Also doable (are perhaps more simply) using a MAP/FILTER; with your 'Caption 1' table in Sheet1!A1:D4 and your 'Caption 2' table at the top-left of Sheet2, the following in Sheet2!C2 gives you the delivery number for a many names/days as you enter in the columns alongside:
=map(A2:A,B2:B,lambda(name,day,ifna(filter(filter(Sheet1!B2:D4,Sheet1!A2:A4=name),Sheet1!B1:D1=day))))
N.B. The IFNA blanks out errors for those rows where a Name/Day pair hasn't been entered yet. Extend the ranges in the filter to suit your real data.
all you need is simple vlookup:
=INDEX(IFNA(VLOOKUP(A9:A11&B9:B11,
SPLIT(FLATTEN(A2:A4&B1:D1&""&B2:D4), ""), 2, )))
I've looked at many threads regarding COUNT and COUNTA, but I can't seem to figure out how to use it correctly.
I am new to DAX and am learning my way around. I have attempted to look this up and have gotten a little ways to where I need to be but not exactly. I think I am confused about how to apply a filter.
Here's the situation:
Four separate queries used to generate the data in the report; but only need to use two for the DAX function (Products and Display).
I have three columns I need to filter by, as follows:
Customer (Display or Products query; can do either)
Brand (Products query)
Location (Display query)
I want to count the columns based on if the data is unique.
Here's an example:
Customer: Big Box Buy;
Item: Lego Big Blocks;
Brand: Lego;
Location: Toys;
BREAK
Customer: Big Box Buy;
Item: Lego Star Wars;
Brand: Lego;
Location: Toys;
BREAK
Customer: Big Box Buy;
Item: Surface Pro;
Brand: Microsoft;
Location: Electronics;
BREAK
Customer: Little Shop on the Corner;
Item: Red Bicycle;
Brand: Trek;
Location: Racks;
In this example, no matter the fact that the items are different, we want to look at just the customer, the brand, and the location. We see in the first two records, the customer is "Big Box Buy" and the brand is "Lego" and the location is "Toys". This appears twice, but I want to count it distinct as "1". The next "Big Box Buy" store has the brand "Microsoft" and the location is "Electronics". It appears once and only once, and thus the distinct count is "1" anyway. This means that there are two separate entries for "Big Box Buy", both with a count of 1. And lastly there is "Little Shop on the Corner" which appears just once and is counted just once.
The "skeleton" of the code I have is basically just to see if I can get a count to work at all, which I can. It's the FILTER that I think is the problem (not used in the below example) judging by other threads I've read.
TotalDisplays = CALCULATE(COUNTA(products[Brand]))
Obviously I can't just count the amount of times a brand appears as that would give me duplicates. I need it unique based on if the following conditions are met:
Customer must be the same
Brand must be the same
Location must be the same
If so, we distinctly count it as one.
I know I ranted a bit and may seem to have gone in circles, but I was trying to figure out how to explain it. Please let me know if I need to edit this post or post clarification.
Many thanks in advance as I go through my journey with DAX!
I believe I have the answer. I used a NATURALINNERJOIN in DAX to create a new, merged table since I needed to reference all values in the same query (couldn't figure out how to do it otherwise). I also created an "unique identity" calculated column that combined data from multiple rows, but was hidden behind the scenes (not actually displayed on the report) so I could then take a measure of the unique values that way.
TotalDisplays = COUNTROWS(DISTINCT('GD-DP-Merge'[DisplayCountCalcCol]))
My calculated column is as follows:
DisplayCountCalcCol = 'GD-DP-Merge'[CustID] & 'GD-DP-Merge'[Brand] & 'GD-DP-Merge'[Location] & 'GD-DP-Merge'[Order#]
So the measure TotalDisplays now reports back the distinct count of rows based on the unique value of the customer ID, the brand, and the location of the item. I also threw in an order number just in case.
Thanks!
I am semi new to DAX and was struggling with Count and CountA formula, you post has helped me with answers. I would like to add the solution which i got for my query: Wanted count for Right Time start Achieved hence if anyone is looking for this kind of answer use below, filter will be selecting the table and adding string which you want to
RTSA:=calculate(COUNTA([RTS]),VEO_Daily_Services[RTS]="RTSA")
I am new to Qlik and trying to solve the following issue.
I have a table with two dimensions, one with the entry's unique ID, and one with a category, as in the example below.
Table example
My goal is to create a new column with a ranking of 'Score' - my measure - per category:
Table with desired output
If I use the expression
Rank(Score)
I get a column of ones, as the command takes the most granular dimension (Unique ID) as the default one. If I use
Rank(TOTAL Score)
It obviously returns a ranking regardless of all the dimensions. By reading the documentation and similar questions asked by other users I reckon that it should be possible to specify which dimension to use for TOTAL, with the following syntax:
Rank(TOTAL <Category> Score)
Yet, the formula returns an error and only null column values. I've tried different syntax, use of brackets but I still cannot grasp what I am doing wrong.
Please note that I cannot create the ranking column when loading the data.
I would immensely appreciate if someone were so kind to help on this!
Try with
=aggr(rank(sum(Score)), Category, UniqueID)
I have a dilemma that I've encountered before. What's the best in terms of usability when one displays personal names in a table? Should there be a single column for the name? If so, is "firstname lastname" or "lastname, firstname" preferable? Or would a column for "firstname" and a column for "lastname" be best? I'm thinking in terms of the user's desire to sort the columns. I like having a column for each name component because I can imagine that in some cases the first name will be more important to the user whereas in other cases the last name would be more important.
I would assume that many out there have had this dilemma and am looking for pearls of wisdom based on past experience.
Definitely have a column for each part. That gives you much more flexibility. So you could sort by surname, but print "firstname surname", for example.
If you don't have the screen real estate to have a column for each part, you can combine them into a single string whose format represents the sorting order. Each click on the column header cycles to the next sort order. For example:
Default: sort by last, first (ASC)
Bimbleman, Wally P.
Zonkenstein, Arnold Q.
1st click: sort by last, first (DESC)
Zonkenstein, Arnold Q.
Bimbleman, Wally P.
2nd click: sort by first, middle, last (ASC)
Arnold Q. Zonkenstein
Wally P. Bimbleman
3rd click: sort by first, middle, last (DESC)
Wally P. Bimbleman
Arnold Q. Zonkenstein
etc...
Easier to read an entire name this way (vs. having it span across columns), takes up less screen real estate, and frees you from having to decide upon a single format & sort.
As far as I know, each country has Its own rules to Sort the names, some countries have the uses of do it By First name, and some by Last Name, I believe that the right answer here is, what is about your app? how many users will appear on those columns? And which users (age/nationality/context) are going to use your app?
Really, I agree with Skilldrick - a good UI has at least separate columns for first and last names...
But don't forget that CONSISTENCY in a UI is actually more important and makes things usable: giving the end user an implied expectation of how things are done.
You might consider calling the fields "Given Name" and "Family Name" to account for people who put their family name first. Of course this doesn't cover everyone (some people only have a given name) but it might reduce potential confusion with Chinese and Japanese names, for example.
In most cases you will find that these fields will cover for most scenarios: Title, Firstname, Middlename, Lastname
Most systems that I have worked with here in Australia, data are sorted by their lastname on default display. Also on the screen if you are providing search, usually Lastname field comes before firstname. Sorting by firstname is just as common too, so your systems should always allow the view to switch to sorting by Firstname
Here is a solution for a single column, I don't think separate columns can be scanned and read as quickly, although I don't have any data to back that up.
The primary focus of a user-oriented solution should be to display names as they would be read aloud, i.e. Title Firstname Middlename Lastname.
For most domains where the names are known to the user, sorting by firstname is acceptable. Here is an example where a persons title is ignored in the sorting, and the sort field is clear as it is highlighted:
Arnold Q. Zonkenstein
Mr. David Cliff
Marty P. Bimbleman
For formal business oriented applications, the default sorting could be by surname. You can preserve reading order, while still sorting by last name, again using highlighting:
Marty P. Bimbleman
Mr. David Cliff
Arnold Q. Zonkenstein
If you want the sorting field to be configurable, use an explicit checkbox, the solution of clicking multiple times on the column heading to cycle between sort fields will be jarring to the user (toggling sort direction by clicking on the heading is more acceptable).
IMO this is the simplest solution without any compromises.
I have been looking at using MapReduce to build a parallelized record combining system. The language doesn't matter, I can use a pre-existing library such as Hadoop or build my own if necessary, I'm not worried about that.
The problem that I keep running into, however, is that I need the records to be matched on multiple criteria. For example: I may need to match the records based on person's name or the person's phone number, but not necessarily the person's name and phone number.
For instance, given the following keys for each record:
'John Smith' and '555-555-5555'
'Jane Smith' and '555-555-5555'
'John Smith' and '555-555-1111'
I want the system to take all three records, figure out that they match on one of the keys, and combine them into a single combined record that has both names ('John Smith' and 'Jane Smith') as well as both phone numbers ('555-555-5555' and '555-555-1111').
Is this something that I can accomplish using MapReduce? If so, how would I go about matching the keys produced by the Map function so that all of the matched records can be passed into the Reduce function.* Alternatively, is there a different/better way I could be doing this? My only real requirement is that I need it parallelized.
[*] Please note: I am assuming that the Reduce function could be used in such a way that each call to the Reduce function produces a single combined record, rather than the Reduce function producing a single result for the entire job.
You can definitely do this in the map/reduce paradigm.
Let's say that you're matching on anything containing "smith" or phone numbers starting with "555". You would canonicalize your search string into "smith|^555", for example. In the Map phase, you would do:
John Smith / 555-555-5555 → K: smith|^555, V = (John Smith,555-555-5555)
Jane Doe / 555-555-5555 → K: smith|^555, V = (Jane Doe,555-555-5555)
John Smith / 555-555-1111 → K: smith|^555, V = (John Smith,555-555-1111)
Since you've given them all the same key ("smith|^555") they will all be handed off to the same reducer instance, which would now get, as input:
K: smith|^555, V: [(John Smith,555-555-5555),(Jane Smith,555-555-5555),(John Smith,555-555-1111))
Now, in your reducer step, you can instantiate a hashset for names and another one for numbers, and then when done processing the array of values, output all the keys from the names hashset and all the keys from the numbers hashset.
I don't think Map is useful here, because you can't really create a meaningful key for each record that will help identify the groupings of records.
It is not possible to implement this using Reduce either. Consider the example you yourself gave... If you query for 'Jane Smith', you cannot detect at the time that the first record is related to the query and so will ignore it. In fact you could end up chaining names and numbers together until you've got every record in the file. The only way to pick up all the matches is iteratively scan over the list until you stop finding new links.
This is very easy to parallelize though, you can just share out the records amongst some number of threads, and each can search its own records for new links. I'd suggest treating these sets as rings of data, so that you can record the point you were searching with the most up to date information, and you know you're finished once all threads have done a complete loop.