algorithm for customer name verification - algorithm

Is there any algorithm or standard to verify customer names in different formats.
I mean,
J. Smith
John Smith
John L. Smith
J. Louis Smith
John Louis S.
Could be the same person and should pass the validation.
Thanks

The accepted answer of Figure out if a business name is very similar to another one - Python will definitely help you out as I myself have worked on a very similar approach to normalize names.
Note that a single standalone metric is not going to suffice. An ensemble approach will have to implemented taking character N Gram matching, Edit Distance and so on into account which ultimately returns a strength of the matched words. Devise a formula for calculating strength of your matched keywords and once your list of names is exhausted just re-run the Algorithm for the names/words which have a strength below a particular threshold set by you. This make the names then resonate to some other cluster of names where the match/strength value is more strong.
Also you will have to watch out for precision/recall trade-off. With the above approach I have seen that the precision is too good but the recall is not that great.

Related

Which algorithm for assigning pattern shifts

I'm looking for a good algorithm or technic to find the best solution for the following problem. First, I’ll introduce the context and then, the problem.
I work for a company with more than 2000 employees; all of them work with pattern shift, this means that any employee has a pattern which specifies the sequence of workday and free day. We have these patterns:
5-2-5-2 (5 days work, 2 free, 5 days work, 2 free) and so on.
5-2-4-3
5-4-5-3
5-3-5-3
At this moment we have all these patterns and different numbers of start, that is to say, a pattern can start at a certain date at a specific part inside the pattern, for e.g., the pattern 5-4-5-3 has 17 possible starting sequences, this number is a sum of 5+4+5+3 = 17 possible sequences.
https://en.wikipedia.org/wiki/Shift_plan
Now the problem,
Every 6 months each employee can change the pattern and start in any sequence number of the pattern.
But we must analyze all the requirements and accept or reject in order to obtain the better combination for the company operation, because we need that every day have the same work force but we understand that this is impossible but the algorithm will help us find a good solution, not perfect.
I was reading about the "Nurse scheduling problem" with Google Or-Tool but I don’t understand how to set Pattern Sequence to create a solution for this problem. I read some opinions about GA (genetic algorithms) and all of them said that this kind of solution is not good for this kind of problem.
Does anyone have a similar problem? Can someone give me a more accurate example with Google OR-tools than the example in GitHub.
it's not necessary to find a strictly optimal solution; the roster is currently done manual, and I'm pretty sure the result is considerably sub-optimal most of the time.
Does anyone have a similar problem?
Sounds a lot like OptaWeb Employee Rostering, which is a vertical on top of OptaPlanner, the constraint solver. Take a look at the source code. It's all open source.
I think this can be modeled as a MIP model.
Thinking out loud:
Introduce a binary decision variable:
δ(i,p) = 1 if pattern i is selected for person p
0 otherwise
This includes the current pattern (say i=0). This would allow the cases:
an employee does not submit a new pattern (then we only have i=0 for this employee)
an employee submits one or more preferable patterns
We have the constraints:
sum(i, δ(i,p)) = 1 ∀p
sum((i,p), pattern(i,p,t)*δ(i,p)) ≈ requiredlevel(t) ∀t
δ(i,p) ∈ {0,1}
here pattern(i,p,t) describes pattern i: it is 1 if period t is covered when pattern (i,p) is used and 0 otherwise. Here I use ≈ to indicate "approximately". (This is easily modeled using slacks and possibly penalty terms in the objective).
Now we maximize
maximize sum((i,p), weight(i,p) * δ(i,p))
where weight(i,p) indicates the preference for a pattern (e.g. weight(0,p)=0 i.e. no bonus points when not selected a newly, preferred pattern).
Something like this should not be too difficult to set up. Of course many refinements are possible. These type of model tend to solve quite quickly.
What is the workflow ?
If you have a fixed roster, and one person proposes a new pattern. Just remove this person contribution, test all (17) starting points of the new pattern and score them.
If you can change patterns, or starting points for more than 1 employee, create an integer variable per starting point. From this starting point, it is easy to compute the persons contribution for each shifted day of the pattern. Then you can optimize quality of service w.r.t. the starting points of each pattern, summing potential contributions per day of the week for each employee.
Is that clear ?

most efficient edit distance to identify misspellings in names?

Algorithms for edit distance give a measure of the distance between two strings.
Question: which of these measures would be most relevant to detect two different persons names which are actually the same? (different because of a mispelling). The trick is that it should minimize false positives. Example:
Obaama
Obama
=> should probably be merged
Obama
Ibama
=> should not be merged.
This is just an oversimple example. Are their programmers and computer scientists who worked out this issue in more detail?
I can suggest an information-retrieval technique of doing so, but it requires a large collection of documents in order to work properly.
Index your data, using the standard IR techniques. Lucene is a good open source library that can help you with it.
Once you get a name (Obaama for example): retrieve the set of collections the word Obaama appears in. Let this set be D1.
Now, for each word w in D11 search for Obaama AND w (using your IR system). Let the set be D2.
The score |D2|/|D1| is an estimation how much w is connected to Obaama, and most likely will be close to 1 for w=Obama2.
You can manually label a set of examples and find the value from which words will be expected.
Using a standard lexicographical similarity technique you can chose to filter out words that are definetly not spelling mistakes (Like Barack).
Another solution that is often used requires a query log - find a correlation between searched words, if obaama has correlation with obama in the query log - they are connected.
1: You can improve performance by first doing the 2nd filter, and check only for candidates who are "similar enough" lexicographically.
2: Usually a normalization is also used, because more frequent words are more likely to be in the same documents with any word, regardless of being related or not.
You can check NerSim (demo) which also uses SecondString. You can find their corresponding papers, or consider this paper: Robust Similarity Measures for Named Entities Matching.

Privacy and Anonymization "Algorithm"

I read this problem in a book (Interview Question), and wanted to discuss this problem, in detail over here. Kindly throw some lights on it.
The problem is as follows:-
Privacy & Anonymization
The Massachusetts Group Insurance Commission had a bright idea back in the mid 1990s - it decided to release "anonymized" data on state employees that showed every single hospital visit they had.
The goal was to help the researchers. The state spent time removing identifiers such as name, address and social security no. The Governor of Masachusetts assured the public that this was sufficient to protect patient privacy.
Then a graduate student, saw significant pitfalls in this approach. She requested a copy of the data and by collating the data in multiple columns, she was able to identify the health records of the Governor.
This demonstrated that extreme care needs to be taken in anonymizing data. One way of ensuring privacy is to aggregate data such that any record can be mapped to at least k individuals, for some large value of k.
I wanted to actually experience this problem, with some kind of example set, and then what it actually takes to do this anonymization. I hope you are clear with the question.....
I have no experienced person, who can help me deal with such kind of problems. Kindly don't put votes to close this question..... As I would be helpless, if this happens...
Thanks & if any more explanation in question required, kindly shoot with the questions.
I just copy pasted part of your text, and stumbled upon this
This helps understanding your problem :
At the time GIC released the data, William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identifiers. In response, then-graduate student Sweeney started hunting for the Governor’s hospital records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts, a city of 54,000 residents and seven ZIP codes. For twenty dollars, she purchased the complete voter rolls from the city of Cambridge, a database containing, among other things, the name, address, ZIP code, birth date, and sex of every voter. By combining this data with the GIC records, Sweeney found Governor Weld with ease. Only six people in Cambridge shared his birth date, only three of them men, and of them, only he lived in his ZIP code. In a theatrical flourish, Dr. Sweeney sent the Governor’s health records (which included diagnoses and prescriptions) to his office.
Boom! But it was only an early mile marker in Sweeney's career; in 2000, she showed that 87 percent of all Americans could be uniquely identified using only three bits of information: ZIP code, birthdate, and sex.
Well, as you stated it, you need a random database, and ensure that any record can be mapped to at least k individuals, for some large value of k.
In other words, you need to clear the database of discriminative information. For example, if you keep in the database only the sex (M/F), then there is no way to found out who is who. Because there are only two entries : M and F.
But, if you take the birthdate, then your total number of entries become more or less 2*365*80 ~=50.000. (I chose 80 years). Even if your database contain 500.000 people, there is a chance that one of them (let's say a male born on 03/03/1985) is the ONLY one with such entry, thus you can recognize him.
This is only a simplistic approach that relies on combinatorial stuff. If you're wanting something more complex, look for correlated information and PCA
Edit : Let's give an example. Let's suppose I'm working with medical things. If I keep only
The sex : 2 possibilities (M, F)
The blood group : 4 possibilities (O, A, B, AB)
The rhesus : 2 possibilities (+, -)
The state they're living in : 50 possibilities (if you're in the USA)
The month of birth : 12 possibilities (affects death rate of babies)
Their age category : 10 possibilities (0-9 years old, 10-19 years old ... 90-infinity)
That leads to a total number of category of 2*4*2*50*12*10 = 96.000 categories. Thus, if your database contains 200.000.000 entries (rough approximation of the number of inhabitants in the USA that are in your database) there is NO WAY you can identify someone.
This also implies that you do not give out any further information, no ZIP code, etc... With only the 6 information given, you can compute some nice statistics (do persons born in december live longer?) but there is no identification possible because 96.000 is very inferior to 200.000.000.
However, if you only have the database of the city you live in, who has for example 200.000 inhabitants, the you cannot guaranty anonymization. Because 200.000 is "not much bigger" than 96.000. ("not much bigger" is a true complex scientifical term that requires knowledge in probabilities :P )
"I wanted to actually experience this problem, with some kind of example set, and then what it actually takes to do this anonymization."
You can also construct your own dataset by finding one alone, "anonymizing" it, and trying to reconstruct it.
Here is a very detailed discussion of the de-identification/anonymization problem, and potential tools & techniques for solving them.
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CDQQFjAA&url=https%3A%2F%2Fwww.infoway-inforoute.ca%2Findex.php%2Fcomponent%2Fdocman%2Fdoc_download%2F624-tools-for-de-identification-of-personal-health-information&ei=QiO0VL72J-3nsATkl4CQBg&usg=AFQjCNF3YUE2cl9QZTuw-L4PYtWnzmwlIQ&sig2=JE8bYkqg04auXstgF0f7Aw&bvm=bv.83339334,d.cWc
The jurisdiction for the document above is within the rules of the Canadian public health system, but they are conceptually applicable to other jurisdictions.
For the U.S., you would specifically need to comply with the HIPAA de-identification requirements. http://www.hhs.gov/ocr/privacy/hipaa/understanding/coveredentities/De-identification/guidance.html
"Conceptually applicable" does not mean "compliant". To be compliant, with the EU, for example, you would need to dig into their specific EU requirements as well as the country requirements and potentially State/local requirements.

Converting preferences to ratings

Suppose I have a list of (e.g.) restaurants. A lot of users get a list of pairs of restaurants, and select the one of the two they prefer (a la hotornot).
I would like to convert these results into absolute ratings: For each restaurant, 1-5 stars (rating can be non-integer, if necessary).
What are the general ways to go with this problem?
Thanks
I would consider each pairwise decision as a vote in favor of one of the restaurants, and each non-preferred partner as a downvote. Count the votes across all users and restaurants, and then sort cluster them equally (so that that each star "weighs" for a number of votes).
Elo ratings come to mind. It's how the chess world computes a rating from your win/loss/draw record. Losing a matchup against an already-high-scoring restaurant gets penalized less than against a low-scoring one, a little like how PageRank cares more about a link from a website it also ranks highly. There's no upper bound to your possible score; you'd have to renormalize somehow for a 1-5 star system.

K Nearest Neighbour Algorithm doubt

I am new to Artificial Intelligence. I understand K nearest neighbour algorithm and how to implement it. However, how do you calculate the distance or weight of things that aren't on a scale?
For example, distance of age can be easily calculated, but how do you calculate how near is red to blue? Maybe colours is a bad example because you still can say use the frequency. How about a burger to pizza to fries for example?
I got a feeling there's a clever way to do this.
Thank you in advance for your kind attention.
EDIT: Thank you all for very nice answers. It really helped and I appreciate it. But I am thinking there must be a way out.
Can I do it this way? Let's say I am using my KNN algorithm to do a prediction for a person whether he/she will eat at my restaurant that serves all three of the above food. Of course, there's other factors but to keep it simple, for the field of favourite food, out of 300 people, 150 loves burger, 100 loves pizza, and 50 loves fries. Common sense tells me favourite food affect peoples' decision on whether to eat or not.
So now a person enters his/her favourite food as burger and I am going to predict whether he/she's going to eat at my restaurant. Ignoring other factors, and based on my (training) previous knowledge base, common sense tells me that there's a higher chance the k nearest neighbours' distance for this particular field favourite food is nearer as compared to if he entered pizza or fries.
The only problem with that is that I used probability, and I might be wrong because I don't know and probably can't calculate the actual distance. I also worry about this field putting too much/too little weight on my prediction because the distance probably isn't to scale with other factors (price, time of day, whether the restaurant is full, etc that I can easily quantify) but I guess I might be able to get around it with some parameter tuning.
Oh, everyone put up a great answer, but I can only accept one. In that case, I'll just accept the one with highest votes tomorrow. Thank you all once again.
Represent all food for which you collect data as a "dimension" (or a column in a table).
Record "likes" for every person on whom you can collect data, and place the results in a table:
Burger | Pizza | Fries | Burritos | Likes my food
person1 1 | 0 | 1 | 1 | 1
person2 0 | 0 | 1 | 0 | 0
person3 1 | 1 | 0 | 1 | 1
person4 0 | 1 | 1 | 1 | 0
Now, given a new person, with information about some of the foods he likes, you can measure similarity to other people using a simple measure such as the Pearson Correlation Coefficient, or the Cosine Similarity, etc.
Now you have a way to find K nearest neighbors and make some decision..
For more advanced information on this, look up "collaborative filtering" (but I'll warn you, it gets math-y).
Well, 'nearest' implies that you have some metric on which things can be more or less 'distant'. Quantification of 'burger', 'pizza', and 'fries' isn't so much a KNN problem as it's about fundamental system modeling. If you have a system where you're doing analysis where 'burger', 'pizza', and 'fries' are terms, the reason for the system to exist is going to determine how they're quantified -- like if you're trying to figure out how to get the best taste and least calories for a given amount of money, then ta-da, you know what your metrics are. (Of course, 'best taste' is subjective, but that's another set of issues.)
It's not up to these terms to have inherent quantifiability and thereby to tell you how to design your system of analysis; it's up to you to decide what you're trying to accomplish and design metrics from there.
This is one of the problems of knowledge representation in AI. Subjectively plays a big part. Would you and me agree, for example, on the "closeness" of a burger, pizza and fries?
You'd probably need a look up matrix containing the items to be compared. You may be able to reduce this matrix if you can assume transitivity, but I think even that would be uncertain in your example.
The key may be to try and determine the feature that you are trying to compare on. For example, if you were comparing your food items on health, you may be able to get at something more objective.
If you look at "Collective Intelligence", you'll see that they assign a scale and a value. That's how Netflix is comparing movie rankings and such.
You'll have to define "nearness" by coming up with that scale and assigning values for each.
I would actually present pairs of these attributes to users and ask them to define their proximity. You would present them with a scale reaching from [synonym..very foreign] or similar. Having many people do this you will end up with a widely accepted proximity function for the non-linear attribute values.
There is no "best" way to do this. Ultimately, you need to come up with an arbitrary scale.
Good answers. You could just make up a metric, or, as malach suggests, ask some people. To really do it right, it sounds like you need bayesian analysis.

Resources