Statistical Spell Checking: General Approach and Avoid Feedback Loops - algorithm

A database I am building has a large number of names, that are often repeated. However, many names have misspellings, and I want to attempt to automatically correct the misspellings. I do not know the correct list of names before-hand.
My current approach has been to keep a list of the top N names along with their frequencies in the corpus. Then, when a new name is input into the database, I find the name which maximizes term_freq(name) / edit_distance(new_name, name). That is, I find an existing name in the top N names that with the highest frequency divided by the edit distance between the candidate name and the candidate.
Is this a sound approach to checking for names? Am I going about this incorrectly?
I am concerned that if the system detects a group of documents that misspell a name, they can make it into the top N names, and then cause all other names to change as well.

So, first of all... I'm obligated to say that I suspect you'll create more trouble with this than you save, and that you might be happier just requiring everyone to spell things correctly (and accept that things will break in a nice, predictable way if they don't). That being said, if you're going to do things this way, I would look to Baye's law:
P(H|E) = P(H) * P(E|H) / P(E)
(That reads: "The probability of a hypothesis given certain evidence is equal to the probability of the hypothesis multiplied by the probability of the evidence given the hypothesis, divided by the probability of the evidence.")
In this case, you would be considering the probability that a name is a mis-spelled version of another name, given the frequencies and some pre-conceived expectations about the likelyhood of typos:
P(H) would be the probability of (some number of specific typos | length of name)
P(E) would be the base probability of seeing this name (for this, I would keep the un-modified list of all entered words)
P(E|H) would be the sum of probabilities of all names that are (some number of typos) away from this name.
I would then compare the results of that to the probability that they were right in the first place, i.e., P(Zero Typos| Length of Name) * P(This Spelling).

Related

What data structure/algorithm to use to compute similarity between input sequence and a database of stored sequences?

By this question, I mean if I have an input sequence abchytreq and a database / data structure containing jbohytbbq, I would compare the two elements pairwise to get a match of 5/9, or 55%, because of the pairs (b-b, hyt-hyt, q-q). Each sequence additionally needs to be linked to another object (but I don't think this will be hard to do). The sequence does not necessarily need to be a string.
The maximum number of elements in the sequence is about 100. This is easy to do when the database/datastructure has only one or a few sequences to compare to, but I need to compare the input sequence to over 100000 (mostly) unique sequences, and then return a certain number of the most similar previously stored data matches. Additionally, each element of the sequence could have a different weighting. Back to the first example: if the first input element was weighted double, abchytreq would only be a 50% match to jbohytbbq.
I was thinking of using BLAST and creating a little hack as needed to account for any weighting, but I figured that might be a little bit overkill. What do you think?
One more thing. Like I said, comparison needs to be pairwise, e.g. abcdefg would be a zero percent match to bcdefgh.
A modified Edit Distance algorithm with weightings for character positions could help.
https://www.biostars.org/p/11863/
Multiply the resulting distance matrix with a matrix of weights for character positions/
I'm not entirely clear on the question; for instance, would you return all matches of 90% or better, regardless of how many or few there are, or would you return the best 10% of the input, even if some of them match only 50%? Here are a couple of suggestions:
First: Do you know the story of the wise bachelor? The foolish bachelor makes a list of requirements for his mate --- slender, not blonde (Mom was blonde, and he hates her), high IQ, rich, good cook, loves horses, etc --- then spends his life considering one mate after another, rejecting each for failing one of his requirements, and dies unfulfilled. The wise bachelor considers that he will meet 100 marriageable women in his life, examines the first sqrt(100) = 10 of them, then marries the next mate with a better score than the best of the first ten; she might not be perfect, but she's good enough. There's some theorem of statistics that says the square root of the population size is the right cutoff, but I don't know what it's called.
Second: I suppose that you have a scoring function that tells you exactly which of two dictionary words is the better match to the target, but is expensive to compute. Perhaps you can find a partial scoring function that is easy to compute and would allow you to quickly scan the dictionary, discarding those inputs that are unlikely to be winners, and then apply your total scoring function only to that subset of the dictionary that passed the partial scoring function. You'll have to define the partial scoring function based on your needs. For instance, you might want to apply your total scoring function to only the first five characters of the target and the dictionary word; if that doesn't eliminate enough of the dictionary, increase to ten characters on each side.

Alternative to Levenshtein distance for prefixes / suffixes

I have a big city database which was compiled from many different sources. I am trying to find a way to easily spot duplicates based on city name. The naive answer would be to use the levenshtein distance. However, the problem with cities is that they often have prefixes and suffixes which are common to the country they are in.
For example:
Boulleville vs. Boscherville
These are almost certainly different cities. However, because they both end with "ville" (and both begin with "Bo") they have a rather small Levenstein distance.
*I am looking for a string distance algorithm that takes into account the position of the character to minimize the effect of prefixes and suffixes by weighting letters in the middle of the word higher than letters at the ends of the word. *
I could probably write something myself but I would find it hard to believe that no one has yet published a suitable algorithm.
This is similar to stemming in Natural Language Programming.
In that field, the stem of a word is found before performing further analysis, e.g.
run => run
running => run
runs => run
(of course things like ran do not stem to run. For that one can use a lemmatizer. But I digress...). Even though stemming is far from perfect in NLP, it works remarkably well.
In your case, it may work well to stem the city using rules specific to city names before applying Levenstein. I'm not aware of a stemmer implementation for cities, but the rules seem on the surface to be fairly simple.
You might start with a list of prefixes and a list of suffixes (including any common variant / typo spellings) and simply remove such a prefix / suffix before checking the Levenstein distance.
On a side note, if you have additional address information (such as a street address or zip/postal code), there exists address normalization software for many countries that will find the best match based on address-specific algorithms.
A pretty simple way to do it would be to just remove the common prefix and suffix before doing the distance calculation. The absolute distance between the resulting strings will be the same as with the full strings, but when the shorter length is taken into account the distance looks much greater.
Also keep in mind that in general even grevious misspellings get the first letter right. It's highly likely, then, that Cowville and Bowville are different cities, even though their L. distance is only 1.
You can make your job a lot easier by, at least at first, not doing the distance calculation if two words start with the different letters. They're likely to be different. Concentrate first on removing duplicates of words that start with the same letters. If, after that, you still have a large number of potential duplicates, you can refine your distance threshold to more closely examine words that start with different letters.

How to apply the Levenshtein distance to a set of target strings?

Let TARGET be a set of strings that I expect to be spoken.
Let SOURCE be the set of strings returned by a speech recognizer (that is, the possible sentences that it has heard).
I need a way to choose a string from TARGET. I read about the Levenshtein distance and the Damerau-Levenshtein distance, which basically returns the distance between a source string and a target string, that is the number of changes needed to transform the source string into the target string.
But, how can I apply this algorithm to a set of target strings?
I thought I'd use the following method:
For each string that belongs to TARGET, I calculate the distance from each string in SOURCE. In this way we obtain an m-by-n matrix, where n is the cardinality of SOURCE and n is the cardinality of TARGET. We could say that the i-th row represents the similarity of the sentences detected by the speech recognizer with respect to the i-th target.
Calculating the average of the values ​​on each row, you can obtain the average distance between the i-th target and the output of the speech recognizer. Let's call it average_on_row(i), where i is the row index.
Finally, for each row, I calculate the standard deviation of all values in the row. For each row, I also perform the sum of all the standard deviations. The result is a column vector, in which each element (Let's call it stadard_deviation_sum(i)) refers to a string of TARGET.
The string which is associated with the shortest stadard_deviation_sum could be the sentence pronounced by the user. Could be considered the correct method I used? Or are there other methods?
Obviously, too high values ​​indicate that the sentence pronounced by the user probably does not belong to TARGET.
I'm not an expert but your proposal does not make sense. First of all, in practice I'd expect the cardinality of TARGET to be very large if not infinite. Second, I don't believe the Levensthein distance or some similar similarity metric will be useful.
If :
you could really define SOURCE and TARGET sets,
all strings in SOURCE were equally probable,
all strings in TARGET were equally probable,
the strings in SOURCE and TARGET consisted of not characters but phonemes,
then I believe your best bet would be to find the pair p in SOURCE, q in TARGET such that distance(p,q) is minimum. Since especially you cannot guarantee the equal-probability part, I think you should think about the problem from scratch, do some research and make a completely different design. The usual methodology for speech recognition is the use Hidden Markov models. I would start from there.
Answer to your comment: Choose whichever is more probable. If you don't consider probabilities, it is hopeless.
[Suppose the following example is on phonemes, not characters]
Suppose the recognized word the "chees". Target set is "cheese", "chess". You must calculate P(cheese|chees) and P(chess|chees) What I'm trying to say is that not every substitution is equiprobable. If you will model probabilities as distances between strings, then at least you must allow that for example d("c","s") < d("c","q") . (It is common to confuse c and s letters but it is not common to confuse c and q) Adapting the distance calculation algorithm is easy, coming with good values for all pairs is difficult.
Also you must somehow estimate P(cheese|context) and P(chess|context) If we are talking about board games chess is more probable. If we are talking about dairy products cheese is more probable. This is why you'll need large amounts of data to come up with such estimates. This is also why Hidden Markov Models are good for this kind of problem.
You need to calculate these probabilities first: probability of insertion, deletion and substitution. Then use log of these probabilities as penalties for each operation.
In a "context independent" situation, if pi is probability of insertion, pd is probability of deletion and ps probability of substitution, the probability of observing the same symbol is pp=1-ps-pd.
In this case use log(pi/pp/k), log(pd/pp) and log(ps/pp/(k-1)) as penalties for insertion, deletion and substitution respectively, where k is the number of symbols in the system.
Essentially if you use this distance measure between a source and target you get log probability of observing that target given the source. If you have a bunch of training data (i.e. source-target pairs) choose some initial estimates for these probabilities, align source-target pairs and re-estimate these probabilities (AKA EM strategy).
You can start with one set of probabilities and assume context independence. Later you can assume some kind of clustering among the contexts (eg. assume there are k different sets of letters whose substitution rate is different...).

Grouping individuals into families

We have a simulation program where we take a very large population of individual people and group them into families. Each family is then run through the simulation.
I am in charge of grouping the individuals into families, and I think it is a really cool problem.
Right now, my technique is pretty naive/simple. Each individual record has some characteristics, including married/single, age, gender, and income level. For married people I select an individual and loop through the population and look for a match based on a match function. For people/couples with children I essentially do the same thing, looking for a random number of children (selected according to an empirical distribution) and then loop through all of the children and pick them out and add them to the family based on a match function. After this, not everybody is matched, so I relax the restrictions in my match function and loop through again. I keep doing this, but I stop before my match function gets too ridiculous (marries 85-year-olds to 20-year-olds for example). Anyone who is leftover is written out as a single person.
This works well enough for our current purposes, and I'll probably never get time or permission to rework it, but I at least want to plan for the occasion or learn some cool stuff - even if I never use it. Also, I'm afraid the algorithm will not work very well for smaller sample sizes. Does anybody know what type of algorithms I can study that might relate to this problem or how I might go about formalizing it?
For reference, I'm comfortable with chapters 1-26 of CLRS, but I haven't really touched NP-Completeness or Approximation Algorithms. Not that you shouldn't bring up those topics, but if you do, maybe go easy on me because I probably won't understand everything you are talking about right away. :) I also don't really know anything about evolutionary algorithms.
Edit: I am specifically looking to improve the following:
Less ridiculous marriages.
Less single people at the end.
Perhaps what you are looking for is cluster analysis?
Lets try to think of your problem like this (starting by solving the spouses matching):
If you were to have a matrix where each row is a male and each column is a female, and every cell in that matrix is the match function's returned value, what you are now looking for is selecting cells so that there won't be a row or a column in which more than one cell is selected, and the total sum of all selected cells should be maximal. This is very similar to the N Queens Problem, with the modification that each allocation of a "queen" has a reward (which we should maximize).
You could solve this problem by using a graph where:
You have a root,
each of the first raw's cells' values is an edge's weight leading to first depth vertices
each of the second raw's cells' values is an edge's weight leading to second depth vertices..
Etc.
(Notice that when you find a match to the first female, you shouldn't consider her anymore, and so for every other female you find a match to)
Then finding the maximum allocation can be done by BFS, or better still by A* (notice A* typically looks for minimum cost, so you'll have to modify it a bit).
For matching between couples (or singles, more on that later..) and children, I think KNN with some modifications is your best bet, but you'll need to optimize it to your needs. But now I have to relate to your edit..
How do you measure your algorithm's efficiency?
You need a function that receives the expected distribution of all states (single, married with one children, single with two children, etc.), and the distribution of all states in your solution, and grades the solution accordingly. How do you calculate the expected distribution? That's quite a bit of statistics work..
First you need to know the distribution of all states (single, married.. as mentioned above) in the population,
then you need to know the distribution of ages and genders in the population,
and last thing you need to know - the distribution of ages and genders in your population.
Only then, according to those three, can you calculate how many people you expect to be in each state.. And then you can measure the distance between what you expected and what you got... That is a lot of typing.. Sorry for the general parts...

Subset generation by rules

Let's say that we have a 5000 users in database. User row has sex column, place where he/she was born column and status (married or not married) column.
How to generate a random subset (let's say 100 users) that would satisfy these conditions:
40% should be males and 60% - females
50% should be born in USA, 20% born in UK, 20% born in Canada, 10% in Australia
70% should be married and 30% not.
These conditions are independent, that is we cannot do like this:
(0.4 * 0.5 * 0.7) * 100 = 14 users that are males, born in USA and married
(0.4 * 0.5 * 0.3) * 100 = 6 users that are males, born in USA and not married.
Is there an algorithm to this generation?
Does the breakdown need to be exact, or approximate? Typically if you are generating a sample like this then you are doing some statistical study, so it is sufficient to generate an approximate sample.
Here's how to do this:
Have a function genRandomIndividual().
Each time you generate an individual, use the random function to choose the sex - male with probability 40%
Choose birth location using random function again (just generate a real in the interval 0-1, and if it falls 0-.5, choose USA, if .5-.7, then &K, if .7-.9 then Canada, otherwise Australia).
Choose married status using random function (again generate in 0-1, if 0-.7 then married, otherwise not).
Once you have a set of characteristics, search in the database for the first individual who satisfies these characteristics, add them to your sample, and tag it as already added in the database. Keep doing this unti you have fulfilled your sample size.
There may be no individaul that satisfies the characteristics. Then, just generate a new random individual instead. Since the generations are independent and generate the characteristics according to the required probabilities, in the end you will have a sample size of the correct size with the individuals generated randomly according to the probabilities specified.
You could try something like this:
Pick a random initial set of 100
Until you have the right distribution (or give up):
Pick a random record not in the set, and a random one that is
If swapping in the other record gets you closer to the set you want, exchange them. Otherwise, don't.
I'd probaby use the sum of squares of distance to the desired distribution as the metric for deciding whether to swap.
That's what comes to mind that keeps the set random. Keep in mind that there may be no subset which matches the distribution you're after.
It is important to note that you may not be able to find a subset that satisfies these conditions. To take an example, suppose your database contained only American males, and only Australian females. Clearly you could not generate any subset that satisfies your distribution constraints.
(Rewrote my post completely (actually, wrote a new one and deleted the old) because I thought of a much simpler and more efficient way to do the same thing.)
I'm assuming you actually want the exact proportions and not just to satisfy them on average. This is a pretty simple way to accomplish that, but depending on your data it might take a while to run.
First, arrange your original data so that you can access each combination of types easily, that is, group married US men in one pile, unmarried US men in another, and so on. Then, assuming that you have p conditions and you want to select k elements, make p arrays of size k each; one array will represent one condition. Make the elements of each array be the types of that condition, in the proportions that you require. So, in your example, the gender array would have 40 males and 60 females.
Now, shuffle each of the p arrays independently (actually, you can leave one array unshuffled if you like). Then, for each index i, take the type of the picked element to be the combination from the shuffled p arrays at index i, and pick one such type at random from the remaining ones in your original group, removing the picked element. If there are no elements of that type left, the algorithm has failed, so reshuffle the arrays and start again to pick elements.
To use this, you need to first make sure that the conditions are satisfiable at all because otherwise it will just loop infinitely. To be honest, I don't see a simple way to verify that the conditions are satisfiable, but if the number of elements in your original data is large compared to k and their distribution isn't too skewed, there should be solutions. Also, if there are only a few ways in which the conditions can be satisfied, it might take a long time to find one; though the method will terminate with probability 1, there is no upper bound that you can place on the running time.
Algorithm may be too strong a word, since to me that implies formalism and publication, but there is a method to select subsets with exact proportions (assuming your percentages yield whole numbers of subjects from the sample universe), and it's much simpler than the other proposed solutions. I've built one and tested it.
Incidentally, I'm sorry to be a slow responder here, but my time is constrained these days. I wrote a hard-coded solution fairly quickly, and since then I've been refactoring it into a decent general-purpose implementation. Because I've been busy, that's still not complete yet, but I didn't want to delay answering any longer.
The method:
Basically, you're going to consider each row separately, and decide whether it's selectable based on whether your criteria give you room to select each of its column values.
In order to do that, you'll consider each of your column rules (e.g., 40% males, 60% females) as an individual target (e.g., given a desired subset size of 100, you're looking for 40 males, 60 females). Make a counter for each.
Then you loop, until you've either created your subset, or you've examined all the rows in the sample universe without finding a match (see below for what happens then). This is the loop in pseudocode:
- Randomly select a row.
- Mark the row examined.
- For each column constraint:
* Get the value for the relevant column from the row
* Test for selectability:
If there's a value target for the value,
and if we haven't already selected our target number of incidences of this value,
then the row is selectable with respect to this column
* Else: the row fails.
- If the row didn't fail, select it: add it to the subset
That's the core of it. It will provide a subset which matches your rules, or it will fail to do so... which brings me to what happens when we can't find a
match.
Unsatisfiability:
As others have pointed out, it's not always possible to satisfy any arbitrary set of rules for any arbitrary sample universe. Even assuming that the rules are valid (percentages for each value sum to 100), the subset size is less than the universe size, and the universe does contain enough individuals with each selected value to hit the targets, it's still possible to fail if the values are actually not distributed independently.
Consider the case where all the males in the sample universe are Australian: in this case, you can only select as many males as you can select Australians, and vice-versa. So a set of constraints (subset size: 100; male: 40%; Australian 10%) cannot be satisfied at all from such a universe, even if all the Australians we select are male.
If we change the constraints (subset size: 100; male: 40%; Australian 40%), now we can possibly make a matching subset, but all of the Australians we select must be male. And if we change the constraints again (subset size: 100; male: 20%; Australian 40%), now we can possibly make a matching subset, but only if we don't pick too many Australian women (no more than half in this case).
In this latter case, selection order is going to matter. Depending on our random seed, sometimes we might succeed, and sometimes we might fail.
For this reason, the algorithm must (and my implementation does) be prepared to retry. I think of this as a patience test: the question is how many times are we willing to let it fail before we decide that the constraints are not compatible with the sample population.
Suitability
This method is well suited to the OP's task as described: selecting a random subset which matches given criteria. It is not suitable to answering a slightly different question: "is it possible to form a subset with the given criteria".
My reasoning for this is simple: the situations in which the algorithm fails to find a subset are those in which the data contains unknown linkages, or where the criteria allow a very limited number of subsets from the sample universe. In these cases, the use of any subset would be questionable for statistical analysis, at least not without further thought.
But for the purpose of answering the question of whether it's possible to form a subset, this method is non-deterministic and inefficient. It would be better to use one of the more complex shuffle-and-sort algorithms proposed by others.
Pre-Validation:
The immediate thought upon discovering that not all subsets can be satisfied is to perform some initial validation, and perhaps to analyze the data to see whether it's answerable or only conditionally answerable.
My position is that other than initially validating that each of the column rules is valid (i.e., the column percentages sum to 100, or near enough) and that the subset size is less than the universe size, there's no other prior validation which is worth doing. An argument can be made that you might want to check that the universe contains enough individuals with each selected value (e.g., that there actually are 40 males and 60 females in the universe), but I haven't implemented that.
Other than those, any analysis to identify linkages in the population is itself time-consuming that you might be better served just running the thing with more retries. Maybe that's just my lack of statistics background talking.
Not quite the subset sum problem
It has been suggested that this problem is like the subset sum problem. I contend that this is subtly and yet significantly different. My reasoning is as follows: for the subset sum problem, you must form and test a subset in order to answer the question of whether it meets the rules: it is not possible (except in certain edge conditions) to test an individual element before adding it to the subset.
For the OP's question, though, it is possible. As I'll explain, we can randomly select rows and test them individually, because each has a weight of one.

Resources