How to apply the Levenshtein distance to a set of target strings? - algorithm

Let TARGET be a set of strings that I expect to be spoken.
Let SOURCE be the set of strings returned by a speech recognizer (that is, the possible sentences that it has heard).
I need a way to choose a string from TARGET. I read about the Levenshtein distance and the Damerau-Levenshtein distance, which basically returns the distance between a source string and a target string, that is the number of changes needed to transform the source string into the target string.
But, how can I apply this algorithm to a set of target strings?
I thought I'd use the following method:
For each string that belongs to TARGET, I calculate the distance from each string in SOURCE. In this way we obtain an m-by-n matrix, where n is the cardinality of SOURCE and n is the cardinality of TARGET. We could say that the i-th row represents the similarity of the sentences detected by the speech recognizer with respect to the i-th target.
Calculating the average of the values ​​on each row, you can obtain the average distance between the i-th target and the output of the speech recognizer. Let's call it average_on_row(i), where i is the row index.
Finally, for each row, I calculate the standard deviation of all values in the row. For each row, I also perform the sum of all the standard deviations. The result is a column vector, in which each element (Let's call it stadard_deviation_sum(i)) refers to a string of TARGET.
The string which is associated with the shortest stadard_deviation_sum could be the sentence pronounced by the user. Could be considered the correct method I used? Or are there other methods?
Obviously, too high values ​​indicate that the sentence pronounced by the user probably does not belong to TARGET.

I'm not an expert but your proposal does not make sense. First of all, in practice I'd expect the cardinality of TARGET to be very large if not infinite. Second, I don't believe the Levensthein distance or some similar similarity metric will be useful.
If :
you could really define SOURCE and TARGET sets,
all strings in SOURCE were equally probable,
all strings in TARGET were equally probable,
the strings in SOURCE and TARGET consisted of not characters but phonemes,
then I believe your best bet would be to find the pair p in SOURCE, q in TARGET such that distance(p,q) is minimum. Since especially you cannot guarantee the equal-probability part, I think you should think about the problem from scratch, do some research and make a completely different design. The usual methodology for speech recognition is the use Hidden Markov models. I would start from there.
Answer to your comment: Choose whichever is more probable. If you don't consider probabilities, it is hopeless.
[Suppose the following example is on phonemes, not characters]
Suppose the recognized word the "chees". Target set is "cheese", "chess". You must calculate P(cheese|chees) and P(chess|chees) What I'm trying to say is that not every substitution is equiprobable. If you will model probabilities as distances between strings, then at least you must allow that for example d("c","s") < d("c","q") . (It is common to confuse c and s letters but it is not common to confuse c and q) Adapting the distance calculation algorithm is easy, coming with good values for all pairs is difficult.
Also you must somehow estimate P(cheese|context) and P(chess|context) If we are talking about board games chess is more probable. If we are talking about dairy products cheese is more probable. This is why you'll need large amounts of data to come up with such estimates. This is also why Hidden Markov Models are good for this kind of problem.

You need to calculate these probabilities first: probability of insertion, deletion and substitution. Then use log of these probabilities as penalties for each operation.
In a "context independent" situation, if pi is probability of insertion, pd is probability of deletion and ps probability of substitution, the probability of observing the same symbol is pp=1-ps-pd.
In this case use log(pi/pp/k), log(pd/pp) and log(ps/pp/(k-1)) as penalties for insertion, deletion and substitution respectively, where k is the number of symbols in the system.
Essentially if you use this distance measure between a source and target you get log probability of observing that target given the source. If you have a bunch of training data (i.e. source-target pairs) choose some initial estimates for these probabilities, align source-target pairs and re-estimate these probabilities (AKA EM strategy).
You can start with one set of probabilities and assume context independence. Later you can assume some kind of clustering among the contexts (eg. assume there are k different sets of letters whose substitution rate is different...).

Related

Create matching of row and column entries such that values are maximized

Currently I am facing the following optimization problem and I cant seem to find the right applicable algorithm for this. This has to do with some of the combinatorial optimzation problems such as the knapsack problem but my mathematical knowledge is limited to that extent.
assume we have a list of the following words: ["apple", "banana", "cookie", "donut", "ear", "force"] Further, assume we have a dataset of texts which, among others, include these words. At some point I compute a cofrequency matrix, that is, a matrix of each of the word combinations the frequency in which the words combine together in all of the files. e.g. cofreq("apple", "banana") = (amount of files which have apple and banana)/(total files). Therefore, cofreq(apple, banana) = cofreq(banana, apple). We ignore cofreq(apple, apple)
Assume we have the following computed matrix (as an image, adding tables seems to be impossible): Table
The goal now is to create unique word pairs such that the word frequencies are maximized and each of the word pairs have a "partner" (We assume we have an even number of words). In this example it would be:
(apple, force) 0.4
(cookie, donut) 0.5
(banana, ear) 0.05
------------------+--
.95
In this case I did it by hand but I know that there is a good algorithm for it, but I cant seem to find it. I was hoping someone could point me in the right direction in the form of a research paper or such.
You need to use a maximum weight matching algorithm to compute this maximal sum pairing.
The table you have in input can be seen as the adjacency matrix of a graph, where the values in the table correspond to the graph's edges weight. You can do it since the cofreq value is commutative (meaning cofreq(apple, banana) == cofreq(banana, apple)).
The matching algorithm you can use here is called the blossom algorithm. It is not trivial, but very elegant. If you have some experience in implementing complex algorithms, you can implement it. Otherwise, there exists implementations of it in graph libraries for most of the common laguages.

What data structure/algorithm to use to compute similarity between input sequence and a database of stored sequences?

By this question, I mean if I have an input sequence abchytreq and a database / data structure containing jbohytbbq, I would compare the two elements pairwise to get a match of 5/9, or 55%, because of the pairs (b-b, hyt-hyt, q-q). Each sequence additionally needs to be linked to another object (but I don't think this will be hard to do). The sequence does not necessarily need to be a string.
The maximum number of elements in the sequence is about 100. This is easy to do when the database/datastructure has only one or a few sequences to compare to, but I need to compare the input sequence to over 100000 (mostly) unique sequences, and then return a certain number of the most similar previously stored data matches. Additionally, each element of the sequence could have a different weighting. Back to the first example: if the first input element was weighted double, abchytreq would only be a 50% match to jbohytbbq.
I was thinking of using BLAST and creating a little hack as needed to account for any weighting, but I figured that might be a little bit overkill. What do you think?
One more thing. Like I said, comparison needs to be pairwise, e.g. abcdefg would be a zero percent match to bcdefgh.
A modified Edit Distance algorithm with weightings for character positions could help.
https://www.biostars.org/p/11863/
Multiply the resulting distance matrix with a matrix of weights for character positions/
I'm not entirely clear on the question; for instance, would you return all matches of 90% or better, regardless of how many or few there are, or would you return the best 10% of the input, even if some of them match only 50%? Here are a couple of suggestions:
First: Do you know the story of the wise bachelor? The foolish bachelor makes a list of requirements for his mate --- slender, not blonde (Mom was blonde, and he hates her), high IQ, rich, good cook, loves horses, etc --- then spends his life considering one mate after another, rejecting each for failing one of his requirements, and dies unfulfilled. The wise bachelor considers that he will meet 100 marriageable women in his life, examines the first sqrt(100) = 10 of them, then marries the next mate with a better score than the best of the first ten; she might not be perfect, but she's good enough. There's some theorem of statistics that says the square root of the population size is the right cutoff, but I don't know what it's called.
Second: I suppose that you have a scoring function that tells you exactly which of two dictionary words is the better match to the target, but is expensive to compute. Perhaps you can find a partial scoring function that is easy to compute and would allow you to quickly scan the dictionary, discarding those inputs that are unlikely to be winners, and then apply your total scoring function only to that subset of the dictionary that passed the partial scoring function. You'll have to define the partial scoring function based on your needs. For instance, you might want to apply your total scoring function to only the first five characters of the target and the dictionary word; if that doesn't eliminate enough of the dictionary, increase to ten characters on each side.

Normalization of a multi-dimensional space, what algorithm is this?

I'm not a trained statistician so I apologize for the incorrect usage of some words. I'm just trying to get some good results from the Weka Nearest Neighbor algorithms. I'll use some redundancy in my explanation as a means to try to get the concept across:
Is there a way to normalize a multi-dimensional space so that the distances between any two instances are always proportional to the effect on the dependent variable?
In other words I have a statistical data set and I want to use a "nearest neighbor" algorithm to find instances that are most similar to a specified test instance. Unfortunately my initial results are useless because two attributes that are very close in value weakly correlated to the dependent variable would incorrectly bias the distance calculation.
For example let's say you're trying to find the nearest-neighbor of a given car based on a database of cars: make, model, year, color, engine size, number of doors. We know intuitively that the make, model, and year have a bigger effect on price than the number of doors. So a car with identical color, door count, may not be the nearest neighbor to a car with different color/doors but same make/model/year. What algorithm(s) can be used to appropriately set the weights of each independent variable in the Nearest Neighbor distance calculation so that the distance will be statistically proportional (correlated, whatever) to the dependent variable?
Application: This can be used for a more accurate "show me products similar to this other product" on shopping websites. Back to the car example, this would have cars of same make and model bubbling up to the top, with year used as a tie-breaker, and then within cars of the same year, it might sort the ones with the same number of cylinders (4 or 6) ahead of the ones with the same number of doors (2 or 4). I'm looking for an algorithmic way to derive something similar to the weights that I know intuitively (make >> model >> year >> engine >> doors) and actually assign numerical values to them to be used in the nearest-neighbor search for similar cars.
A more specific example:
Data set:
Blue,Honda,6-cylinder
Green,Toyota,4-cylinder
Blue,BMW,4-cylinder
now find cars similar to:
Blue,Honda,4-cylinder
in this limited example, it would match the Green,Toyota,4-cylinder ahead of the Blue,Honda,6-cylinder because the two brands are statistically almost interchangeable and cylinder is a stronger determinant of price rather than color. BMW would match lower because that brand tends to double the price, i.e. placing the item a larger distance.
Final note: the prices are available during training of the algorithm, but not during calculation.
Possible you should look at Solr/Lucene for this aim. Solr provides a similarity search based field value frequency and it already has functionality MoreLikeThis for find similar items.
Maybe nearest neighbor is not a good algorithm for this case? As you want to classify discrete values it can become quite hard to define reasonable distances. I think an C4.5-like algorithm may better suit the application you describe. On each step the algorithm would optimize the information entropy, thus you will always select the feature that gives you the most information.
Found something in the IEEE website. The algorithm is called DKNDAW ("dynamic k-nearest-neighbor with distance and attribute weighted"). I couldn't locate the actual paper (probably needs a paid subscription). This looks very promising assuming that the attribute weights are computed by the algorithm itself.

Genetic Algorithm Implementation for weight optimization

I am a data mining student and I have a problem that I was hoping that you guys could give me some advice on:
I need a genetic algo that optimizes the weights between three inputs. The weights need to be positive values AND they need to sum to 100%.
The difficulty is in creating an encoding that satisfies the sum to 100% requirement.
As a first pass, I thought that I could simply create a chrom with a series of numbers (ex.4,7,9). Each weight would simply be its number divided by the sum of all of the chromosome's numbers (ex. 4/20=20%).
The problem with this encoding method is that any change to the chromosome will change the sum of all the chromosome's numbers resulting in a change to all of the chromosome's weights. This would seem to significantly limit the GA's ability to evolve a solution.
Could you give any advice on how to approach this problem?
I have read about real valued encoding and I do have an implementation of a GA but it will give me weights that may not necessarily add up to 100%.
It is mathematically impossible to change one value without changing at least one more if you need the sum to remain constant.
One way to make changes would be exactly what you suggest: weight = value/sum. In this case when you change one value, the difference to be made up is distributed across all the other values.
The other extreme is to only change pairs. Start with a set of values that add to 100, and whenever 1 value changes, change another by the opposite amount to maintain your sum. The other could be picked randomly, or by a rule. I'd expect this would take longer to converge than the first method.
If your chromosome is only 3 values long, then mathematically, these are your only two options.

Is there an edit distance algorithm that takes "chunk transposition" into account?

I put "chunk transposition" in quotes because I don't know whether or what the technical term should be. Just knowing if there is a technical term for the process would be very helpful.
The Wikipedia article on edit distance gives some good background on the concept.
By taking "chunk transposition" into account, I mean that
Turing, Alan.
should match
Alan Turing
more closely than it matches
Turing Machine
I.e. the distance calculation should detect when substrings of the text have simply been moved within the text. This is not the case with the common Levenshtein distance formula.
The strings will be a few hundred characters long at most -- they are author names or lists of author names which could be in a variety of formats. I'm not doing DNA sequencing (though I suspect people that do will know a bit about this subject).
In the case of your application you should probably think about adapting some algorithms from bioinformatics.
For example you could firstly unify your strings by making sure, that all separators are spaces or anything else you like, such that you would compare "Alan Turing" with "Turing Alan". And then split one of the strings and do an exact string matching algorithm ( like the Horspool-Algorithm ) with the pieces against the other string, counting the number of matching substrings.
If you would like to find matches that are merely similar but not equal, something along the lines of a local alignment might be more suitable since it provides a score that describes the similarity, but the referenced Smith-Waterman-Algorithm is probably a bit overkill for your application and not even the best local alignment algorithm available.
Depending on your programming environment there is a possibility that an implementation is already available. I personally have worked with SeqAn lately, which is a bioinformatics library for C++ and definitely provides the desired functionality.
Well, that was a rather abstract answer, but I hope it points you in the right direction, but sadly it doesn't provide you with a simple formula to solve your problem.
Have a look at the Jaccard distance metric (JDM). It's an oldie-but-goodie that's pretty adept at token-level discrepancies such as last name first, first name last. For two string comparands, the JDM calculation is simply the number of unique characters the two strings have in common divided by the total number of unique characters between them (in other words the intersection over the union). For example, given the two arguments "JEFFKTYZZER" and "TYZZERJEFF," the numerator is 7 and the denominator is 8, yielding a value of 0.875. My choice of characters as tokens is not the only one available, BTW--n-grams are often used as well.
One of the easiest and most effective modern alternatives to edit distance is called the Normalized Compression Distance, or NCD. The basic idea is easy to explain. Choose a popular compressor that is implemented in your language such as zlib. Then, given string A and string B, let C(A) be the compressed size of A and C(B) be the compressed size of B. Let AB mean "A concatenated with B", so that C(AB) means "The compressed size of "A concatenated with B". Next, compute the fraction (C(AB) - min(C(A),C(B))) / max(C(A), C(B)) This value is called NCD(A,B) and measures similarity similar to edit distance but supports more forms of similarity depending on which data compressor you choose. Certainly, zlib supports the "chunk" style similarity that you are describing. If two strings are similar the compressed size of the concatenation will be near the size of each alone so the numerator will be near 0 and the result will be near 0. If two strings are very dissimilar the compressed size together will be roughly the sum of the compressed sizes added and so the result will be near 1. This formula is much easier to implement than edit distance or almost any other explicit string similarity measure if you already have access to a data compression program like zlib. It is because most of the "hard" work such as heuristics and optimization has already been done in the data compression part and this formula simply extracts the amount of similar patterns it found using generic information theory that is agnostic to language. Moreover, this technique will be much faster than most explicit similarity measures (such as edit distance) for the few hundred byte size range you describe. For more information on this and a sample implementation just search Normalized Compression Distance (NCD) or have a look at the following paper and github project:
http://arxiv.org/abs/cs/0312044 "Clustering by Compression"
https://github.com/rudi-cilibrasi/libcomplearn C language implementation
There are many other implementations and papers on this subject in the last decade that you may use as well in other languages and with modifications.
I think you're looking for Jaro-Winkler distance which is precisely for name matching.
You might find compression distance useful for this. See an answer I gave for a very similar question.
Or you could use a k-tuple based counting system:
Choose a small value of k, e.g. k=4.
Extract all length-k substrings of your string into a list.
Sort the list. (O(knlog(n) time.)
Do the same for the other string you're comparing to. You now have two sorted lists.
Count the number of k-tuples shared by the two strings. If the strings are of length n and m, this can be done in O(n+m) time using a list merge, since the lists are in sorted order.
The number of k-tuples in common is your similarity score.
With small alphabets (e.g. DNA) you would usually maintain a vector storing the count for every possible k-tuple instead of a sorted list, although that's not practical when the alphabet is any character at all -- for k=4, you'd need a 256^4 array.
I'm not sure that what you really want is edit distance -- which works simply on strings of characters -- or semantic distance -- choosing the most appropriate or similar meaning. You might want to look at topics in information retrieval for ideas on how to distinguish which is the most appropriate matching term/phrase given a specific term or phrase. In a sense what you're doing is comparing very short documents rather than strings of characters.

Resources