Find the words in a long stream of characters. Auto-tokenize - algorithm

How would you find the correct words in a long stream of characters?
Input :
"The revised report onthesyntactictheoriesofsequentialcontrolandstate"
Google's Output:
"The revised report on syntactic theories sequential controlandstate"
(which is close enough considering the time that they produced the output)
How do you think Google does it?
How would you increase the accuracy?

I would try a recursive algorithm like this:
Try inserting a space at each position. If the left part is a word, then recur on the right part.
Count the number of valid words / number of total words in all the final outputs. The one with the best ratio is likely your answer.
For example, giving it "thesentenceisgood" would run:
thesentenceisgood
the sentenceisgood
sent enceisgood
enceisgood: OUT1: the sent enceisgood, 2/3
sentence isgood
is good
go od: OUT2: the sentence is go od, 4/5
is good: OUT3: the sentence is good, 4/4
sentenceisgood: OUT4: the sentenceisgood, 1/2
these ntenceisgood
ntenceisgood: OUT5: these ntenceisgood, 1/2
So you would pick OUT3 as the answer.

Try a stochastic regular grammar (equivalent to hidden markov models) with the following rules:
for every word in a dictionary:
stream -> word_i stream with probability p_w
word_i -> letter_i1 ...letter_in` with probability q_w (this is the spelling of word_i)
stream -> letter stream with prob p (for any letter)
stream -> epsilon with prob 1
The probabilities could be derived from a training set, but see the following discussion.
The most likely parse is computed using the Viterbi algorithm, which has quadratic time complexity in the number of hidden states, in this case your vocabulary, so you could run into speed issues with large vocabularies. But what if you set all the p_w = 1, q_w = 1 p = .5 Which means, these are probabilities in an artificial language model where all words are equally likely and all non-words are equally unlikely. Of course you could segment better if you didn't use this simplification, but the algorithm complexity goes down by quite a bit. If you look at the recurrence relation in the wikipedia entry you can try and simplify it for this special case. The viterbi parse probability up to position k can be simplified to VP(k) = max_l(VP(k-l) * (1 if text[k-l:k] is a word else .5^l) You can bound l with the maximim length of a word and find if a l letters form a word with a hash search. The complexity of this is independent of the vocabulary size and is O(<text length> <max l>). Sorry this is not a proof, just a sketch but should get you going. Another potential optimization, if you create a trie of the dictionary, you can check if a substring is a prefix of any correct word. So when you query text[k-l:k] and get a negative answer, you already know that the same is true for text[k-l:k+d] for any d. To take advantage of this you would have to rearrange the recursion significantly, so I am not sure this can be fully exploited (it can see comment).

Here is a code in Mathematica I started to develop for a recent code golf.
It is a minimal matching, non greedy, recursive algorithm. That means that the sentence "the pen is mighter than the sword" (without spaces) returns {"the pen is might er than the sword} :)
findAll[s_] :=
Module[{a = s, b = "", c, sy = "="},
While[
StringLength[a] != 0,
j = "";
While[(c = findFirst[a]) == {} && StringLength[a] != 0,
j = j <> StringTake[a, 1];
sy = "~";
a = StringDrop[a, 1];
];
b = b <> " " <> j ;
If[c != {},
b = b <> " " <> c[[1]];
a = StringDrop[a, StringLength[c[[1]]]];
];
];
Return[{StringTrim[StringReplace[b, " " -> " "]], sy}];
]
findFirst[s_] :=
If[s != "" && (c = DictionaryLookup[s]) == {},
findFirst[StringDrop[s, -1]], Return[c]];
Sample Input
ss = {"twodreamstop",
"onebackstop",
"butterfingers",
"dependentrelationship",
"payperiodmatchcode",
"labordistributioncodedesc",
"benefitcalcrulecodedesc",
"psaddresstype",
"ageconrolnoticeperiod",
"month05",
"as_benefits",
"fname"}
Output
twodreamstop = two dreams top
onebackstop = one backstop
butterfingers = butterfingers
dependentrelationship = dependent relationship
payperiodmatchcode = pay period match code
labordistributioncodedesc ~ labor distribution coded es c
benefitcalcrulecodedesc ~ benefit c a lc rule coded es c
psaddresstype ~ p sad dress type
ageconrolnoticeperiod ~ age con rol notice period
month05 ~ month 05
as_benefits ~ as _ benefits
fname ~ f name
HTH

Check spelling correction algorithm. Here is a link to an article on algorithm used in google - http://www.norvig.com/spell-correct.html. Here you will find a scientific paper on this topic from google.

After doing the recursive splitting and dictionary lookup, to increase the quality of word pairs in your your phrase you might be interested to employ Mutual information of Word pairs.
This is essentially going though a training set and finding out M.I. values of word pairs that tells you that Albert Simpson is less Likely than Albert Einstein :)
You can try searching Science Direct for academic papers in this theme. For basic information on Mutual information see http://en.wikipedia.org/wiki/Mutual_information
Last year I had been involved in the phrase search part of a search engine project in which I was trying to parse though wikipedia dataset and rank each word pair. I've got the code in C++ if you care could share it with you if you can find some use of it. It parses wikimedia and for every word pair finds out the mutual information.

Related

dynamic programming word segmentation

Suppose I have a string like 'meetateight' and I need to segment it into meaningful words like 'meet' 'at' 'eight' using dynamic programming.
To judge how “good” a block/segment "x = x1x2x3" is, I am given a black box that, on input x, returns a real number quality(x) such that: A large positive value for quality(x) indicates x is close to an English word, and a large negative number indicates x is far from an English word.
I need help with designing an algorithm for the same.
I tried thinking over an algorithm in which I would iteratively add letters based on their quality and segment whenever there is a dip in quality.
But this fails in the above example because it cuts out me instead of meet.
I need suggestions for a better algorithm.
Thanks
What about building a Trie using an English Dictionary and navigating it down scanning your string with all the possible path to leaf (backtracking when you have more than one choice).
You can use dynamic programming, and keep track of the score for each prefix of your input, adding one letter at a time. Each time you add a letter, see if any suffixes can be added on to a prefix you've already used (choosing the one with the best score). For example:
m = 0
me = 1
mee = 0
meet = 1
meeta = 1 (meet + a)
meetat = 1 (meet + at)
meetate = 1 (meet + ate)
meetatei = 1 (meetate + i)
meetateig = 0
meetateigh = 0
meetateight = 1 (meetat + eight)
To handle values between 0 and 1, you can multiply them together. Also save which words you've used so you can split the whole string at the end.
I wrote a program to do this at my blog; it's too long to include here. The basic idea is to chop off prefixes that form words, then recursively process the rest of the input, backtracking when it is not possible to split the entire string.

Understanding the Viterbi algorithm

I was looking for a precise step by step example of the Viterbi algorithm.
Considering sentence tagging with the input sentence as:
The cat saw the angry dog jump
And from this I would like to generate the most probable output as:
D N V T A N V
How do we use the Viterbi algorithm to get the above output using a trigram-HMM?
(PS: I'm looking for a precise step by step explanation, not a piece of code, or math representation. Assume all probabilities as numbers.)
Thanks a ton!
I suggest that you look it up in one of the books available, e.g. Chris Bishop "Pattern Recognition and Machine Learning". Viterbi algorithm is a really basic thing and has been described in various levels of detail in the literature.
For Viterbi algorithm and Hidden Markov Model, you first need the transition probability and emission probability.
In your example, the transition probability is P(D->N), P(N->V) and the emission probability (assuming bigram model) is P(D|the), P(N|cat).
Of course, in real world example, there are a lot more word than the, cat, saw, etc. You have to loop through all your training data to have estimate of P(D|the), P(N|cat), P(N|car). Then we use Viterbi algorithm to find the most likely sequence of tags such as
D N V T A N V
given your observation.
Here is my implementation of Viterbi.
def viterbi(vocab, vocab_tag, words, tags, t_bigram_count, t_unigram_count, e_bigram_count, e_unigram_count, ADD_K):
vocab_size = len(vocab)
V = [{}]
for t in vocab_tag:
# Prob of very first word
prob = np.log2(float(e_bigram_count.get((words[0],t),0)+ADD_K))-np.log2(float(e_unigram_count[t]+vocab_size*ADD_K))
# trigram V[0][0]
V[0][t] = {"prob": prob, "prev": None}
for i in range(1,len(words)):
V.append({})
for t in vocab_tag:
V[i][t] = {"prob": np.log2(0), "prev": None}
for t in vocab_tag:
max_trans_prob = np.log2(0);
for prev_tag in vocab_tag:
trans_prob = np.log2(float(t_bigram_count.get((t, prev_tag),0)+ADD_K))-np.log2(float(t_unigram_count[prev_tag]+vocab_size*ADD_K))
if V[i-1][prev_tag]["prob"]+trans_prob > max_trans_prob:
max_trans_prob = V[i-1][prev_tag]["prob"]+trans_prob
max_prob = max_trans_prob+np.log2(e_bigram_count.get((words[i],t),0)+ADD_K)-np.log2(float(e_unigram_count[t]+vocab_size*ADD_K))
V[i][t] = {"prob": max_prob, "prev": prev_tag}
opt = []
previous = None
max_prob = max(value["prob"] for value in V[-1].values())
# Get most probable state and its backtrack
for st, data in V[-1].items():
if data["prob"] == max_prob:
opt.append(st)
previous = st
break
for t in range(len(V) - 2, -1, -1):
opt.insert(0, V[t + 1][previous]["prev"])
previous = V[t][previous]["prev"]
return opt

Spell checker with fused spelling error correction algorithm

Recently I've looked through several spell checker algorithms including simple ones(like Peter Norvig's) and much more complex (like Brill and Moore's) ones. But there's a type of errors which none of them can handle. If for example I type stackoverflow instead of stack overflow these spellcheckers will fail to correct the mistype (unless the stack overflow in the dictionary of terms). Storing all the pairs of words is too expensive (and it will not help if the error is 3 single words without spaces between them).
Is there an algorithm which can correct (despite usual mistypes) this type of errors?
Some examples of what I need:
spel checker -> spell checker
spellchecker -> spell checker
spelcheker -> spell checker
I hacked up Norvig's spell corrector to do this. I had to cheat a bit and add the word 'checker' to Norvig's data file because it never appears. Without that cheating, the problem is really hard.
expertsexchange expert exchange
spel checker spell checker
spellchecker spell checker
spelchecker she checker # can't win them all
baseball base all # baseball isn't in the dictionary either :(
hewent he went
Basically you need to change the code so that:
you add space to the alphabet to automatically explore the word breaks.
you first check that all of the words that make up a phrase are in the dictionary to consider the phrase valid, rather than just dictionary membership directly (the dict contains no phrases).
you need a way to score a phrase against plain words.
The latter is the trickiest, and I use a braindead independence assumption for phrase composition that the probability of two adjacent words is the product of their individual probabilities (here done with sum in log prob space), with a small penalty. I am sure that in practice, you'll want to keep some bigram stats to do that splitting well.
import re, collections, math
def words(text): return re.findall('[a-z]+', text.lower())
def train(features):
counts = collections.defaultdict(lambda: 1.0)
for f in features:
counts[f] += 1.0
tot = float(sum(counts.values()))
model = collections.defaultdict(lambda: math.log(.1 / tot))
for f in counts:
model[f] = math.log(counts[f] / tot)
return model
NWORDS = train(words(file('big.txt').read()))
alphabet = 'abcdefghijklmnopqrstuvwxyz '
def valid(w):
return all(s in NWORDS for s in w.split())
def score(w):
return sum(NWORDS[s] for s in w.split()) - w.count(' ')
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if valid(e2))
def known(words): return set(w for w in words if valid(w))
def correct(word):
candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
return max(candidates, key=score)
def t(w):
print w, correct(w)
t('expertsexchange')
t('spel checker')
t('spellchecker')
t('spelchecker')
t('baseball')
t('hewent')
This problem is very similar to the problem of compound splitting as applied to German or Dutch, but also noisy English data. See Monz & De Rijke for a very simple algorithm (which can I think be implemented as a finite state transducer for efficiency) and Google for "compound splitting" and "decompounding".
I sometimes get such suggestions when spell-checking in kate, so there certainly is an algorithm that can correct some such errors. I am sure one can do better, but one idea is to split the candidate in likely places and check whether close matches for the components exist. The hard part is to decide what are likely places. In the languages I'm sort of familiar with, there are letter combinations that occur rarely in words. For example, the combinations dk or lh are, as far as I'm aware rare in English words. Other combinations occur often at the start of words (e.g. un, ch), so those would be good guesses for splitting too. In the example spelcheker, the lc combination is not too widespread, and ch is a common start of words, so the split spel and cheker is a prime candidate, and any decent algorithm would then find spell and checker (but it would probably also find spiel, so don't auto-correct, just give suggestions).

Getting the closest string match

I need a way to compare multiple strings to a test string and return the string that closely resembles it:
TEST STRING: THE BROWN FOX JUMPED OVER THE RED COW
CHOICE A : THE RED COW JUMPED OVER THE GREEN CHICKEN
CHOICE B : THE RED COW JUMPED OVER THE RED COW
CHOICE C : THE RED FOX JUMPED OVER THE BROWN COW
(If I did this correctly) The closest string to the "TEST STRING" should be "CHOICE C". What is the easiest way to do this?
I plan on implementing this into multiple languages including VB.net, Lua, and JavaScript. At this point, pseudo code is acceptable. If you can provide an example for a specific language, this is appreciated too!
I was presented with this problem about a year ago when it came to looking up user entered information about a oil rig in a database of miscellaneous information. The goal was to do some sort of fuzzy string search that could identify the database entry with the most common elements.
Part of the research involved implementing the Levenshtein distance algorithm, which determines how many changes must be made to a string or phrase to turn it into another string or phrase.
The implementation I came up with was relatively simple, and involved a weighted comparison of the length of the two phrases, the number of changes between each phrase, and whether each word could be found in the target entry.
The article is on a private site so I'll do my best to append the relevant contents here:
Fuzzy String Matching is the process of performing a human-like estimation of the similarity of two words or phrases. In many cases, it involves identifying words or phrases which are most similar to each other. This article describes an in-house solution to the fuzzy string matching problem and its usefulness in solving a variety of problems which can allow us to automate tasks which previously required tedious user involvement.
Introduction
The need to do fuzzy string matching originally came about while developing the Gulf of Mexico Validator tool. What existed was a database of known gulf of Mexico oil rigs and platforms, and people buying insurance would give us some badly typed out information about their assets and we had to match it to the database of known platforms. When there was very little information given, the best we could do is rely on an underwriter to "recognize" the one they were referring to and call up the proper information. This is where this automated solution comes in handy.
I spent a day researching methods of fuzzy string matching, and eventually stumbled upon the very useful Levenshtein distance algorithm on Wikipedia.
Implementation
After reading about the theory behind it, I implemented and found ways to optimize it. This is how my code looks like in VBA:
'Calculate the Levenshtein Distance between two strings (the number of insertions,
'deletions, and substitutions needed to transform the first string into the second)
Public Function LevenshteinDistance(ByRef S1 As String, ByVal S2 As String) As Long
Dim L1 As Long, L2 As Long, D() As Long 'Length of input strings and distance matrix
Dim i As Long, j As Long, cost As Long 'loop counters and cost of substitution for current letter
Dim cI As Long, cD As Long, cS As Long 'cost of next Insertion, Deletion and Substitution
L1 = Len(S1): L2 = Len(S2)
ReDim D(0 To L1, 0 To L2)
For i = 0 To L1: D(i, 0) = i: Next i
For j = 0 To L2: D(0, j) = j: Next j
For j = 1 To L2
For i = 1 To L1
cost = Abs(StrComp(Mid$(S1, i, 1), Mid$(S2, j, 1), vbTextCompare))
cI = D(i - 1, j) + 1
cD = D(i, j - 1) + 1
cS = D(i - 1, j - 1) + cost
If cI <= cD Then 'Insertion or Substitution
If cI <= cS Then D(i, j) = cI Else D(i, j) = cS
Else 'Deletion or Substitution
If cD <= cS Then D(i, j) = cD Else D(i, j) = cS
End If
Next i
Next j
LevenshteinDistance = D(L1, L2)
End Function
Simple, speedy, and a very useful metric. Using this, I created two separate metrics for evaluating the similarity of two strings. One I call "valuePhrase" and one I call "valueWords". valuePhrase is just the Levenshtein distance between the two phrases, and valueWords splits the string into individual words, based on delimiters such as spaces, dashes, and anything else you'd like, and compares each word to each other word, summing up the shortest Levenshtein distance connecting any two words. Essentially, it measures whether the information in one 'phrase' is really contained in another, just as a word-wise permutation. I spent a few days as a side project coming up with the most efficient way possible of splitting a string based on delimiters.
valueWords, valuePhrase, and Split function:
Public Function valuePhrase#(ByRef S1$, ByRef S2$)
valuePhrase = LevenshteinDistance(S1, S2)
End Function
Public Function valueWords#(ByRef S1$, ByRef S2$)
Dim wordsS1$(), wordsS2$()
wordsS1 = SplitMultiDelims(S1, " _-")
wordsS2 = SplitMultiDelims(S2, " _-")
Dim word1%, word2%, thisD#, wordbest#
Dim wordsTotal#
For word1 = LBound(wordsS1) To UBound(wordsS1)
wordbest = Len(S2)
For word2 = LBound(wordsS2) To UBound(wordsS2)
thisD = LevenshteinDistance(wordsS1(word1), wordsS2(word2))
If thisD < wordbest Then wordbest = thisD
If thisD = 0 Then GoTo foundbest
Next word2
foundbest:
wordsTotal = wordsTotal + wordbest
Next word1
valueWords = wordsTotal
End Function
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
' SplitMultiDelims
' This function splits Text into an array of substrings, each substring
' delimited by any character in DelimChars. Only a single character
' may be a delimiter between two substrings, but DelimChars may
' contain any number of delimiter characters. It returns a single element
' array containing all of text if DelimChars is empty, or a 1 or greater
' element array if the Text is successfully split into substrings.
' If IgnoreConsecutiveDelimiters is true, empty array elements will not occur.
' If Limit greater than 0, the function will only split Text into 'Limit'
' array elements or less. The last element will contain the rest of Text.
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
Function SplitMultiDelims(ByRef Text As String, ByRef DelimChars As String, _
Optional ByVal IgnoreConsecutiveDelimiters As Boolean = False, _
Optional ByVal Limit As Long = -1) As String()
Dim ElemStart As Long, N As Long, M As Long, Elements As Long
Dim lDelims As Long, lText As Long
Dim Arr() As String
lText = Len(Text)
lDelims = Len(DelimChars)
If lDelims = 0 Or lText = 0 Or Limit = 1 Then
ReDim Arr(0 To 0)
Arr(0) = Text
SplitMultiDelims = Arr
Exit Function
End If
ReDim Arr(0 To IIf(Limit = -1, lText - 1, Limit))
Elements = 0: ElemStart = 1
For N = 1 To lText
If InStr(DelimChars, Mid(Text, N, 1)) Then
Arr(Elements) = Mid(Text, ElemStart, N - ElemStart)
If IgnoreConsecutiveDelimiters Then
If Len(Arr(Elements)) > 0 Then Elements = Elements + 1
Else
Elements = Elements + 1
End If
ElemStart = N + 1
If Elements + 1 = Limit Then Exit For
End If
Next N
'Get the last token terminated by the end of the string into the array
If ElemStart <= lText Then Arr(Elements) = Mid(Text, ElemStart)
'Since the end of string counts as the terminating delimiter, if the last character
'was also a delimiter, we treat the two as consecutive, and so ignore the last elemnent
If IgnoreConsecutiveDelimiters Then If Len(Arr(Elements)) = 0 Then Elements = Elements - 1
ReDim Preserve Arr(0 To Elements) 'Chop off unused array elements
SplitMultiDelims = Arr
End Function
Measures of Similarity
Using these two metrics, and a third which simply computes the distance between two strings, I have a series of variables which I can run an optimization algorithm to achieve the greatest number of matches. Fuzzy string matching is, itself, a fuzzy science, and so by creating linearly independent metrics for measuring string similarity, and having a known set of strings we wish to match to each other, we can find the parameters that, for our specific styles of strings, give the best fuzzy match results.
Initially, the goal of the metric was to have a low search value for for an exact match, and increasing search values for increasingly permuted measures. In an impractical case, this was fairly easy to define using a set of well defined permutations, and engineering the final formula such that they had increasing search values results as desired.
In the above screenshot, I tweaked my heuristic to come up with something that I felt scaled nicely to my perceived difference between the search term and result. The heuristic I used for Value Phrase in the above spreadsheet was =valuePhrase(A2,B2)-0.8*ABS(LEN(B2)-LEN(A2)). I was effectively reducing the penalty of the Levenstein distance by 80% of the difference in the length of the two "phrases". This way, "phrases" that have the same length suffer the full penalty, but "phrases" which contain 'additional information' (longer) but aside from that still mostly share the same characters suffer a reduced penalty. I used the Value Words function as is, and then my final SearchVal heuristic was defined as =MIN(D2,E2)*0.8+MAX(D2,E2)*0.2 - a weighted average. Whichever of the two scores was lower got weighted 80%, and 20% of the higher score. This was just a heuristic that suited my use case to get a good match rate. These weights are something that one could then tweak to get the best match rate with their test data.
As you can see, the last two metrics, which are fuzzy string matching metrics, already have a natural tendency to give low scores to strings that are meant to match (down the diagonal). This is very good.
Application
To allow the optimization of fuzzy matching, I weight each metric. As such, every application of fuzzy string match can weight the parameters differently. The formula that defines the final score is a simply combination of the metrics and their weights:
value = Min(phraseWeight*phraseValue, wordsWeight*wordsValue)*minWeight
+ Max(phraseWeight*phraseValue, wordsWeight*wordsValue)*maxWeight
+ lengthWeight*lengthValue
Using an optimization algorithm (neural network is best here because it is a discrete, multi-dimentional problem), the goal is now to maximize the number of matches. I created a function that detects the number of correct matches of each set to each other, as can be seen in this final screenshot. A column or row gets a point if the lowest score is assigned the the string that was meant to be matched, and partial points are given if there is a tie for the lowest score, and the correct match is among the tied matched strings. I then optimized it. You can see that a green cell is the column that best matches the current row, and a blue square around the cell is the row that best matches the current column. The score in the bottom corner is roughly the number of successful matches and this is what we tell our optimization problem to maximize.
The algorithm was a wonderful success, and the solution parameters say a lot about this type of problem. You'll notice the optimized score was 44, and the best possible score is 48. The 5 columns at the end are decoys, and do not have any match at all to the row values. The more decoys there are, the harder it will naturally be to find the best match.
In this particular matching case, the length of the strings are irrelevant, because we are expecting abbreviations that represent longer words, so the optimal weight for length is -0.3, which means we do not penalize strings which vary in length. We reduce the score in anticipation of these abbreviations, giving more room for partial word matches to supersede non-word matches that simply require less substitutions because the string is shorter.
The word weight is 1.0 while the phrase weight is only 0.5, which means that we penalize whole words missing from one string and value more the entire phrase being intact. This is useful because a lot of these strings have one word in common (the peril) where what really matters is whether or not the combination (region and peril) are maintained.
Finally, the min weight is optimized at 10 and the max weight at 1. What this means is that if the best of the two scores (value phrase and value words) isn't very good, the match is greatly penalized, but we don't greatly penalize the worst of the two scores. Essentially, this puts emphasis on requiring either the valueWord or valuePhrase to have a good score, but not both. A sort of "take what we can get" mentality.
It's really fascinating what the optimized value of these 5 weights say about the sort of fuzzy string matching taking place. For completely different practical cases of fuzzy string matching, these parameters are very different. I've used it for 3 separate applications so far.
While unused in the final optimization, a benchmarking sheet was established which matches columns to themselves for all perfect results down the diagonal, and lets the user change parameters to control the rate at which scores diverge from 0, and note innate similarities between search phrases (which could in theory be used to offset false positives in the results)
Further Applications
This solution has potential to be used anywhere where the user wishes to have a computer system identify a string in a set of strings where there is no perfect match. (Like an approximate match vlookup for strings).
So what you should take from this, is that you probably want to use a combination of high level heuristics (finding words from one phrase in the other phrase, length of both phrases, etc) along with the implementation of the Levenshtein distance algorithm. Because deciding which is the "best" match is a heuristic (fuzzy) determination - you'll have to come up with a set of weights for any metrics you come up with to determine similarity.
With the appropriate set of heuristics and weights, you'll have your comparison program quickly making the decisions that you would have made.
This problem turns up all the time in bioinformatics. The accepted answer above (which was great by the way) is known in bioinformatics as the Needleman-Wunsch (compare two strings) and Smith-Waterman (find an approximate substring in a longer string) algorithms. They work great and have been workhorses for decades.
But what if you have a million strings to compare? That's a trillion pairwise comparisons, each of which is O(n*m)! Modern DNA sequencers easily generate a billion short DNA sequences, each about 200 DNA "letters" long. Typically, we want to find, for each such string, the best match against the human genome (3 billion letters). Clearly, the Needleman-Wunsch algorithm and its relatives will not do.
This so-called "alignment problem" is a field of active research. The most popular algorithms are currently able to find inexact matches between 1 billion short strings and the human genome in a matter of hours on reasonable hardware (say, eight cores and 32 GB RAM).
Most of these algorithms work by quickly finding short exact matches (seeds) and then extending these to the full string using a slower algorithm (for example, the Smith-Waterman). The reason this works is that we are really only interested in a few close matches, so it pays off to get rid of the 99.9...% of pairs that have nothing in common.
How does finding exact matches help finding inexact matches? Well, say we allow only a single difference between the query and the target. It is easy to see that this difference must occur in either the right or left half of the query, and so the other half must match exactly. This idea can be extended to multiple mismatches and is the basis for the ELAND algorithm commonly used with Illumina DNA sequencers.
There are many very good algorithms for doing exact string matching. Given a query string of length 200, and a target string of length 3 billion (the human genome), we want to find any place in the target where there is a substring of length k that matches a substring of the query exactly. A simple approach is to begin by indexing the target: take all k-long substrings, put them in an array and sort them. Then take each k-long substring of the query and search the sorted index. Sort and search can be done in O(log n) time.
But storage can be a problem. An index of the 3 billion letter target would need to hold 3 billion pointers and 3 billion k-long words. It would seem hard to fit this in less than several tens of gigabytes of RAM. But amazingly we can greatly compress the index, using the Burrows-Wheeler transform, and it will still be efficiently queryable. An index of the human genome can fit in less than 4 GB RAM. This idea is the basis of popular sequence aligners such as Bowtie and BWA.
Alternatively, we can use a suffix array, which stores only the pointers, yet represents a simultaneous index of all suffixes in the target string (essentially, a simultaneous index for all possible values of k; the same is true of the Burrows-Wheeler transform). A suffix array index of the human genome will take 12 GB of RAM if we use 32-bit pointers.
The links above contain a wealth of information and links to primary research papers. The ELAND link goes to a PDF with useful figures illustrating the concepts involved, and shows how to deal with insertions and deletions.
Finally, while these algorithms have basically solved the problem of (re)sequencing single human genomes (a billion short strings), DNA sequencing technology improves even faster than Moore's law, and we are fast approaching trillion-letter datasets. For example, there are currently projects underway to sequence the genomes of 10,000 vertebrate species, each a billion letters long or so. Naturally, we will want to do pairwise inexact string matching on the data...
I contest that choice B is closer to the test string, as it's only 4 characters(and 2 deletes) from being the original string. Whereas you see C as closer because it includes both brown and red. It would, however, have a greater edit distance.
There is an algorithm called Levenshtein Distance which measures the edit distance between two inputs.
Here is a tool for that algorithm.
Rates choice A as a distance of 15.
Rates choice B as a distance of 6.
Rates choice C as a distance of 9.
EDIT: Sorry, I keep mixing strings in the levenshtein tool. Updated to correct answers.
Lua implementation, for posterity:
function levenshtein_distance(str1, str2)
local len1, len2 = #str1, #str2
local char1, char2, distance = {}, {}, {}
str1:gsub('.', function (c) table.insert(char1, c) end)
str2:gsub('.', function (c) table.insert(char2, c) end)
for i = 0, len1 do distance[i] = {} end
for i = 0, len1 do distance[i][0] = i end
for i = 0, len2 do distance[0][i] = i end
for i = 1, len1 do
for j = 1, len2 do
distance[i][j] = math.min(
distance[i-1][j ] + 1,
distance[i ][j-1] + 1,
distance[i-1][j-1] + (char1[i] == char2[j] and 0 or 1)
)
end
end
return distance[len1][len2]
end
You might find this library helpful!
http://code.google.com/p/google-diff-match-patch/
It is currently available in Java, JavaScript, Dart, C++, C#, Objective C, Lua and Python
It works pretty well too. I use it in a couple of my Lua projects.
And I don't think it would be too difficult to port it to other languages!
You might be interested in this blog post.
http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python
Fuzzywuzzy is a Python library that provides easy distance measures such as Levenshtein distance for string matching. It is built on top of difflib in the standard library and will make use of the C implementation Python-levenshtein if available.
http://pypi.python.org/pypi/python-Levenshtein/
If you're doing this in the context of a search engine or frontend against a database, you might consider using a tool like Apache Solr, with the ComplexPhraseQueryParser plugin. This combination allows you to search against an index of strings with the results sorted by relevance, as determined by Levenshtein distance.
We've been using it against a large collection of artists and song titles when the incoming query may have one or more typos, and it's worked pretty well (and remarkably fast considering the collections are in the millions of strings).
Additionally, with Solr, you can search against the index on demand via JSON, so you won't have to reinvent the solution between the different languages you're looking at.
The problem is hard to implement if the input data is too large (say millions of strings). I used elastic search to solve this.
Quick start : https://www.elastic.co/guide/en/elasticsearch/client/net-api/6.x/elasticsearch-net.html
Just insert all the input data into DB and you can search any string based on any edit distance quickly. Here is a C# snippet which will give you a list of results sorted by edit distance (smaller to higher)
var res = client.Search<ClassName>(s => s
.Query(q => q
.Match(m => m
.Field(f => f.VariableName)
.Query("SAMPLE QUERY")
.Fuzziness(Fuzziness.EditDistance(5))
)
));
A very, very good resource for these kinds of algorithms is Simmetrics: http://sourceforge.net/projects/simmetrics/
Unfortunately the awesome website containing a lot of the documentation is gone :(
In case it comes back up again, its previous address was this:
http://www.dcs.shef.ac.uk/~sam/simmetrics.html
Voila (courtesy of "Wayback Machine"): http://web.archive.org/web/20081230184321/http://www.dcs.shef.ac.uk/~sam/simmetrics.html
You can study the code source, there are dozens of algorithms for these kinds of comparisons, each with a different trade-off. The implementations are in Java.
To query a large set of text in efficient manner you can use the concept of Edit Distance/ Prefix Edit Distance.
Edit Distance ED(x,y): minimal number of transfroms to get from term x to term y
But computing ED between each term and query text is resource and time intensive. Therefore instead of calculating ED for each term first we can extract possible matching terms using a technique called Qgram Index. and then apply ED calculation on those selected terms.
An advantage of Qgram index technique is it supports for Fuzzy Search.
One possible approach to adapt QGram index is build an Inverted Index using Qgrams. In there we store all the words which consists with particular Qgram, under that Qgram.(Instead of storing full string you can use unique ID for each string). You can use Tree Map data structure in Java for this.
Following is a small example on storing of terms
col : colmbia, colombo, gancola, tacolama
Then when querying, we calculate the number of common Qgrams between query text and available terms.
Example: x = HILLARY, y = HILARI(query term)
Qgrams
$$HILLARY$$ -> $$H, $HI, HIL, ILL, LLA, LAR, ARY, RY$, Y$$
$$HILARI$$ -> $$H, $HI, HIL, ILA, LAR, ARI, RI$, I$$
number of q-grams in common = 4
number of q-grams in common = 4.
For the terms with high number of common Qgrams, we calculate the ED/PED against the query term and then suggest the term to the end user.
you can find an implementation of this theory in following project(See "QGramIndex.java"). Feel free to ask any questions. https://github.com/Bhashitha-Gamage/City_Search
To study more about Edit Distance, Prefix Edit Distance Qgram index please watch the following video of Prof. Dr Hannah Bast https://www.youtube.com/embed/6pUg2wmGJRo (Lesson starts from 20:06)
Here you can have a golang POC for calculate the distances between the given words. You can tune the minDistance and difference for other scopes.
Playground: https://play.golang.org/p/NtrBzLdC3rE
package main
import (
"errors"
"fmt"
"log"
"math"
"strings"
)
var data string = `THE RED COW JUMPED OVER THE GREEN CHICKEN-THE RED COW JUMPED OVER THE RED COW-THE RED FOX JUMPED OVER THE BROWN COW`
const minDistance float64 = 2
const difference float64 = 1
type word struct {
data string
letters map[rune]int
}
type words struct {
words []word
}
// Print prettify the data present in word
func (w word) Print() {
var (
lenght int
c int
i int
key rune
)
fmt.Printf("Data: %s\n", w.data)
lenght = len(w.letters) - 1
c = 0
for key, i = range w.letters {
fmt.Printf("%s:%d", string(key), i)
if c != lenght {
fmt.Printf(" | ")
}
c++
}
fmt.Printf("\n")
}
func (ws words) fuzzySearch(data string) ([]word, error) {
var (
w word
err error
founds []word
)
w, err = initWord(data)
if err != nil {
log.Printf("Errors: %s\n", err.Error())
return nil, err
}
// Iterating all the words
for i := range ws.words {
letters := ws.words[i].letters
//
var similar float64 = 0
// Iterating the letters of the input data
for key := range w.letters {
if val, ok := letters[key]; ok {
if math.Abs(float64(val-w.letters[key])) <= minDistance {
similar += float64(val)
}
}
}
lenSimilarity := math.Abs(similar - float64(len(data)-strings.Count(data, " ")))
log.Printf("Comparing %s with %s i've found %f similar letter, with weight %f", data, ws.words[i].data, similar, lenSimilarity)
if lenSimilarity <= difference {
founds = append(founds, ws.words[i])
}
}
if len(founds) == 0 {
return nil, errors.New("no similar found for data: " + data)
}
return founds, nil
}
func initWords(data []string) []word {
var (
err error
words []word
word word
)
for i := range data {
word, err = initWord(data[i])
if err != nil {
log.Printf("Error in index [%d] for data: %s", i, data[i])
} else {
words = append(words, word)
}
}
return words
}
func initWord(data string) (word, error) {
var word word
word.data = data
word.letters = make(map[rune]int)
for _, r := range data {
if r != 32 { // avoid to save the whitespace
word.letters[r]++
}
}
return word, nil
}
func main() {
var ws words
words := initWords(strings.Split(data, "-"))
for i := range words {
words[i].Print()
}
ws.words = words
solution, _ := ws.fuzzySearch("THE BROWN FOX JUMPED OVER THE RED COW")
fmt.Println("Possible solutions: ", solution)
}
A sample using C# is here.
public static void Main()
{
Console.WriteLine("Hello World " + LevenshteinDistance("Hello","World"));
Console.WriteLine("Choice A " + LevenshteinDistance("THE BROWN FOX JUMPED OVER THE RED COW","THE RED COW JUMPED OVER THE GREEN CHICKEN"));
Console.WriteLine("Choice B " + LevenshteinDistance("THE BROWN FOX JUMPED OVER THE RED COW","THE RED COW JUMPED OVER THE RED COW"));
Console.WriteLine("Choice C " + LevenshteinDistance("THE BROWN FOX JUMPED OVER THE RED COW","THE RED FOX JUMPED OVER THE BROWN COW"));
}
public static float LevenshteinDistance(string a, string b)
{
var rowLen = a.Length;
var colLen = b.Length;
var maxLen = Math.Max(rowLen, colLen);
// Step 1
if (rowLen == 0 || colLen == 0)
{
return maxLen;
}
/// Create the two vectors
var v0 = new int[rowLen + 1];
var v1 = new int[rowLen + 1];
/// Step 2
/// Initialize the first vector
for (var i = 1; i <= rowLen; i++)
{
v0[i] = i;
}
// Step 3
/// For each column
for (var j = 1; j <= colLen; j++)
{
/// Set the 0'th element to the column number
v1[0] = j;
// Step 4
/// For each row
for (var i = 1; i <= rowLen; i++)
{
// Step 5
var cost = (a[i - 1] == b[j - 1]) ? 0 : 1;
// Step 6
/// Find minimum
v1[i] = Math.Min(v0[i] + 1, Math.Min(v1[i - 1] + 1, v0[i - 1] + cost));
}
/// Swap the vectors
var vTmp = v0;
v0 = v1;
v1 = vTmp;
}
// Step 7
/// The vectors were swapped one last time at the end of the last loop,
/// that is why the result is now in v0 rather than in v1
return v0[rowLen];
}
The output is:
Hello World 4
Choice A 15
Choice B 6
Choice C 8
There is one more similarity measure which I once implemented in our system and was giving satisfactory results :-
Use Case
There is a user query which needs to be matched against a set of documents.
Algorithm
Extract keywords from the user query (relevant POS TAGS - Noun, Proper noun).
Now calculate score based on below formula for measuring similarity between user query and given document.
For every keyword extracted from user query :-
Start searching the document for given word and for every subsequent occurrence of that word in the document decrease the rewarded points.
In essence, if first keyword appears 4 times in the document, the score will be calculated as :-
first occurrence will fetch '1' point.
Second occurrence will add 1/2 to calculated score
Third occurrence would add 1/3 to total
Fourth occurrence gets 1/4
Total similarity score = 1 + 1/2 + 1/3 + 1/4 = 2.083
Similarly, we calculate it for other keywords in user query.
Finally, the total score will represent the extent of similarity between user query and given document.
Here is a quick solution that doesn't depend on any libraries, and works well enough for things like autocomplete forms:
function compare_strings(str1, str2) {
arr1 = str1.split("");
arr2 = str2.split("");
res = arr1.reduce((a, c) => a + arr2.includes(c), 0);
return(res)
}
Can use in an autocomplete input like this:
HTML:
<div id="wrapper">
<input id="tag_input" placeholder="add tags..."></input>
<div id="hold_tags"></div>
</div>
CSS:
body {
background: #2c2c54;
display: flex;
justify-content: center;
align-items: center;
}
input {
height: 40px;
width: 400px;
border-radius: 4px;
outline: 0;
border: none;
padding-left: 5px;
font-size: 18px;
}
#wrapper {
height: auto;
background: #40407a;
}
.tag {
background: #ffda79;
margin: 4px;
padding: 5px;
border-radius: 4px;
box-shadow: 2px 2px 2px black;
font-size: 18px;
font-family: arial;
cursor: pointer;
}
JS:
const input = document.getElementById("tag_input");
const wrapper = document.getElementById("wrapper");
const hold_tags = document.getElementById("hold_tags");
const words = [
"machine",
"data",
"platform",
"garbage",
"twitter",
"knowledge"
];
input.addEventListener("input", function (e) {
const value = document.getElementById(e.target.id).value;
hold_tags.replaceChildren();
if (value !== "") {
words.forEach(function (word) {
if (compare_strings(word, value) > value.length - 1) {
const tag = document.createElement("div");
tag.className = "tag";
tag.innerText = word;
hold_tags.append(tag);
}
});
}
});
function compare_strings(str1, str2) {
arr1 = str1.split("");
arr2 = str2.split("");
res = arr1.reduce((a, c) => a + arr2.includes(c), 0);
return res;
}
Result:

Algorithm for linear pattern matching?

I have a linear list of zeros and ones and I need to match multiple simple patterns and find the first occurrence. For example, I might need to find 0001101101, 01010100100, OR 10100100010 within a list of length 8 million. I only need to find the first occurrence of either, and then return the index at which it occurs. However, doing the looping and accesses over the large list can be expensive, and I'd rather not do it too many times.
Is there a faster method than doing
foreach (patterns) {
for (i=0; i < listLength; i++)
for(t=0; t < patternlength; t++)
if( list[i+t] != pattern[t] ) {
break;
}
if( t == patternlength - 1 ) {
return i; // pattern found!
}
}
}
}
Edit: BTW, I have implemented this program according to the above pseudocode, and performance is OK, but nothing spectacular. I'm estimating that I process about 6 million bits a second on a single core of my processor. I'm using this for image processing, and it's going to have to go through a few thousand 8 megapixel images, so every little bit helps.
Edit: If it's not clear, I'm working with a bit array, so there's only two possibilities: ONE and ZERO. And it's in C++.
Edit: Thanks for the pointers to BM and KMP algorithms. I noted that, on the Wikipedia page for BM, it says
The algorithm preprocesses the target
string (key) that is being searched
for, but not the string being searched
in (unlike some algorithms that
preprocess the string to be searched
and can then amortize the expense of
the preprocessing by searching
repeatedly).
That looks interesting, but it didn't give any examples of such algorithms. Would something like that also help?
The key for Googling is "multi-pattern" string matching.
Back in 1975, Aho and Corasick published a (linear-time) algorithm, which was used in the original version of fgrep. The algorithm subsequently got refined by many researchers. For example, Commentz-Walter (1979) combined Aho&Corasick with Boyer&Moore matching. Baeza-Yates (1989) combined AC with the Boyer-Moore-Horspool variant. Wu and Manber (1994) did similar work.
An alternative to the AC line of multi-pattern matching algorithms is Rabin and Karp's algorithm.
I suggest to start with reading the Aho-Corasick and Rabin-Karp Wikipedia pages and then decide whether that would make sense in your case. If so, maybe there already is an implementation for your language/runtime available.
Yes.
The Boyer–Moore string search algorithm
See also: Knuth–Morris–Pratt algorithm
You could Build an SuffixArray and search the runtime is crazy : O ( length(pattern) ).
BUT .. you have to build that array.
It's only worth .. when the Text is static and the pattern dynamic .
A solution that could be efficient:
store the patterns in a trie data structure
start searching the list
check if the next pattern_length chars are in the trie, stop on success ( O(1) operation )
step one char and repeat #3
If the list isn't mutable you can store the offset of matching patterns to avoid repeating calculations the next time.
If your strings need to be flexible, I would also recommend a modified "The Boyer–Moore string search algorithm" as per Mitch Wheat. If your strings do not need to be flexible, you should be able to collapse your pattern matching even more. The model of Boyer-Moore is incredibly efficient for searching a large amount of data for one of multiple strings to match against.
Jacob
If it's a bit array, I suppose doing a rolling sum would be an improvement. If pattern is length n, sum the first n bits and see if it matches the pattern's sum. Store the first bit of the sum always. Then, for every next bit, subtract the first bit from the sum and add the next bit, and see if the sum matches the pattern's sum. That would save the linear loop over the pattern.
It seems like the BM algorithm isn't as awesome for this as it looks, because here I only have two possible values, zero and one, so the first table doesn't help a whole lot. Second table might help, but that means BMH is mostly worthless.
Edit: In my sleep-deprived state I couldn't understand BM, so I just implemented this rolling sum (it was really easy) and it made my search 3 times faster. Thanks to whoever mentioned "rolling hashes". I can now search through 321,750,000 bits for two 30-bit patterns in 5.45 seconds (and that's single-threaded), versus 17.3 seconds before.
If it's just alternating 0's and 1's, then encode your text as runs. A run of n 0's is -n and a run of n 1's is n. Then encode your search strings. Then create a search function that uses the encoded strings.
The code looks like this:
try:
import psyco
psyco.full()
except ImportError:
pass
def encode(s):
def calc_count(count, c):
return count * (-1 if c == '0' else 1)
result = []
c = s[0]
count = 1
for i in range(1, len(s)):
d = s[i]
if d == c:
count += 1
else:
result.append(calc_count(count, c))
count = 1
c = d
result.append(calc_count(count, c))
return result
def search(encoded_source, targets):
def match(encoded_source, t, max_search_len, len_source):
x = len(t)-1
# Get the indexes of the longest segments and search them first
most_restrictive = [bb[0] for bb in sorted(((i, abs(t[i])) for i in range(1,x)), key=lambda x: x[1], reverse=True)]
# Align the signs of the source and target
index = (0 if encoded_source[0] * t[0] > 0 else 1)
unencoded_pos = sum(abs(c) for c in encoded_source[:index])
start_t, end_t = abs(t[0]), abs(t[x])
for i in range(index, len(encoded_source)-x, 2):
if all(t[j] == encoded_source[j+i] for j in most_restrictive):
encoded_start, encoded_end = abs(encoded_source[i]), abs(encoded_source[i+x])
if start_t <= encoded_start and end_t <= encoded_end:
return unencoded_pos + (abs(encoded_source[i]) - start_t)
unencoded_pos += abs(encoded_source[i]) + abs(encoded_source[i+1])
if unencoded_pos > max_search_len:
return len_source
return len_source
len_source = sum(abs(c) for c in encoded_source)
i, found, target_index = len_source, None, -1
for j, t in enumerate(targets):
x = match(encoded_source, t, i, len_source)
print "Match at: ", x
if x < i:
i, found, target_index = x, t, j
return (i, found, target_index)
if __name__ == "__main__":
import datetime
def make_source_text(len):
from random import randint
item_len = 8
item_count = 2**item_len
table = ["".join("1" if (j & (1 << i)) else "0" for i in reversed(range(item_len))) for j in range(item_count)]
return "".join(table[randint(0,item_count-1)] for _ in range(len//item_len))
targets = ['0001101101'*2, '01010100100'*2, '10100100010'*2]
encoded_targets = [encode(t) for t in targets]
data_len = 10*1000*1000
s = datetime.datetime.now()
source_text = make_source_text(data_len)
e = datetime.datetime.now()
print "Make source text(length %d): " % data_len, (e - s)
s = datetime.datetime.now()
encoded_source = encode(source_text)
e = datetime.datetime.now()
print "Encode source text: ", (e - s)
s = datetime.datetime.now()
(i, found, target_index) = search(encoded_source, encoded_targets)
print (i, found, target_index)
print "Target was: ", targets[target_index]
print "Source matched here: ", source_text[i:i+len(targets[target_index])]
e = datetime.datetime.now()
print "Search time: ", (e - s)
On a string twice as long as you offered, it takes about seven seconds to find the earliest match of three targets in 10 million characters. Of course, since I am using random text, that varies a bit with each run.
psyco is a python module for optimizing the code at run-time. Using it, you get great performance, and you might estimate that as an upper bound on the C/C++ performance. Here is recent performance:
Make source text(length 10000000): 0:00:02.277000
Encode source text: 0:00:00.329000
Match at: 2517905
Match at: 494990
Match at: 450986
(450986, [1, -1, 1, -2, 1, -3, 1, -1, 1, -1, 1, -2, 1, -3, 1, -1], 2)
Target was: 1010010001010100100010
Source matched here: 1010010001010100100010
Search time: 0:00:04.325000
It takes about 300 milliseconds to encode 10 million characters and about 4 seconds to search three encoded strings against it. I don't think the encoding time would be high in C/C++.

Resources