Algorithms for "fuzzy matching" strings - algorithm

By fuzzy matching I don't mean similar strings by Levenshtein distance or something similar, but the way it's used in TextMate/Ido/Icicles: given a list of strings, find those which include all characters in the search string, but possibly with other characters between, preferring the best fit.

I've finally understood what you were looking for. The issue is interesting however looking at the 2 algorithms you found it seems that people have widely different opinions about the specifications ;)
I think it would be useful to state the problem and the requirements more clearly.
Problem:
We are looking for a way to speed up typing by allowing users to only type a few letters of the keyword they actually intended and propose them a list from which to select.
It is expected that all the letters of the input be in the keyword
It is expected that the letters in the input be in the same order in the keyword
The list of keywords returned should be presented in a consistent (reproductible) order
The algorithm should be case insensitive
Analysis:
The first two requirements can be sum up like such: for an input axg we are looking for words matching this regular expression [^a]*a[^x]*x[^g]*g.*
The third requirement is purposely loose. The order in which the words should appear in the list need being consistent... however it's difficult to guess whether a scoring approach would be better than alphabetical order. If the list is extremy long, then a scoring approach could be better, however for short list it's easier for the eye to look for a particular item down a list sorted in an obvious manner.
Also, the alphabetical order has the advantage of consistency during typing: ie adding a letter does not completely reorder the list (painful for the eye and brain), it merely filters out the items that do not match any longer.
There is no precision about handling unicode characters, for example is à similar to a or another character altogether ? Since I know of no language that currently uses such characters in their keywords, I'll let it slip for now.
My solution:
For any input, I would build the regular expression expressed earlier. It suitable for Python because the language already features case-insensitive matching.
I would then match my (alphabetically sorted) list of keywords, and output it so filtered.
In pseudo-code:
WORDS = ['Bar', 'Foo', 'FooBar', 'Other']
def GetList(input, words = WORDS):
expr = ['[^' + i + ']*' + i for i in input]
return [w for w in words if re.match(expr, w, re.IGNORECASE)]
I could have used a one-liner but thought it would obscure the code ;)
This solution works very well for incremental situations (ie, when you match as the user type and thus keep rebuilding) because when the user adds a character you can simply refilter the result you just computed. Thus:
Either there are few characters, thus the matching is quick and the length of the list does not matter much
Either there are a lots of characters, and this means we are filtering a short list, thus it does not matter too much if the matching takes a bit longer element-wise
I should also note that this regular expression does not involve back-tracking and is thus quite efficient. It could also be modeled as a simple state machine.

Levenshtein 'Edit Distance' algorithms will definitely work on what you're trying to do: they will give you a measurement of how closely two words or addresses or phone numbers, psalms, monologues and scholarly articles match each other, allowing you you rank the results and choose the best match.
A more lightweight appproach is to count up the common substrings: it's not as good as Levenshtein, but it provides usable results and runs quickly in slow languages which have access to fast 'InString' functions.
I published an Excel 'Fuzzy Lookup' in Excellerando a few years ago, using 'FuzzyMatchScore' function that is, as far as I can tell, exactly what you need:
http://excellerando.blogspot.com/2010/03/vlookup-with-fuzzy-matching-to-get.html
It is, of course, in Visual Basic for Applications. Proceed with caution, crucifixes and garlic:
Public Function SumOfCommonStrings( _
ByVal s1 As String, _
ByVal s2 As String, _
Optional Compare As VBA.VbCompareMethod = vbTextCompare, _
Optional iScore As Integer = 0 _
) As Integer
Application.Volatile False
' N.Heffernan 06 June 2006
' THIS CODE IS IN THE PUBLIC DOMAIN
' Function to measure how much of String 1 is made up of substrings found in String 2
' This function uses a modified Longest Common String algorithm.
' Simple LCS algorithms are unduly sensitive to single-letter
' deletions/changes near the midpoint of the test words, eg:
' Wednesday is obviously closer to WedXesday on an edit-distance
' basis than it is to WednesXXX. So it would be better to score
' the 'Wed' as well as the 'esday' and add up the total matched
' Watch out for strings of differing lengths:
'
' SumOfCommonStrings("Wednesday", "WednesXXXday")
'
' This scores the same as:
'
' SumOfCommonStrings("Wednesday", "Wednesday")
'
' So make sure the calling function uses the length of the longest
' string when calculating the degree of similarity from this score.
' This is coded for clarity, not for performance.
Dim arr() As Integer ' Scoring matrix
Dim n As Integer ' length of s1
Dim m As Integer ' length of s2
Dim i As Integer ' start position in s1
Dim j As Integer ' start position in s2
Dim subs1 As String ' a substring of s1
Dim len1 As Integer ' length of subs1
Dim sBefore1 ' documented in the code
Dim sBefore2
Dim sAfter1
Dim sAfter2
Dim s3 As String
SumOfCommonStrings = iScore
n = Len(s1)
m = Len(s2)
If s1 = s2 Then
SumOfCommonStrings = n
Exit Function
End If
If n = 0 Or m = 0 Then
Exit Function
End If
's1 should always be the shorter of the two strings:
If n > m Then
s3 = s2
s2 = s1
s1 = s3
n = Len(s1)
m = Len(s2)
End If
n = Len(s1)
m = Len(s2)
' Special case: s1 is n exact substring of s2
If InStr(1, s2, s1, Compare) Then
SumOfCommonStrings = n
Exit Function
End If
For len1 = n To 1 Step -1
For i = 1 To n - len1 + 1
subs1 = Mid(s1, i, len1)
j = 0
j = InStr(1, s2, subs1, Compare)
If j > 0 Then
' We've found a matching substring...
iScore = iScore + len1
' Now clip out this substring from s1 and s2...
' And search the fragments before and after this excision:
If i > 1 And j > 1 Then
sBefore1 = left(s1, i - 1)
sBefore2 = left(s2, j - 1)
iScore = SumOfCommonStrings(sBefore1, _
sBefore2, _
Compare, _
iScore)
End If
If i + len1 < n And j + len1 < m Then
sAfter1 = right(s1, n + 1 - i - len1)
sAfter2 = right(s2, m + 1 - j - len1)
iScore = SumOfCommonStrings(sAfter1, _
sAfter2, _
Compare, _
iScore)
End If
SumOfCommonStrings = iScore
Exit Function
End If
Next
Next
End Function
Private Function Minimum(ByVal a As Integer, _
ByVal b As Integer, _
ByVal c As Integer) As Integer
Dim min As Integer
min = a
If b < min Then
min = b
End If
If c < min Then
min = c
End If
Minimum = min
End Function

Two algorithms I've found so far:
LiquidMetal
Better Ido Flex-Matching

I'm actually building something similar to Vim's Command-T and ctrlp plugins for Emacs, just for fun. I have just had a productive discussion with some clever workmates about ways to do this most efficiently. The goal is to reduce the number of operations needed to eliminate files that don't match. So we create a nested map, where at the top-level each key is a character that appears somewhere in the search set, mapping to the indices of all the strings in the search set. Each of those indices then maps to a list of character offsets at which that particular character appears in the search string.
In pseudo code, for the strings:
controller
model
view
We'd build a map like this:
{
"c" => {
0 => [0]
},
"o" => {
0 => [1, 5],
1 => [1]
},
"n" => {
0 => [2]
},
"t" => {
0 => [3]
},
"r" => {
0 => [4, 9]
},
"l" => {
0 => [6, 7],
1 => [4]
},
"e" => {
0 => [9],
1 => [3],
2 => [2]
},
"m" => {
1 => [0]
},
"d" => {
1 => [2]
},
"v" => {
2 => [0]
},
"i" => {
2 => [1]
},
"w" => {
2 => [3]
}
}
So now you have a mapping like this:
{
character-1 => {
word-index-1 => [occurrence-1, occurrence-2, occurrence-n, ...],
word-index-n => [ ... ],
...
},
character-n => {
...
},
...
}
Now searching for the string "oe":
Initialize a new map where the keys will be the indices of strings that match, and the values the offset read through that string so far.
Consume the first char from the search string "o" and look it up in the lookup table.
Since strings at indices 0 and 1 match the "o", put them into the map {0 => 1, 1 => 1}.
Now search consume the next char in the input string, "e" and loo it up in the table.
Here 3 strings match, but we know that we only care about strings 0 and 1.
Check if there are any offsets > the current offsets. If not, eliminate the items from our map, otherwise update the offset: {0 => 9, 1 => 3}.
Now by looking at the keys in our map that we've accumulated, we know which strings matched the fuzzy search.
Ideally, if the search is being performed as the user types, you'll keep track of the accumulated hash of results and pass it back into your search function. I think this will be a lot faster than iterating all search strings and performing a full wildcard search on each one.
The interesting thing about this is that you could also efficient store the Levenstein Distance along with each match, assuming you only care about insertions, not substitutions or deletions. Though perhaps it's not hard to get that logic added too.

I recently had to solve the same problem. My solution involves scoring strings with consecutively matched letters highly and excluding strings that don't contain the typed letters in order.
I've documented the algorithm in detail here: http://blog.kazade.co.uk/2014/10/a-fuzzy-filename-matching-algorithm.html

If your text is predominantly English then you may try your hand at various Soundex algorithms
1. Classic soundex
2. Metafone
These algorithms will let you choose words which sound like each other and will be a good way to find misspelled words.

Related

How to split a string into an array of individual characters

I have this textbox named txtnum in which I have to enter a 15 digit number and allocate it to variable num. I want to split the number into individual characters so that j can carry out calculations on each. Something like: product= arraynum[2]*2 . how do I split the string in the text box into array characters?
Nothing is built-in (as far as I know), but it I easy enough to write a function which takes a string and returns an array of characters:
Function ToArray(s As String) As Variant
Dim A As Variant
Dim i As Long, n As Long
n = Len(s)
ReDim A(0 To n - 1)
For i = 0 To n - 1
A(i) = Mid(s, i + 1, 1)
Next i
ToArray = A
End Function
Having done this, there is little actual gain from using a function like this as opposed to simply using Mid().
Here is another option:
Dim s As Variant
s = "012345678901234"
s = StrConv(s, vbUnicode)
s = Split(s, vbNullChar)
s will contain an array of characters.

Combining two lists with minimum distance between elements

I have to lists like these:
a = ["1a","2a","3a","4a","5a","6a","7a","8a","9a","10a","11a","12a","13a","14a"]
b = ["1b","2b","3b","4b","5b","6b","7b","8b","9b","10b","11b","12b","13b","14b"]
And what I want is to combine them, so that there is at least a difference of n elements between an element from a and it's corresponding element in b.
As an example, if my n is 10, and "3a" is in position 3 and "3b" is in position 5, that isn't a solution since there's only a distance of 2 between these corresponding elements.
I have already solved this for the purpose I want through a brute force method: shuffle the union of the two arrays and see if the constraint is met; if not, shuffle again and so on... Needless to say, that for 14 elements array, sometimes there is 5 to 10 second computation to yield a solution with a minimum distance of 10. Even though that's kind of ok for what I am looking for, I am curious about how I could solve this in a more optimized way.
I am currently using Python, but code in any language (or pseudo-code) is more than welcomed.
EDIT: The context of this problem is something like a questionnarie, in which around 100 participants are expected to join in. Therefore, I am not necessarily interested in all the solutions, but rather something like the first 100.
Thanks.
For your specific scenario, you could use a randomized approach -- though not as random as what you've already tried. Something like this:
start with a random permutation of the items in both lists
create a new permutation by creating a copy of the other and randomly swapping two items
measure the quality of the permutations, e.g., the sum of the distances of each pair of related items, or the minimum of such distances
if the quality of the new permutation is better than that of the original permutation, keep the new one, otherwise discard the new one and continue with the original permutation
repeat from 2. until each distance is at least 10 or until quality does not improve over a number of iterations
The difference to shuffling the whole list in each iteration (as in your approach) is that in each iteration the permutation can only get better, until a satisfying solution is found.
Each time you run this algorithm, the result will be slightly different, so you can run it 100 times for 100 different solutions. Of course, this algorithm does not guarantee to find a solution (much less all such solutions), but it should be fast enough so that you could just restart it in case it fails.
In Python, this could look somewhat like this (slightly simplified, but still working):
def shuffle(A, B):
# original positions, i.e. types of questions
kind = dict([(item, i) for i, item in list(enumerate(A)) + list(enumerate(B))])
# get positions of elements of kinds, and return sum of their distances
def quality(perm):
pos = dict([(kind[item], i) for i, item in enumerate(perm)])
return sum(abs(pos[kind[item]] - i) for i, item in enumerate(perm))
# initial permutation and quality
current = A + B
random.shuffle(current)
best = quality(current)
# improve upon initial permutation by randomly swapping items
for g in range(1000):
i = random.randint(0, len(current)-1)
j = random.randint(0, len(current)-1)
copy = current[:]
copy[i], copy[j] = copy[j], copy[i]
q = quality(copy)
if q > best:
current, best = copy, q
return current
Example output for print shuffle(a, b):
['14b', '2a', '13b', '3a', '9b', '4a', '6a', '1a', '8a', '5b', '12b', '11a', '10b', '7b', '4b', '11b', '5a', '7a', '8b', '12a', '13a', '14a', '1b', '2b', '3b', '6b', '10a', '9a']
As I understand from your question, it is possible to perform all the ordering by relying exclusively on the indices of the arrays (i.e., on pure integers) and thus the problem can be reduced to create (valid) ranges instead of analysing each element.
for each a <= total_items-n , valid b = if(a + n == total_items){total_items} else{[a + n, total_items]}
For example:
n = 10;
total_items = 15;
for a = 1 -> valid b = [11, 15]
for a = 2 -> valid b = [12, 15]
, etc.
This would be perfomed 4 times: forwards and backwards for a respect to b and the same for b respect to a.
In this way you would reduce the number of iterations to its minimum expression and would get, as an output, a set of "solutions" for each position, rather than a one-to-one binding (that is what you have right now, isn't it?).
If there are equivalents in Python to .NET's Lists and LINQ, then you may be able to directly convert the following code. It generates up to 100 lists really quickly: I press "debug" to run it and up pops a windows with the results in much less than a second.
' VS2012
Option Infer On
Module Module1
Dim minDistance As Integer = 10
Dim rand As New Random ' a random number generator
Function OkToAppend(current As List(Of Integer), x As Integer) As Boolean
' see if the previous minDistance values contain the number x
Return Not (current.Skip(current.Count - minDistance).Take(minDistance).Contains(x))
End Function
Function GenerateList() As List(Of String)
' We don't need to start with strings: integers will make it faster.
' The "a" and "b" suffixes can be sprinkled on at random once the
' list is created.
Dim numbersToUse() As Integer = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}
Dim pool As New List(Of Integer)
' we need all the numbers twice
pool.AddRange(numbersToUse)
pool.AddRange(numbersToUse)
Dim newList As New List(Of Integer)
Dim pos As Integer
For i = 0 To pool.Count - 1
' limit the effort it puts in
Dim sanity As Integer = pool.Count * 10
Do
pos = rand.Next(0, pool.Count)
sanity -= 1
Loop Until OkToAppend(newList, pool(pos)) OrElse sanity = 0
If sanity > 0 Then ' it worked
' append the value to the list
newList.Add(pool(pos))
' remove the value which has been used
pool.RemoveAt(pos)
Else ' give up on this arrangement
Return Nothing
End If
Next
' Create the final list with "a" and "b" stuck on each value.
Dim stringList As New List(Of String)
Dim usedA(numbersToUse.Length) As Boolean
Dim usedB(numbersToUse.Length) As Boolean
For i = 0 To newList.Count - 1
Dim z = newList(i)
Dim suffix As String = ""
If usedA(z) Then
suffix = "b"
ElseIf usedB(z) Then
suffix = "a"
End If
' rand.Next(2) generates an integer in the range [0,2)
If suffix.Length = 0 Then suffix = If(rand.Next(2) = 1, "a", "b")
If suffix = "a" Then
usedA(z) = True
Else
usedB(z) = True
End If
stringList.Add(z.ToString & suffix)
Next
Return stringList
End Function
Sub Main()
Dim arrangements As New List(Of List(Of String))
For i = 1 To 100
Dim thisArrangement = GenerateList()
If thisArrangement IsNot Nothing Then
arrangements.Add(thisArrangement)
End If
Next
'TODO: remove duplicate entries and generate more to make it up to
' the required quantity.
For Each a In arrangements
' outputs the elements of a with ", " as a separator
Console.WriteLine(String.Join(", ", a))
Next
' wait for user to press enter
Console.ReadLine()
End Sub
End Module

Working with arbitrary inequalities and checking which, if any, are satisfied

Given a non-negative integer n and an arbitrary set of inequalities that are user-defined (in say an external text file), I want to determine whether n satisfies any inequality, and if so, which one(s).
Here is a points list.
n = 0: 1
n < 5: 5
n = 5: 10
If you draw a number n that's equal to 5, you get 10 points.
If n less than 5, you get 5 points.
If n is 0, you get 1 point.
The stuff left of the colon is the "condition", while the stuff on the right is the "value".
All entries will be of the form:
n1 op n2: val
In this system, equality takes precedence over inequality, so the order that they appear in will not matter in the end. The inputs are non-negative integers, though intermediary and results may not be non-negative. The results may not even be numbers (eg: could be strings). I have designed it so that will only accept the most basic inequalities, to make it easier for writing a parser (and to see whether this idea is feasible)
My program has two components:
a parser that will read structured input and build a data structure to store the conditions and their associated results.
a function that will take an argument (a non-negative integer) and return the result (or, as in the example, the number of points I receive)
If the list was hardcoded, that is an easy task: just use a case-when or if-else block and I'm done. But the problem isn't as easy as that.
Recall the list at the top. It can contain an arbitrary number of (in)equalities. Perhaps there's only 3 like above. Maybe there are none, or maybe there are 10, 20, 50, or even 1000000. Essentially, you can have m inequalities, for m >= 0
Given a number n and a data structure containing an arbitrary number of conditions and results, I want to be able to determine whether it satisfies any of the conditions and return the associated value. So as with the example above, if I pass in 5, the function will return 10.
They condition/value pairs are not unique in their raw form. You may have multiple instances of the same (in)equality but with different values. eg:
n = 0: 10
n = 0: 1000
n > 0: n
Notice the last entry: if n is greater than 0, then it is just whatever you got.
If multiple inequalities are satisfied (eg: n > 5, n > 6, n > 7), all of them should be returned. If that is not possible to do efficiently, I can return just the first one that satisfied it and ignore the rest. But I would like to be able to retrieve the entire list.
I've been thinking about this for a while and I'm thinking I should use two hash tables: the first one will store the equalities, while the second will store the inequalities.
Equality is easy enough to handle: Just grab the condition as a key and have a list of values. Then I can quickly check whether n is in the hash and grab the appropriate value.
However, for inequality, I am not sure how it will work. Does anyone have any ideas how I can solve this problem in as little computational steps as possible? It's clear that I can easily accomplish this in O(n) time: just run it through each (in)equality one by one. But what happens if this checking is done in real-time? (eg: updated constantly)
For example, it is pretty clear that if I have 100 inequalities and 99 of them check for values > 100 while the other one checks for value <= 100, I shouldn't have to bother checking those 99 inequalities when I pass in 47.
You may use any data structure to store the data. The parser itself is not included in the calculation because that will be pre-processed and only needs to be done once, but if it may be problematic if it takes too long to parse the data.
Since I am using Ruby, I likely have more flexible options when it comes to "messing around" with the data and how it will be interpreted.
class RuleSet
Rule = Struct.new(:op1,:op,:op2,:result) do
def <=>(r2)
# Op of "=" sorts before others
[op=="=" ? 0 : 1, op2.to_i] <=> [r2.op=="=" ? 0 : 1, r2.op2.to_i]
end
def matches(n)
#op2i ||= op2.to_i
case op
when "=" then n == #op2i
when "<" then n < #op2i
when ">" then n > #op2i
end
end
end
def initialize(text)
#rules = text.each_line.map do |line|
Rule.new *line.split(/[\s:]+/)
end.sort
end
def value_for( n )
if rule = #rules.find{ |r| r.matches(n) }
rule.result=="n" ? n : rule.result.to_i
end
end
end
set = RuleSet.new( DATA.read )
-1.upto(8) do |n|
puts "%2i => %s" % [ n, set.value_for(n).inspect ]
end
#=> -1 => 5
#=> 0 => 1
#=> 1 => 5
#=> 2 => 5
#=> 3 => 5
#=> 4 => 5
#=> 5 => 10
#=> 6 => nil
#=> 7 => 7
#=> 8 => nil
__END__
n = 0: 1
n < 5: 5
n = 5: 10
n = 7: n
I would parse the input lines and separate them into predicate/result pairs and build a hash of callable procedures (using eval - oh noes!). The "check" function can iterate through each predicate and return the associated result when one is true:
class PointChecker
def initialize(input)
#predicates = Hash[input.split(/\r?\n/).map do |line|
parts = line.split(/\s*:\s*/)
[Proc.new {|n| eval(parts[0].sub(/=/,'=='))}, parts[1].to_i]
end]
end
def check(n)
#predicates.map { |p,r| [p.call(n) ? r : nil] }.compact
end
end
Here is sample usage:
p = PointChecker.new <<__HERE__
n = 0: 1
n = 1: 2
n < 5: 5
n = 5: 10
__HERE__
p.check(0) # => [1, 5]
p.check(1) # => [2, 5]
p.check(2) # => [5]
p.check(5) # => [10]
p.check(6) # => []
Of course, there are many issues with this implementation. I'm just offering a proof-of-concept. Depending on the scope of your application you might want to build a proper parser and runtime (instead of using eval), handle input more generally/gracefully, etc.
I'm not spending a lot of time on your problem, but here's my quick thought:
Since the points list is always in the format n1 op n2: val, I'd just model the points as an array of hashes.
So first step is to parse the input point list into the data structure, an array of hashes.
Each hash would have values n1, op, n2, value
Then, for each data input you run through all of the hashes (all of the points) and handle each (determining if it matches to the input data or not).
Some tricks of the trade
Spend time in your parser handling bad input. Eg
n < = 1000 # no colon
n < : 1000 # missing n2
x < 2 : 10 # n1, n2 and val are either number or "n"
n # too short, missing :, n2, val
n < 1 : 10x # val is not a number and is not "n"
etc
Also politely handle non-numeric input data
Added
Re: n1 doesn't matter. Be careful, this could be a trick. Why wouldn't
5 < n : 30
be a valid points list item?
Re: multiple arrays of hashes, one array per operator, one hash per point list item -- sure that's fine. Since each op is handled in a specific way, handling the operators one by one is fine. But....ordering then becomes an issue:
Since you want multiple results returned from multiple matching point list items, you need to maintain the overall order of them. Thus I think one array of all the point lists would be the easiest way to do this.

How do I modify the Damerau-Levenshtein algorithm, such that it also includes the start index, and the end index of the larger substring?

Here is my code:
#http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
# used for fuzzy matching of two strings
# for indexing, seq2 must be the parent string
def dameraulevenshtein(seq1, seq2)
oneago = nil
min = 100000000000 #index
max = 0 #index
thisrow = (1..seq2.size).to_a + [0]
seq1.size.times do |x|
twoago, oneago, thisrow = oneago, thisrow, [0] * seq2.size + [x + 1]
seq2.size.times do |y|
delcost = oneago[y] + 1
addcost = thisrow[y - 1] + 1
subcost = oneago[y - 1] + ((seq1[x] != seq2[y]) ? 1 : 0)
thisrow[y] = [delcost, addcost, subcost].min
if (x > 0 and y > 0 and seq1[x] == seq2[y-1] and seq1[x-1] == seq2[y] and seq1[x] != seq2[y])
thisrow[y] = [thisrow[y], twoago[y-2] + 1].min
end
end
end
return thisrow[seq2.size - 1], min, max
end
there has to be someway to get the starting and ending index of substring, seq1, withing parent string, seq2, right?
I'm not entirely sure how this algorithm works, even after reading the wiki article on it. I mean, I understand the highest level explanation, as it finds the insertion, deletion, and transposition difference (the lines in the second loop).. but beyond that. I'm a bit lost.
Here is an example of something that I wan to be able to do with this (^):
substring = "hello there"
search_string = "uh,\n\thello\n\t there"
the indexes should be:
start: 5
end: 18 (last char of string)
Ideally, the search_string will never be modified. But, I guess I could take out all the white space characters (since there are only.. 3? \n \r and \t) store the indexes of each white space character, get the indexes of my substring, and then re-add in the white space characters, making sure to compensate the substring's indexes as I offset them with the white space characters that were originally in there in the first place. -- but if this could all be done in the same method, that would be amazing, as the algorithm is already O(n^2).. =(
At some point, I'd like to only allow white space characters to split up the substring (s1).. but one thing at a time
I don't think this algorithm is the right choice for what you want to do. The algorithm is simply calculating the distance between two strings in terms of the number of modifications you need to make to turn one string into another. If we rename your function to dlmatch for brevity and only return the distance, then we have:
dlmatch("hello there", "uh, \n\thello\n\t there"
=> 7
meaning that you can convert one string into the other in 7 steps (effectively by removing seven characters from the second). The problem is that 7 steps is a pretty big difference:
dlmatch("hello there", "panda here"
=> 6
This would actually imply that "hello there" and "panda here" are closer matches than the first example.
If what you are trying to do is "find a substring that mostly matches", I think you are stuck with an O(n^3) algorithm as you feed the first string to a series of substrings of the second string, and then selecting the substring that provides you the closest match.
Alternatively, you may be better off trying to do pre-processing on the search string and then doing regexp matching with the substring. For example, you could strip off all special characters and then build a regexp that looks for words in the substring that are case insensitive and can have any amount of whitespace between them.

Algorithm Issue: letter combinations

I'm trying to write a piece of code that will do the following:
Take the numbers 0 to 9 and assign one or more letters to this number. For example:
0 = N,
1 = L,
2 = T,
3 = D,
4 = R,
5 = V or F,
6 = B or P,
7 = Z,
8 = H or CH or J,
9 = G
When I have a code like 0123, it's an easy job to encode it. It will obviously make up the code NLTD. When a number like 5,6 or 8 is introduced, things get different. A number like 051 would result in more than one possibility:
NVL and NFL
It should be obvious that this gets even "worse" with longer numbers that include several digits like 5,6 or 8.
Being pretty bad at mathematics, I have not yet been able to come up with a decent solution that will allow me to feed the program a bunch of numbers and have it spit out all the possible letter combinations. So I'd love some help with it, 'cause I can't seem to figure it out. Dug up some information about permutations and combinations, but no luck.
Thanks for any suggestions/clues. The language I need to write the code in is PHP, but any general hints would be highly appreciated.
Update:
Some more background: (and thanks a lot for the quick responses!)
The idea behind my question is to build a script that will help people to easily convert numbers they want to remember to words that are far more easily remembered. This is sometimes referred to as "pseudo-numerology".
I want the script to give me all the possible combinations that are then held against a database of stripped words. These stripped words just come from a dictionary and have all the letters I mentioned in my question stripped out of them. That way, the number to be encoded can usually easily be related to a one or more database records. And when that happens, you end up with a list of words that you can use to remember the number you wanted to remember.
It can be done easily recursively.
The idea is that to handle the whole code of size n, you must handle first the n - 1 digits.
Once you have all answers for n-1 digits, the answers for the whole are deduced by appending to them the correct(s) char(s) for the last one.
There's actually a much better solution than enumerating all the possible translations of a number and looking them up: Simply do the reverse computation on every word in your dictionary, and store the string of digits in another field. So if your mapping is:
0 = N,
1 = L,
2 = T,
3 = D,
4 = R,
5 = V or F,
6 = B or P,
7 = Z,
8 = H or CH or J,
9 = G
your reverse mapping is:
N = 0,
L = 1,
T = 2,
D = 3,
R = 4,
V = 5,
F = 5,
B = 6,
P = 6,
Z = 7,
H = 8,
J = 8,
G = 9
Note there's no mapping for 'ch', because the 'c' will be dropped, and the 'h' will be converted to 8 anyway.
Then, all you have to do is iterate through each letter in the dictionary word, output the appropriate digit if there's a match, and do nothing if there isn't.
Store all the generated digit strings as another field in the database. When you want to look something up, just perform a simple query for the number entered, instead of having to do tens (or hundreds, or thousands) of lookups of potential words.
The general structure you want to hold your number -> letter assignments is an array or arrays, similar to:
// 0 = N, 1 = L, 2 = T, 3 = D, 4 = R, 5 = V or F, 6 = B or P, 7 = Z,
// 8 = H or CH or J, 9 = G
$numberMap = new Array (
0 => new Array("N"),
1 => new Array("L"),
2 => new Array("T"),
3 => new Array("D"),
4 => new Array("R"),
5 => new Array("V", "F"),
6 => new Array("B", "P"),
7 => new Array("Z"),
8 => new Array("H", "CH", "J"),
9 => new Array("G"),
);
Then, a bit of recursive logic gives us a function similar to:
function GetEncoding($number) {
$ret = new Array();
for ($i = 0; $i < strlen($number); $i++) {
// We're just translating here, nothing special.
// $var + 0 is a cheap way of forcing a variable to be numeric
$ret[] = $numberMap[$number[$i]+0];
}
}
function PrintEncoding($enc, $string = "") {
// If we're at the end of the line, then print!
if (count($enc) === 0) {
print $string."\n";
return;
}
// Otherwise, soldier on through the possible values.
// Grab the next 'letter' and cycle through the possibilities for it.
foreach ($enc[0] as $letter) {
// And call this function again with it!
PrintEncoding(array_slice($enc, 1), $string.$letter);
}
}
Three cheers for recursion! This would be used via:
PrintEncoding(GetEncoding("052384"));
And if you really want it as an array, play with output buffering and explode using "\n" as your split string.
This kind of problem are usually resolved with recursion. In ruby, one (quick and dirty) solution would be
#values = Hash.new([])
#values["0"] = ["N"]
#values["1"] = ["L"]
#values["2"] = ["T"]
#values["3"] = ["D"]
#values["4"] = ["R"]
#values["5"] = ["V","F"]
#values["6"] = ["B","P"]
#values["7"] = ["Z"]
#values["8"] = ["H","CH","J"]
#values["9"] = ["G"]
def find_valid_combinations(buffer,number)
first_char = number.shift
#values[first_char].each do |key|
if(number.length == 0) then
puts buffer + key
else
find_valid_combinations(buffer + key,number.dup)
end
end
end
find_valid_combinations("",ARGV[0].split(""))
And if you run this from the command line you will get:
$ ruby r.rb 051
NVL
NFL
This is related to brute-force search and backtracking
Here is a recursive solution in Python.
#!/usr/bin/env/python
import sys
ENCODING = {'0':['N'],
'1':['L'],
'2':['T'],
'3':['D'],
'4':['R'],
'5':['V', 'F'],
'6':['B', 'P'],
'7':['Z'],
'8':['H', 'CH', 'J'],
'9':['G']
}
def decode(str):
if len(str) == 0:
return ''
elif len(str) == 1:
return ENCODING[str]
else:
result = []
for prefix in ENCODING[str[0]]:
result.extend([prefix + suffix for suffix in decode(str[1:])])
return result
if __name__ == '__main__':
print decode(sys.argv[1])
Example output:
$ ./demo 1
['L']
$ ./demo 051
['NVL', 'NFL']
$ ./demo 0518
['NVLH', 'NVLCH', 'NVLJ', 'NFLH', 'NFLCH', 'NFLJ']
Could you do the following:
Create a results array.
Create an item in the array with value ""
Loop through the numbers, say 051 analyzing each one individually.
Each time a 1 to 1 match between a number is found add the correct value to all items in the results array.
So "" becomes N.
Each time a 1 to many match is found, add new rows to the results array with one option, and update the existing results with the other option.
So N becomes NV and a new item is created NF
Then the last number is a 1 to 1 match so the items in the results array become
NVL and NFL
To produce the results loop through the results array, printing them, or whatever.
Let pn be a list of all possible letter combinations of a given number string s up to the nth digit.
Then, the following algorithm will generate pn+1:
digit = s[n+1];
foreach(letter l that digit maps to)
{
foreach(entry e in p(n))
{
newEntry = append l to e;
add newEntry to p(n+1);
}
}
The first iteration is somewhat of a special case, since p-1 is undefined. You can simply initialize p0 as the list of all possible characters for the first character.
So, your 051 example:
Iteration 0:
p(0) = {N}
Iteration 1:
digit = 5
foreach({V, F})
{
foreach(p(0) = {N})
{
newEntry = N + V or N + F
p(1) = {NV, NF}
}
}
Iteration 2:
digit = 1
foreach({L})
{
foreach(p(1) = {NV, NF})
{
newEntry = NV + L or NF + L
p(2) = {NVL, NFL}
}
}
The form you want is probably something like:
function combinations( $str ){
$l = len( $str );
$results = array( );
if ($l == 0) { return $results; }
if ($l == 1)
{
foreach( $codes[ $str[0] ] as $code )
{
$results[] = $code;
}
return $results;
}
$cur = $str[0];
$combs = combinations( substr( $str, 1, $l ) );
foreach ($codes[ $cur ] as $code)
{
foreach ($combs as $comb)
{
$results[] = $code.$comb;
}
}
return $results;}
This is ugly, pidgin-php so please verify it first. The basic idea is to generate every combination of the string from [1..n] and then prepend to the front of all those combinations each possible code for str[0]. Bear in mind that in the worst case this will have performance exponential in the length of your string, because that much ambiguity is actually present in your coding scheme.
The trick is not only to generate all possible letter combinations that match a given number, but to select the letter sequence that is most easy to remember. A suggestion would be to run the soundex algorithm on each of the sequence and try to match against an English language dictionary such as Wordnet to find the most 'real-word-sounding' sequences.

Resources