Count ways to choose K equal strings - algorithm

Given a string S consisting of N lowercase English alphabets. Suppose we have a list L consisting of all non empty substrings of the string S.
Now we need to answer Q queries. For ith query, I need to count the number of ways to choose exactly K equal strings from the list L.
NOTE: For each K we will have different value of K.
To avoid overflow I need to take it modulo 10^9+7.
Example : Let S=ababa and we have 2 Queries. Value of K for each query is :
2
3
Then answer for first query is 7 and for second query its 1.
As List L = {"a", "b", "a", "b", "a", "ab", "ba", "ab", "ba", "aba", "bab", "aba", "abab", "baba", "ababa"}
For Query 1 : There are seven ways to choose two equal strings ("a", "a"), ("a", "a"), ("a", "a"), ("b", "b"), ("ab", "ab"), ("ba", "ba"), ("aba", "aba").
For Query 2 : There is one way to choose three equal strings - ("a", "a", "a").
Now the problem is that N<=5000 and Queries can be 100000. So brute solution won't work. What can be better way to do it.

One way to optimize search would be to reduce the set of possible candidates for pairs.
Basic idea would be: For each substring of length n > 2, that matches the constraints (the substring appears atleast K times in the input-string), 2 substrings of length n - 1 that match requirements must exist. Example: if ("abab" , "abab") is a pair of substrings of the input-string, ("aba" , "aba") and ("bab" , "bab") must aswell be pairs of substrings of the input-string.
This can be used to eliminate candidates for pairs in the following way:
Start with an initial set of substrings of length = 1, such that the set only contains substrings for which atleast K - 1 equal substrings can be found. Extend each of these substrings by adding the next character. Now we can eliminate substrings, for which not enough matches can be found. Repeat this until all substrings are eliminated.
Now from theory to praxis:
This basic datastructure simply represents a substring by it's starting and end-point (inclusive) in the input string.
define substr:
int start , end
A helper method for getting the string represented by a substr:
define getstr:
input: string s , substr sub
return string(s , sub.start , sub.end)
First generate a lookup-table for all characters and their position in the string. The table will be needed later on.
define posMap:
input: string in
output: multimap
multimap pos
for int i in [0 , length(in)]
put(pos , in[i] , i)//store the position of character in[i] in the map
return pos
Another helper-method generates a set of all indices of characters that only appear once in the input-string
define listSingle:
input: multimap pos
output: set
set single
for char c in keys(pos)
if length(get(pos , c)) == 1
add(single , get(get(pos , c) , 0)
return single
A method that creates the initial set of matching pairs. These pairs consist of substrings of length 1. The pairs themself aren't specifed; the algorithm only maps the text of a substring to all occurences. (NOTE: I'm using pair here, though the correct term would be set of length K)
define listSinglePairs:
input: multimap pos
output: multimap
multimap result
for char key in keys(pos)
list ind = get(pos , key)
if length(ind) < 2
continue
string k = toString(key)
for int i in ind
put(result , k , substr(i , i))
return result
Furthermore a method is required to list all substrings that contain the same string as a given string:
define matches:
input: string in , substr sub , multimap charmap
output: list
list result
string txt = getstr(in , sub)
list candidates = get(charmap , txt[0])
for int i in [1 , length(txt)[
//increment all elements in candidates
for int c in [0 , size(candidates)[
replace(candidates , c , get(candidates , c) + 1)
list next = get(charmap , txt[i])
//since the indices of all candidates were incremented (index of the previous character in
//in) they now are equal to the indices of the next character in the substring, if it matches
candidates = intersection(candidates , next)
if isEmpty(candidates)
return EMPTY
//candidates now holds the indices of the end of all substrings that
//match the given substring -> convert to list of substr
for int i in candidates
add(result , substr(i - length(txt) , i))
return result
This is the main-routine that does the work:
define listMatches:
input: string in , int K
output: multimap
multimap chars = posMap(in)
set single = listSingle(chars)
multimap clvl = listSinglePairs(chars , K)
multimap result
while NOT isEmpty(clvl)
multimap nextlvl
for string sub in clvl
list pairs = get(clvl , sub)
list tmp
//extend all substrings by one character
//substrings that end in a character that only appears once in the
//input string can be ignored
for substr s in pairs
if s.end + 1 > length(in) OR contains(single , s.end + 1)
continue
add(tmp , substr(s.start , s.end + 1)
//map all substrs to their respective string
while NOT isEmpty(tmp)
substr s = get(tmp , 0)
string txt = getstr(s , in)
list match = matches(in , s , chars)
//this substring doesn't have enough pairs
if size(match) < K
continue
//save all matches as solution and candidates for the next round
for substr m in match
put(result , txt , m)
put(nextlvl , txt , m)
//overwrite candidates for the next round with the given candidates
clvl = nextlvl
return result
NOTE: this algorithm generates a map of all substrings, for which pairs exist to the position of the substrings.
I hope this is comprehensible (i'm horrible in explaining things).

Related

Finding all the shortest unique substring which are of same length?

Given a string sequence which contains only four letters, ['a','g','c','t']
for example: agggcttttaaaatttaatttgggccc.
Find all the shortest unique sub-string of the string sequence which are of equal length (the length should be minimum of all the unique sub-strings) ?
For example : aaggcgccttt
answer: ['aa', 'ag', 'gg','cg', 'cc','ct']
explanation:shortest unique sub-string of length 2
I have tried using suffix-arrays coupled with longest common prefix but i am unable to draw the solution perfectly.
I'm not sure what you mean by "minimum unique sub-string", but looking at your example I assume you mean "shortest runs of a single letter". If this is the case, you just need to iterate through the string once (character by character) and count all the shortest runs you find. You should keep track of the length of the minimum run found so far (infinity at start) and the length of the current run.
If you need to find the exact runs, you can add all the minimum runs you find to e.g. a list as you iterate through the string (and modify that list accordingly if a shorter run is found).
EDIT:
I thought more about the problem and came up with the following solution.
We find all the unique sub-strings of length i (in ascending order). So, first we consider all sub-strings of length 1, then all sub-strings of length 2, and so on. If we find any, we stop, since the sub-string length can only increase from this point.
You will have to use a list to keep track of the sub-strings you've seen so far, and a list to store the actual sub-strings. You will also have to maintain them accordingly as you find new sub-strings.
Here's the Java code I came up with, in case you need it:
String str = "aaggcgccttt";
String curr = "";
ArrayList<String> uniqueStrings = new ArrayList<String>();
ArrayList<String> alreadySeen = new ArrayList<String>();
for (int i = 1; i < str.length(); i++) {
for (int j = 0; j < str.length() - i + 1; j++) {
curr = str.substring(j, j + i);
if (!alreadySeen.contains(curr)){ //Sub-string hasn't been seen yet
uniqueStrings.add(curr);
alreadySeen.add(curr);
}
else //Repeated sub-string found
uniqueStrings.remove(curr);
}
if (!uniqueStrings.isEmpty()) //We have found non-repeating sub-string(s)
break;
alreadySeen.clear();
}
//Output
if (uniqueStrings.isEmpty())
System.out.println(str);
else {
for (String s : uniqueStrings)
System.out.println(s);
}
The uniqueStrings list contains all the unique sub-strings of minimum length (used for output). The alreadySeen list keeps track of all the sub-strings that have already been seen (used to exclude repeating sub-strings).
I'll write some code in Python, because that's what I find the easiest.
I actually wrote both the overlapping and the non-overlapping variants. As a bonus, it also checks that the input is valid.
You seems to be interested only in the overlapping variant:
import itertools
def find_all(
text,
pattern,
overlap=False):
"""
Find all occurrencies of the pattern in the text.
Args:
text (str|bytes|bytearray): The input text.
pattern (str|bytes|bytearray): The pattern to find.
overlap (bool): Detect overlapping patterns.
Yields:
position (int): The position of the next finding.
"""
len_text = len(text)
offset = 1 if overlap else (len(pattern) or 1)
i = 0
while i < len_text:
i = text.find(pattern, i)
if i >= 0:
yield i
i += offset
else:
break
def is_valid(text, tokens):
"""
Check if the text only contains the specified tokens.
Args:
text (str|bytes|bytearray): The input text.
tokens (str|bytes|bytearray): The valid tokens for the text.
Returns:
result (bool): The result of the check.
"""
return set(text).issubset(set(tokens))
def shortest_unique_substr(
text,
tokens='acgt',
overlapping=True,
check_valid_input=True):
"""
Find the shortest unique substring.
Args:
text (str|bytes|bytearray): The input text.
tokens (str|bytes|bytearray): The valid tokens for the text.
overlap (bool)
check_valid_input (bool): Check if the input is valid.
Returns:
result (set): The set of the shortest unique substrings.
"""
def add_if_single_match(
text,
pattern,
result,
overlapping):
match_gen = find_all(text, pattern, overlapping)
try:
next(match_gen) # first match
except StopIteration:
# the pattern is not found, nothing to do
pass
else:
try:
next(match_gen)
except StopIteration:
# the pattern was found only once so add to results
result.add(pattern)
else:
# the pattern is found twice, nothing to do
pass
# just some sanity check
if check_valid_input and not is_valid(text, tokens):
raise ValueError('Input text contains invalid tokens.')
result = set()
# shortest sequence cannot be longer than this
if overlapping:
max_lim = len(text) // 2 + 1
max_lim = len(tokens)
for n in range(1, max_lim + 1):
for pattern_gen in itertools.product(tokens, repeat=2):
pattern = ''.join(pattern_gen)
add_if_single_match(text, pattern, result, overlapping)
if len(result) > 0:
break
else:
max_lim = len(tokens)
for n in range(1, max_lim + 1):
for i in range(len(text) - n):
pattern = text[i:i + n]
add_if_single_match(text, pattern, result, overlapping)
if len(result) > 0:
break
return result
After some sanity check for the correctness of the outputs:
shortest_unique_substr_ovl = functools.partial(shortest_unique_substr, overlapping=True)
shortest_unique_substr_ovl.__name__ = 'shortest_unique_substr_ovl'
shortest_unique_substr_not = functools.partial(shortest_unique_substr, overlapping=False)
shortest_unique_substr_not.__name__ = 'shortest_unique_substr_not'
funcs = shortest_unique_substr_ovl, shortest_unique_substr_not
test_inputs = (
'aaa',
'aaaa',
'aaggcgccttt',
'agggcttttaaaatttaatttgggccc',
)
import functools
for func in funcs:
print('Func:', func.__name__)
for test_input in test_inputs:
print(func(test_input))
print()
Func: shortest_unique_substr_ovl
set()
set()
{'cg', 'ag', 'gg', 'ct', 'aa', 'cc'}
{'tg', 'ag', 'ct'}
Func: shortest_unique_substr_not
{'aa'}
{'aaa'}
{'cg', 'tt', 'ag', 'gg', 'ct', 'aa', 'cc'}
{'tg', 'ag', 'ct', 'cc'}
it is wise to benchmark how fast we actually are.
Below you can find some benchmarks, produced using some template code from here (the overlapping variant is in blue):
and the rest of the code for completeness:
def gen_input(n, tokens='acgt'):
return ''.join([tokens[random.randint(0, len(tokens) - 1)] for _ in range(n)])
def equal_output(a, b):
return a == b
input_sizes = tuple(2 ** (1 + i) for i in range(16))
runtimes, input_sizes, labels, results = benchmark(
funcs, gen_input=gen_input, equal_output=equal_output,
input_sizes=input_sizes)
plot_benchmarks(runtimes, input_sizes, labels, units='ms')
plot_benchmarks(runtimes, input_sizes, labels, units='μs', zoom_fastest=2)
As far as the asymptotic time-complexity analysis is concerned, considering only the overlapping case, let N be the input size, let K be the number of tokens (4 in your case), find_all() is O(N), and the body of shortest_unique_substr is O(K²) (+ O((K - 1)²) + O((K - 2)²) + ...).
So, this is overall O(N*K²) or O(N*(Σk²)) (for k = 1, …, K), since K is fixed, this is O(N), as the benchmarks seem to indicate.

Find the longest subsequence containing as many 1 as 0 time O(n)

I had a question in an interview and I couldn't find the optimal solution (and it's quite frustrating lol)
So you have a n-list of 1 and 0.
110000110101110..
The goal is to extract the longest sub sequence containing as many 1 as 0.
Here for example it is "110000110101" or "100001101011" or "0000110101110"
I have an idea for O(n^2), just scanning all possibilities from the beginning to the end, but apparently there is a way to do it in O(n).
Any ideas?
Thanks a lot!
Consider '10110':
Create a variable S. Create array A=[0].
Iterate from first number and add 1 to S if you notice 1 and subtract 1 from S if you notice 0 and append S to A.
For our example sequence A will be: [0, 1, 0, 1, 2, 1]. A is simply an array which stores a difference between number of 1s and 0s preceding the index. The sequence has to start and end at the place which has the same difference between 1s and 0s. So now our task is to find the longest distance between same numbers in A.
Now create 2 empty dictionaries (hash maps) First and Last.
Iterate through A and save position of first occurrence of every number in A in dictionary First.
Iterate through A (starting from the end) and save position of the last occurrence of each number in A in dictionary Last.
So for our example array First will be {0:0, 1:1, 2:4} and Last will be {0:2, 1:5, 2:4}
Now find the key(max_key) for which the difference between corresponding values in First and Last is the largest. This max difference is the length of the subsequence. Subsequence starts at First[max_key] and ends at Last[max_key].
I know it is a bit hard to understand but it has complexity O(n) - four loops, each has complexity N. You can replace dictionaries with arrays of course but it is more complicated then using dictionaries.
Solution in Python.
def find_subsequence(seq):
S = 0
A = [0]
for e in seq:
if e=='1':
S+=1
else:
S-=1
A.append(S)
First = {}
Last = {}
for pos, e in enumerate(A):
if e not in First:
First[e] = pos
for pos, e in enumerate(reversed(A)):
if e not in Last:
Last[e] = len(seq) - pos
max_difference = 0
max_key = None
for key in First:
difference = Last[key] - First[key]
if difference>max_difference:
max_difference = difference
max_key = key
if max_key is None:
return ''
return seq[First[max_key]:Last[max_key]]
find_sequene('10110') # Gives '0110'
find_sequence('1') # gives ''
J.F. Sebastian's code is more optimised.
EXTRA
This problem is related to Maximum subarray problem. Its solution is also based on summing elements from start:
def max_subarray(arr):
max_diff = total = min_total = start = tstart = end = 0
for pos, val in enumerate(arr, 1):
total += val
if min_total > total:
min_total = total
tstart = pos
if total - min_total > max_diff:
max_diff = total - min_total
end = pos
start = tstart
return max_diff, arr[start:end]

count and iterate arrangements of a sequence

I have a string and a number of spaces e.g. the string is "accttgagattcagt" and I have 10 spaces to insert.
How can you iterate over all combinations of that string and spaces? The letters in the string cannot be reordered, and all spaces must be inserted.
And how can you calculate the number of rearrangements (without iterating them)?
And what is the proper word for this? Permutations, combinations or something else?
(I visualise this as strings of 1s and 0s where the 1s are used by the string and the 0s are spaces.
So a short string of 3 letters and 2 spaces would be asking for all all 5 bit numbers with 3 1s and 2 0s e.g. 11100, 11010, 11001, 10110, 10101, 10011, 01110, 01101, 01011, 00111?
But easy as short sequences are to make on paper, I am struggling to make a for-loop to do it :(. So nice pseudocode to create this sequence, and count how long it would be, please?
Recursion will be easier to understand but can it be faster if recursion is avoided somehow?)
That's combinations.
So a short string of 3 letters and 2 spaces would be asking for all all 5 bit numbers with 3 1s and 2 0s e.g. 11100, 11010, 11001, 10110, 10101, 10011, 01110, 01101, 01011, 00111?
You put the three '1's on one of the 5 indices and order does not matter. So its 5 over 3:
5!/((5-3)!3!) = 5*4/(2*1) = 10
The article on wikipedia.org has an image illustrating a random sequence of 3 red and 2 white squares.
This might be useful:
Statistics: combinations in Python
Hmm, a bit pseudo code, but you should get the idea
list doThat(string, spaces){
returnList
spacesTemp = spaces;
for(c = 0; c < string.length; c++){
subString = string.getSubString(c, string.length);
tmpString = string.insertStringAtPosition(c, createSpaceString(spacesTemp);
retSubStringList = doThat(subString, spaces - spacesTemp);
retCombinedList = addStringInFrontOfAllStringsInList(tmpString, retSubStringList);
returnList.addList(retCombinedList);
spacesTemp--;
}
return returnList;
}
n - number of letters
m - number of spaces
Counting
Denote number of spaces between first and second letter as a_1, between second and third as a_2 and so on. Your question can now be stated as: In how many different ways a_1, a_2 ..a_n-1 can be chosen that each number is not less than 0 and their sum satisfies a_1 + a_2 .... + a_(n-1) = m? Answer to this question is n + m over n (by over I mean Newton symbol).
Why is it so? Visualize this problem as n + m empty bins in a row. If we fill exactly n of them with sand, distances between filled ones will satisfy our requirements for sum of a_1 ... a_n-1.
Generation
def generate(s, num_spaces):
ans = generate_aux("_" + s, num_spaces)
return [x[1:] for x in ans]
def generate_aux(s, num_spaces) : # returns list of arrangements
if num_spaces == 0:
return [s]
if s == "":
return []
val = []
for i in range(0, num_spaces + 1):
tmp = generate_aux(s[1:], num_spaces - i)
pref = s[0] + (" " * i)
val.extend([pref + x for x in tmp])
return val
print generate("abc", 2)

Mapping integers to strings in a given string space

Suppose I have an alphabet of 'abcd' and a maximum string length of 3. This gives me 85 possible strings, including the empty string. What I would like to do is map an integer in the range [0,85) to a string in my string space without using a lookup table. Something like this:
0 => ''
1 => 'a'
...
4 => 'd'
5 => 'aa'
6 => 'ab'
...
84 => 'ddd'
This is simple enough to do if the string is fixed length using this pseudocode algorithm:
str = ''
for i in 0..maxLen do
str += alphabet[i % alphabet.length]
i /= alphabet.length
done
I can't figure out a good, efficient way of doing it though when the length of the string could be anywhere in the range [0,3). This is going to be running in a tight loop with random inputs so I would like to avoid any unnecessary branching or lookups.
Shift your index by one and ignore the empty string temporarily. So you'd map 0 -> "a", ..., 83 -> "ddd".
Then the mapping is
n -> base-4-encode(n - number of shorter strings)
With 26 symbols, that's the Excel-column-numbering scheme.
With s symbols, there are s + s^2 + ... + s^l nonempty strings of length at most l. Leaving aside the trivial case s = 1, that sum is (a partial sum of a geometric series) s*(s^l - 1)/(s-1).
So, given n, find the largest l such that s*(s^l - 1)/(s-1) <= n, i.e.
l = floor(log((s-1)*n/s + 1) / log(s))
Then let m = n - s*(s^l - 1)/(s-1) and encode m as an l+1-symbol string in base s ('a' ~> 0, 'b' ~> 1, ...).
For the problem including the empty string, map 0 to the empty string and for n > 0 encode n-1 as above.
In Haskell
encode cs n = reverse $ encode' n where
len = length cs
encode' 0 = ""
encode' n = (cs !! ((n-1) `mod` len)) : encode' ((n-1) `div` len)
Check:
*Main> map (encode "abcd") [0..84] ["","a","b","c","d","aa","ab","ac","ad","ba","bb","bc","bd","ca","cb","cc","cd","da","db","dc","dd","aaa","aab","aac","aad","aba","abb","abc","abd","aca","acb","acc","acd","ada","adb","adc","add","baa","bab","bac","bad","bba","bbb","bbc","bbd","bca","bcb","bcc","bcd","bda","bdb","bdc","bdd","caa","cab","cac","cad","cba","cbb","cbc","cbd","cca","ccb","ccc","ccd","cda","cdb","cdc","cdd","daa","dab","dac","dad","dba","dbb","dbc","dbd","dca","dcb","dcc","dcd","dda","ddb","ddc","ddd"]
Figure out the number of strings for each length: N0, N1, N2 & N3 (actually, you won't need N3). Then, use those values to partition your space of integers: 0..N0-1 are length 0, N0..N0+N1-1 are length 1, etc. Within each partition, you can use your fixed-length algorithm.
At worst, you've greatly reduced the size of your lookup table.
Here is a C# solution:
static string F(int x, int alphabetSize)
{
string ret = "";
while (x > 0)
{
x--;
ret = (char)('a' + (x % alphabetSize)) + ret;
x /= alphabetSize;
}
return ret;
}
If you want to optimize this further, you may want to do something to avoid the string concatenations. For example, you could store the result into a preallocated char[] array.

Algorithms for "fuzzy matching" strings

By fuzzy matching I don't mean similar strings by Levenshtein distance or something similar, but the way it's used in TextMate/Ido/Icicles: given a list of strings, find those which include all characters in the search string, but possibly with other characters between, preferring the best fit.
I've finally understood what you were looking for. The issue is interesting however looking at the 2 algorithms you found it seems that people have widely different opinions about the specifications ;)
I think it would be useful to state the problem and the requirements more clearly.
Problem:
We are looking for a way to speed up typing by allowing users to only type a few letters of the keyword they actually intended and propose them a list from which to select.
It is expected that all the letters of the input be in the keyword
It is expected that the letters in the input be in the same order in the keyword
The list of keywords returned should be presented in a consistent (reproductible) order
The algorithm should be case insensitive
Analysis:
The first two requirements can be sum up like such: for an input axg we are looking for words matching this regular expression [^a]*a[^x]*x[^g]*g.*
The third requirement is purposely loose. The order in which the words should appear in the list need being consistent... however it's difficult to guess whether a scoring approach would be better than alphabetical order. If the list is extremy long, then a scoring approach could be better, however for short list it's easier for the eye to look for a particular item down a list sorted in an obvious manner.
Also, the alphabetical order has the advantage of consistency during typing: ie adding a letter does not completely reorder the list (painful for the eye and brain), it merely filters out the items that do not match any longer.
There is no precision about handling unicode characters, for example is à similar to a or another character altogether ? Since I know of no language that currently uses such characters in their keywords, I'll let it slip for now.
My solution:
For any input, I would build the regular expression expressed earlier. It suitable for Python because the language already features case-insensitive matching.
I would then match my (alphabetically sorted) list of keywords, and output it so filtered.
In pseudo-code:
WORDS = ['Bar', 'Foo', 'FooBar', 'Other']
def GetList(input, words = WORDS):
expr = ['[^' + i + ']*' + i for i in input]
return [w for w in words if re.match(expr, w, re.IGNORECASE)]
I could have used a one-liner but thought it would obscure the code ;)
This solution works very well for incremental situations (ie, when you match as the user type and thus keep rebuilding) because when the user adds a character you can simply refilter the result you just computed. Thus:
Either there are few characters, thus the matching is quick and the length of the list does not matter much
Either there are a lots of characters, and this means we are filtering a short list, thus it does not matter too much if the matching takes a bit longer element-wise
I should also note that this regular expression does not involve back-tracking and is thus quite efficient. It could also be modeled as a simple state machine.
Levenshtein 'Edit Distance' algorithms will definitely work on what you're trying to do: they will give you a measurement of how closely two words or addresses or phone numbers, psalms, monologues and scholarly articles match each other, allowing you you rank the results and choose the best match.
A more lightweight appproach is to count up the common substrings: it's not as good as Levenshtein, but it provides usable results and runs quickly in slow languages which have access to fast 'InString' functions.
I published an Excel 'Fuzzy Lookup' in Excellerando a few years ago, using 'FuzzyMatchScore' function that is, as far as I can tell, exactly what you need:
http://excellerando.blogspot.com/2010/03/vlookup-with-fuzzy-matching-to-get.html
It is, of course, in Visual Basic for Applications. Proceed with caution, crucifixes and garlic:
Public Function SumOfCommonStrings( _
ByVal s1 As String, _
ByVal s2 As String, _
Optional Compare As VBA.VbCompareMethod = vbTextCompare, _
Optional iScore As Integer = 0 _
) As Integer
Application.Volatile False
' N.Heffernan 06 June 2006
' THIS CODE IS IN THE PUBLIC DOMAIN
' Function to measure how much of String 1 is made up of substrings found in String 2
' This function uses a modified Longest Common String algorithm.
' Simple LCS algorithms are unduly sensitive to single-letter
' deletions/changes near the midpoint of the test words, eg:
' Wednesday is obviously closer to WedXesday on an edit-distance
' basis than it is to WednesXXX. So it would be better to score
' the 'Wed' as well as the 'esday' and add up the total matched
' Watch out for strings of differing lengths:
'
' SumOfCommonStrings("Wednesday", "WednesXXXday")
'
' This scores the same as:
'
' SumOfCommonStrings("Wednesday", "Wednesday")
'
' So make sure the calling function uses the length of the longest
' string when calculating the degree of similarity from this score.
' This is coded for clarity, not for performance.
Dim arr() As Integer ' Scoring matrix
Dim n As Integer ' length of s1
Dim m As Integer ' length of s2
Dim i As Integer ' start position in s1
Dim j As Integer ' start position in s2
Dim subs1 As String ' a substring of s1
Dim len1 As Integer ' length of subs1
Dim sBefore1 ' documented in the code
Dim sBefore2
Dim sAfter1
Dim sAfter2
Dim s3 As String
SumOfCommonStrings = iScore
n = Len(s1)
m = Len(s2)
If s1 = s2 Then
SumOfCommonStrings = n
Exit Function
End If
If n = 0 Or m = 0 Then
Exit Function
End If
's1 should always be the shorter of the two strings:
If n &GT; m Then
s3 = s2
s2 = s1
s1 = s3
n = Len(s1)
m = Len(s2)
End If
n = Len(s1)
m = Len(s2)
' Special case: s1 is n exact substring of s2
If InStr(1, s2, s1, Compare) Then
SumOfCommonStrings = n
Exit Function
End If
For len1 = n To 1 Step -1
For i = 1 To n - len1 + 1
subs1 = Mid(s1, i, len1)
j = 0
j = InStr(1, s2, subs1, Compare)
If j &GT; 0 Then
' We've found a matching substring...
iScore = iScore + len1
' Now clip out this substring from s1 and s2...
' And search the fragments before and after this excision:
If i &GT; 1 And j &GT; 1 Then
sBefore1 = left(s1, i - 1)
sBefore2 = left(s2, j - 1)
iScore = SumOfCommonStrings(sBefore1, _
sBefore2, _
Compare, _
iScore)
End If
If i + len1 &LT; n And j + len1 &LT; m Then
sAfter1 = right(s1, n + 1 - i - len1)
sAfter2 = right(s2, m + 1 - j - len1)
iScore = SumOfCommonStrings(sAfter1, _
sAfter2, _
Compare, _
iScore)
End If
SumOfCommonStrings = iScore
Exit Function
End If
Next
Next
End Function
Private Function Minimum(ByVal a As Integer, _
ByVal b As Integer, _
ByVal c As Integer) As Integer
Dim min As Integer
min = a
If b &LT; min Then
min = b
End If
If c &LT; min Then
min = c
End If
Minimum = min
End Function
Two algorithms I've found so far:
LiquidMetal
Better Ido Flex-Matching
I'm actually building something similar to Vim's Command-T and ctrlp plugins for Emacs, just for fun. I have just had a productive discussion with some clever workmates about ways to do this most efficiently. The goal is to reduce the number of operations needed to eliminate files that don't match. So we create a nested map, where at the top-level each key is a character that appears somewhere in the search set, mapping to the indices of all the strings in the search set. Each of those indices then maps to a list of character offsets at which that particular character appears in the search string.
In pseudo code, for the strings:
controller
model
view
We'd build a map like this:
{
"c" => {
0 => [0]
},
"o" => {
0 => [1, 5],
1 => [1]
},
"n" => {
0 => [2]
},
"t" => {
0 => [3]
},
"r" => {
0 => [4, 9]
},
"l" => {
0 => [6, 7],
1 => [4]
},
"e" => {
0 => [9],
1 => [3],
2 => [2]
},
"m" => {
1 => [0]
},
"d" => {
1 => [2]
},
"v" => {
2 => [0]
},
"i" => {
2 => [1]
},
"w" => {
2 => [3]
}
}
So now you have a mapping like this:
{
character-1 => {
word-index-1 => [occurrence-1, occurrence-2, occurrence-n, ...],
word-index-n => [ ... ],
...
},
character-n => {
...
},
...
}
Now searching for the string "oe":
Initialize a new map where the keys will be the indices of strings that match, and the values the offset read through that string so far.
Consume the first char from the search string "o" and look it up in the lookup table.
Since strings at indices 0 and 1 match the "o", put them into the map {0 => 1, 1 => 1}.
Now search consume the next char in the input string, "e" and loo it up in the table.
Here 3 strings match, but we know that we only care about strings 0 and 1.
Check if there are any offsets > the current offsets. If not, eliminate the items from our map, otherwise update the offset: {0 => 9, 1 => 3}.
Now by looking at the keys in our map that we've accumulated, we know which strings matched the fuzzy search.
Ideally, if the search is being performed as the user types, you'll keep track of the accumulated hash of results and pass it back into your search function. I think this will be a lot faster than iterating all search strings and performing a full wildcard search on each one.
The interesting thing about this is that you could also efficient store the Levenstein Distance along with each match, assuming you only care about insertions, not substitutions or deletions. Though perhaps it's not hard to get that logic added too.
I recently had to solve the same problem. My solution involves scoring strings with consecutively matched letters highly and excluding strings that don't contain the typed letters in order.
I've documented the algorithm in detail here: http://blog.kazade.co.uk/2014/10/a-fuzzy-filename-matching-algorithm.html
If your text is predominantly English then you may try your hand at various Soundex algorithms
1. Classic soundex
2. Metafone
These algorithms will let you choose words which sound like each other and will be a good way to find misspelled words.

Resources