Optimized way of finding similar strings - algorithm

Suppose I have a large list of words. (about 4-5 thousands and increasing). Someone searched for a string. Unfortunately, The string was not found on the wordlist. Now what would be the best and optimized way to find words similar to the input string? The first thing that came to my mind was calculating Levenshtein distance between each entry of the wordlist and the input string. But is that the optimized way to do that?
(Note that, this is not language-specific question)

EDIT: new solution
Yes, calculating Levenshtein distances between your input and the word list can be a reasonable approach, but takes a lot of time. BK-trees can improve this, but they become slow quickly as the Levenshtein distance becomes bigger. It seems we can speed up the Levenshtein distance calculations using a trie, as described in this excellent blog post:
Fast and Easy Levenshtein distance using a Trie
It relies on the fact that the dynamic programming lookup table for Levenshtein distance has common rows in different invocations i.e. levenshtein(kate,cat) and levenshtein(kate,cats).
Running the Python program given on that page with the TWL06 dictionary gives:
> python dict_lev.py HACKING 1
Read 178691 words into 395185 nodes
('BACKING', 1)
('HACKING', 0)
('HACKLING', 1)
('HANKING', 1)
('HARKING', 1)
('HAWKING', 1)
('HOCKING', 1)
('JACKING', 1)
('LACKING', 1)
('PACKING', 1)
('SACKING', 1)
('SHACKING', 1)
('RACKING', 1)
('TACKING', 1)
('THACKING', 1)
('WHACKING', 1)
('YACKING', 1)
Search took 0.0189998 s
That's really fast, and would be even faster in other languages. Most of the time is spent in building the trie, which is irrelevant as it needs to be done just once and stored in memory.
The only minor downside to this is that tries take up a lot of memory (which can be reduced with a DAWG, at the cost of some speed).
Another approach: Peter Norvig has an great article (with complete source code) on spelling correction.
http://norvig.com/spell-correct.html
The idea is to build possible edits of the words, and then choose the most likely spelling correction of that word.

I think that something better than this exists, but BK trees are a good optimization from brute force at least.
It uses the property of Levenshtein distance being a metric space, and hence if you get a Levenshtein distance of d between your query and an arbitrary string s from the dict, then all your results must be at a distance from (d+n) to (d-n) to s. Here n is the maximum Levenshtein distance from the query you want to output.
It's explained in detail here: http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees

If you are interested in the very code itself I implemented an algorithm for finding optimal alignment between two strings. It basically shows how to transform one string into the other with k operations (where k is Levenstein/Edit Distance of the strings). It could be bit simplified for your needs (as you only need the distance itself). By the way it works in O(mn) where m and n are lengths of the strings. My implementation is based on: this and this.
#optimalization: using int instead of strings:
#1 ~ "left", "insertion"
#7 ~ "up", "deletion"
#17 ~ "up-left", "match/mismatch"
def GetLevenshteinDistanceWithBacktracking(sequence1,sequence2):
distances = [[0 for y in range(len(sequence2)+1)] for x in range(len(sequence1)+1)]
backtracking = [[1 for y in range(len(sequence2)+1)] for x in range(len(sequence1)+1)]
for i in range(1, len(sequence1)+1):
distances[i][0]=i
for i in range(1, len(sequence2)+1):
distances[0][i]=i
for j in range(1, len(sequence2)+1):
for i in range(1, len(sequence1)+1):
if sequence1[i-1] == sequence2[j-1]:
distances[i][j]=distances[i-1][j-1]
backtracking[i][j] = 17
else:
deletion = distances[i-1][j]+1
substitution = distances[i-1][j-1]+1
insertion = distances[i][j-1] + 1
distances[i][j]=min( deletion, substitution, insertion)
if distances[i][j] == deletion:
backtracking[i][j] = 7
elif distances[i][j] == insertion:
backtracking[i][j] = 1
else:
backtracking[i][j] = 17
return (distances[len(sequence1)][len(sequence2)], backtracking)
def Alignment(sequence1, sequence2):
cost, backtracking = GetLevenshteinDistanceWithBacktracking(sequence1, sequence2)
alignment1 = alignment2 = ""
i = len(sequence1)
j = len(sequence2)
#from backtracking-matrix we get optimal-alignment
while(i > 0 or j > 0):
if i > 0 and j > 0 and backtracking[i][j] == 17:
alignment1 = sequence1[i-1] + alignment1
alignment2 = sequence2[j-1] + alignment2
i -= 1
j -= 1
elif i > 0 and backtracking[i][j] == 7:
alignment1 = sequence1[i-1] + alignment1
alignment2 = "-" + alignment2
i -= 1
elif j > 0 and backtracking[i][j]==1:
alignment2 = sequence2[j-1] + alignment2
alignment1 = "-" + alignment1
j -= 1
elif i > 0:
alignment1 = sequence1[i-1] + alignment1
alignment2 = "-" + alignment2
i -= 1
return (cost, (alignment1, alignment2))
It depends on broader context and how accurate you want to be. But what I would (probably) start with:
Only consider the subset which starts with the same character as the query word. It would decrease the amount of work by ~20 for a single query.
I would categorized words according to their lengths and for each category would allow the maximal distance to be a different number. In case of 4 categories e.g.:
0 -- if length is between 0 and 2; 1 -- if length is between 3 and 5; 2 -- if length is between 6 and 8; 3 -- if length is 9+. Then based on the query length you could just check the words from given category. Moreover it should not be hard to implement the algorithm to stop when the max. distance has been exceeded.
If needed would start to think about implementing some machine learning approach.

Related

Product of consecutive numbers f(n) = n(n-1)(n-2)(n-3)(n- ...) find the value of n

Is there a way to find programmatically the consecutive natural numbers?
On the Internet I found some examples using either factorization or polynomial solving.
Example 1
For n(n−1)(n−2)(n−3) = 840
n = 7, -4, (3+i√111)/2, (3-i√111)/2
Example 2
For n(n−1)(n−2)(n−3) = 1680
n = 8, −5, (3+i√159)/2, (3-i√159)/2
Both of those examples give 4 results (because both are 4th degree equations), but for my use case I'm only interested in the natural value. Also the solution should work for any sequences size of consecutive numbers, in other words, n(n−1)(n−2)(n−3)(n−4)...
The solution can be an algorithm or come from any open math library. The parameters passed to the algorithm will be the product and the degree (sequences size), like for those two examples the product is 840 or 1640 and the degree is 4 for both.
Thank you
If you're interested only in natural "n" solution then this reasoning may help:
Let's say n(n-1)(n-2)(n-3)...(n-k) = A
The solution n=sthen verifies:
remainder of A/s = 0
remainder of A/(s-1) = 0
remainder of A/(s-2) = 0
and so on
Now, we see that s is in the order of t= A^(1/k) : A is similar to s*s*s*s*s... k times. So we can start with v= (t-k) and finish at v= t+1. The solution will be between these two values.
So the algo may be, roughly:
s= 0
t= (int) (A^(1/k)) //this truncation by leave out t= v+1. Fix it in the loop
theLoop:
for (v= t-k to v= t+1, step= +1)
{ i=0
while ( i <= k )
{ if (A % (v - k + i) > 0 ) // % operator to find the reminder
continue at theLoop
i= i+1
}
// All are valid divisors, solution found
s = v
break
}
if (s==0)
not natural solution
Assuming that:
n is an integer, and
n > 0, and
k < n
Then approximately:
n = FLOOR( (product ** (1/(k+1)) + (k+1)/2 )
The only cases I have found where this isn't exactly right is when k is very close to n. You can of course check it by back-calculating the product and see if it matches. If not, it almost certainly is only 1 or 2 in higher than this estimate, so just keep incrementing n until the product matches. (I can write this up in pseudocode if you need it)

Is this a correct recurrence relationship I've found for the coin change challenge?

I'm trying to solve the "coin change problem" and I think I've come up with a recursive solution but I want to verify.
As a a example, let's suppose we have pennies, nickles and dimes and are trying to make change for 22 cents.
C = { 1 = penny, nickle = 5, dime = 10 }
K = 22
Then the number of ways to make change is
f(C,N) = f({1,5,10},22)
=
(# of ways to make change with 0 dimes)
+ (# of ways to make change with 1 dimes)
+ (# of ways to make change with 2 dimes)
= f(C\{dime},22-0*10) + f(C\{dime},22-1*10) + f(C\{dime},22-2*10)
= f({1,5},22) + f({1,5},12) + f({1,5},2)
and
f({1,5},22)
= f({1,5}\{nickle},22-0*5) + f({1,5}\{nickle},22-1*5) + f({1,5}\{nickle},22-2*5) + f({1,5}\{nickle},22-3*5) + f({1,5}\{nickle},22-4*5)
= f({1},22) + f({1},17) + f({1},12) + f({1},7) + f({1},2)
= 5
and so forth.
In other words, my algorithm is like
let f(C,K) be the number of ways to make change for K cents with coins C
and have the following implementation
if(C is empty or K=0)
return 0
sum = 0
m = C.PopLargest()
A = {0, 1, ..., K / m}
for(i in A)
sum += f(C,K-i*m)
return sum
If there any flaw in that?
Would be linear time, I think.
Rethink about your base cases:
1. What if K < 0 ? Then no solution exists. i.e. No of ways = 0.
2. When K = 0, so there is 1 way to make changes and which is to consider zero elements from array of coin-types.
3. When coin array is empty then No of ways = 0.
Rest of the logic is correct. But your perception that the algorithm is Linear is absolutely wrong.
Lets compute the complexity:
Popping largest element is O(C.length). However this step can be
optimised if you consider sorting the whole array in the beginning.
Your for Loop works O(K/C.max) times in every call and in every iteration it is calling the function recursively.
So if you write the recurrence for it. then it should be something like:
T(N) = O(N) + K*T(N-1)
And this is going to be exponential in terms of N (Size of array).
In case you are looking for improvement, i would suggest you to use Dynamic Programming.

Homework: Implementing Karp-Rabin; For the hash values modulo q, explain why it is a bad idea to use q as a power of 2?

I have a two-fold homework problem, Implement Karp-Rabin and run it on a test file and the second part:
For the hash values modulo q, explain why it is a bad idea to use q as a power of 2. Can you construct a terrible example e.g. for q=64
and n=15?
This is my implementation of the algorithm:
def karp_rabin(text, pattern):
# setup
alphabet = 'ACGT'
d = len(alphabet)
n = len(pattern)
d_n = d**n
q = 2**32-1
m = {char:i for i,char in enumerate(alphabet)}
positions = []
def kr_hash(s):
return sum(d**(n-i-1) * m[s[i]] for i in range(n))
def update_hash():
return d*text_hash + m[text[i+n-1]] - d_n * m[text[i-1]]
pattern_hash = kr_hash(pattern)
for i in range(0, len(text) - n + 1):
text_hash = update_hash() if i else kr_hash(text[i:n])
if pattern_hash % q == text_hash % q and pattern == text[i:i+n]:
positions.append(i)
return ' '.join(map(str, positions))
...The second part of the question is referring to this part of the code/algo:
pattern_hash = kr_hash(pattern)
for i in range(0, len(text) - n + 1):
text_hash = update_hash() if i else kr_hash(text[i:n])
# the modulo q used to check if the hashes are congruent
if pattern_hash % q == text_hash % q and pattern == text[i:i+n]:
positions.append(i)
I don't understand why it would be a bad idea to use q as a power of 2. I've tried running the algorithm on the test file provided(which is the genome of ecoli) and there's no discernible difference.
I tried looking at the formula for how the hash is derived (I'm not good at math) trying to find some common factors that would be really bad for powers of two but found nothing. I feel like if q is a power of 2 it should cause a lot of clashes for the hashes so you'd need to compare strings a lot more but I didn't find anything along those lines either.
I'd really appreciate help on this since I'm stumped. If someone wants to point out what I can do better in the first part (code efficiency, readability, correctness etc.) I'd also be thrilled to hear your input on that.
There is a problem if q divides some power of d, because then only a few characters contribute to the hash. For example in your code d=4, if you take q=64 only the last three characters determine the hash (d**3 = 64).
I don't really see a problem if q is a power of 2 but gcd(d,q) = 1.
Your implementation looks a bit strange because instead of
if pattern_hash % q == text_hash % q and pattern == text[i:i+n]:
you could also use
if pattern_hash == text_hash and pattern == text[i:i+n]:
which would be better because you get fewer collisions.
The Thue–Morse sequence has among its properties that its polynomial hash quickly becomes zero when a power of 2 is the hash module, for whatever polynomial base (d). So if you will try to search a short Thue-Morse sequence in a longer one, you will have a great lot of hash collisions.
For example, your code, slightly adapted:
def karp_rabin(text, pattern):
# setup
alphabet = '01'
d = 15
n = len(pattern)
d_n = d**n
q = 32
m = {char:i for i,char in enumerate(alphabet)}
positions = []
def kr_hash(s):
return sum(d**(n-i-1) * m[s[i]] for i in range(n))
def update_hash():
return d*text_hash + m[text[i+n-1]] - d_n * m[text[i-1]]
pattern_hash = kr_hash(pattern)
for i in range(0, len(text) - n + 1):
text_hash = update_hash() if i else kr_hash(text[i:n])
if pattern_hash % q == text_hash % q : #and pattern == text[i:i+n]:
positions.append(i)
return ' '.join(map(str, positions))
print(karp_rabin('0110100110010110100101100110100110010110011010010110100110010110', '0110100110010110'))
outputs a lot of positions, although only three of then are proper matches.
Note that I have dropped the and pattern == text[i:i+n] check. Obviously if you restore it, the result will be correct, but also it is obvious that the algorithm will do much more work checking this additional condition than for other q. In fact, because there are so many collisions, the whole idea of algorithm becomes not working: you could almost as effectively wrote a simple algorithm that checks every position for a match.
Also note that your implementation is quite strange. The whole idea of polynomial hashing is to take the modulo operation each time you compute the hash. Otherwise your pattern_hash and text_hash are very big numbers. In other languages this might mean arithmetic overflow, but in Python this will invoke big integer arithmetic, which is slow and once again loses the whole idea of the algorithm.

Approximate matching of two lists of events (with duration)

I have a black box algorithm that analyses a time series and "detects" certain events in the series. It returns a list of events, each containing a start time and end time. The events do not overlap.
I also have a list of the "true" events, again with start time and end time for each event, not overlapping.
I want to compare the two lists and match detected and true events that fall within a certain time tolerance (True Positives). The complication is that the algorithm may detect events that are not really there (False Positives) or might miss events that were there (False Negatives).
What is an algorithm that optimally pairs events from the two lists and leaves the proper events unpaired? I am pretty sure I am not the first one to tackle this problem and that such a method exists, but I haven't been able to find it, perhaps because I do not know the right terminology.
Speed requirement:
The lists will contain no more than a few hundred entries, and speed is not a major factor. Accuracy is more important. Anything taking less than a few seconds on an ordinary computer will be fine.
Here's a quadratic-time algorithm that gives a maximum likelihood estimate with respect to the following model. Let A1 < ... < Am be the true intervals and let B1 < ... < Bn be the reported intervals. The quantity sub(i, j) is the log-likelihood that Ai becomes Bj. The quantity del(i) is the log-likelihood that Ai is deleted. The quantity ins(j) is the log-likelihood that Bj is inserted. Make independence assumptions everywhere! I'm going to choose sub, del, and ins so that, for every i < i' and every j < j', we have
sub(i, j') + sub(i', j) <= max {sub(i, j ) + sub(i', j')
,del(i) + ins(j') + sub(i', j )
,sub(i, j') + del(i') + ins(j)
}.
This ensures that the optimal matching between intervals is noncrossing and thus that we can use the following Levenshtein-like dynamic program.
The dynamic program is presented as a memoized recursive function, score(i, j), that computes the optimal score of matching A1, ..., Ai with B1, ..., Bj. The root of the call tree is score(m, n). It can be modified to return the sequence of sub(i, j) operations in the optimal solution.
score(i, j) | i == 0 && j == 0 = 0
| i > 0 && j == 0 = del(i) + score(i - 1, 0 )
| i == 0 && j > 0 = ins(j) + score(0 , j - 1)
| i > 0 && j > 0 = max {sub(i, j) + score(i - 1, j - 1)
,del(i) + score(i - 1, j )
,ins(j) + score(i , j - 1)
}
Here are some possible definitions for sub, del, and ins. I'm not sure if they will be any good; you may want to multiply their values by constants or use powers other than 2. If Ai = [s, t] and Bj = [u, v], then define
sub(i, j) = -(|u - s|^2 + |v - t|^2)
del(i) = -(t - s)^2
ins(j) = -(v - u)^2.
(Apologies to the undoubtedly extant academic who published something like this in the bioinformatics literature many decades ago.)

Find the minimum number of operations required to compute a number using a specified range of numbers

Let me start with an example -
I have a range of numbers from 1 to 9. And let's say the target number that I want is 29.
In this case the minimum number of operations that are required would be (9*3)+2 = 2 operations. Similarly for 18 the minimum number of operations is 1 (9*2=18).
I can use any of the 4 arithmetic operators - +, -, / and *.
How can I programmatically find out the minimum number of operations required?
Thanks in advance for any help provided.
clarification: integers only, no decimals allowed mid-calculation. i.e. the following is not valid (from comments below): ((9/2) + 1) * 4 == 22
I must admit I didn't think about this thoroughly, but for my purpose it doesn't matter if decimal numbers appear mid-calculation. ((9/2) + 1) * 4 == 22 is valid. Sorry for the confusion.
For the special case where set Y = [1..9] and n > 0:
n <= 9 : 0 operations
n <=18 : 1 operation (+)
otherwise : Remove any divisor found in Y. If this is not enough, do a recursion on the remainder for all offsets -9 .. +9. Offset 0 can be skipped as it has already been tried.
Notice how division is not needed in this case. For other Y this does not hold.
This algorithm is exponential in log(n). The exact analysis is a job for somebody with more knowledge about algebra than I.
For more speed, add pruning to eliminate some of the search for larger numbers.
Sample code:
def findop(n, maxlen=9999):
# Return a short postfix list of numbers and operations
# Simple solution to small numbers
if n<=9: return [n]
if n<=18: return [9,n-9,'+']
# Find direct multiply
x = divlist(n)
if len(x) > 1:
mults = len(x)-1
x[-1:] = findop(x[-1], maxlen-2*mults)
x.extend(['*'] * mults)
return x
shortest = 0
for o in range(1,10) + range(-1,-10,-1):
x = divlist(n-o)
if len(x) == 1: continue
mults = len(x)-1
# We spent len(divlist) + mults + 2 fields for offset.
# The last number is expanded by the recursion, so it doesn't count.
recursion_maxlen = maxlen - len(x) - mults - 2 + 1
if recursion_maxlen < 1: continue
x[-1:] = findop(x[-1], recursion_maxlen)
x.extend(['*'] * mults)
if o > 0:
x.extend([o, '+'])
else:
x.extend([-o, '-'])
if shortest == 0 or len(x) < shortest:
shortest = len(x)
maxlen = shortest - 1
solution = x[:]
if shortest == 0:
# Fake solution, it will be discarded
return '#' * (maxlen+1)
return solution
def divlist(n):
l = []
for d in range(9,1,-1):
while n%d == 0:
l.append(d)
n = n/d
if n>1: l.append(n)
return l
The basic idea is to test all possibilities with k operations, for k starting from 0. Imagine you create a tree of height k that branches for every possible new operation with operand (4*9 branches per level). You need to traverse and evaluate the leaves of the tree for each k before moving to the next k.
I didn't test this pseudo-code:
for every k from 0 to infinity
for every n from 1 to 9
if compute(n,0,k):
return k
boolean compute(n,j,k):
if (j == k):
return (n == target)
else:
for each operator in {+,-,*,/}:
for every i from 1 to 9:
if compute((n operator i),j+1,k):
return true
return false
It doesn't take into account arithmetic operators precedence and braces, that would require some rework.
Really cool question :)
Notice that you can start from the end! From your example (9*3)+2 = 29 is equivalent to saying (29-2)/3=9. That way we can avoid the double loop in cyborg's answer. This suggests the following algorithm for set Y and result r:
nextleaves = {r}
nops = 0
while(true):
nops = nops+1
leaves = nextleaves
nextleaves = {}
for leaf in leaves:
for y in Y:
if (leaf+y) or (leaf-y) or (leaf*y) or (leaf/y) is in X:
return(nops)
else:
add (leaf+y) and (leaf-y) and (leaf*y) and (leaf/y) to nextleaves
This is the basic idea, performance can be certainly be improved, for instance by avoiding "backtracks", such as r+a-a or r*a*b/a.
I guess my idea is similar to the one of Peer Sommerlund:
For big numbers, you advance fast, by multiplication with big ciphers.
Is Y=29 prime? If not, divide it by the maximum divider of (2 to 9).
Else you could subtract a number, to reach a dividable number. 27 is fine, since it is dividable by 9, so
(29-2)/9=3 =>
3*9+2 = 29
So maybe - I didn't think about this to the end: Search the next divisible by 9 number below Y. If you don't reach a number which is a digit, repeat.
The formula is the steps reversed.
(I'll try it for some numbers. :) )
I tried with 2551, which is
echo $((((3*9+4)*9+4)*9+4))
But I didn't test every intermediate result whether it is prime.
But
echo $((8*8*8*5-9))
is 2 operations less. Maybe I can investigate this later.

Resources