Improving the Efficiency Of This Code With Tracking Variable? - algorithm

I have written the below code outline, basically to sum an array (a) where each element is multiplied by a value x^i:
y = a(0)
i = 0
{y = sum from i=0 to (n-1) a(i) * x^i AND 0 <= n <= a.length} //Invariant
while (i < (n-1))
{y = sum from i=0 to (n-1) a(i) * x^i AND 0 <= n <= a.length AND i < (n-1)}
y = y + a(i)*x^i
i = i + 1
end while
{y = sum from i=0 to (n-1) a(i) * x^i} //Postcondition
Note that I do not expect the code to compile - it's just a sensible outline of how the code should work. I need to improve the efficiency of the code by using a tracking variable, and thus, a linking invariant to bridge said variable with the rest of the code. This is where I am stuck. What would be useful to track in this case? I have thought about retaining sum values at each iteration, but I'm not sure if that does the trick. If I could figure out what to track, I'm pretty sure it would be trivial to link it to the space. Can anyone see how my algorithm might be improved via a tracking variable?

Your invariant logic has off-by-1 problems. Here is a corrected version that tracks partial power operations.
// Precondition: 1 <= n <= a.length
// Invariant:
{ 0 <= i < n AND xi = x^i AND y = sum(j = 0..i) . a(j) * x^j }
// Establish invariant at i = 0:
// xi = x^0 = 1 AND y = sum(j=0..0) . a(j) * x^j = a(0) * x^0 = a(0)
i = 0;
xi = 1;
y = a(0);
while (i < n - 1) {
i = i + 1; // Break the invariant
xi = xi * x; // Re-establish it
y = y + a(i) * xi
}
// Invariant was last established at i = n-1, so we have post condition:
{ y = sum(j = 0..n-1) . a(j) * x^j }
The more common and numerically stable way to calculate polynomials is with Horner's Rule
y = 0
for i = n-1 downto 0 do y = y * x + a(i)

So it seems like you're trying to end up with this:
(a(0)*x^0) + (a(1)*x^1) + ... + (a(n-1)*x^(n-1))
Is that right?
The only way I can see to improve performance would be if the ^ operation is more costly than the * operation. In that case, you could keep track of the x^n variable as you go, multiplying x by the value through each iteration.
In fact, in that case you could probably start at the end of the array and work your way backwards, multiplying by x each time, to produce:
(((...((a(n-1)*x+a(n-2))*x+...)+a(2))*x+a(1))*x)+a(0)
That would theoretically be slightly faster than recalculating x^i each time, but it's not going to be algorithmically faster. It probably wouldn't be an order of magnitude faster.

Related

Given two sequences, find the maximal overlap between ending of one and beginning of the other

I need to find an efficient (pseudo)code to solve the following problem:
Given two sequences of (not necessarily distinct) integers (a[1], a[2], ..., a[n]) and (b[1], b[2], ..., b[n]), find the maximum d such that a[n-d+1] == b[1], a[n-d+2] == b[2], ..., and a[n] == b[d].
This is not homework, I actually came up with this when trying to contract two tensors along as many dimensions as possible. I suspect an efficient algorithm exists (maybe O(n)?), but I cannot come up with something that is not O(n^2). The O(n^2) approach would be the obvious loop on d and then an inner loop on the items to check the required condition until hitting the maximum d. But I suspect something better than this is possible.
You can utilize the z algorithm, a linear time (O(n)) algorithm that:
Given a string S of length n, the Z Algorithm produces an array Z
where Z[i] is the length of the longest substring starting from S[i]
which is also a prefix of S
You need to concatenate your arrays (b+a) and run the algorithm on the resulting constructed array till the first i such that Z[i]+i == m+n.
For example, for a = [1, 2, 3, 6, 2, 3] & b = [2, 3, 6, 2, 1, 0], the concatenation would be [2, 3, 6, 2, 1, 0, 1, 2, 3, 6, 2, 3] which would yield Z[10] = 2 fulfilling Z[i] + i = 12 = m + n.
For O(n) time/space complexity, the trick is to evaluate hashes for each subsequence. Consider the array b:
[b1 b2 b3 ... bn]
Using Horner's method, you can evaluate all the possible hashes for each subsequence. Pick a base value B (bigger than any value in both of your arrays):
from b1 to b1 = b1 * B^1
from b1 to b2 = b1 * B^1 + b2 * B^2
from b1 to b3 = b1 * B^1 + b2 * B^2 + b3 * B^3
...
from b1 to bn = b1 * B^1 + b2 * B^2 + b3 * B^3 + ... + bn * B^n
Note that you can evaluate each sequence in O(1) time, using the result of the previous sequence, hence all the job costs O(n).
Now you have an array Hb = [h(b1), h(b2), ... , h(bn)], where Hb[i] is the hash from b1 until bi.
Do the same thing for the array a, but with a little trick:
from an to an = (an * B^1)
from an-1 to an = (an-1 * B^1) + (an * B^2)
from an-2 to an = (an-2 * B^1) + (an-1 * B^2) + (an * B^3)
...
from a1 to an = (a1 * B^1) + (a2 * B^2) + (a3 * B^3) + ... + (an * B^n)
You must note that, when you step from one sequence to another, you multiply the whole previous sequence by B and add the new value multiplied by B. For example:
from an to an = (an * B^1)
for the next sequence, multiply the previous by B: (an * B^1) * B = (an * B^2)
now sum with the new value multiplied by B: (an-1 * B^1) + (an * B^2)
hence:
from an-1 to an = (an-1 * B^1) + (an * B^2)
Now you have an array Ha = [h(an), h(an-1), ... , h(a1)], where Ha[i] is the hash from ai until an.
Now, you can compare Ha[d] == Hb[d] for all d values from n to 1, if they match, you have your answer.
ATTENTION: this is a hash method, the values can be large and you may have to use a fast exponentiation method and modular arithmetics, which may (hardly) give you collisions, making this method not totally safe. A good practice is to pick a base B as a really big prime number (at least bigger than the biggest value in your arrays). You should also be careful as the limits of the numbers may overflow at each step, so you'll have to use (modulo K) in each operation (where K can be a prime bigger than B).
This means that two different sequences might have the same hash, but two equal sequences will always have the same hash.
This can indeed be done in linear time, O(n), and O(n) extra space. I will assume the input arrays are character strings, but this is not essential.
A naive method would -- after matching k characters that are equal -- find a character that does not match, and go back k-1 units in a, reset the index in b, and then start the matching process from there. This clearly represents a O(n²) worst case.
To avoid this backtracking process, we can observe that going back is not useful if we have not encountered the b[0] character while scanning the last k-1 characters. If we did find that character, then backtracking to that position would only be useful, if in that k sized substring we had a periodic repetition.
For instance, if we look at substring "abcabc" somewhere in a, and b is "abcabd", and we find that the final character of b does not match, we must consider that a successful match might start at the second "a" in the substring, and we should move our current index in b back accordingly before continuing the comparison.
The idea is then to do some preprocessing based on string b to log back-references in b that are useful to check when there is a mismatch. So for instance, if b is "acaacaacd", we could identify these 0-based backreferences (put below each character):
index: 0 1 2 3 4 5 6 7 8
b: a c a a c a a c d
ref: 0 0 0 1 0 0 1 0 5
For example, if we have a equal to "acaacaaca" the first mismatch happens on the final character. The above information then tells the algorithm to go back in b to index 5, since "acaac" is common. And then with only changing the current index in b we can continue the matching at the current index of a. In this example the match of the final character then succeeds.
With this we can optimise the search and make sure that the index in a can always progress forwards.
Here is an implementation of that idea in JavaScript, using the most basic syntax of that language only:
function overlapCount(a, b) {
// Deal with cases where the strings differ in length
let startA = 0;
if (a.length > b.length) startA = a.length - b.length;
let endB = b.length;
if (a.length < b.length) endB = a.length;
// Create a back-reference for each index
// that should be followed in case of a mismatch.
// We only need B to make these references:
let map = Array(endB);
let k = 0; // Index that lags behind j
map[0] = 0;
for (let j = 1; j < endB; j++) {
if (b[j] == b[k]) {
map[j] = map[k]; // skip over the same character (optional optimisation)
} else {
map[j] = k;
}
while (k > 0 && b[j] != b[k]) k = map[k];
if (b[j] == b[k]) k++;
}
// Phase 2: use these references while iterating over A
k = 0;
for (let i = startA; i < a.length; i++) {
while (k > 0 && a[i] != b[k]) k = map[k];
if (a[i] == b[k]) k++;
}
return k;
}
console.log(overlapCount("ababaaaabaabab", "abaababaaz")); // 7
Although there are nested while loops, these do not have more iterations in total than n. This is because the value of k strictly decreases in the while body, and cannot become negative. This can only happen when k++ was executed that many times to give enough room for such decreases. So all in all, there cannot be more executions of the while body than there are k++ executions, and the latter is clearly O(n).
To complete, here you can find the same code as above, but in an interactive snippet: you can input your own strings and see the result interactively:
function overlapCount(a, b) {
// Deal with cases where the strings differ in length
let startA = 0;
if (a.length > b.length) startA = a.length - b.length;
let endB = b.length;
if (a.length < b.length) endB = a.length;
// Create a back-reference for each index
// that should be followed in case of a mismatch.
// We only need B to make these references:
let map = Array(endB);
let k = 0; // Index that lags behind j
map[0] = 0;
for (let j = 1; j < endB; j++) {
if (b[j] == b[k]) {
map[j] = map[k]; // skip over the same character (optional optimisation)
} else {
map[j] = k;
}
while (k > 0 && b[j] != b[k]) k = map[k];
if (b[j] == b[k]) k++;
}
// Phase 2: use these references while iterating over A
k = 0;
for (let i = startA; i < a.length; i++) {
while (k > 0 && a[i] != b[k]) k = map[k];
if (a[i] == b[k]) k++;
}
return k;
}
// I/O handling
let [inputA, inputB] = document.querySelectorAll("input");
let output = document.querySelector("pre");
function refresh() {
let a = inputA.value;
let b = inputB.value;
let count = overlapCount(a, b);
let padding = a.length - count;
// Apply some HTML formatting to highlight the overlap:
if (count) {
a = a.slice(0, -count) + "<b>" + a.slice(-count) + "</b>";
b = "<b>" + b.slice(0, count) + "</b>" + b.slice(count);
}
output.innerHTML = count + " overlapping characters:\n" +
a + "\n" +
" ".repeat(padding) + b;
}
document.addEventListener("input", refresh);
refresh();
body { font-family: monospace }
b { background:yellow }
input { width: 90% }
a: <input value="acacaacaa"><br>
b: <input value="acaacaacd"><br>
<pre></pre>

Finding the continued fraction of 2^(1/3) to very high precision

Here I'll use the notation
It is possible to find the continued fraction of a number by computing it then applying the definition, but that requires at least O(n) bits of memory to find a0, a1 ... an, in practice it is a much worse. Using double floating point precision it is only possible to find a0, a1 ... a19.
An alternative is to use the fact that if a,b,c are rational numbers then there exist unique rationals p,q,r such that 1/(a+b*21/3+c*22/3) = x+y*21/3+z*22/3, namely
So if I represent x,y, and z to absolute precision using the boost rational lib I can obtain floor(x + y*21/3+z*22/3) accurately only using double precision for 21/3 and 22/3 because I only need it to be within 1/2 of the true value. Unfortunately the numerators and denominators of x,y, and z grow considerably fast, and if you use regular floats instead the errors pile up quickly.
This way I was able to compute a0, a1 ... a10000 in under an hour, but somehow mathematica can do that in 2 seconds. Here's my code for reference
#include <iostream>
#include <boost/multiprecision/cpp_int.hpp>
namespace mp = boost::multiprecision;
int main()
{
const double t_1 = 1.259921049894873164767210607278228350570251;
const double t_2 = 1.587401051968199474751705639272308260391493;
mp::cpp_rational p = 0;
mp::cpp_rational q = 1;
mp::cpp_rational r = 0;
for(unsigned int i = 1; i != 10001; ++i) {
double p_f = static_cast<double>(p);
double q_f = static_cast<double>(q);
double r_f = static_cast<double>(r);
uint64_t floor = p_f + t_1 * q_f + t_2 * r_f;
std::cout << floor << ", ";
p -= floor;
//std::cout << floor << " " << p << " " << q << " " << r << std::endl;
mp::cpp_rational den = (p * p * p + 2 * q * q * q +
4 * r * r * r - 6 * p * q * r);
mp::cpp_rational a = (p * p - 2 * q * r) / den;
mp::cpp_rational b = (2 * r * r - p * q) / den;
mp::cpp_rational c = (q * q - p * r) / den;
p = a;
q = b;
r = c;
}
return 0;
}
The Lagrange algorithm
The algorithm is described for example in Knuth's book The Art of Computer Programming, vol 2 (Ex 13 in section 4.5.3 Analysis of Euclid's Algorithm, p. 375 in 3rd edition).
Let f be a polynomial of integer coefficients whose only real root is an irrational number x0 > 1. Then the Lagrange algorithm calculates the consecutive quotients of the continued fraction of x0.
I implemented it in python
def cf(a, N=10):
"""
a : list - coefficients of the polynomial,
i.e. f(x) = a[0] + a[1]*x + ... + a[n]*x^n
N : number of quotients to output
"""
# Degree of the polynomial
n = len(a) - 1
# List of consecutive quotients
ans = []
def shift_poly():
"""
Replaces plynomial f(x) with f(x+1) (shifts its graph to the left).
"""
for k in range(n):
for j in range(n - 1, k - 1, -1):
a[j] += a[j+1]
for _ in range(N):
quotient = 1
shift_poly()
# While the root is >1 shift it left
while sum(a) < 0:
quotient += 1
shift_poly()
# Otherwise, we have the next quotient
ans.append(quotient)
# Replace polynomial f(x) with -x^n * f(1/x)
a.reverse()
a = [-x for x in a]
return ans
It takes about 1s on my computer to run cf([-2, 0, 0, 1], 10000). (The coefficients correspond to the polynomial x^3 - 2 whose only real root is 2^(1/3).) The output agrees with the one from Wolfram Alpha.
Caveat
The coefficients of the polynomials evaluated inside the function quickly become quite large integers. So this approach needs some bigint implementation in other languages (Pure python3 deals with it, but for example numpy doesn't.)
You might have more luck computing 2^(1/3) to high accuracy and then trying to derive the continued fraction from that, using interval arithmetic to determine if the accuracy is sufficient.
Here's my stab at this in Python, using Halley iteration to compute 2^(1/3) in fixed point. The dead code is an attempt to compute fixed-point reciprocals more efficiently than Python via Newton iteration -- no dice.
Timing from my machine is about thirty seconds, spent mostly trying to extract the continued fraction from the fixed point representation.
prec = 40000
a = 1 << (3 * prec + 1)
two_a = a << 1
x = 5 << (prec - 2)
while True:
x_cubed = x * x * x
two_x_cubed = x_cubed << 1
x_prime = x * (x_cubed + two_a) // (two_x_cubed + a)
if -1 <= x_prime - x <= 1: break
x = x_prime
cf = []
four_to_the_prec = 1 << (2 * prec)
for i in range(10000):
q = x >> prec
r = x - (q << prec)
cf.append(q)
if True:
x = four_to_the_prec // r
else:
x = 1 << (2 * prec - r.bit_length())
while True:
delta_x = (x * ((four_to_the_prec - r * x) >> prec)) >> prec
if not delta_x: break
x += delta_x
print(cf)

Find the Maximum Element in any SubMatrix of Matrix

I am giving a Matrix of N x M. For a Submatrix of Length X which starts at position (a, b) i have to find the largest element present in a Submatrix.
My Approach:
Do as the question says:
Simple 2 loops
for(i in range(a, a + x))
for(j in range(b, b + x)) max = max(max,A[i][j]) // N * M
A little Advance:
1. Make a segment tree for every i in range(0, N)
2. for i in range(a, a + x) query(b, b + x) // N * logM
Is there any better solution having O(log n) complexity only ?
A Sparse Table Algorithm Approach
:- <O( N x M x log(N) x log(M)) , O(1)>.
Precomputation Time - O( N x M x log(N) x log(M))
Query Time - O(1)
For understanding this method you should have knowledge of finding RMQ using sparse Table Algorithm for one dimension.
We can use 2D Sparse Table Algorithm for finding Range Minimum Query.
What we do in One Dimension:-
we preprocess RMQ for sub arrays of length 2^k using dynamic programming. We will keep an array M[0, N-1][0, logN] where M[i][j] is the index of the minimum value in the sub array starting at i.
For calculating M[i][j] we must search for the minimum value in the first and second half of the interval. It’s obvious that the small pieces have 2^(j – 1) length, so the pseudo code for calculation this is:-
if (A[M[i][j-1]] < A[M[i + 2^(j-1) -1][j-1]])
M[i][j] = M[i][j-1]
else
M[i][j] = M[i + 2^(j-1) -1][j-1]
Here A is actual array which stores values.Once we have these values preprocessed, let’s show how we can use them to calculate RMQ(i, j). The idea is to select two blocks that entirely cover the interval [i..j] and find the minimum between them. Let k = [log(j - i + 1)]. For computing RMQ(i, j) we can use the following formula:-
if (A[M[i][k]] <= A[M[j - 2^k + 1][k]])
RMQ(i, j) = A[M[i][k]]
else
RMQ(i , j) = A[M[j - 2^k + 1][k]]
For 2 Dimension :-
Similarly We can extend above rule for 2 Dimension also , here we preprocess RMQ for sub matrix of length 2^K, 2^L using dynamic programming & keep an array M[0,N-1][0, M-1][0, logN][0, logM]. Where M[x][y][k][l] is the index of the minimum value in the sub matrix starting at [x , y] and having length 2^K, 2^L respectively.
pseudo code for calculation M[x][y][k][l] is:-
M[x][y][i][j] = GetMinimum(M[x][y][i-1][j-1], M[x + (2^(i-1))][y][i-1][j-1], M[x][y+(2^(j-1))][i-1][j-1], M[x + (2^(i-1))][y+(2^(j-1))][i-1][j-1])
Here GetMinimum function will return the index of minimum element from provided elements. Now we have preprocessed, let's see how to calculate RMQ(x, y, x1, y1). Here [x, y] starting point of sub matrix and [x1, y1] represent end point of sub matrix means bottom right point of sub matrix. Here we have to select four sub matrices blocks that entirely cover [x, y, x1, y1] and find minimum of them. Let k = [log(x1 - x + 1)] & l = [log(y1 - y + 1)]. For computing RMQ(x, y, x1, y1) we can use following formula:-
RMQ(x, y, x1, y1) = GetMinimum(M[x][y][k][l], M[x1 - (2^k) + 1][y][k][l], M[x][y1 - (2^l) + 1][k][l], M[x1 - (2^k) + 1][y1 - (2^l) + 1][k][l]);
pseudo code for above logic:-
// remember Array 'M' store index of actual matrix 'P' so for comparing values in GetMinimum function compare the values of array 'P' not of array 'M'
SparseMatrix(n , m){ // n , m is dimension of matrix.
for i = 0 to 2^i <= n:
for j = 0 to 2^j <= m:
for x = 0 to x + 2^i -1 < n :
for y = 0 to y + (2^j) -1 < m:
if i == 0 and j == 0:
M[x][y][i][j] = Pair(x , y) // store x, y
else if i == 0:
M[x][y][i][j] = GetMinimum(M[x][y][i][j-1], M[x][y+(2^(j-1))][i][j-1])
else if j == 0:
M[x][y][i][j] = GetMinimum(M[x][y][i-1][j], M[x+ (2^(i-1))][y][i-1][j])
else
M[x][y][i][j] = GetMinimum(M[x][y][i-1][j-1], M[x + (2^(i-1))][y][i-1][j-1], M[x][y+(2^(j-1))][i-1][j-1], M[x + (2^(i-1))][y+(2^(j-1))][i-1][j-1]);
}
RMQ(x, y, x1, y1){
k = log(x1 - x + 1)
l = log(y1 - y + 1)
ans = GetMinimum(M[x][y][k][l], M[x1 - (2^k) + 1][y][k][l], M[x][y1 - (2^l) + 1][k][l], M[x1 - (2^k) + 1][y1 - (2^l) + 1][k][l]);
return P[ans->x][ans->y] // ans->x represent Row number stored in ans and similarly ans->y represent column stored in ans
}
Here is the sample code in c++, for the pseudo code given by #Chapta, as was requested by some user.
int M[1000][1000][10][10];
int **matrix;
void precompute_max(){
for (int i = 0 ; (1<<i) <= n; i += 1){
for(int j = 0 ; (1<<j) <= m ; j += 1){
for (int x = 0 ; x + (1<<i) -1 < n; x+= 1){
for (int y = 0 ; y + (1<<j) -1 < m; y+= 1){
if (i == 0 and j == 0)
M[x][y][i][j] = matrix[x][y]; // store x, y
else if (i == 0)
M[x][y][i][j] = max(M[x][y][i][j-1], M[x][y+(1<<(j-1))][i][j-1]);
else if (j == 0)
M[x][y][i][j] = max(M[x][y][i-1][j], M[x+ (1<<(i-1))][y][i-1][j]);
else
M[x][y][i][j] = max(M[x][y][i-1][j-1], M[x + (1<<(i-1))][y][i-1][j-1], M[x][y+(1<<(j-1))][i-1][j-1], M[x + (1<<(i-1))][y+(1<<(j-1))][i-1][j-1]);
// cout << "from i="<<x<<" j="<<y<<" of length="<<(1<<i)<<" and length="<<(1<<j) <<"max is: " << M[x][y][i][j] << endl;
}
}
}
}
}
int compute_max(int x, int y, int x1, int y1){
int k = log2(x1 - x + 1);
int l = log2(y1 - y + 1);
// cout << "Value of k="<<k<<" l="<<l<<endl;
int ans = max(M[x][y][k][l], M[x1 - (1<<k) + 1][y][k][l], M[x][y1 - (1<<l) + 1][k][l], M[x1 - (1<<k) + 1][y1 - (1<<l) + 1][k][l]);
return ans;
}
This code first precomputes, the 2 dimensional sparse table, and then queries it in constant time.
Additional info: the sparse table stores the maximum element and not the indices to the maximum element.
AFAIK, there can be no O(logn approach) as the matrix follows no order. However, if you have an order such that every row is sorted in ascending from left to right and every column is sorted ascending from up to down, then you know that A[a+x][b+x] (bottom-right cell of the submatrix) is the largest element in that submatrix. Thus, finding the maximum takes O(1) time once the matrix is sorted. However, sorting the matrix, if not already sorted, will cost O(NxM log{NxM})

Avoiding Brute Force: Counting Solutions

In a programming contest, a problem was:
Count all solutions to the equation: x + 4y + 4z = n. You will be
given n and you will determine the count of solutions. Assume x, y and z are positive integers.
I have considered using triple for loops (brute force), but it was unefficient, causing TIME LIMIT EXCEED. (since the n may be = 1000,000):
int sol = 0;
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= n / 4; j++)
{
for (int k = 1; k <= n / 4; k++)
{
if (i + 4 * j + 4 * k == n)
sol++;
}
}
}
My friend could solve the problem. When I asked him, he said that he didn't use brute force at all. Instead, he converted the equation to a 'series' (i.e. summition). I asked him to tell how me but he refused :)
Can I know how?
This is particular case of coin change problem, which is solved in general by dynamic programming.
But here we can elaborate simple solution. I consider x,y,z > 0
x + 4*(y+z)=n
Let y + z = q = p + 1 (q > 1, p > 0)
x+4*q=n
x+4*p=n-4
There are M = Floor((n-5)/4) variants for x and p, hence there are M possible values of
q = 2..M+1
For every q>1 there are (q-1) variants of y and z: q = 1 + (q-1) = 2 + (q-2) +..+(q-1)+1
So we have N=1 + 2 + 3 + ... + M = M * (M + 1)/2 solutions
Example:
n = 15;
M = (15 - 5) div 4 = 2
N = 3
(3,1,2),(3,2,1),(7,1,1)
First note that n-x must be divisible by 4. Start by finding the smallest value that x can take:
start = 4
while ((n - start) % 4 != 0)
{
start = start + 1
}
From now on, you know that x will take values from [start, start+4, start+8 ...]. Now you can count the number of solutions by a simple counting loop:
count = 0
for (x = start; x < n - 4; x = x + 4)
{
y_z_sum = (n - x) / 4
count = count + y_z_sum - 1
}
For each choice of x, we can compute the value of y+z. For each value for y+z, there are y+z-1 possible choices (since y ranges from 1 to y+z-1, assuming that y and z are both positive integers).
Instead of a brute force solution with O(n3) running time, you can achieve O(n) this way.
This is a classic linear algebra problem. Please refer to any linear algebra textbook on how to solve a system of linear equations. One such method is called Gaussian Elimination.

Most efficient way to calculate Levenshtein distance

I just implemented a best match file search algorithm to find the closest match to a string in a dictionary. After profiling my code, I found out that the overwhelming majority of time is spent calculating the distance between the query and the possible results. I am currently implementing the algorithm to calculate the Levenshtein Distance using a 2-D array, which makes the implementation an O(n^2) operation. I was hoping someone could suggest a faster way of doing the same.
Here's my implementation:
public int calculate(String root, String query)
{
int arr[][] = new int[root.length() + 2][query.length() + 2];
for (int i = 2; i < root.length() + 2; i++)
{
arr[i][0] = (int) root.charAt(i - 2);
arr[i][1] = (i - 1);
}
for (int i = 2; i < query.length() + 2; i++)
{
arr[0][i] = (int) query.charAt(i - 2);
arr[1][i] = (i - 1);
}
for (int i = 2; i < root.length() + 2; i++)
{
for (int j = 2; j < query.length() + 2; j++)
{
int diff = 0;
if (arr[0][j] != arr[i][0])
{
diff = 1;
}
arr[i][j] = min((arr[i - 1][j] + 1), (arr[i][j - 1] + 1), (arr[i - 1][j - 1] + diff));
}
}
return arr[root.length() + 1][query.length() + 1];
}
public int min(int n1, int n2, int n3)
{
return (int) Math.min(n1, Math.min(n2, n3));
}
The wikipedia entry on Levenshtein distance has useful suggestions for optimizing the computation -- the most applicable one in your case is that if you can put a bound k on the maximum distance of interest (anything beyond that might as well be infinity!) you can reduce the computation to O(n times k) instead of O(n squared) (basically by giving up as soon as the minimum possible distance becomes > k).
Since you're looking for the closest match, you can progressively decrease k to the distance of the best match found so far -- this won't affect the worst case behavior (as the matches might be in decreasing order of distance, meaning you'll never bail out any sooner) but average case should improve.
I believe that, if you need to get substantially better performance, you may have to accept some strong compromise that computes a more approximate distance (and so gets "a reasonably good match" rather than necessarily the optimal one).
According to a comment on this blog, Speeding Up Levenshtein, you can use VP-Trees and achieve O(nlogn). Another comment on the same blog points to a python implementation of VP-Trees and Levenshtein. Please let us know if this works.
The Wikipedia article discusses your algorithm, and various improvements. However, it appears that at least in the general case, O(n^2) is the best you can get.
There are however some improvements if you can restrict your problem (e.g. if you are only interested in the distance if it's smaller than d, complexity is O(dn) - this might make sense as a match whose distance is close to the string length is probably not very interesting ). See if you can exploit the specifics of your problem...
I modified the Levenshtein distance VBA function found on this post to use a one dimensional array. It performs much faster.
'Calculate the Levenshtein Distance between two strings (the number of insertions,
'deletions, and substitutions needed to transform the first string into the second)
Public Function LevenshteinDistance2(ByRef s1 As String, ByRef s2 As String) As Long
Dim L1 As Long, L2 As Long, D() As Long, LD As Long 'Length of input strings and distance matrix
Dim i As Long, j As Long, ss2 As Long, ssL As Long, cost As Long 'loop counters, loop step, loop start, and cost of substitution for current letter
Dim cI As Long, cD As Long, cS As Long 'cost of next Insertion, Deletion and Substitution
Dim L1p1 As Long, L1p2 As Long 'Length of S1 + 1, Length of S1 + 2
L1 = Len(s1): L2 = Len(s2)
L1p1 = L1 + 1
L1p2 = L1 + 2
LD = (((L1 + 1) * (L2 + 1))) - 1
ReDim D(0 To LD)
ss2 = L1 + 1
For i = 0 To L1 Step 1: D(i) = i: Next i 'setup array positions 0,1,2,3,4,...
For j = 0 To LD Step ss2: D(j) = j / ss2: Next j 'setup array positions 0,1,2,3,4,...
For j = 1 To L2
ssL = (L1 + 1) * j
For i = (ssL + 1) To (ssL + L1)
If Mid$(s1, i Mod ssL, 1) <> Mid$(s2, j, 1) Then cost = 1 Else cost = 0
cI = D(i - 1) + 1
cD = D(i - L1p1) + 1
cS = D(i - L1p2) + cost
If cI <= cD Then 'Insertion or Substitution
If cI <= cS Then D(i) = cI Else D(i) = cS
Else 'Deletion or Substitution
If cD <= cS Then D(i) = cD Else D(i) = cS
End If
Next i
Next j
LevenshteinDistance2 = D(LD)
End Function
I have tested this function with string 's1' of length 11,304 and 's2' of length 5,665 ( > 64 million character comparisons). With the above single dimension version of the function, the execution time is ~24 seconds on my machine. The original two dimensional function that I referenced in the link above requires ~37 seconds for the same strings. I have optimized the single dimensional function further as shown below and it requires ~10 seconds for the same strings.
'Calculate the Levenshtein Distance between two strings (the number of insertions,
'deletions, and substitutions needed to transform the first string into the second)
Public Function LevenshteinDistance(ByRef s1 As String, ByRef s2 As String) As Long
Dim L1 As Long, L2 As Long, D() As Long, LD As Long 'Length of input strings and distance matrix
Dim i As Long, j As Long, ss2 As Long 'loop counters, loop step
Dim ssL As Long, cost As Long 'loop start, and cost of substitution for current letter
Dim cI As Long, cD As Long, cS As Long 'cost of next Insertion, Deletion and Substitution
Dim L1p1 As Long, L1p2 As Long 'Length of S1 + 1, Length of S1 + 2
Dim sss1() As String, sss2() As String 'Character arrays for string S1 & S2
L1 = Len(s1): L2 = Len(s2)
L1p1 = L1 + 1
L1p2 = L1 + 2
LD = (((L1 + 1) * (L2 + 1))) - 1
ReDim D(0 To LD)
ss2 = L1 + 1
For i = 0 To L1 Step 1: D(i) = i: Next i 'setup array positions 0,1,2,3,4,...
For j = 0 To LD Step ss2: D(j) = j / ss2: Next j 'setup array positions 0,1,2,3,4,...
ReDim sss1(1 To L1) 'Size character array S1
ReDim sss2(1 To L2) 'Size character array S2
For i = 1 To L1 Step 1: sss1(i) = Mid$(s1, i, 1): Next i 'Fill S1 character array
For i = 1 To L2 Step 1: sss2(i) = Mid$(s2, i, 1): Next i 'Fill S2 character array
For j = 1 To L2
ssL = (L1 + 1) * j
For i = (ssL + 1) To (ssL + L1)
If sss1(i Mod ssL) <> sss2(j) Then cost = 1 Else cost = 0
cI = D(i - 1) + 1
cD = D(i - L1p1) + 1
cS = D(i - L1p2) + cost
If cI <= cD Then 'Insertion or Substitution
If cI <= cS Then D(i) = cI Else D(i) = cS
Else 'Deletion or Substitution
If cD <= cS Then D(i) = cD Else D(i) = cS
End If
Next i
Next j
LevenshteinDistance = D(LD)
End Function
Commons-lang has a pretty fast implementation. See http://web.archive.org/web/20120526085419/http://www.merriampark.com/ldjava.htm.
Here's my translation of that into Scala:
// The code below is based on code from the Apache Commons lang project.
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with this
* work for additional information regarding copyright ownership. The ASF
* licenses this file to You under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance with the
* License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
* License for the specific language governing permissions and limitations
* under the License.
*/
/**
* assert(levenshtein("algorithm", "altruistic")==6)
* assert(levenshtein("1638452297", "444488444")==9)
* assert(levenshtein("", "") == 0)
* assert(levenshtein("", "a") == 1)
* assert(levenshtein("aaapppp", "") == 7)
* assert(levenshtein("frog", "fog") == 1)
* assert(levenshtein("fly", "ant") == 3)
* assert(levenshtein("elephant", "hippo") == 7)
* assert(levenshtein("hippo", "elephant") == 7)
* assert(levenshtein("hippo", "zzzzzzzz") == 8)
* assert(levenshtein("hello", "hallo") == 1)
*
*/
def levenshtein(s: CharSequence, t: CharSequence, max: Int = Int.MaxValue) = {
import scala.annotation.tailrec
def impl(s: CharSequence, t: CharSequence, n: Int, m: Int) = {
// Inside impl n <= m!
val p = new Array[Int](n + 1) // 'previous' cost array, horizontally
val d = new Array[Int](n + 1) // cost array, horizontally
#tailrec def fillP(i: Int) {
p(i) = i
if (i < n) fillP(i + 1)
}
fillP(0)
#tailrec def eachJ(j: Int, t_j: Char, d: Array[Int], p: Array[Int]): Int = {
d(0) = j
#tailrec def eachI(i: Int) {
val a = d(i - 1) + 1
val b = p(i) + 1
d(i) = if (a < b) a else {
val c = if (s.charAt(i - 1) == t_j) p(i - 1) else p(i - 1) + 1
if (b < c) b else c
}
if (i < n)
eachI(i + 1)
}
eachI(1)
if (j < m)
eachJ(j + 1, t.charAt(j), p, d)
else
d(n)
}
eachJ(1, t.charAt(0), d, p)
}
val n = s.length
val m = t.length
if (n == 0) m else if (m == 0) n else {
if (n > m) impl(t, s, m, n) else impl(s, t, n, m)
}
}
I know this is very late but it is relevant to the discussion at hand.
As mentioned by others, if all you want to do is check whether the edit distance between two strings is within some threshold k, you can reduce the time complexity to O(kn). A more precise expression would be O((2k+1)n). You take a strip which spans k cells either side of the diagonal cell (length of strip 2k+1) and compute the values of cells lying on this strip.
Interestingly, there's been an improvement by Li et. al. and this has been further reduced to O((k+1)n).

Resources