Finding longest common subsequence in O(NlogN) time - algorithm

Is there any way of finding the longest common subsequence of two sequences in O(NlogN) time?
I read somewhere that there is a way to achieve this using binary search.
I know the dp approach that takes O(N2) time.

For the general case, the O(N^2) dynamic programming algorithm is the best you can do. However, there exist better algorithms in some special cases.
Alphabet size is bounded
This is a very common situation. Sequences consisting of letters from some alphabet (e.g. English) lie in this category. For this case, the O(N*M) algorithm can be optimised to get an O(N^2/logN) with method of four Russians. I don't know how exactly, you can search for it.
Both sequences consist of distinct elements
An example problem is "Given two permutations of numbers from 1 to N, find their LCS". This one can be solved in O(N*logN). Let the sequences be A and B.
Define a sequence C. C[i] is the index of B[i] in A. (A[C[i]] = B[i])
LCS of A and B is the longest increasing subsequence of C.

The dynamic programming approach, which is O(n2) for general case. For certain other cases, there are lower-complexity algorithms:
For a fixed alphabet size (which doesn't grow with n), there's the Method of Four Russians which brings the time down to O(n2/log n) (see here).
See here another further optimized case.

Assuming Exponential Time Hypothesis (which is stricter than P is not equal to NP, but is still widely believed to be true), it is not possible to do it in time O(N^{2 - eps}) for any positive constant eps, see "Quadratic Conditional Lower Bounds for String Problems and Dynamic Time Warping" by Karl Bringmann and Marvin Kunnemann (pre-print on arXiv is available).
Roughly speaking, it means that the general case of this problem cannot be solved in time better than something like O(N^2/log N), so if you want faster algorithms you have to consider additional constraints (some properties of the strings) or look for approximate solution.

The longest common subsequence between two sequences is essentially n-squared.
Masek and Patterson (1980) made a minor improvement to n-squared / log n using the so-called "Four Russians" technique.
In most cases the additional complexity introduced by such convoluted approaches is not justified by the small gains. For practical purposes you can consider the n-squared approach as the reasonable optimum in typical applications.

vector <int> LIS;
int LongestIncreasingSubsequence(int n){
if(!n) return 0;
LIS.emplace_back(arr[0]);
for(int i = 1; i < n; i++){
if(arr[i] > LIS.back()) LIS.emplace_back(arr[i]);
else *lower_bound(LIS.begin(), LIS.end(), arr[i]) = arr[i];
}
return LIS.size();
}

Related

Computational complexity (Big-O notation) of a geometrical weighted centroid among access points

I need to compute the computational complexity of the following equations using Big-O notations:
Here, m is the total number of access points (perhaps the number of iterations in terms of complexity, i is individual access point). I learned about Big-O notation form this blog. Moreover, I found a similar question at this link. In the above equation, d is a distance computed with 4 operations (multiply, subtraction, division, and power). As seen in the above equation, w is computed with two operations (power and division). xw and yware computed with two operations each (multiplication and division).
Hence, I've figured out the Big-O notation of above algorithm as:
4*[m]+2*[m]+2*[m]+2*[m]
Is it correct? Can it be approximated as O(m) ?
Moreover, the above algorithm (equations) is combined with next algorithm whose computational complexity is O(N), N being the number of iterations. Here, N>>m. What will be the final computational complexity in terms of Big-O notation?
Thank you.
UPDATE:
The subscript w with x and y is just a notation. It is not the iteration. Iteration is only m. Eg. i = 1,2,3,4,5,......,m.The two algorithms operate in a pipeline fashion. For eg., at first the algorithm with m iterations is operated, and the output of this algorithm is fed (as input) to next algorithm with N iterations. So, when m iterations (algorithm 1) are completed, it is followed by N iterations (algorithm 2). My problem is similar to two loops that are not nested and have different iterations where N>>m.
for(int i=0; i<m; i++){
System.out.println(i);
}
for(int j=0; j<N; j++){
System.out.println(j);
}
What will be the final computational complexity?
Yes, your sum from i=1 to i=m takes O(m) time. All other operations are constant, you dont have any sub-sum in sum or something like this.
About your N value, you did not provide enough information. We have to know how the N is computed or how it is related to m.
Also you should consider following constraint - can you provide some maximum value (even incredibly) big one that cannot never be reached by the numbers or equations? Usually the operation with numbers are considered constants, because they are made on 32 or 64bit numbers which always take constant time.
However if you have some equations with incredible long numbers (like hundreds of characters long or more), the size of the numbers have to be considered in complexity. (You can probably imagine that multiplying two numbers that are milion characters long takes more than doing the same with 2x2)

Is this Dynamic Programming Solution to Text Justification just Brute Force?

I'm having trouble understanding the dynamic programming solution to the text justification problem as specified in the MIT open courseware lecture here. Some notes from that lecture are here, and page 3 of the notes is what I am referring to.
I thought that Dynamic Programming meant you memoize some of the computations so that you don't need to recompute, thus saving you time, but in the algorithm given in the lecture, I don't see any use of memoization, just a whole bunch of deep recursive calls, i.e. the main function is this:
DP[i] = min(badness (i, j) + DP[j] for j in range (i + 1, n + 1))
DP[n] = 0
where badness is a function that determines the the amount of unused space after subtracting the length of the words from the line length. To me it looks like this algorithm calculates all possible "badness" calculations and chooses the smallest one, which seems like brute force to me. Where is the advantage Dynamic Programming usually gives us by memoizing past calculations so we don't have to recompute?
If you memoize the results, you don't have to compute each DP[i] several times.
That is, DP[0] "calls" DP[2] for example, but so does DP[1]. In the second time DP[2] is called, it won't be necessary to compute it again, you can just return the memoized value.
This also makes it easy to verify a polynomial upper bound for this algorithm. Since each DP[i] will perform O(n) operations, and there are n of them, the overall algorithm is O(n^2), assuming, of course, that badness(i, j) is O(1).

Finding maximum subsequence below or equal to a certain value

I'm learning dynamic programming and I've been having a great deal of trouble understanding more complex problems. When given a problem, I've been taught to find a recursive algorithm, memoize the recursive algorithm and then create an iterative, bottom-up version. At almost every step I have an issue. In terms of the recursive algorithm, I write different different ways to do recursive algorithms, but only one is often optimal for use in dynamic programming and I can't distinguish what aspects of a recursive algorithm make memoization easier. In terms of memoization, I don't understand which values to use for indices. For conversion to a bottom-up version, I can't figure out which order to fill the array/double array.
This is what I understand:
- it should be possible to split the main problem to subproblems
In terms of the problem mentioned, I've come up with a recursive algorithm that has these important lines of code:
int optionOne = values[i] + find(values, i+1, limit - values[i]);
int optionTwo = find(values, i+1, limit);
If I'm unclear or this is not the correct qa site, let me know.
Edit:
Example: Given array x: [4,5,6,9,11] and max value m: 20
Maximum subsequence in x under or equal to m would be [4,5,11] as 4+5+11 = 20
I think this problem is NP-hard, meaning that unless P = NP there isn't a polynomial-time algorithm for solving the problem.
There's a simple reduction from the subset-sum problem to this problem. In subset-sum, you're given a set of n numbers and a target number k and want to determine whether there's a subset of those numbers that adds up to exactly k. You can solve subset-sum with a solver for your problem as follows: create an array of the numbers in the set and find the largest subsequence whose sum is less than or equal to k. If that adds up to exactly k, the set has a subset that adds up to k. Otherwise, it does not.
This reduction takes polynomial time, so because subset-sum is NP-hard, your problem is NP-hard as well. Therefore, I doubt there's a polynomial-time algorithm.
That said - there is a pseudopolynomial-time algorithm for subset-sum, which is described on Wikipedia. This algorithm uses DP in two variables and isn't strictly polynomial time, but it will probably work in your case.
Hope this helps!

Why would an O(n^2) algorithm run quicker than a O(n) algorithm on the same input?

Two algorithms say A and B are written to solve the same problem.
Algorithm A is O(n).
Algorithm B is (n^2).
You expect algorithm A to work better.
However when you run a specific example of the same machine, Algorithm B runs quicker.
Give the reasons to explain how such a thing happen?
Algorithm A, for example, can run in time 10000000*n which is O(n).
If algorithm B, is running in n*n which is O(n^2), A will be slower for every n < 10000000.
O(n), O(n^2) are asymptotic runtimes that describe the behavior when n->infinity
EDIT - EXAMPLE
Suppose you have the two following functions:
boolean flag;
void algoA(int n) {
for (int i = 0; i < n; i++)
for (int j = 0; j < 1000000; j++)
flag = !flag;
void algoB(int n) {
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
flag = !flag;
algoA has n*1000000 flag flip operations so it is O(n) whereas algoB has n^2 flag flip operations so it is O(n^2).
Just solve the inequality 1000000n > n^2 and you'll get that for n < 1000000 it holds. That is, the O(n) method will be slower.
Knowing the algorithms would help give a more exact answer.
But for the general case, I could think of a few relevant factors:
Hardware related
e.g. if the slower algorithm makes good use of caching & locality or similar low-level mechanisms (see Quicksort's performance compared to other theoretically faster sorting algorithms). Worth reading about timsort as well, as an example where an "efficient" algorithm is used to break the problem up in to smaller input sets and a "simpler" and theoretically "higher complexity" algo is used on those sets, because it's faster.
Properties of the input set
e.g. if the input size is small, the efficiency will not come through in a test; also, for example with sorting again, if the input is mostly pre-sorted vs completely random, you will see different results. Many different inputs should be used in a test of this type for an accurate result. Using just one example is simply not enough, as the input can be engineered to favor one algorithm instead of another.
Specific implementation of either algorithms
e.g. there's a long way to go from the theoretical description of an algorithm to implementation; poor use of data structures, recursion, memory management etc. can seriously affect performance
Big-O-notation says nothing about the speed itself, only about how the speed will change when you change n.
If both algorithms take the same time for a single iteration, #Itay's example is also correct.
While all of the answers so far seem correct... none of them feel really "right" in the context of a CS class. In a computational complexity course you want to be precise and use definitions. I'll outline a lot of the nuances of this question and of computational complexity in general. By the end, we'll conclude why Itay's solution at the top is probably what you should've written. My main issue with Itay's solution is that it lacks definitions which are key to writing a good proof for a CS class. Note that my definitions may differ slightly from your class' so feel free to substitute in whatever you want.
When we say "an algorithm is O(n)" we actually mean "this algorithm is in the set O(n)". And the set O(n) contains all algorithms whose worst-case asymptotic complexity f(n) has the property that f(n) <= c*n + c_0 for some constant c and c_0 where c, c_0 > 0.
Now we want to prove your claim. First of all, the way you stated the problem, it has a trivial solution. That's because our asymptotic bounds are "worst-case". For many "slow" algorithms there is some input that it runs remarkably quickly. For instance, insertion sort is linear if the input is already sorted! So take insertion sort (O(n)) and merge sort (O(nlog(n))) and notice that the insertion sort will run faster if you pass in a sorted array! Boom, proof done.
But I am assuming that your exam meant something more like "show why a linear algorithm might run faster than a quadratic algorithm in the worst-case." As Alex noted above, this is an open ended question. The crux of the issue is that runtime analysis makes assumptions that certain operations are O(1) (e.g. for some problem you might assume that multiplication is O(1) even though it becomes quadratically slower for large numbers (one might argue that the numbers for a given problem are bounded to be 100-bits so it's still "constant time")). Since your class is probably focusing specifically on computational complexity then they probably want you to gloss over this issue. So we'll prove the claim assuming that our O(1) assumptions are right, and so there aren't details like "caching makes this algorithm way faster than the other one".
So now we have one algorithm which runs in f(n) which is O(n) and some other algorithm which runs in g(n) which is O(n^2). We want to use the definitions above to show that for some n we can have g(n) < f(n). The trick is that our assumptions have not fixed the c, c_0, c', c_0'. As Itay mentions, we can choose values for those constants such that g(n) < f(n) for many n. And the rest of the proof is what he wrote above (e.g. let c, c_0 be the constants for f(n) and say they are both 100 while c', c_0' are the constants for g(n) and they are both 1. Then g(n) < f(n) => n + 1 < 100n^2 + 100 => 100n^2 - n + 99 > 0 => (factor to get actual bounds for n))
It depends on different scenario.There are 3 types of scenario 1.Best, 2.Average, 3.Worst. If you know sorting techniques there is also same things happens. For more information see following link:
http://en.wikipedia.org/wiki/Sorting_algorithm
Please correct me if I am wrong.

Count the number of operations for a sorting algorithm

This is my assignment question:
Explain with an example quick sort , merge sort and heap sort .
further count the number of operations, by each of these sorting methods.
I don't understand what exactly i have to answer in the context of " count the number of operations " ?
I found something in coremen book in chapter 2, they have explained insertions sort the running time of an algorithm by calculating run time of each statement ....
do i have to do in similar way ?
To count the number of operations is also known as to analyze the algorithm complexity. The idea is to have a rough idea how many operations are in the worst case needed to execute the algorithm on an input of size N, which gives you the upper bound of the computational resources required for that algorithm. And since each operation by itself (like multiplication or comparison for example) is a finite operation and takes deterministic time (even though it might be different on different machines), to get an idea of how good or bad an algorithm is, especially compared to other algorithms, all you need to know is the rough number of operations.
Here's an example with bubble sort. Let's say you have an array of two numbers. To sort it you need to compare both numbers and potentially exchange them. Since comparing and exchanging are single operations, the exact time for executing them is minimal and not important by itself. Thus, you can say that with N=2, the number of operations is O(N)=1. For three numbers, though, you need three operations in the worst case - compare the first and the second one and potentially exchange them, then compare the second one and the third one and exchange them, then compare the first one with the second one again. When you continue to generalize the bubble sort, you will find out that potentially to sort N numbers, you need to do N operations for the first number, N-1 for the second and so on. In other words, O(N) = N + (N-1) + ... + 2 + 1 = N * (N-1) / 2, which for big enough N can be simplified to O(N) = N^2.
Of course, you could just cheat and find out on the web the O(N) number for each of the three sort algorithms, but I would urge you to spend the time and try to come up with that number yourself at first. Even if you get it wrong, comparing your estimate and how you got it with the actual way to estimate their complexity will help you understand better the process of analyzing the complexity of particular piece of software you write in future.
This is called the big O notation.
This page shows you the most common sorting algorithms and their comparison expressed through big O.
Computational complexity (worst,
average and best number of comparisons
for several typical test cases, see
below). Typically, good average number
of comparisons/operations is O(n log
n) and bad is O(n^2)
From http://www.softpanorama.org/Algorithms/sorting.shtml
I think this assignment is to give you an idea that how a complexity of an algorithm is calculated. For example bubble sort algorithm has a complexity of O(n^2).
// Bubble sort method.
// ref: [http://www.metalshell.com/source_code/105/Bubble_Sort.html][1]
for(x = 0; x < ARRAY_SIZE; x++)
for(y = 0; y < ARRAY_SIZE-1; y++)
if(iarray[y] > iarray[y+1]) {
holder = iarray[y+1];
iarray[y+1] = iarray[y];
iarray[y] = holder;
}
As you see above, two loops are used to sort the array. Let ARRAY_SIZE be n. Then the number of operations is n*(n-1). That makes n^2-n which is denoted by O(N^2). That is big O notation. We just take the n that has the largest exponent, the highest growth rate. If it were 2n^2+2n, that would be still O(N^2) because constants are also omitted in calculating complexity. The wikipedia article on Big O Notation is really helpful (as Leniel mentioned in his post).
That's your homework so I did not get into details of algorithms you mentioned. But you need to do the math like this. But I think what you are asked is the actual number of operations. So, for the example above, if ARRAY_SIZE is 10, the answer gets 10*9=90. To see the differences you need to use the same array in your sample codes.

Resources