Array merging and sorting complexity calculation - algorithm

I have one exercise from my algorithm text book and I am not really sure about the solution. I need to explain why this solution:
function array_merge_sorted(array $foo, array $bar)
{
$baz = array_merge($foo, $bar);
$baz = array_unique($baz);
sort($baz);
return $baz;
}
that merge two array and order them is not the most efficient and I need to provide one solution that is the most optimized and prove that not better solution can be done.
My idea was about to use a mergesort algorithm that is O(n log n), to merge and order the two array passed as parameter. But how can I prove that is the best solution ever?

Algorithm
As you have said that both inputs are already sorted, you can use a simple zipper-like approach.
You have one pointer for each input array, pointing to the begin of it. Then you compare both elements, adding the smaller one to the result and advancing the pointer of the array with the smaller element. Then you repeat the step until both pointers reached the end and all elements where added to the result.
You find a collection of such algorithms at Wikipedia#Merge algorithm with my current presented approach being listed as Merging two lists.
Here is some pseudocode:
function Array<Element> mergeSorted(Array<Element> first, Array<Element> second) {
Array<Element> result = new Array<Element>(first.length + second.length);
int firstPointer = 0;
int secondPointer = 0;
while (firstPointer < first.length && secondPointer < first.length) {
Element elementOfFirst = first.get(firstPointer);
Element elementOfSecond = second.get(secondPointer);
if (elementOfFirst < elementOfSecond) {
result.add(elementOfFirst);
firstPointer = firstPointer + 1;
} else {
result.add(elementOfSecond);
secondPointer = secondPointer + 1;
}
}
}
Proof
The algorithm obviously works in O(n) where n is the size of the resulting list. Or more precise it is O(max(n, n') with n being the size of the first list and n' of the second list (or O(n + n') which is the same set).
This is also obviously optimal since you need, at some point, at least traverse all elements once in order to build the result and know the final ordering. This yields a lower bound of Omega(n) for this problem, thus the algorithm is optimal.
A more formal proof assumes a better arbitrary algorithm A which solves the problem without taking a look at each element at least once (or more precise, with less than O(n)).
We call that element, which the algorithm does not look at, e. We can now construct an input I such that e has a value which fulfills the order in its own array but will be placed wrong by the algorithm in the resulting array.
We are able to do so for every algorithm A and since A always needs to work correctly on all possible inputs, we are able to find a counter-example I such that it fails.
Thus A can not exist and Omega(n) is a lower bound for that problem.
Why the given algorithm is worse
Your given algorithm first merges the two arrays, this works in O(n) which is good. But after that it sorts the array.
Sorting (more precise: comparison-based sorting) has a lower-bound of Omega(n log n). This means every such algorithm can not be better than that.
Thus the given algorithm has a total time complexity of O(n log n) (because of the sorting part). Which is worse than O(n), the complexity of the other algorithm and also the optimal solution.
However, to be super-correct, we also would need to argue whether the sort-method truly yields that complexity, since it does not get arbitrary inputs but always the result of the merge-method. Thus it could be possible that a specific sorting method works especially good for such specific inputs, yielding O(n) in the end.
But I doubt that this is in the focus of your task.

Related

Search a Sorted Array for First Occurrence of K

I'm trying to solve question 11.1 in Elements of Programming Interviews (EPI) in Java: Search a Sorted Array for First Occurrence of K.
The problem description from the book:
Write a method that takes a sorted array and a key and returns the index of the first occurrence of that key in the array.
The solution they provide in the book is a modified binary search algorithm that runs in O(logn) time. I wrote my own algorithm also based on a modified binary search algorithm with a slight difference - it uses recursion. The problem is I don't know how to determine the time complexity of my algorithm - my best guess is that it will run in O(logn) time because each time the function is called it reduces the size of the candidate values by half. I've tested my algorithm against the 314 EPI test cases that are provided by the EPI Judge so I know it works, I just don't know the time complexity - here is the code:
public static int searchFirstOfKUtility(List<Integer> A, int k, int Lower, int Upper, Integer Index)
{
while(Lower<=Upper){
int M = Lower + (Upper-Lower)/2;
if(A.get(M)<k)
Lower = M+1;
else if(A.get(M) == k){
Index = M;
if(Lower!=Upper)
Index = searchFirstOfKUtility(A, k, Lower, M-1, Index);
return Index;
}
else
Upper=M-1;
}
return Index;
}
Here is the code that the tests cases call to exercise my function:
public static int searchFirstOfK(List<Integer> A, int k) {
Integer foundKey = -1;
return searchFirstOfKUtility(A, k, 0, A.size()-1, foundKey);
}
So, can anyone tell me what the time complexity of my algorithm would be?
Assuming that passing arguments is O(1) instead of O(n), performance is O(log(n)).
The usual theoretical approach for analyzing recursion is calling the Master Theorem. It is to say that if the performance of a recursive algorithm follows a relation:
T(n) = a T(n/b) + f(n)
then there are 3 cases. In plain English they correspond to:
Performance is dominated by all the calls at the bottom of the recursion, so is proportional to how many of those there are.
Performance is equal between each level of recursion, and so is proportional to how many levels of recursion there are, times the cost of any layer of recursion.
Performance is dominated by the work done in the very first call, and so is proportional to f(n).
You are in case 2. Each recursive call costs the same, and so performance is dominated by the fact that there are O(log(n)) levels of recursion times the cost of each level. Assuming that passing a fixed number of arguments is O(1), that will indeed be O(log(n)).
Note that this assumption is true for Java because you don't make a complete copy of the array before passing it. But it is important to be aware that it is not true in all languages. For example I recently did a bunch of work in PL/pgSQL, and there arrays are passed by value. Meaning that your algorithm would have been O(n log(n)).

Complexity analysis of a solution to minimizing concat cost

This is about analyzing the complexity of a solution to a popular interview problem.
Problem
There is a function concat(str1, str2) that concatenates two strings. The cost of the function is measured by the lengths of the two input strings len(str1) + len(str2). Implement concat_all(strs) that concatenates a list of strings using only the concat(str1, str2) function. The goal is to minimize the total concat cost.
Warnings
Usually in practice, you would be very cautious about concatenating pairs of strings in a loop. Some good explanations can be found here and here. In reality, I have witnessed a severity-1 accident caused by such code. Warnings aside, let's say this is an interview problem. What's really interesting to me is the complexity analysis around the various solutions.
You can pause here if you would like to think about the problem. I am gonna reveal some solutions below.
Solutions
Naive solution. Loop through the list and concatenate
def concat_all(strs):
result = ''
for str in strs:
result = concat(result, str)
return result
Min-heap solution. The idea is to concatenate shorter strings first. Maintain a min-heap of the strings based on the length of the strings. Each concatenation concatenates 2 strings off the min-heap and the result is pushed back the min-heap. Until only one string is left on the heap.
def concat_all(strs):
heap = MinHeap(strs, key_func=len)
while len(heap) > 1:
str1 = heap.pop()
str2 = heap.pop()
heap.push(concat(str1, str2))
return heap.pop()
Binary concat. May not be intuitively clear. But another good solution is to recursively split the list by half and concatenate.
def concat_all(strs):
if len(strs) == 1:
return strs[0]
if len(strs) == 2:
return concat(strs[0], strs[1])
mid = len(strs) // 2
str1 = concat_all(strs[:mid])
str2 = concat_all(strs[mid:])
return concat(str1, str2)
Complexity
What I am really struggling and asking here is the complexity of the 2nd approach above that uses a min-heap.
Let's say the number of strings in the list is n and the total number of characters in all the strings is m. The upper bound for the naive solution is O(mn). The binary-concat has an exact bound of theta(mlog(n)). It is the min-heap approach that is elusive to me.
I am kind of guessing it has an upper bound of O(mlog(n) + nlog(n)). The second term, nlog(n) is associated with maintaining the heap; there are n concats and each concat updates the heap in log(n). If we only focus on the cost of concatenations and ignore the cost of maintaining the min-heap, the overall complexity of the min-heap approach can be reduced to O(mlog(n)). Then min-heap is a more optimal approach than binary-concat cause for the former mlog(n) is the upper bound while for the latter it is the exact bound.
But I can't seem to prove it, or even find a good intuition to support that guessed upper bound. Can the upper bound be even lower than O(mlog(n))?
Let us call the length of strings 1 to n and m be the sum of all these values.
For the naive solution, clearly the worst appears if
m1
is almost equal to m, and you obtain a O(nm) complexity, as you pointed.
For the min-heap, the worst-case is a bit different, it consists in having the same length for any string. In that case, it's going to work exactly as your case 3. of binary concat, but you'll also have to maintain the min-heap structure. So yes, it will be a bit more costly than case 3 in real-life. Nevertheless, from a complexity point of view, both will be in O(m log n) since we have m > n and O(m log n + n log n)can be reduced to O(m log n).
To prove the min-heap complexity more rigorously, we can prove that when we take the two smallest strings in a set of k strings, and denote by S the sum of the two smallest strings, then we have: (m-S)/(k-1) >= S/2 (it simply means that the mean of the two smallest strings is less than the mean of the k-2 other strings). Reformulating leads to S <= 2m/(k+1). Let us apply it to the min-heap algorithm:
at first step, we can show that the 2 strings we take are of total length at most 2m/(n+1)
at first step, we can show that the 2 strings we take are of total length at most 2m/(n)
...
at last step, we can show that the 2 strings we take are of total length at most 2m/(1)
Hence the computation time of min-heap is 2m*[1/(n+1) + 1/n + ... + 1/2 + 1] which is in O(m log n)

Best case of fractional knapsack

the worst case running time of fractional knapsack is O(n), then what should be its best case? is it O(1), because if a weight limit is 16 and you get first item having value, is it right??
True if you assume that input is given in sorted order of value !!!
But as per the definition, the algorithm is expected to take non-sorted input too. see this.
If you are considering a normal input that may or may not be sorted. Then there are two approaches to solve the problem:
Sort the input. which can not be less than O(n) even in best case that too if you use bubble/insertion sort. Which looks completely foolish because both of these sorting algorithms has O(n^2) avarage/worst case performance.
Use the weighted medians approach . That will cost you O(n) as finding the weighted median will take O(n). The code for this approach is given below.
Weighted median approach for fractional knapsack:
We will work on value per unit of item in the following code. The code will first find the middle value (i.e. mid of values per unit of items if given in sorted order) and place it in its correct position. We will use quick sort partition method for this. Once we get the middle (call it mid) element, following two cases need to be taken into consideration:
When sum of weight of all items present in the right side of mid is more than the value of W, we need to search our answer in right side of mid.
else sum all the values present in right side of mid (call it v_left) and search for W-v_left in the left side of mid (include mid as well).
Following is the implementation in python (Use only floating point numbers everywhere):
Please note that i am not providing you the production level code and there are cases which will fail as well. Think about what can cause worst case/failure for finding kth max in array (when all valules are same may be).
def partition(weights,values,start,end):
x = values[end]/weights[end]
i = start
for j in range(start,end):
if values[j]/weights[j] < x:
values[i],values[j] = values[j],values[i]
weights[i], weights[j] = weights[j],weights[i]
i+=1
values[i],values[end] = values[end],values[i]
weights[i], weights[end] = weights[end],weights[i]
return i
def _find_kth(weights,values,start,end,k):
ind = partition(weights,values,start,end)
if ind - start == k-1:
return ind
if ind - start > k-1:
return _find_kth(weights,values,start,ind-1,k)
return _find_kth(weights,values,ind+1,end,k-ind-1)
def find_kth(weights,values,k):
return _find_kth(weights,values,0,len(weights)-1,k)
def fractional_knapsack(weights,values,w):
if w == 0 or len(weights)==0:
return 0
if len(weights) == 1 and weights[0] > w:
return w*(values[0]/weights[0])
mid = find_kth(weights,values,len(weights)/2)
w1 = reduce(lambda x,y: x+y,weights[mid+1:])
v1 = reduce(lambda x,y: x+y, values[mid+1:])
if(w1>w):
return fractional_knapsack(weights[mid+1:],values[mid+1:],w)
return v1 + fractional_knapsack(weights[:mid+1],values[:mid+1],w-w1)
(Editing and rewriting the answer after discussion with #Shasha99, since I feel answers before 2016-12-06 are a bit deceiving)
Summary
O(1) best case is possible if the items are already sorted. Otherwise best case is O(n).
Discussion
If the items are not sorted, you need to find the best item (for the case where one item already fills the knapsack), and that alone will take O(n), since you have to check all of them. Therefore, best case O(n).
On the opposite end, you could have a knapsack where all the items fit. Searching for best would not be needed, but you need to put all of them in, so it's still O(n).
More analysis
Funny enough, O(n) worst case does not imply items being sorted.
Apparently idea from http://algo2.iti.kit.edu/sanders/courses/algdat03/sol12.pdf paired with fast median selection algorithm (weighted medians or maybe median of medians?). Thanks to #Shasha99 for finding this algorithm.
Note that plain quickselect is O(n) expected, O(n*n) worst, but if you use median-of-medians that becomes O(n) worst case. The downside is quite a complicated algorithm.
I'd be interested in a working implementation of any algorithm. More sources to (hopefully simple) algorithms also wouldn't hurt.

Efficient algorithm to determine if two sets of numbers are disjoint

Practicing for software developer interviews and got stuck on an algorithm question.
Given two sets of unsorted integers with array of length m and other of
length n and where m < n find an efficient algorithm to determine if
the sets are disjoint. I've found solutions in O(nm) time, but haven't
found any that are more efficient than this, such as in O(n log m) time.
Using a datastructure that has O(1) lookup/insertion you can easily insert all elements of first set.
Then foreach element in second set, if it exists not disjoint, otherwise it is disjoint
Pseudocode
function isDisjoint(list1, list2)
HashMap = new HashMap();
foreach( x in list1)
HashMap.put(x, true);
foreach(y in list2)
if(HashMap.hasKey(y))
return false;
return true;
This will give you an O(n + m) solution
Fairly obvious approach - sort the array of length m - O(m log m).
For every element in the array of length n, use binary search to check if it exists in the array of length m - O(log m) per element = O(n log m). Since m<n, this adds up to O(n log m).
Here's a link to a post that I think answers your question.
3) Sort smaller O((m + n)logm)
Say, m < n, sort A
Binary search for each element of B into A
Disadvantage: Modifies the input
Looks like Cheruvian beat me to it, but you can use a hash table to get O(n+m) in average case:
*Insert all elements of m into the table, taking (probably) constant time for each, assuming there aren't a lot with the same hash. This step is O(m)
*For each element of n, check to see if it is in the table. If it is, return false. Otherwise, move on to the next. This takes O(n).
*If none are in the table, return true.
As I said before, this works because a hash table gives constant lookup time in average case. In the rare event that many unique elements in m have the same hash, it will take slightly longer. However, most people don't need to care about hypothetical worst cases. For example, quick sort is used more than merge sort because it gives better average performance, despite the O(n^2) upper bound.

What is the n in big-O notation?

The question is rather simple, but I just can't find a good enough answer. On the most upvoted SO question regarding the big-O notation, it says that:
For example, sorting algorithms are typically compared based on comparison operations (comparing two nodes to determine their relative ordering).
Now let's consider the simple bubble sort algorithm:
for (int i = arr.length - 1; i > 0; i--) {
for (int j = 0; j < i; j++) {
if (arr[j] > arr[j+1]) {
switchPlaces(...)
}
}
}
I know that worst case is O(n²) and best case is O(n), but what is n exactly? If we attempt to sort an already sorted algorithm (best case), we would end up doing nothing, so why is it still O(n)? We are looping through 2 for-loops still, so if anything it should be O(n²). n can't be the number of comparison operations, because we still compare all the elements, right?
When analyzing the Big-O performance of sorting algorithms, n typically represents the number of elements that you're sorting.
So, for example, if you're sorting n items with Bubble Sort, the runtime performance in the worst case will be on the order of O(n2) operations. This is why Bubble Sort is considered to be an extremely poor sorting algorithm, because it doesn't scale well with increasing numbers of elements to sort. As the number of elements to sort increases linearly, the worst case runtime increases quadratically.
Here is an example graph demonstrating how various algorithms scale in terms of worst-case runtime as the problem size N increases. The dark-blue line represents an algorithm that scales linearly, while the magenta/purple line represents a quadratic algorithm.
Notice that for sufficiently large N, the quadratic algorithm eventually takes longer than the linear algorithm to solve the problem.
Graph taken from http://science.slc.edu/~jmarshall/courses/2002/spring/cs50/BigO/.
See Also
The formal definition of Big-O.
I think two things are getting confused here, n and the function of n that is being bounded by the Big-O analysis.
By convention, for any algorithm complexity analysis, n is the size of the input if nothing different is specified. For any given algorithm, there are several interesting functions of the input size for which one might calculate asymptotic bounds such as Big-O.
The commonest such function for a sorting algorithm is the worst case number of comparisons. If someone says a sorting algorithm is O(n^2), without specifying anything else, I would assume they mean the worst case comparison count is O(n^2), where n is the input size.
Another interesting function is the amount of work space, of space in addition to the array being sorted. Bubble sort's work space is O(1), constant space, because it only uses a few variables regardless of the array size.
Bubble sort can be coded to do only n-1 array element comparisons in the best case, by finishing after any pass that does no exchanges. See this pseudo code implementation, which uses swapped to remember whether there were any exchanges. If the array is already sorted the first pass does no exchanges, so the sort finishes after one pass.
n is usually the size of the input. For array, that would be the number of elements.
To see the different cases, you would need to change the algorithm:
for (int i = arr.length - 1; i > 0 ; i--) {
boolean swapped = false;
for (int j = 0; j<i; j++) {
if (arr[j] > arr[j+1]) {
switchPlaces(...);
swapped = true;
}
}
if(!swapped) {
break;
}
}
Your algorithm's best/worst cases are both O(n^2), but with the possibility of returning early, the best-case is now O(n).
n is array length. You want to find T(n) algorithm complexity.
It is much expensive to access memory then check condition if. So, you define T(n) to be number of access memory.
In the given algorithm BC and WC use O(n^2) accesses to memory because you check the if-condition O(n^2) times.
Make the complexity better: Hold a flag and if you don't do any swaps in the main-loop, it means your array is sorted and you can put a break.
Now, in BC the array is sorted and you access all elements once so O(n).
And in WC still O(n^2).

Resources