Time complexity - algorithm

The Problem is finding majority elements in an array.
I understand how this algorithm works, but i don't know why this has O(nlogn) as a time complexity.....
a. Both return \no majority." Then neither half of the array has a majority
element, and the combined array cannot have a majority element. Therefore,
the call returns \no majority."
b. The right side is a majority, and the left isn't. The only possible majority for
this level is with the value that formed a majority on the right half, therefore,
just compare every element in the combined array and count the number of
elements that are equal to this value. If it is a majority element then return
that element, else return \no majority."
c. Same as above, but with the left returning a majority, and the right returning
\no majority."
d. Both sub-calls return a majority element. Count the number of elements equal
to both of the candidates for majority element. If either is a majority element
in the combined array, then return it. Otherwise, return \no majority."
The top level simply returns either a majority element or that no majority element
exists in the same way.
Therefore, T(1) = 0 and T(n) = 2T(n/2) + 2n = O(nlogn)
I think,
Every recursion it compares the majority element to whole array which takes 2n.
T(n) = 2T(n/2) + 2n = 2(2T(n/4) + 2n) +
2n = ..... = 2^kT(n/2^k) + 2n + 4n + 8n........ 2^kn = O(n^2)

T(n) = 2T(n/2) + 2n
The question is how many iterations does it take for n to get to 1.
We divide by 2 in each iteration so we get a series: n , n/2 , n/4 , n/8 ... n/(n^k)
So, let's find k that will bring us to 1 (last iteration):
n/(2^k)=1 .. n=2^k ... k=log(n)
So we got log(n) iterations.
Now, in each iteration we do 2n operations (less because we divide n by 2 each time) but in worth case scenario lets say 2n.
So in total, we got log(n) iterations with O(n) operations: nlog(n)

I'm not sure if I understand, but couldn't you just create a hash map, walk over the array, incrementing hash[value] at every step, then sort the hash map (xlogx time complexity) and compare the top two elements? This would cost you O(n) + O(mlogm) + 2 = O(n + mlogm), with n the size of the array and m the amount of different elements in the vector.
Am I mistaken here? Or ...?

When you do this recursively, you split the array in two for each level, make a call for each half, then makes one of the tests a - d. The test a requires no looping, the other tests requires looping through the entire array. By average you will loop through (0 + 1 + 1 + 1) / 4 = 3 / 4 of the array for each level in the recursion.
The number of levels in the recursion is based on the size of the array. As you split the array in half each level, the number of levels will be log2(n).
So, the total work is (n * 3/4) * log2(n). As constants are irrelevant to the time complexity, and all logarithms are the same, the complexity is O(n * log n).
Edit:
If someone is wondering about the algorithm, here's a C# implementation. :)
private int? FindMajority(int[] arr, int start, int len) {
if (len == 1) return arr[start];
int len1 = len / 2, len2 = len - len1;
int? m1 = FindMajority(arr, start, len1);
int? m2 = FindMajority(arr, start + len1, len2);
int cnt1 = m1.HasValue ? arr.Skip(start).Take(len).Count(n => n == m1.Value) : 0;
if (cnt1 * 2 >= len) return m1;
int cnt2 = m2.HasValue ? arr.Skip(start).Take(len).Count(n => n == m2.Value) : 0;
if (cnt2 * 2 >= len) return m2;
return null;
}

This guy has a lot of videos on recurrence relation, and the different techniques you can use to solve them:
https://www.youtube.com/watch?v=TEzbkIggJfo&list=PLj68PAxAKGoyyBwi6qrfcsqE_4trSO1yL
Basically for this problem I would use the Master Theorem:
https://youtu.be/i5kTZof1LRY
T(1) = 0 and T(n) = 2T(n/2) + 2n
Master Theorem ==> AT(n/B) + 2n^D, so in this case A=2, B=3, D=1
So according to the Master Theorem this is O(nlogn)
You can also use another method to solve this (below) it would just take a little bit more time:
https://youtu.be/TEzbkIggJfo?list=PLj68PAxAKGoyyBwi6qrfcsqE_4trSO1yL
I hope this helps you out !

Related

Divide-and-conquer algorithms' property example

I'm having trouble with understanding the following property of divide-and-conquer algorithms.
A recursive method that divides a problem of size N into two independent
(nonempty) parts that it solves recursively calls itself less than N times.
The proof is
A recursive function that divides a problem of size N into two independent
(nonempty) parts that it solves recursively calls itself less than N times.
If the parts are one of size k and one of size N-k, then the total number of
recursive calls that we use is T(n) = T(k) + T(n-k) + 1, for N>=1 with T(1) = 0.
The solution T(N) = N-1 is immediate by induction. If the sizes sum to a value
less than N, the proof that the number of calls is less than N-1 follows from
same inductive argument.
I perfectly understand the formal proof above. What I don't understand is how this property is connected to the examples that are usually used to demonstrate the divide-and-conquer idea, particularly to the finding the maximum problem:
static double max(double a[], int l, int r)
{
if (l == r) return a[l];
int m = (l+r)/2;
double u = max(a, l, m);
double v = max(a, m+1, r);
if (u > v) return u; else return v;
}
In this case when a consists of N=2 elements max(0,1) will call itself 2 more times, that is max(0,0) and max(1,1), which equals to N. If N=4, max(0,3) will call itself 2 times, and then each of the subsequent calls will also call max 2 times, so the total number of calls is 6 > N. What am I missing?
You're not missing anything. The theorem and its proof are wrong. The error is here:
T(n) = T(k) + T(n-k) + 1
The constant term of 1 should be 2, as the function makes one recursive call for each of the two pieces into which it divides the problem. The correct bound is 2N-1, rather than N. Hopefully, this error will be fixed in the next edition of your textbook, or at least in the errata.

Time complexity analysis of function with recursion inside loop

I am trying to analysis time complexity of below function. This function is used to check if a string is made of other strings.
set<string> s; // s has been initialized and stores all the strings
bool fun(string word) {
int len = word.size();
// something else that can also return true or false with O(1) complexity
for (int i=1; i<=len; ++i) {
string prefix = word.substr(0,i);
string suffix = word.substr(i);
if (prefix in s && fun(suffix))
return true;
else
return false;
}
}
I think the time complexity is O(n) where n is the length of word (am I right?). But as the recursion is inside the loop, I don't know how to prove it.
Edit:
This code is not a correct C++ code (e.g., prefix in s). I just show the idea of this function, and want to know how to analysis its time complexity
The way to analyze this is by developing a recursion relationship based on the length of the input and the (unknown) probability that a prefix is in s. Let's assume that the probability of a prefix being in s is given by some function pr(L) of the length L of the prefix. Let the complexity (number of operations) be given by T(len).
If len == 0 (word is the empty string), then T = 1. (The function is missing a final return statement after the loop, but we're assuming that the actual code is only a sketch of the idea, not what's actually executing).
For each loop iteration, denote the loop body complexity by T(len; i). If the prefix is not in s, then the body has constant complexity (T(len; i) = 1). This event has probability 1 - pr(i).
If the prefix is in s, then the function returns true or false according to the recursive call to fun(suffix), which has complexity T(len - i). This event has probability pr(i).
So for each value of i, the loop body complexity is:
T(len; i) = 1 * (1 - pr(i)) + T(len - i) * pr(i)
Finally (and this depends on the intended logic, not the posted code), we have
T(len) = sum i=1...len(T(len; i))
For simplicity, let's treat pr(i) as a constant function with value 0.5. Then the recursive relationship for T(len) is (up to a constant factor, which is unimportant for O() calculations):
T(len) = sum i=1...len(1 + T(len - i)) = len + sum i=0...len-1(T(i))
As noted above, the boundary condition is T(0) = 1. This can be solved by standard recursive function methods. Let's look at the first few terms:
len T(len)
0 1
1 1 + 1 = 2
2 2 + 2 + 1 = 5
3 3 + (4 + 2 + 1) = 11
4 4 + (11 + 5 + 2 + 1) = 23
5 5 + (23 + 11 + 5 + 2 + 1) = 47
The pattern is clearly T(len) = 2 * T(len - 1) + 1. This corresponds to exponential complexity:
T(n) = O(2n)
Of course, this result depends on the assumption we made about pr(i). (For instance, if pr(i) = 0 for all i, then T(n) = O(1). There would also be non-exponential growth if pr(i) had a maximum prefix length—pr(i) = 0 for all i > M for some M.) The assumption that pr(i) is independent of i is probably unrealistic, but this really depends on how s is populated.
Assuming that you've fixed the bugs others have noted, then the i values are the places that the string is being split (each i is the leftmost splitpoint, and then you recurse on everything to the right of i). This means that if you were to unwind the recursion, you are looking at up to n-1 different split points, and asking if each substring is a valid word. Things are ok if the beginning of word doesn't have a lot of elements from your set, since then you can skip the recursion. But in the worst case, prefix in s is always true, and you try every possible subset of the n-1 split points. This gives 2^{n-1} different splitting sets, multiplied by the length of each such set.

Finding the median of an unsorted array

To find the median of an unsorted array, we can make a min-heap in O(nlogn) time for n elements, and then we can extract one by one n/2 elements to get the median. But this approach would take O(nlogn) time.
Can we do the same by some method in O(n) time? If we can, then please tell or suggest some method.
You can use the Median of Medians algorithm to find median of an unsorted array in linear time.
I have already upvoted the #dasblinkenlight answer since the Median of Medians algorithm in fact solves this problem in O(n) time. I only want to add that this problem could be solved in O(n) time by using heaps also. Building a heap could be done in O(n) time by using the bottom-up. Take a look to the following article for a detailed explanation Heap sort
Supposing that your array has N elements, you have to build two heaps: A MaxHeap that contains the first N/2 elements (or (N/2)+1 if N is odd) and a MinHeap that contains the remaining elements. If N is odd then your median is the maximum element of MaxHeap (O(1) by getting the max). If N is even, then your median is (MaxHeap.max()+MinHeap.min())/2 this takes O(1) also. Thus, the real cost of the whole operation is the heaps building operation which is O(n).
BTW this MaxHeap/MinHeap algorithm works also when you don't know the number of the array elements beforehand (if you have to resolve the same problem for a stream of integers for e.g). You can see more details about how to resolve this problem in the following article Median Of integer streams
Quickselect works in O(n), this is also used in the partition step of Quicksort.
The quick select algorithm can find the k-th smallest element of an array in linear (O(n)) running time. Here is an implementation in python:
import random
def partition(L, v):
smaller = []
bigger = []
for val in L:
if val < v: smaller += [val]
if val > v: bigger += [val]
return (smaller, [v], bigger)
def top_k(L, k):
v = L[random.randrange(len(L))]
(left, middle, right) = partition(L, v)
# middle used below (in place of [v]) for clarity
if len(left) == k: return left
if len(left)+1 == k: return left + middle
if len(left) > k: return top_k(left, k)
return left + middle + top_k(right, k - len(left) - len(middle))
def median(L):
n = len(L)
l = top_k(L, n / 2 + 1)
return max(l)
No, there is no O(n) algorithm for finding the median of an arbitrary, unsorted dataset.
At least none that I am aware of in 2022. All answers offered here are variations/combinations using heaps, Median of Medians, Quickselect, all of which are strictly O(nlogn).
See https://en.wikipedia.org/wiki/Median_of_medians and http://cs.indstate.edu/~spitla/abstract2.pdf.
The problem appears to be confusion about how algorithms are classified, which is according their limiting (worst case) behaviour. "On average" or "typically" O(n) with "worst case" O(f(n)) means (in textbook terms) "strictly O(f(n))". Quicksort for example, is often discussed as being O(nlogn) (which is how it typically performs), although it is in fact an O(n^2) algorithm because there is always some pathological ordering of inputs for which it can do no better than n^2 comparisons.
It can be done using Quickselect Algorithm in O(n), do refer to Kth order statistics (randomized algorithms).
As wikipedia says, Median-of-Medians is theoretically o(N), but it is not used in practice because the overhead of finding "good" pivots makes it too slow.
http://en.wikipedia.org/wiki/Selection_algorithm
Here is Java source for a Quickselect algorithm to find the k'th element in an array:
/**
* Returns position of k'th largest element of sub-list.
*
* #param list list to search, whose sub-list may be shuffled before
* returning
* #param lo first element of sub-list in list
* #param hi just after last element of sub-list in list
* #param k
* #return position of k'th largest element of (possibly shuffled) sub-list.
*/
static int select(double[] list, int lo, int hi, int k) {
int n = hi - lo;
if (n < 2)
return lo;
double pivot = list[lo + (k * 7919) % n]; // Pick a random pivot
// Triage list to [<pivot][=pivot][>pivot]
int nLess = 0, nSame = 0, nMore = 0;
int lo3 = lo;
int hi3 = hi;
while (lo3 < hi3) {
double e = list[lo3];
int cmp = compare(e, pivot);
if (cmp < 0) {
nLess++;
lo3++;
} else if (cmp > 0) {
swap(list, lo3, --hi3);
if (nSame > 0)
swap(list, hi3, hi3 + nSame);
nMore++;
} else {
nSame++;
swap(list, lo3, --hi3);
}
}
assert (nSame > 0);
assert (nLess + nSame + nMore == n);
assert (list[lo + nLess] == pivot);
assert (list[hi - nMore - 1] == pivot);
if (k >= n - nMore)
return select(list, hi - nMore, hi, k - nLess - nSame);
else if (k < nLess)
return select(list, lo, lo + nLess, k);
return lo + k;
}
I have not included the source of the compare and swap methods, so it's easy to change the code to work with Object[] instead of double[].
In practice, you can expect the above code to be o(N).
Let the problem be: finding the Kth largest element in an unsorted array.
Divide the array into n/5 groups where each group consisting of 5 elements.
Now a1,a2,a3....a(n/5) represent the medians of each group.
x = Median of the elements a1,a2,.....a(n/5).
Now if k<n/2 then we can remove the largets, 2nd largest and 3rd largest element of the groups whose median is greater than the x. We can now call the function again with 7n/10 elements and finding the kth largest value.
else if k>n/2 then we can remove the smallest ,2nd smallest and 3rd smallest element of the group whose median is smaller than the x. We can now call the function of again with 7n/10 elements and finding the (k-3n/10)th largest value.
Time Complexity Analysis:
T(n) time complexity to find the kth largest in an array of size n.
T(n) = T(n/5) + T(7n/10) + O(n)
if you solve this you will find out that T(n) is actually O(n)
n/5 + 7n/10 = 9n/10 < n
Notice that building a heap takes O(n) actually not O(nlogn), you can check this using amortized analysis or simply check in Youtube.
Extract-Min takes O(logn), therefore, extracting n/2 will take (nlogn/2) = O(nlogn) amortized time.
About your question, you can simply check at Median of Medians.

Complexity analysis of SelectionSort

Here's a SelectionSort routine I wrote. Is my complexity analysis that follows correct?
public static void selectionSort(int[] numbers) {
// Iterate over each cell starting from the last one and working backwards
for (int i = numbers.length - 1; i >=1; i--)
{
// Always set the max pos to 0 at the start of each iteration
int maxPos = 0;
// Start at cell 1 and iterate up to the second last cell
for (int j = 1; j < i; j++)
{
// If the number in the current cell is larger than the one in maxPos,
// set a new maxPos
if (numbers[j] > numbers[maxPos])
{
maxPos = j;
}
}
// We now have the position of the maximum number. If the maximum number is greater
// than the number in the current cell swap them
if (numbers[maxPos] > numbers[i])
{
int temp = numbers[i];
numbers[i] = numbers[maxPos];
numbers[maxPos] = temp;
}
}
}
Complexity Analysis
Outter Loop (comparison & assignment): 2 ops performed n times = 2n ops
Assigning maxPos: n ops
Inner Loop (comparison & assignment): 2 ops performed 2n^2 times = 2n² ops
Comparison of array elements (2 array references & a comparison): 3n² ops
Assigning new maxPos: n² ops
Comparison of array elements (2 array references & a comparison): 3n² ops
Assignment & array reference: 2n² ops
Assignment & 2 array references: 3n² ops
Assignment & array reference: 2n² ops
Total number of primitive operations
2n + n + 2n² + 3n² + n^2 + 3n² + 2n² + 3n² + 2n² = 16n² + 3n
Leading to Big Oh(n²)
Does that look correct? Particularly when it comes to the inner loop and the stuff inside it...
Yes, O(N2) is correct.
Edit: It's a little hard to guess at exactly what they may want as far as "from first principles" goes, but I would guess that they're looking for (in essence) something on the order of a proof (or at least indication) that the basic definition of big-O is met:
there exist positive constants c and n0 such that:
0 ≤ f(n) ≤ cg(n) for all n ≥ n0.
So, the next step after finding 16N2+3N would be to find the correct values for n0 and c. At least at first glance, c appears to be 16, and n0, -3, (which is probably treated as 0, negative numbers of elements having no real meaning).
Generally it is pointless (and incorrect) to add up actual operations, because operations take various numbers of processor cycles, some of them dereference values from memory which takes a lot more time, then it gets even more complex because compilers optimize code, then you have stuff like cache locality, etc, so unless you know really, really well how everything works underneath, you are adding up apples and oranges. You can't just add up "j < i", "j++", and "numbers[i] = numbers[maxPos]" as if they were equal, and you don't need to do so - for the purpose of complexity analysis, a constant time block is a constant time block. You are not doing low level code optimization.
The complexity is indeed N^2, but your coefficients are meaningless.

Exactly how many comparisons does merge sort make?

I have read that quicksort is much faster than mergesort in practice, and the reason for this is the hidden constant.
Well, the solution for the randomized quick sort complexity is 2nlnn=1.39nlogn which means that the constant in quicksort is 1.39.
But what about mergesort? What is the constant in mergesort?
Let's see if we can work this out!
In merge sort, at each level of the recursion, we do the following:
Split the array in half.
Recursively sort each half.
Use the merge algorithm to combine the two halves together.
So how many comparisons are done at each step? Well, the divide step doesn't make any comparisons; it just splits the array in half. Step 2 doesn't (directly) make any comparisons; all comparisons are done by recursive calls. In step 3, we have two arrays of size n/2 and need to merge them. This requires at most n comparisons, since each step of the merge algorithm does a comparison and then consumes some array element, so we can't do more than n comparisons.
Combining this together, we get the following recurrence:
C(1) = 0
C(n) = 2C(n / 2) + n
(As mentioned in the comments, the linear term is more precisely (n - 1), though this doesn’t change the overall conclusion. We’ll use the above recurrence as an upper bound.)
To simplify this, let's define n = 2k and rewrite this recurrence in terms of k:
C'(0) = 0
C'(k) = 2C'(k - 1) + 2^k
The first few terms here are 0, 2, 8, 24, ... . This looks something like k 2k, and we can prove this by induction. As our base case, when k = 0, the first term is 0, and the value of k 2k is also 0. For the inductive step, assume the claim holds for some k and consider k + 1. Then the value is 2(k 2k) + 2k + 1 = k 2 k + 1 + 2k + 1 = (k + 1)2k + 1, so the claim holds for k + 1, completing the induction. Thus the value of C'(k) is k 2k. Since n = 2 k, this means that, assuming that n is a perfect power of two, we have that the number of comparisons made is
C(n) = n lg n
Impressively, this is better than quicksort! So why on earth is quicksort faster than merge sort? This has to do with other factors that have nothing to do with the number of comparisons made. Primarily, since quicksort works in place while merge sort works out of place, the locality of reference is not nearly as good in merge sort as it is in quicksort. This is such a huge factor that quicksort ends up being much, much better than merge sort in practice, since the cost of a cache miss is pretty huge. Additionally, the time required to sort an array doesn't just take the number of comparisons into account. Other factors like the number of times each array element is moved can also be important. For example, in merge sort we need to allocate space for the buffered elements, move the elements so that they can be merged, then merge back into the array. These moves aren't counted in our analysis, but they definitely add up. Compare this to quicksort's partitioning step, which moves each array element exactly once and stays within the original array. These extra factors, not the number of comparisons made, dominate the algorithm's runtime.
This analysis is a bit less precise than the optimal one, but Wikipedia confirms that the analysis is roughly n lg n and that this is indeed fewer comparisons than quicksort's average case.
Hope this helps!
In the worst case and assuming a straight-forward implementation, the number of comparisons to sort n elements is
n ⌈lg n⌉ − 2⌈lg n⌉ + 1
where lg n indicates the base-2 logarithm of n.
This result can be found in the corresponding Wikipedia article or recent editions of The Art of Computer Programming by Donald Knuth, and I just wrote down a proof for this answer.
Merging two sorted arrays (or lists) of size k resp. m takes k+m-1 comparisons at most, min{k,m} at best. (After each comparison, we can write one value to the target, when one of the two is exhausted, no more comparisons are necessary.)
Let C(n) be the worst case number of comparisons for a mergesort of an array (a list) of n elements.
Then we have C(1) = 0, C(2) = 1, pretty obviously. Further, we have the recurrence
C(n) = C(floor(n/2)) + C(ceiling(n/2)) + (n-1)
An easy induction shows
C(n) <= n*log_2 n
On the other hand, it's easy to see that we can come arbitrarily close to the bound (for every ε > 0, we can construct cases needing more than (1-ε)*n*log_2 n comparisons), so the constant for mergesort is 1.
Merge sort is O(n log n) and at each step, in the "worst" case (for number of comparisons), performs a comparison.
Quicksort, on the other hand, is O(n^2) in the worst case.
C++ program to count the number of comparisons in merge sort.
First the program will sort the given array, then it will show the number of comparisons.
#include<iostream>
using namespace std;
int count=0; /* to count the number of comparisions */
int merge( int arr [ ], int l, int m, int r)
{
int i=l; /* left subarray*/
int j=m+1; /* right subarray*/
int k=l; /* temporary array*/
int temp[r+1];
while( i<=m && j<=r)
{
if ( arr[i]<= arr[j])
{
temp[k]=arr[i];
i++;
}
else
{
temp[k]=arr[j];
j++;
}
k++;
count++;
}
while( i<=m)
{
temp[k]=arr[i];
i++;
k++;
}
while( j<=r)
{
temp[k]=arr[j];
j++;
k++;
}
for( int p=l; p<=r; p++)
{
arr[p]=temp[p];
}
return count;
}
int mergesort( int arr[ ], int l, int r)
{
int comparisons;
if(l<r)
{
int m= ( l+r)/2;
mergesort(arr,l,m);
mergesort(arr,m+1,r);
comparisions = merge(arr,l,m,r);
}
return comparisons;
}
int main ()
{
int size;
cout<<" Enter the size of an array "<< endl;
cin>>size;
int myarr[size];
cout<<" Enter the elements of array "<<endl;
for ( int i=0; i< size; i++)
{
cin>>myarr[i];
}
cout<<" Elements of array before sorting are "<<endl;
for ( int i=0; i< size; i++)
{
cout<<myarr[i]<<" " ;
}
cout<<endl;
int c=mergesort(myarr, 0, size-1);
cout<<" Elements of array after sorting are "<<endl;
for ( int i=0; i< size; i++)
{
cout<<myarr[i]<<" " ;
}
cout<<endl;
cout<<" Number of comaprisions while sorting the given array"<< c <<endl;
return 0;
}
I am assuming reader knows Merge sort. Comparisons happens only when two sorted arrays is getting merged. For simplicity, assume n as power of 2. To merge two (n/2) size arrays in worst case, we need (n - 1) comparisons. -1 appears here, as last element left on merging does not require any comparison. First found number of total comparison assuming it as n for some time, we can correct it by (-1) part. Number of levels for merging is log2(n) (Imagine as tree structure). In each layer there will be n comparison (need to minus some number, due to -1 part),so total comparison is nlog2(n) - (Yet to be found). "Yet to be found" part does not give nlog2(n) constant, it is actually (1 + 2 + 4 + 8 + ... + (n/2) = n - 1).
Number of total comparison in merge sort = n*log2(n) - (n - 1).
So, your constant is 1.

Resources