Can {2,3,4,5,.........,n,1} be called worst case of bubble sort? - sorting

Its mentioned at many places that bubble sort's worst case happens when we have sorted array in reverse order. But we can get $\Theta(n^2)$ complexity in other case also.
I have wrote the following code for bubble sort (C):
void bubbleSort(int * arr, int size)
{
for ( int i = size; i > 1; i-- ) // i is size of Unsorted array
{
bool swapped = false;
for ( int j = 0 ; j <= i-2; j++ )
{
if ( arr[j] > arr[j+1] )
{
swap(arr+j, arr+j+1);
swapped = true;
}
}
if ( swapped == false )
{
break;
}
}
}
Now for worst case, I was thinking that we should never encounter the break. That means we should at least have one swap in each pass. So now I started constructing an example array starting from size = 2. So one of the arrays satisfying the condition is {2,3,4,5,1}.
For any such array I am getting $\Theta(n^2)$ complexity, which is same as the worst case complexity described elsewhere.
But is this array the most worst case? In the case of descending ordered array, we have one swap for each comparison, but in my example we have a single swap in each pass. So the actual run time would be lower in my example (same number of comparisons in both examples though).
I get that big theta is just an approximation, but is the "worst case" described with respect to asymptotic approximation or actual run time?
Edit: In the book: "Data Structures with C (Schaum's Outline Series)"
on page 2.15 its written:
The time is measured by counting the number of key operations - in sorting and searching algorithms, the number of comparisons.
Both the examples above have same number of comparisons, so should they both be worst cases?

Related

What is the algorithmic time complexity, if the exact is ∛(n^2 )

I have calculated the answer to be n raised to 2/3 . Could anyone tell me what is the worst case Big O()
Whether it is worst case or not that depends on the algorithm that you are using.
But in general, if you are having the number of operations to be n^(2/3) then the complexity in terms of O-notation is O(n^(2/3)).
Let me explain this in a bit more detail (so that we can get rid of the "general" word and give a definitive answer).
Consider a simple algorithm that finds the specific number in an array of n elements.
If your algorithm is something like this:
find(arr, number) {
boolean found = false;
for(i = 0; i < arr.length; i++) {
if(arr[i] == number) {
found = true;
}
}
return found;
}
The time complexity to find a number in an array using the above algorithm is O(n) always. By always I mean that what so-ever is the input, the above algorithm will always have n iterations (considering the length of array is n).
Now compare and contrast this algorithm with the following:
find(arr, number) {
for(i = 0; i < arr.length; i++) {
if(arr[i] == number) {
return true;
}
}
return false;
}
Now the time-complexity depends on the input of the array. (If you have an array with 10^8 elements and the first element matches with the element that you are searching for then you are done and you can instantly return without iterating over the whole array). So the worst-case complexity here becomes O(n) but the best case here is O(1).
So it basically depends on your algorithm how it is working. I assume that your intent when you say "exact" is that you have first version of find that I described above. And if that is the case then yes, the worst-case time complexity is O(n^2/3)

Big O - is n always the size of the input?

I made up my own interview-style problem, and have a question on the big O of my solution. I will state the problem and my solution below, but first let me say that the obvious solution involves a nested loop and is O(n2). I believe I found a O(n) solution, but then I realized it depends not only on the size of the input, but the largest value of the input. It seems like my running time of O(n) is only a technicality, and that it could easily run in O(n2) time or worse in real life.
The problem is:
For each item in a given array of positive integers, print all the other items in the array that are multiples of the current item.
Example Input:
[2 9 6 8 3]
Example Output:
2: 6 8
9:
6:
8:
3: 9 6
My solution (in C#):
private static void PrintAllDivisibleBy(int[] arr)
{
Dictionary<int, bool> dic = new Dictionary<int, bool>();
if (arr == null || arr.Length < 2)
return;
int max = arr[0];
for(int i=0; i<arr.Length; i++)
{
if (arr[i] > max)
max = arr[i];
dic[arr[i]] = true;
}
for(int i=0; i<arr.Length; i++)
{
Console.Write("{0}: ", arr[i]);
int multiplier = 2;
while(true)
{
int product = multiplier * arr[i];
if (dic.ContainsKey(product))
Console.Write("{0} ", product);
if (product >= max)
break;
multiplier++;
}
Console.WriteLine();
}
}
So, if 2 of the array items are 1 and n, where n is the array length, the inner while loop will run n times, making this equivalent to O(n2). But, since the performance is dependent on the size of the input values, not the length of the list, that makes it O(n), right?
Would you consider this a true O(n) solution? Is it only O(n) due to technicalities, but slower in real life?
Good question! The answer is that, no, n is not always the size of the input: You can't really talk about O(n) without defining what the n means, but often people use imprecise language and imply that n is "the most obvious thing that scales here". Technically we should usually say things like "This sort algorithm performs a number of comparisons that is O(n) in the number of elements in the list": being specific about both what n is, and what quantity we are measuring (comparisons).
If you have an algorithm that depends on the product of two different things (here, the length of the list and the largest element in it), the proper way to express that is in the form O(m*n), and then define what m and n are for your context. So, we could say that your algorithm performs O(m*n) multiplications, where m is the length of the list and n is the largest item in the list.
An algorithm is O(n) when you have to iterate over n elements and perform some constant time operation in each iteration. The inner while loop of your algorithm is not constant time as it depends on the hugeness of the biggest number in your array.
Your algorithm's best case run-time is O(n). This is the case when all the n numbers are same.
Your algorithm's worst case run-time is O(k*n), where k = the max value of int possible on your machine if you really insist to put an upper bound on k's value. For 32 bit int the max value is 2,147,483,647. You can argue that this k is a constant, but this constant is clearly
not fixed for every case of input array; and,
not negligible.
Would you consider this a true O(n) solution?
The runtime actually is O(nm) where m is the maximum element from arr. If the elements in your array are bounded by a constant you can consider the algorithm to be O(n)
Can you improve the runtime? Here's what else you can do. First notice that you can ensure that the elements are different. ( you compress the array in hashmap which stores how many times an element is found in the array). Then your runtime would be max/a[0]+max/a[1]+max/a[2]+...<= max+max/2+...max/max = O(max log (max)) (assuming your array arr is sorted). If you combine this with the obvious O(n^2) algorithm you'd get O(min(n^2, max*log(max)) algorithm.

Complexity of Multiple Binary Search and comparison algorithms

Hello,
I have two questions regarding complexity:
1) What would be the Best/worst complexity of a binary search called
multiple times. In other words a binary search used to compare two
arrays. I believe one of them is O(mlog(n)) (no matches between
arrays). However, since I can't figure out the other, I can't tell if
this is the best or worst.
2) For the following segment of code: Two arrays A of size m and B of
size n where m>=n, and where A and B have gone through a bubble sort
each. In addition no elements in A are repeated. And no elements in B
are repeated. However, A and B have may have common elements. The
following pseudo-code computes the number of common elements:
count = 0
for i from 0 to m:
if i < n:
for j from 0 to n:
if A[i]==B[j]:
count + =1
break
else:
break
I seem to come up with the following complexities for the sort and comparison:
bubble sort has worst O(n^2) and best O(n)
search has two bounds ( I think): situation 1: no matches O(mn),
situation 2: The first n elements of array A(size m) match to all of
the elements of array B(size n). -> O(n^2)
Complexity possibilities (sort and search): O(n^2+mn),O(n+mn),O(n^2+n)=O(n^2) The best
seems to be sorted and no matches O(mn+n) The worst seems to be not
sorted and no matches O(n^2+mn) Does this seem valid?
Thank you.
P.S. sorry the whole thing is in block format. It would not let me submit without doing so.
if the arrays are sorted why you use binary search or m:n combinations comparison with it when you can done both operations in O(n) <= O(n+m-c) <= O(n+m) ?
Example how to do it for question 2:
int c=0; // common elements
for (int i=0,j=0;(i<m)&&(j<n);)
{
if (a[i]==b[i]) { c++; i++; j++; }
else if (a[i]< b[i]) i++;
else j++;
}
Just add some ifs for the end of arrays if needed (not sure if it will not ignore last common element and too lazy to analyze it) but anyway you see how to approach this kind of problems faster
the same can be done with the comparison but you did not provide any details on its purpose ...
Now back to your questions:
if you parse each element of array A and bin-search relevant element in B for it
then yes worst complexity is O(m.log(n))
but the best also because you go through whole array of b without using last found index
btw if you would parse B and search A then O(n.log(m)) is better if m>n
your code for common elements computation is confusing
why is there if (i<n) ???
what if you have A={1,2,3,4,5} and B={3,5} ?
then this if will reject both common elements
my opinion is that it should be removed
so after the removal the complexity is O(m.n) for all cases
[edit1] code update to handle the edge case
int c=0; // common elements
if (n+m) for (int i=0,j=0;;) // loop only if both arrays are not empty
{
if (a[i]==b[i]) { c++; if (i<m-1) i++; else if (j==n-1) break; j++; }
else if (a[i]< b[i]) { if (i<m-1) i++; else if (j==n-1) break; }
else { if (j<n-1) j++; else if (i==m-1) break; }
}

Why should Insertion Sort be used after threshold crossover in Merge Sort

I have read everywhere that for divide and conquer sorting algorithms like Merge-Sort and Quicksort, instead of recursing until only a single element is left, it is better to shift to Insertion-Sort when a certain threshold, say 30 elements, is reached. That is fine, but why only Insertion-Sort? Why not Bubble-Sort or Selection-Sort, both of which has similar O(N^2) performance? Insertion-Sort should come handy only when many elements are pre-sorted (although that advantage should also come with Bubble-Sort), but otherwise, why should it be more efficient than the other two?
And secondly, at this link, in the 2nd answer and its accompanying comments, it says that O(N log N) performs poorly compared to O(N^2) upto a certain N. How come? N^2 should always perform worse than N log N, since N > log N for all N >= 2, right?
If you bail out of each branch of your divide-and-conquer Quicksort when it hits the threshold, your data looks like this:
[the least 30-ish elements, not in order] [the next 30-ish ] ... [last 30-ish]
Insertion sort has the rather pleasing property that you can call it just once on that whole array, and it performs essentially the same as it does if you call it once for each block of 30. So instead of calling it in your loop, you have the option to call it last. This might not be faster, especially since it pulls the whole data through cache an extra time, but depending how the code is structured it might be convenient.
Neither bubble sort nor selection sort has this property, so I think the answer might quite simply be "convenience". If someone suspects selection sort might be better then the burden of proof lies on them to "prove" that it's faster.
Note that this use of insertion sort also has a drawback -- if you do it this way and there's a bug in your partition code then provided it doesn't lose any elements, just partition them incorrectly, you'll never notice.
Edit: apparently this modification is by Sedgewick, who wrote his PhD on QuickSort in 1975. It was analyzed more recently by Musser (the inventor of Introsort). Reference https://en.wikipedia.org/wiki/Introsort
Musser also considered the effect on caches of Sedgewick's delayed
small sorting, where small ranges are sorted at the end in a single
pass of insertion sort. He reported that it could double the number of
cache misses, but that its performance with double-ended queues was
significantly better and should be retained for template libraries, in
part because the gain in other cases from doing the sorts immediately
was not great.
In any case, I don't think the general advice is "whatever you do, don't use selection sort". The advice is, "insertion sort beats Quicksort for inputs up to a surprisingly non-tiny size", and this is pretty easy to prove to yourself when you're implementing a Quicksort. If you come up with another sort that demonstrably beats insertion sort on the same small arrays, none of those academic sources is telling you not to use it. I suppose the surprise is that the advice is consistently towards insertion sort, rather than each source choosing its own favorite (introductory teachers have a frankly astonishing fondness for bubble sort -- I wouldn't mind if I never hear of it again). Insertion sort is generally thought of as "the right answer" for small data. The issue isn't whether it "should be" fast, it's whether it actually is or not, and I've never particularly noticed any benchmarks dispelling this idea.
One place to look for such data would be in the development and adoption of Timsort. I'm pretty sure Tim Peters chose insertion for a reason: he wasn't offering general advice, he was optimizing a library for real use.
Insertion sort is faster in practice, than bubblesort at least. Their asympotic running time is the same, but insertion sort has better constants (fewer/cheaper operations per iteration). Most notably, it requires only a linear number of swaps of pairs of elements, and in each inner loop it performs comparisons between each of n/2 elements and a "fixed" element that can be stores in a register (while bubble sort has to read values from memory). I.e. insertion sort does less work in its inner loop than bubble sort.
The answer claims that 10000 n lg n > 10 n² for "reasonable" n. This is true up to about 14000 elements.
I am surprised no-one's mentioned the simple fact that insertion sort is simply much faster for "almost" sorted data. That's the reason it's used.
The easier one first: why insertion sort over selection sort? Because insertion sort is in O(n) for optimal input sequences, i.e. if the sequence is already sorted. Selection sort is always in O(n^2).
Why insertion sort over bubble sort? Both need only a single pass for already sorted input sequences, but insertion sort degrades better. To be more specific, insertion sort usually performs better with a small number of inversion than bubble sort does. Source This can be explained because bubble sort always iterates over N-i elements in pass i while insertion sort works more like "find" and only needs to iterate over (N-i)/2 elements in average (in pass N-i-1) to find the insertion position. So, insertion sort is expected to be about two times faster than insertion sort on average.
Here is an empirical proof the insertion sort is faster then bubble sort (for 30 elements, on my machine, the attached implementation, using java...).
I ran the attached code, and found out that the bubble sort ran on average of 6338.515 ns, while insertion took 3601.0
I used wilcoxon signed test to check the probability that this is a mistake and they should actually be the same - but the result is below the range of the numerical error (and effectively P_VALUE ~= 0)
private static void swap(int[] arr, int i, int j) {
int temp = arr[i];
arr[i] = arr[j];
arr[j] = temp;
}
public static void insertionSort(int[] arr) {
for (int i = 1; i < arr.length; i++) {
int j = i;
while (j > 0 && arr[j-1] > arr[j]) {
swap(arr, j, j-1);
j--;
}
}
}
public static void bubbleSort(int[] arr) {
for (int i = 0 ; i < arr.length; i++) {
boolean bool = false;
for (int j = 0; j < arr.length - i ; j++) {
if (j + 1 < arr.length && arr[j] > arr[j+1]) {
bool = true;
swap(arr,j,j+1);
}
}
if (!bool) break;
}
}
public static void main(String... args) throws Exception {
Random r = new Random(1);
int SIZE = 30;
int N = 1000;
int[] arr = new int[SIZE];
int[] millisBubble = new int[N];
int[] millisInsertion = new int[N];
System.out.println("start");
//warm up:
for (int t = 0; t < 100; t++) {
insertionSort(arr);
}
for (int t = 0; t < N; t++) {
arr = generateRandom(r, SIZE);
int[] tempArr = Arrays.copyOf(arr, arr.length);
long start = System.nanoTime();
insertionSort(tempArr);
millisInsertion[t] = (int)(System.nanoTime()-start);
tempArr = Arrays.copyOf(arr, arr.length);
start = System.nanoTime();
bubbleSort(tempArr);
millisBubble[t] = (int)(System.nanoTime()-start);
}
int sum1 = 0;
for (int x : millisBubble) {
System.out.println(x);
sum1 += x;
}
System.out.println("end of bubble. AVG = " + ((double)sum1)/millisBubble.length);
int sum2 = 0;
for (int x : millisInsertion) {
System.out.println(x);
sum2 += x;
}
System.out.println("end of insertion. AVG = " + ((double)sum2)/millisInsertion.length);
System.out.println("bubble took " + ((double)sum1)/millisBubble.length + " while insertion took " + ((double)sum2)/millisBubble.length);
}
private static int[] generateRandom(Random r, int size) {
int[] arr = new int[size];
for (int i = 0 ; i < size; i++)
arr[i] = r.nextInt(size);
return arr;
}
EDIT:
(1) optimizing the bubble sort (updated above) reduced the total time taking to bubble sort to: 6043.806 not enough to make a significant change. Wilcoxon test is still conclusive: Insertion sort is faster.
(2) I also added a selection sort test (code attached) and compared it against insertion. The results are: selection took 4748.35 while insertion took 3540.114.
P_VALUE for wilcoxon is still below the range of numerical error (effectively ~=0)
code for selection sort used:
public static void selectionSort(int[] arr) {
for (int i = 0; i < arr.length ; i++) {
int min = arr[i];
int minElm = i;
for (int j = i+1; j < arr.length ; j++) {
if (arr[j] < min) {
min = arr[j];
minElm = j;
}
}
swap(arr,i,minElm);
}
}
EDIT: As IVlad points out in a comment, selection sort does only n swaps (and therefore only 3n writes) for any dataset, so insertion sort is very unlikely to beat it on account of doing fewer swaps -- but it will likely do substantially fewer comparisons. The reasoning below better fits a comparison with bubble sort, which will do a similar number of comparisons but many more swaps (and thus many more writes) on average.
One reason why insertion sort tends to be faster than the other O(n^2) algorithms like bubble sort and selection sort is because in the latter algorithms, every single data movement requires a swap, which can be up to 3 times as many memory copies as are necessary if the other end of the swap needs to be swapped again later.
With insertion sort OTOH, if the next element to be inserted isn't already the largest element, it can be saved into a temporary location, and all lower elements shunted forward by starting from the right and using single data copies (i.e. without swaps). This opens up a gap to put the original element.
C code for insertion-sorting integers without using swaps:
void insertion_sort(int *v, int n) {
int i = 1;
while (i < n) {
int temp = v[i]; // Save the current element here
int j = i;
// Shunt everything forwards
while (j > 0 && v[j - 1] > temp) {
v[j] = v[j - 1]; // Look ma, no swaps! :)
--j;
}
v[j] = temp;
++i;
}
}

Exactly how many comparisons does merge sort make?

I have read that quicksort is much faster than mergesort in practice, and the reason for this is the hidden constant.
Well, the solution for the randomized quick sort complexity is 2nlnn=1.39nlogn which means that the constant in quicksort is 1.39.
But what about mergesort? What is the constant in mergesort?
Let's see if we can work this out!
In merge sort, at each level of the recursion, we do the following:
Split the array in half.
Recursively sort each half.
Use the merge algorithm to combine the two halves together.
So how many comparisons are done at each step? Well, the divide step doesn't make any comparisons; it just splits the array in half. Step 2 doesn't (directly) make any comparisons; all comparisons are done by recursive calls. In step 3, we have two arrays of size n/2 and need to merge them. This requires at most n comparisons, since each step of the merge algorithm does a comparison and then consumes some array element, so we can't do more than n comparisons.
Combining this together, we get the following recurrence:
C(1) = 0
C(n) = 2C(n / 2) + n
(As mentioned in the comments, the linear term is more precisely (n - 1), though this doesn’t change the overall conclusion. We’ll use the above recurrence as an upper bound.)
To simplify this, let's define n = 2k and rewrite this recurrence in terms of k:
C'(0) = 0
C'(k) = 2C'(k - 1) + 2^k
The first few terms here are 0, 2, 8, 24, ... . This looks something like k 2k, and we can prove this by induction. As our base case, when k = 0, the first term is 0, and the value of k 2k is also 0. For the inductive step, assume the claim holds for some k and consider k + 1. Then the value is 2(k 2k) + 2k + 1 = k 2 k + 1 + 2k + 1 = (k + 1)2k + 1, so the claim holds for k + 1, completing the induction. Thus the value of C'(k) is k 2k. Since n = 2 k, this means that, assuming that n is a perfect power of two, we have that the number of comparisons made is
C(n) = n lg n
Impressively, this is better than quicksort! So why on earth is quicksort faster than merge sort? This has to do with other factors that have nothing to do with the number of comparisons made. Primarily, since quicksort works in place while merge sort works out of place, the locality of reference is not nearly as good in merge sort as it is in quicksort. This is such a huge factor that quicksort ends up being much, much better than merge sort in practice, since the cost of a cache miss is pretty huge. Additionally, the time required to sort an array doesn't just take the number of comparisons into account. Other factors like the number of times each array element is moved can also be important. For example, in merge sort we need to allocate space for the buffered elements, move the elements so that they can be merged, then merge back into the array. These moves aren't counted in our analysis, but they definitely add up. Compare this to quicksort's partitioning step, which moves each array element exactly once and stays within the original array. These extra factors, not the number of comparisons made, dominate the algorithm's runtime.
This analysis is a bit less precise than the optimal one, but Wikipedia confirms that the analysis is roughly n lg n and that this is indeed fewer comparisons than quicksort's average case.
Hope this helps!
In the worst case and assuming a straight-forward implementation, the number of comparisons to sort n elements is
n ⌈lg n⌉ − 2⌈lg n⌉ + 1
where lg n indicates the base-2 logarithm of n.
This result can be found in the corresponding Wikipedia article or recent editions of The Art of Computer Programming by Donald Knuth, and I just wrote down a proof for this answer.
Merging two sorted arrays (or lists) of size k resp. m takes k+m-1 comparisons at most, min{k,m} at best. (After each comparison, we can write one value to the target, when one of the two is exhausted, no more comparisons are necessary.)
Let C(n) be the worst case number of comparisons for a mergesort of an array (a list) of n elements.
Then we have C(1) = 0, C(2) = 1, pretty obviously. Further, we have the recurrence
C(n) = C(floor(n/2)) + C(ceiling(n/2)) + (n-1)
An easy induction shows
C(n) <= n*log_2 n
On the other hand, it's easy to see that we can come arbitrarily close to the bound (for every ε > 0, we can construct cases needing more than (1-ε)*n*log_2 n comparisons), so the constant for mergesort is 1.
Merge sort is O(n log n) and at each step, in the "worst" case (for number of comparisons), performs a comparison.
Quicksort, on the other hand, is O(n^2) in the worst case.
C++ program to count the number of comparisons in merge sort.
First the program will sort the given array, then it will show the number of comparisons.
#include<iostream>
using namespace std;
int count=0; /* to count the number of comparisions */
int merge( int arr [ ], int l, int m, int r)
{
int i=l; /* left subarray*/
int j=m+1; /* right subarray*/
int k=l; /* temporary array*/
int temp[r+1];
while( i<=m && j<=r)
{
if ( arr[i]<= arr[j])
{
temp[k]=arr[i];
i++;
}
else
{
temp[k]=arr[j];
j++;
}
k++;
count++;
}
while( i<=m)
{
temp[k]=arr[i];
i++;
k++;
}
while( j<=r)
{
temp[k]=arr[j];
j++;
k++;
}
for( int p=l; p<=r; p++)
{
arr[p]=temp[p];
}
return count;
}
int mergesort( int arr[ ], int l, int r)
{
int comparisons;
if(l<r)
{
int m= ( l+r)/2;
mergesort(arr,l,m);
mergesort(arr,m+1,r);
comparisions = merge(arr,l,m,r);
}
return comparisons;
}
int main ()
{
int size;
cout<<" Enter the size of an array "<< endl;
cin>>size;
int myarr[size];
cout<<" Enter the elements of array "<<endl;
for ( int i=0; i< size; i++)
{
cin>>myarr[i];
}
cout<<" Elements of array before sorting are "<<endl;
for ( int i=0; i< size; i++)
{
cout<<myarr[i]<<" " ;
}
cout<<endl;
int c=mergesort(myarr, 0, size-1);
cout<<" Elements of array after sorting are "<<endl;
for ( int i=0; i< size; i++)
{
cout<<myarr[i]<<" " ;
}
cout<<endl;
cout<<" Number of comaprisions while sorting the given array"<< c <<endl;
return 0;
}
I am assuming reader knows Merge sort. Comparisons happens only when two sorted arrays is getting merged. For simplicity, assume n as power of 2. To merge two (n/2) size arrays in worst case, we need (n - 1) comparisons. -1 appears here, as last element left on merging does not require any comparison. First found number of total comparison assuming it as n for some time, we can correct it by (-1) part. Number of levels for merging is log2(n) (Imagine as tree structure). In each layer there will be n comparison (need to minus some number, due to -1 part),so total comparison is nlog2(n) - (Yet to be found). "Yet to be found" part does not give nlog2(n) constant, it is actually (1 + 2 + 4 + 8 + ... + (n/2) = n - 1).
Number of total comparison in merge sort = n*log2(n) - (n - 1).
So, your constant is 1.

Resources