Insertion Sort with binary search - algorithm

When implementing Insertion Sort, a binary search could be used to locate the position within the first i - 1 elements of the array into which element i should be inserted.
How would this affect the number of comparisons required? How would using such a binary search affect the asymptotic running time for Insertion Sort?
I'm pretty sure this would decrease the number of comparisons, but I'm not exactly sure why.

Straight from Wikipedia:
If the cost of comparisons exceeds the cost of swaps, as is the case
for example with string keys stored by reference or with human
interaction (such as choosing one of a pair displayed side-by-side),
then using binary insertion sort may yield better performance. Binary
insertion sort employs a binary search to determine the correct
location to insert new elements, and therefore performs ⌈log2(n)⌉
comparisons in the worst case, which is O(n log n). The algorithm as a
whole still has a running time of O(n2) on average because of the
series of swaps required for each insertion.
Source:
http://en.wikipedia.org/wiki/Insertion_sort#Variants
Here is an example:
http://jeffreystedfast.blogspot.com/2007/02/binary-insertion-sort.html
I'm pretty sure this would decrease the number of comparisons, but I'm
not exactly sure why.
Well, if you know insertion sort and binary search already, then its pretty straight forward. When you insert a piece in insertion sort, you must compare to all previous pieces. Say you want to move this [2] to the correct place, you would have to compare to 7 pieces before you find the right place.
[1][3][3][3][4][4][5] ->[2]<- [11][0][50][47]
However, if you start the comparison at the half way point (like a binary search), then you'll only compare to 4 pieces! You can do this because you know the left pieces are already in order (you can only do binary search if pieces are in order!).
Now imagine if you had thousands of pieces (or even millions), this would save you a lot of time. I hope this helps. |=^)

If you have a good data structure for efficient binary searching, it is unlikely to have O(log n) insertion time. Conversely, a good data structure for fast insert at an arbitrary position is unlikely to support binary search.
To achieve the O(n log n) performance of the best comparison searches with insertion sort would require both O(log n) binary search and O(log n) arbitrary insert.

Binary Insertion Sort - Take this array => {4, 5 , 3 , 2, 1}
Now inside the main loop , imagine we are at the 3rd element. Now using Binary Search we will know where to insert 3 i.e. before 4.
Binary Search uses O(Logn) comparison which is an improvement but we still need to insert 3 in the right place. For that we need to swap 3 with 5 and then with 4.
Due to insertion taking the same amount of time as it would without binary search the worst case Complexity Still remains O(n^2).
I hope this helps.

Assuming the array is sorted (for binary search to perform), it will not reduce any comparisons since inner loop ends immediately after 1 compare (as previous element is smaller). In general the number of compares in insertion sort is at max the number of inversions plus the array size - 1.
Since number of inversions in sorted array is 0, maximum number of compares in already sorted array is N - 1.

For comparisons we have log n time, and swaps will be order of n.
For n elements in worst case : n*(log n + n) is order of n^2.

Related

Data structure with O(1) insertion and O(log(n)) search complexity?

Is there any data structure available that would provide O(1) -- i.e. constant -- insertion complexity and O(log(n)) search complexity even in the worst case?
A sorted vector can do a O(log(n)) search but insertion would take O(n) (taken the fact that I am not always inserting the elements either at the front or the back). Whereas a list would do O(1) insertion but would fall short of providing O(log(n)) lookup.
I wonder whether such a data structure can even be implemented.
Yes, but you would have to bend the rules a bit in two ways:
1) You could use a structure that has O(1) insertion and O(1) search (such as the CritBit tree, also called bitwise trie) and add artificial cost to turn search into O(log n).
A critbit tree is like a binary radix tree for bits. It stores keys by walking along the bits of a key (say 32bits) and use the bit to decide whether to navigate left ('0') or right ('1') at every node. The maximum complexity for search and insertion is both O(32), which becomes O(1).
2) I'm not sure that this is O(1) in a strict theoretical sense, because O(1) works only if we limit the value range (to, say, 32 bit or 64 bit), but for practical purposes, this seems a reasonable limitation.
Note that the perceived performance will be O(log n) until a significant part of the possible key permutations are inserted. For example, for 16 bit keys you probably have to insert a significant part of 2^16 = 65563 keys.
No (at least in a model where the elements stored in the data structure can be compared for order only; hashing does not help for worst-case time bounds because there can be one big collision).
Let's suppose that every insertion requires at most c comparisons. (Heck, let's make the weaker assumption that n insertions require at most c*n comparisons.) Consider an adversary that inserts n elements and then looks up one. I'll describe an adversarial strategy that, during the insertion phase, forces the data structure to have Omega(n) elements that, given the comparisons made so far, could be ordered any which way. Then the data structure can be forced to search these elements, which amount to an unsorted list. The result is that the lookup has worst-case running time Omega(n).
The adversary's goal is to give away as little information as possible. Elements are sorted into three groups: winners, losers, and unknown. Initially, all elements are in the unknown group. When the algorithm compares two unknown elements, one chosen arbitrarily becomes a winner and the other becomes a loser. The winner is deemed greater than the loser. Similarly, unknown-loser, unknown-winner, and loser-winner comparisons are resolved by designating one of the elements a winner and the other a loser, without changing existing designations. The remaining cases are loser-loser and winner-winner comparisons, which are handled recursively (so the winners' group has a winner-unknown subgroup, a winner-winners subgroup, and a winner-losers subgroup). By an averaging argument, since at least n/2 elements are compared at most 2*c times, there exists a subsub...subgroup of size at least n/2 / 3^(2*c) = Omega(n). It can be verified that none of these elements are ordered by previous comparisons.
I wonder whether such a data structure can even be implemented.
I am afraid the answer is no.
Searching OK, Insertion NOT
When we look at the data structures like Binary search tree, B-tree, Red-black tree and AVL tree, they have average search complexity of O(log N), but at the same time the average insertion complexity is same as O(log N). Reason is obvious, the search will follow (or navigate through) the same pattern in which the insertion happens.
Insertion OK, Searching NOT
Data structures like Singly linked list, Doubly linked list have average insertion complexity of O(1), but again the searching in Singly and Doubly LL is painful O(N), just because they don't have any indexing based element access support.
Answer to your question lies in the Skiplist implementation, which is a linked list, still it needs O(log N) on average for insertion (when lists are expected to do insertion in O(1)).
On closing notes, Hashmap comes very close to meet the speedy search and speedy insertion requirement with the cost of huge space, but if horribly implemented, it can result into a complexity of O(N) for both insertion and searching.

Why insertion sort is best algorithm for sorted or nearly sorted array?

So i guess its because it just compares A[k] and A[k-1], and does the implementation in one sweep but its still not clear. Can someone explain better.
Thanks
This link shows a graphical representation of sorting algorithm with different types of data set.
As you can see, when the data is sorted the algorithm complexity is reduced to N. Which is equivalent to the number of elements as inputs.
The link provided gives a clear picture of how its more efficient.
You answered your own question: For a nearly sorted array, insertion sort will only need a handful of O(n) passes to complete. Contrast that to a divide and conquer sorting algorithm like merge sort, which takes O(n*lgn). For any non trivial value of n, a divide and conquer algorithm will need many O(n) passes, even if the array be almost completely sorted, whereas insertion sort might only require a few.
Insertion sort is a faster and more improved sorting algorithm than selection sort. In selection sort the algorithm iterates through all of the data through every pass whether it is already sorted or not. However, insertion sort works differently, instead of iterating through all of the data after every pass the algorithm only traverses the data it needs to until the segment that is being sorted is sorted. Again there are two loops that are required by insertion sort and therefore two main variables, which in this case are named 'i' and 'j'. Variables 'i' and 'j' begin on the same index after every pass of the first loop, the second loop only executes if variable 'j' is greater then index 0 AND arr[j] < arr[j - 1]. In other words, if 'j' hasn't reached the end of the data AND the value of the index where 'j' is at is smaller than the value of the index to the left of 'j', finally 'j' is decremented. As long as these two conditions are met in the second loop it will keep executing, this is what sets insertion sort apart from selection sort. Only the data that needs to be sorted is sorted.
The general goal of a sorting algorithm is to minimize the number of comparisons. Sorting algorithms have a lower bound and an upper bound on the number of comparisons( n log n worst-case for merge and heap sorts, n log n average case for quick sort). In the most general case, you'd go with an algorithm that happens to have the best average or best worst-case number of comparisons. However, when you know something about the data (e.g., the array is already sorted, or almost sorted), you can exploit the fact that insertion sort's lower bound is far lower than the "n log n" sorts.
For example, if you have an array [1,2,3,4,5,6,7,9] and you need to insert 8 into it, you can either insert it at the end, and sort the array using a vanilla n log n sort (which will do about 28 comparisons (roughly) to sort the data to [1,2,3,4,5,6,7,8,9]). However, insertion sort lets you insert the 8 at the right position in only about 8 comparisons.

Algorithm comparison in unsorted array

If I have a unsorted array A[1.....n]
using linear search to search number x
using bubble sorting to sort the array A in ascending order, then use binary search to search number x in sorted array
Which way will be more efficient — 1 or 2?
How to justify it?
If you need to search for a single number, nothing can beat a linear search: sorting cannot proceed faster than O(n), and even that is achievable only in special cases. Moreover, bubble sort is extremely inefficient, taking O(n2) time. Binary search is faster than that, so the overall timing is going to be dominated by O(n2).
Hence you are comparing O(n) to O(n2); obviously, O(n) wins.
The picture would be different if you needed to search for k different numbers, where k is larger than n2. The outcome of this comparison may very well be negative.

Is there a name for this sorting algorithm?

I thought of a sorting algorithm but I am not sure if this already exists.
Say we have a container with n items:
We choose the 3rd element and do a binary search on the first 2, putting it in the correct position. The first 3 items in the container are sorted.
We choose the 4th element and do a binary search on the first 3 and put it in the correct position. Now the first 4 items are sorted.
We choose the 5th element and do a binary search on the first 4 items and put it in the correct position. Now 5 items are sorted.
.
.
.
We choose the nth element and do a binary search on the other n-1 elements putting it in the correct position. All the items are sorted.
Binary search takes logk for k elements and let's say that the insertion takes constant time. Shouldn't this take:
log2 to put the 3rd element in the correct spot.
log3 to put the 4th element in the correct spot.
log4 to put the 5th element in the correct spot.
.
.
.
log(n-1) to put the nth element in the correct spot.
log2 + log3 + log4 + ... log(n-1) = log((n-1)!) ?
I may be talking nonsense but this looked interesting.
EDIT:
I did not take the insertion time into consideration. What if the sorting was done in a sorting array with gaps between the elements? This would allow for fast inserting without having to shift many elements. After a number of inserts, we could redistribute the elements. Considering that the array is not sorted (we could use a shuffle to ensure this) I think that the results could be quite fast.
It sounds like insertion sort modified to use binary search. It's fairly well-known, but not particularly well-used (as far as I know), possibly because it doesn't affect the O(n²) worst case, but makes the O(n) best case take O(n log n) instead, and because insertion sort isn't commonly used on anything but really small arrays or those already sorted or nearly sorted.
The problem is that you can't really insert in O(1). Random-access insert into an array takes O(n), which is of course what the well-known O(n²) complexity of insertion sort assumes.
One could consider a data structure like a binary search tree, which has O(log n) insert - it's not O(1), but we still end up with an O(n log n) algorithm.
Oh O(log (n!)) = O(n log n), in case you were wondering about that.
Tree sort (generic binary search tree) and splaysort (splay tree) both use binary search trees to sort. Adding elements to a balanced binary search tree is equivalent to doing a binary search to find where to add the elements then some tree operations to keep the tree balanced. Without a tree of some type, this becomes insertion sort as others have mentioned.
In the worst case the tree can become highly unbalanced, resulting in O(N^2) for tree sort. Using a self-balancing binary search tree yields O(N log N), at least on average. Splay sort is an adaptive sort, making it rather efficient when the input is already nearly sorted.
I think by binary search, he meant that there is an insertion taking place placed on a searchable index of where we would expect to find the item we are inserting. in which case it would be called insertion sort... Either way it's still N*log(N)

Why Binary Search Trees?

I was reading binary search tree and was thinking that why do we need BST at all? All the things as far as I know can also be achieve using simple sorted arrays. For e.g. - In order to build a BST having n elements, we requires n*O(log n) time i.e. O(nlog n) and lookup time is O(log n). But this thing can also be achieve using array. We can have a sorted array(requires O(nlog n) time), and lookup time in that is also O(log n) i.e. binary search algo. Then why do we need another data structure at all? Are there any other use/application of BST which make them so special?
--Ravi
Arrays are great if you're talking about write once, read many times type of interactions. It's when you get down to inserting, swapping, and deletion in which BST really start to shine compared to an array. Since they're node based, rather than based on a contiguous chunk of memory, the cost of moving an element either into the collection or out of the collection is fast while still maintaining the sorted nature of the collection.
Think of it as you would the difference in insertion between linked lists versus arrays. This is an oversimplification but it highlights an aspect of the advantage I've noted above.
Imagine you have an array with a million elements.
You want to insert an element at location 5.
So you insert at the end of the array and then sort.
Let's say the array is full; that's O(nlog n), which is 1,000,000 * 6 = 6,000,000 operations.
Imagine you have a balanced tree.
That's O(log n), plus a bit for balancing = 6 + a bit, call it 10 operations.
So, you've just spent 6,000,000 ops sorting your array. You then want to find that element. What do you do? binary search - O(log n) - which is exactly the same as what you're going to do when you search in the tree!
Now imagine you want to allocate -another- element.
Your array is full! what do you do? re-allocate the array with n extra elements and memcpy the lot? you really want to memcpy 4mbytes?
In a tree, you just add another element...
How about sorted insertion time?
In graphics programming if you have extended object(i.e. which represent an interval in each dimension and not just a point) you can add them to the smallest level of a binary tree(typically an octree) where they fit in entirely.
And if you don't pre-calculate the tree/sortedlist the O(n) random insertion time in a list can be prohibitively slow. Insertion time in a tree on the other hand is only O(log(n)).

Resources