Is binary search optimal in worst case? My instructor has said so, but I could not find a book that backs it up. We start with an ordered array, and in worst case(worst case for that algorithm), any algorithm will always take more pairwise comparisons than binary search.
Many people said that the question was unclear. Sorry! So the input is any general sorted array. I am looking for a proof which says that any search algorithm will take at least log2(N) comparisons in worst case(worst case for the algo in consideration).
Yes, binary search is optimal.
This is easily seen by appealing to information theory. It takes log N bits merely to identify a unique element out of N elements. But each comparison only gives you one bit of information. Therefore, you must perform log N comparisons to identify a unique element.
More verbosely... Consider a hypothetical algorithm X that outperforms binary search in the worst case. For a particular element of the array, run the algorithm and record the questions it asks; i.e., the sequence of comparisons it performs. Or rather, record the answers to those questions (like "true, false, false, true").
Convert that sequence into a binary string (1,0,0,1). Call this binary string the "signature of the element with respect to algorithm X". Do this for each element of the array, assigning a "signature" to each element.
Now here is the key. If two elements have the same signature, then algorithm X cannot tell them apart! All the algorithm knows about the array are the answers it gets from the questions it asks; i.e., the comparisons it performs. And if the algorithm cannot tell two elements apart, then it cannot be correct. (Put another way, if two elements have the same signature, meaning they result in the same sequence of comparisons by the algorithm, which one did the algorithm return? Contradiction.)
Finally, prove that if every signature has fewer than log N bits, then there must exist two elements with the same signature (pigeonhole principle). Done.
[update]
One quick additional comment. The above assumes that the algorithm does not know anything about the array except what it learns from performing comparisons. Of course, in real life, sometimes you do know something about the array a priori. As a toy example, if I know that the array has (say) 10 elements all between 1 and 100, and that they are distinct, and that the numbers 92 through 100 are all present in the array... Then clearly I do not need to perform four comparisons even in the worst case.
More realistically, if I know that the elements are uniformly distributed (or roughly uniformly distributed) between their min and their max, again I can do better than binary search.
But in the general case, binary search is still optimal.
Worst case for which algorithm? There's not one universal "worst case." If your question is...
"Is there a case where binary search takes more comparisons than another algorithm?"
Then, yes, of course. A simple linear search takes less time if the element happens to be the first one in the list.
"Is there even an algorithm with a better worst-case running time than binary search?"
Yes, in cases where you know more about the data. For example, a radix tree or trie is at worst constant-time with regard to the number of entries (but linear in length of the key).
"Is there a general search algorithm with a better worst-case running time than binary search?"
If you can only assume you have a comparison function on keys, no, the best worst-case is O(log n). But there are algorithms that are faster, just not in a big-O sense.
... so I suppose you really would have to define the question first!
Binary search has a worst case complexity of O(log(N)) comparisons - which is optimal for a comparison based search of a sorted array.
In some cases it might make sense to do something other than a purely comparison based search - in this case you might be able to beat the O(log(N)) barrier - i.e. check out interpolation search.
It depends on the nature of the data. For example the English language and a dictionary. You could write an algorithm to achieve better than a binary search by making use of the fact that certain letters occur within the English language with different frequencies.
But in general a binary search is a safe bet.
I think the question is a little unclear, but still here are my thoughts.
Worst case of Binary search would be when the element you are searching for is found after all log n comparisons. But the same data can be best case for linear search. It depends on the data arrangement and what you are searching for but the worst case for Binary search would end up to be log n. Now, this cannot be compared with same data and search for linear search since its worst case would be different. The worst case for Linear search could be finding an element which happens to be at the end of the array.
For example: array A = 1, 2, 3, 4, 5, 6 and Binary Search on A for 1 would be the worst case. Whereas for the same array, linear search for 6 would be the worst case, not search for 1.
Related
"Every comparison-based algorithm to sort n elements must take Ω(nlogn) comparisons in the worst case. With this fact, what would be the complexity of constructing a n-node binary search tree and why?"
Based on this question, I am thinking that the construction complexity must be at least O(nlogn). That said, I can't seem to figure out how to find the total complexity of construction.
The title of the question and the text you quote are asking different things. I am going to address what the quote is saying because finding how expensive BST construction is can be done just by looking at an algorithm.
Assume that for a second it was possible to construct a BST in better than Ω(nlogn). With a binary search tree you can read out the sorted list in Θ(n) time. This means I could create a sorting algorithm as follows.
Algorithm sort(L)
B <- buildBST(L)
Sorted <- inOrderTraversal(B)
return Sorted
With this algorithm I would be able to sort a list in better than Ω(nlogn). But as you stated this is not possible because Ω(nlogn) is a lower bound. Therefor it is not possible to create a binary search tree in better than Ω(nlogn) time.
Furthermore since an algorithm exits to create a BST in O(nlogn) time you can actually say that the algorithm is optimal under the comparison based model
The construction of the BST will be O(n(log(n))).
You will need to insert each and every node which is an O(n) operation.
To insert that n nodes you will need to make at least O(log(n)) comparisons.
Hence the minimum will be O(n(log(n))).
Only in the best case where the array is already sorted the time complexity will be O(n)
Does every algorithm has a 'best case' and 'worst case' , this was a question raised by someone who answered it with no ! I thought that every algorithm has a case depending on its input so that one algorithm finds that a particular set of input are the best case but other algorithms consider it the worst case.
so which answer is correct and if there are algorithms that doesn't have a best case can you give an example ?
Thank You :)
No, not every algorithm has a best and worst case. An example of that is the linear search to find the max/min element in an unsorted array: it always checks all items in the array no matter what. It's time complexity is therefore Theta(N) and it's independent of the particular input.
Best Case input is the casein which your code would take the least number of procedure calls. eg. You have an if in your code and in that, you iterate for every element and no such functionality in else part. So, any input in which the code does not enter if block will be the best case input and conversely, any input in which code enters this if will be worst case for this algorithm.
If, for any algorithm, branching or recursion or looping causes a difference in complexity factor for that algorithm, it will have a possible best case or possible worst case scenario. Otherwise, you can say that it does not or that it has similar complexity for best case or worst case.
Talking about sorting algorithms, lets take example of merge and quick sorts. (I believe you know them well, and their complexities for that matter).
In merge sort every time, array is divided into two equal parts thus taking log n factor in splitting while in recombining, it takes O(n) time (for every split, of course). So, total complexity is always O(n log n) and it does not depend on the input. So, you can either say merge sort has no best/worst case conditions or its complexity is same for best/worst cases.
On the other hand, if quick sort (not randomized, pivot always the 1st element) is given a random input, it will always divide the array in two parts, (may or may not be equal, doesn't matter) and if it does this, log factor of its complexity comes into picture (though base won't always be 2). But, if the input is sorted already (ascending or descending) it will always split it into 1 element + rest of array, so will take n-1 iterations to split the array, which changes its O(log n) factor to O(n) thereby changing complexity to O(n^2). So, quick sort will have best and worst cases with different time complexities.
Well, I believe every algorithm has a best and worst case though there's no guarantee that they will differ. For example, the algorithm to return the first element in an array has an O(1) best, worst and average case.
Contrived, I know, but what I'm saying is that it depends entirely on the algorithm what their best and worst cases are, but the cases will exist, even if they're the same, or unbounded at the top end.
I think its reasonable to say that most algorithms have a best and a worst case. If you think about algorithms in terms of Asymptotic Analysis you can say that a O(n) search algorithm will perform worse than a O(log n) algorithm. However if you provide the O(n) algorithm with data where the search item is early on in the data set and the O(log n) algorithm with data where the search item is in the last node to be found the O(n) will perform faster than the O(log n).
However an algorithm that has to examine each of the inputs every time such as an Average algorithm won't have a best/worst as the processing time would be the same no matter the data.
If you are unfamiliar with Asymptotic Analysis (AKA big O) I suggest you learn about it to get a better understanding of what you are asking.
Trying to wrap my head around some basic and common algorithms.. my current understanding of the question is in bold.
( 1 ) Assuming we have a sorted array with n items: at most how many times will a binary search compare elements?
I keep seeing ' 0(log(n)) ' pop up as a general answer for this type of question but I do not understand why. Is there not a whole number that answers this question (i.e. 2 or 3?)
( 2 ) Assuming we have an array with n items: at most how many times will a linear search compare elements?
Again, same as above, but now ' 0(n) ' seems to be the general answer to this question. Again, I do not really understand the power behind this answer and question why there is not some whole number answer?
( 3 ) Can someone explain an example when a linear search would be better than a binary search?
From the information I have gathered, it generally seems like a binary search is a better option, if possible, because it is of it's quickness. I'm having trouble seeing when a linear search would be a better option.
regarding 1&2, an absolute number as an answer would've been possible if an absolute number was provided as the size of the input. since the question asks about an arbitrarily sized array (of length n) then answer is also given in these terms.
you can read more about big O notation for details, but basically O(n) & O(log n) mean order of n & order of log(n) respectively. i.e. if the input size is 100, for example, the number of compared elements using linear search will also be in the order of 100 while using a binary search would require comparing ~ log(100) elements.
as for 3, binary search requires the input to be sorted...
The O notation is about limiting behaviour. So a binary search divides the list into two. Either you have found the item or you have half to search. Hence the limiting bevaviour of O(nlogn) - i,e at the leaf of the search tree.
Linear search just starts at the beginning. Of the worst can (the limit) is the element is at the end).
For (3) if the item is the first in the list you have hit the jackpot. So would be better in that case
First of all, you are dealing with O notation, so a "whole number" isn't possible. Answers are are always in O(f(n)) format where f(n) is some function of n. If you aren't sure what Big O notation is all about, then starting from http://en.wikipedia.org/wiki/Big_O_notation may help.
(1) With binary search on a sorted array, the search space is repeatedly reduced by half, until the item is located. If we think about an ideal binary search implementation where every operations takes constant time, in the worst case we will need to examine approximately logn items - which will take O(logn) time. As to the math behind logn: they aren't terribly difficult but it's hard to type it out on an iPhone. Hint: Google is your friend.
(2) in a linear search of an unsorted array, we may have to examine every item in the array. Again, this is a simplification, but assuming that every operation in our algorithm takes constant time, we must look at least n times. Hence O(n).
(3) What is the difference in the data you must search in (1) and (2)? Remember that sorting is, optimally, O(nlogn).
The following video provides a comprehensive explanation. I hope it will help you to understand the difference between binary search and liner search.
For question3:
binary search requires, sorted sequence.
for sequence 1,2,3,...,100. When you want to find element 1, linear search will be faster. It will just check the first element.
Problem
I have an application where I want to sort an array a of elements a0, a1,...,an-1. I have a comparison function cmp(i,j) that compares elements ai and aj and a swap function swap(i,j), that swaps elements ai and aj of the array. In the application, execution of the cmp(i,j) function might be extremely expensive, to the point where one execution of cmp(i,j) takes longer than any other steps in the sort (except for other cmp(i,j) calls, of course) together. You may think of cmp(i,j) as a rather lengthy IO operation.
Please assume for the sake of this question that there is no way to make cmp(i,j) faster. Assume all optimizations that could possibly make cmp(i,j) faster have already been done.
Questions
Is there a sorting algorithm that minimizes the number of calls to cmp(i,j)?
It is possible in my application to write a predicate expensive(i,j) that is true iff a call to cmp(i,j) would take a long time. expensive(i,j) is cheap and expensive(i,j) ∧ expensive(j,k) → expensive(i,k) mostly holds in my current application. This is not guaranteed though.
Would the existance of expensive(i,j) allow for a better algorithm that tries to avoid expensive comparing operations? If yes, can you point me to such an algorithm?
I'd like pointers to further material on this topic.
Example
This is an example that is not entirely unlike the application I have.
Consider a set of possibly large files. In this application the goal is to find duplicate files among them. This essentially boils down to sorting the files by some arbitrary criterium and then traversing them in order, outputting sequences of equal files that were encountered.
Of course reader in large amounts of data is expensive, therefor one can, for instance, only read the first megabyte of each file and calculate a hash function on this data. If the files compare equal, so do the hashes, but the reverse may not hold. Two large file could only differ in one byte near the end.
The implementation of expensive(i,j) in this case is simply a check whether the hashes are equal. If they are, an expensive deep comparison is neccessary.
I'll try to answer each question as best as I can.
Is there a sorting algorithm that minimizes the number of calls to cmp(i,j)?
Traditional sorting methods may have some variation, but in general, there is a mathematical limit to the minimum number of comparisons necessary to sort a list, and most algorithms take advantage of that, since comparisons are often not inexpensive. You could try sorting by something else, or try using a shortcut that may be faster that may approximate the real solution.
Would the existance of expensive(i,j) allow for a better algorithm that tries to avoid expensive comparing operations? If yes, can you point me to such an algorithm?
I don't think you can get around the necessity of doing at least the minimum number of comparisons, but you may be able to change what you compare. If you can compare hashes or subsets of the data instead of the whole thing, that could certainly be helpful. Anything you can do to simplify the comparison operation will make a big difference, but without knowing specific details of the data, it's hard to suggest specific solutions.
I'd like pointers to further material on this topic.
Check these out:
Apparently Donald Knuth's The Art of Computer Programming, Volume 3 has a section on this topic, but I don't have a copy handy.
Wikipedia of course has some insight into the matter.
Sorting an array with minimal number of comparisons
How do I figure out the minimum number of swaps to sort a list in-place?
Limitations of comparison based sorting techniques
The theoretical minimum number of comparisons needed to sort an array of n elements on average is lg (n!), which is about n lg n - n. There's no way to do better than this on average if you're using comparisons to order the elements.
Of the standard O(n log n) comparison-based sorting algorithms, mergesort makes the lowest number of comparisons (just about n lg n, compared with about 1.44 n lg n for quicksort and about n lg n + 2n for heapsort), so it might be a good algorithm to use as a starting point. Typically mergesort is slower than heapsort and quicksort, but that's usually under the assumption that comparisons are fast.
If you do use mergesort, I'd recommend using an adaptive variant of mergesort like natural mergesort so that if the data is mostly sorted, the number of comparisons is closer to linear.
There are a few other options available. If you know for a fact that the data is already mostly sorted, you could use insertion sort or a standard variation of heapsort to try to speed up the sorting. Alternatively, you could use mergesort but use an optimal sorting network as a base case when n is small. This might shave off enough comparisons to give you a noticeable performance boost.
Hope this helps!
A technique called the Schwartzian transform can be used to reduce any sorting problem to that of sorting integers. It requires you to apply a function f to each of your input items, where f(x) < f(y) if and only if x < y.
(Python-oriented answer, when I thought the question was tagged [python])
If you can define a function f such that f(x) < f(y) if and only if x < y, then you can sort using
sort(L, key=f)
Python guarantees that key is called at most once for each element of the iterable you are sorting. This provides support for the Schwartzian transform.
Python 3 does not support specifying a cmp function, only the key parameter. This page provides a way of easily converting any cmp function to a key function.
Is there a sorting algorithm that minimizes the number of calls to cmp(i,j)?
Edit: Ah, sorry. There are algorithms that minimize the number of comparisons (below), but not that I know of for specific elements.
Would the existence of expensive(i,j) allow for a better algorithm that tries to avoid expensive comparing operations? If yes, can you point me to such an algorithm?
Not that I know of, but perhaps you'll find it in these papers below.
I'd like pointers to further material on this topic.
On Optimal and Efficient in Place Merging
Stable Minimum Storage Merging by Symmetric Comparisons
Optimal Stable Merging (this one seems to be O(n log2 n) though
Practical In-Place Mergesort
If you implement any of them, posting them here might be useful for others too! :)
Is there a sorting algorithm that minimizes the number of calls to cmp(i,j)?
Merge insertion algorithm, described in D. Knuth's "The art of computer programming", Vol 3, chapter 5.3.1, uses less comparisons than other comparison-based algorithms. But still it needs O(N log N) comparisons.
Would the existence of expensive(i,j) allow for a better algorithm that tries to avoid expensive comparing operations? If yes, can you point me to such an algorithm?
I think some of existing sorting algorithms may be modified to take into account expensive(i,j) predicate. Let's take the simplest of them - insertion sort. One of its variants, named in Wikipedia as binary insertion sort, uses only O(N log N) comparisons.
It employs a binary search to determine the correct location to insert new elements. We could apply expensive(i,j) predicate after each binary search step to determine if it is cheap to compare the inserted element with "middle" element found in binary search step. If it is expensive we could try the "middle" element's neighbors, then their neighbors, etc. If no cheap comparisons could be found we just return to the "middle" element and perform expensive comparison.
There are several possible optimizations. If predicate and/or cheap comparisons are not so cheap we could roll back to the "middle" element earlier than all other possibilities are tried. Also if move operations cannot be considered as very cheap, we could use some order statistics data structure (like Indexable skiplist) do reduce insertion cost to O(N log N).
This modified insertion sort needs O(N log N) time for data movement, O(N2) predicate computations and cheap comparisons and O(N log N) expensive comparisons in the worst case. But more likely there would be only O(N log N) predicates and cheap comparisons and O(1) expensive comparisons.
Consider a set of possibly large files. In this application the goal is to find duplicate files among them.
If the only goal is to find duplicates, I think sorting (at least comparison sorting) is not necessary. You could just distribute the files between buckets depending on hash value computed for first megabyte of data from each file. If there are more than one file in some bucket, take other 10, 100, 1000, ... megabytes. If still more than one file in some bucket, compare them byte-by-byte. Actually this procedure is similar to radix sort.
Most sorting algorithm out there try minimize the amount of comparisons during sorting.
My advice:
Pick quick-sort as a base algorithm and memorize results of comparisons just in case you happen to compare the same problems again. This should help you in the O(N^2) worst case of quick-sort. Bear in mind that this will make you use O(N^2) memory.
Now if you are really adventurous you could try the Dual-Pivot quick-sort.
Something to keep in mind is that if you are continuously sorting the list with new additions, and the comparison between two elements is guaranteed to never change, you can memoize the comparison operation which will lead to a performance increase. In most cases this won't be applicable, unfortunately.
We can look at your problem in the another direction, Seems your problem is IO related, then you can use advantage of parallel sorting algorithms, In fact you can run many many threads to run comparison on files, then sort them by one of a best known parallel algorithms like Sample sort algorithm.
Quicksort and mergesort are the fastest possible sorting algorithm, unless you have some additional information about the elements you want to sort. They will need O(n log(n)) comparisons, where n is the size of your array.
It is mathematically proved that any generic sorting algorithm cannot be more efficient than that.
If you want to make the procedure faster, you might consider adding some metadata to accelerate the computation (can't be more precise unless you are, too).
If you know something stronger, such as the existence of a maximum and a minimum, you can use faster sorting algorithms, such as radix sort or bucket sort.
You can look for all the mentioned algorithms on wikipedia.
As far as I know, you can't benefit from the expensive relationship. Even if you know that, you still need to perform such comparisons. As I said, you'd better try and cache some results.
EDIT I took some time to think about it, and I came up with a slightly customized solution, that I think will make the minimum possible amount of expensive comparisons, but totally disregards the overall number of comparisons. It will make at most (n-m)*log(k) expensive comparisons, where
n is the size of the input vector
m is the number of distinct component which are easy to compare between each other
k is the maximum number of elements which are hard to compare and have consecutive ranks.
Here is the description of the algorithm. It's worth nothing saying that it will perform much worse than a simple merge sort, unless m is big and k is little. The total running time is O[n^4 + E(n-m)log(k)], where E is the cost of an expensive comparison (I assumed E >> n, to prevent it from being wiped out from the asymptotic notation. That n^4 can probably be further reduced, at least in the mean case.
EDIT The file I posted contained some errors. While trying it, I also fixed them (I overlooked the pseudocode for insert_sorted function, but the idea was correct. I made a Java program that sorts a vector of integers, with delays added as you described. Even if I was skeptical, it actually does better than mergesort, if the delay is significant (I used 1s delay agains integer comparison, which usually takes nanoseconds to execute)
In most of the calculation analysis of running times, we have assumed
that all inputs are equally likely. This is not true, because nearly
sorted input, for instance, occurs much more often than is
statistically expected, and this causes problems, particularly for
quicksort and binary search trees.
By using a randomized algorithm, the particular input is no longer
important. The random numbers are important, and we can get an
expected running time, where we now average over all possible random
numbers instead of over all possible inputs. Using quicksort with a
random pivot gives an O(n log n)-expected-time algorithm. This means
that for any input, including already-sorted input, the running time
is expected to be O(n log n), based on the statistics of random
numbers. An expected running time bound is somewhat stronger than an
average-case bound but, of course, is weaker than the corresponding
worst-case bound.
First, we will see a novel scheme for supporting the binary search
tree operations in O(log n) expected time. Once again, this means that
there are no bad inputs, just bad random numbers. From a theoretical
point of view, this is not terribly exciting, since balanced search
trees achieve this bound in the worst case. Nevertheless, the use of
randomization leads to relatively simple algorithms for searching,
inserting, and especially deleting.
My question on above text is
What does author mean by "An expected running time bound is somewhat stronger than an average-case bound but, of course, is weaker than the corresponding worst-case bound" ? in above text.
Regrading binary search trees what does author meant by "since balanced search trees achieve this bound in the worst case"? my understanding for binary search trees worst case is O(d), where d is depth of the node this can be "N" i.e., O(N). what does author mean by this is same as worst case above?
Thanks!
Like the author explained in the sentence before: An expected time must hold for any input. Average case is averaged over all inputs, that is, you get a reasonably mediocre input. Expected time means no matter how bad the input is, the algorithm must be able to compute it within the bound if the random number god is nice (i.e. gives you your expected value, and not the worst possible thing like she often does).
Balanced binary search trees. They can't reach depth N because they are balanced.
Author means that on average Quick Sort will produce slower results then O(n log n) (This is not correct for all sorting algorithms, e.g. for merge sort expected time == average time ==O(n log n) and no randomization is needed)
O(d) = O(log n) for balanced trees
PS who is the author?
In randomized quicksort,even intentionally, we cant produce a bad input(which may cause worst case)since the random permutation makes the input order irrelevant. The randomized algorithm performs badly only if the random-number generator produces an unlucky permutation to be sorted.Nearly all permutations cause quicksort to perform closer to the average case, there are very few permutations that cause near-worst-case behavior and therefore probability of worst case input is very low...so it almost performs in O(nlogn).