Performance of binary search algorithm when there are many duplicates - algorithm

http://katemats.com/interview-questions/ says:
You are given a sorted array and you want to find the number N. How do you do the search as quickly as possible (not just traversing each element)?
How would the performance of your algorithm change if there were lots of duplicates in the array?
My answer to the first question is binary search, which is O(log(n)), where n is the number of elements in the array.
According to this answer, "we have a maximum of log_2(n-1) steps" in the worst case when "element K is not present in A and smaller than all elements in A".
I think the answer to the second question is it doesn't affect the performance. Is this correct?

If you are talking worst case / big O, then you are correct - log(n) is your bound. However, if your data is fairly uniformly distributed (or you can map to that distribution), interpolating where to pick your partition can get log(log(n)) behavior. When you do the interpolation too, you also get rid of your worse cases where you have looking for one of the end elements (of course there are new pathological cases though).
For many many duplicates you might be willing to stride further away the direct center on the next probe. With more dups, you get a better margin for guessing correctly. While always choosing the half-way point gets you there in good time, educated guesses might get you some really excellent average behavior.
When I interview, I like to hear those answers, both knowledge of the book and what the theoretical is, but also what things can be done to specialize to the given situation. Often these constant factors can be really helpful (look at quicksort and its partition choosing schemes).

I don't think having duplicates matters.
You're looking for a particular number N, what matters is whether or not the current node matches N.
If I'm looking for the number 1 in the list 1-2-3-4-5-6 the performance would be identical to searching the list 1-9-9-9-9-9.
If the number N is duplicated then you will have a chance of finding it a couple steps sooner. For example if the same search was done on the list 1-1-1-1-1-9.

Related

Hashtable with chaining efficiency if linked list are sorted

I am currently working on an exercise from the CLRS, here is the problem:
11.2-3
Professor Marley hypothesizes that he can obtain substantial performance gains by
modifying the chaining scheme to keep each list in sorted order. How does the pro-
fessor’s modification affect the running time for successful searches, unsuccessful
searches, insertions, and deletions?
I saw on the internet that the answer is the following:
I do not understand why the result is like this, my answer is that since the linked list are sorted, then we can use a dichotomy to make the search so the excepted search time ( as well as the worst case time ) is θ(log2(a)) ( a being the load factor, n/m, n being the number of keys effectively stored in the table and m it's capacity ).
I am ok that deletion still take θ(1) time ( if lists are double chained ) and I said insertion will now take θ(log2(a)) because you need to determinate the correct place for the element you are adding to the list. Why is this not the correct answer?
A technical point: If you store the buckets as a linked list, then you can't use binary search to look over the items in that linked list in time O(log b), where b is the number of items in the bucket.
But let's suppose that instead of doing this you use dynamic arrays for each bucket. Then you could drop the runtimes down by a log factor, in an asymptotic sense. However, if you were to do that:
You're now using dynamic arrays rather than linked lists for your buckets. If you're storing large elements in your buckets, since most buckets won't be very loaded, the memory overhead of the unused slots in the array will start to add up.
From a practical perspective, you now need some way of comparing the elements you're hashing from lowest to highest. In Theoryland, that's not a problem. In practice, though, this could be a bit of a nuisance.
But more importantly, you might want to ask whether this is worthwhile in the first place. Remember that in a practical hash table the choice of α you'll be using is probably going to be very small (say, α ≤ 5 or something like that). For small α, a binary search might actually be slower than a linear scan, even if in theory for sufficiently large α it's faster.
So generally, you don't see this approach used in practice. If you're looking to speed up a hash table, it's probably better to change hashing strategies (say, use open addressing rather than chaining) or to try to squeeze performance out in other ways.

Comparison based sorting is WC min time nlogn, so what about best/average case

There is a theorem in Cormen which says...
(Th 8.1)
"For comparison based sorting techniques you cannot have an algorithm to sort a given list, which takes time less than nlogn time (comparisons) in the worst case"
I.e.
the worst case time complexity is Omega (nlogn) for Comparison based sorting technique...
Now what I was searching is that whether there exists a statement in case of the best case..or even for avg case
Which states something like:
You cannot have a sorting Algorithm which takes time less than some X to sort a given list of elements...in the best case
Basically do we have any lower bound for best case Algorithm. Or even as a matter of fact for average case. (I tried my best to find this, but couldn't find anywhere). Please also tell me whether the point I am raising is even worth it.
Great question! The challenge with defining “average case” complexity is that you have to ask “averaged over what?”
For example, if we assume that the elements of the array have an equal probability of being in any one of the n! possible permutations of n elements, then the Ω(n log n) bound on comparison sorting still holds, though I seem to remember that the proof of this is fairly complicated.
On the other hand, if we assume that there are trends in the data (say, you’re measuring temperatures over the course of a day, where you know they generally trend upward and then downward). Many real world data sets look like this, and there are algorithms like Timsort that can take advantage of those patterns to speed up performance. So perhaps “average” here would mean “averaged over all possible plots formed by a rising and then falling sequence with noise terms added in.” I haven’t encountered anyone working on analyzing algorithms in those cases, but I’m sure some work has been done there and there may even be some nice average case measures there that are less well known.

Sorting Algorithm that minimizes the maximum number of comparisons in which individual items are involved

I'm interested in finding a comparison sorting algorithm that minimizes the number of times each single element is compared with others during the run of the algorithm.
For a randomly sorted list, I'm interested in two distributions: the number of comparisons that are needed to sort a list (this is the traditional criterion) and the number of comparisons in which each single element of the list is involved.
Among the algorithms that have a good performance in terms of the number of comparisons, say achieving O(n log(n)) on average, I would like to find out the one for which, on average, the number of times a single element is compared with others is minimized.
I supposed that the theoretical minimum is O(log(n)) which is obtained by dividing the above figure on the total number of comparisons by n.
I'm also interested in the case where data are likely to be already ordered to some extent.
Is perhaps a simulation the best way to go about finding an answer?
(My previous question has been put on hold - This is now a very clear question, if you can't understand it then please explain why)
Yes you definitely should do simulations.
There you will implicitely set the size and pre-ordering constraints in a way that may allow more specific statements than the general question you rose.
There can, however, not be a clear answer to such question in general.
Big-O deals with asymptotic behaviour while your question
seem to target smaller problem sizes. So Big-O could hint on the best candidates for sufficiently large input sets to a sort run. (But, e.g. if you are interested in size<=5 the results may be completely different!)
For getting proper estimate on comparison operations you would need
to analyze each individual algorithm.
At the end, the result (for a given algorithm) will necesarily be specific to the dataset being sorted.
Also, on avarage is not well defined in your context. I'd assume you intend to refer to the number of comparisions on the participating objects for a given sort and not avarage over a (sufficiently large) set of sort runs.
Even within a single algorithm the distribution of comparisions an individual object is taking place in may show a large standard deviation in one case and be (nearly) equally distributed in another case.
As complexity of a sorting algorithm is determined by the total number of comparisons (and position changes thereof), I do not assume there will be much from therotical analysis contributing to an answer.
Maybe you can add some background on what would make an answer to your question "interesting" in a practical sense?

How does random shuffling in quick sort help in increasing the efficiency of the code?

I was going through lecture videos by Robert Sedgwick on algorithms, and he explains that random shuffling ensures we don't get to encounter the worst case quadratic time scenario in quick sort. But I am unable to understand how.
It's really an admission that although we often talk about average case complexity, we don't in practice expect every case to turn up with the same probability.
Sorting an already sorted array is worst case in quicksort, because whenever you pick a pivot, you discover that all the elements get placed on the same side of the pivot, so you don't split into two roughly equal halves at all. And often in practice this already sorted case will turn up more often than other cases.
Randomly shuffling the data first is a quick way of ensuring that you really do end up with all cases turning up with equal probability, and therefore that this worst case will be as rare as any other case.
It's worth noting that there are other strategies that deal well with already sorted data, such as choosing the middle element as the pivot.
The assumption is that the worst case -- everything already sorted -- is frequent enough to be worth worrying about, and a shuffle is a black-magic least-effort sloppy way to avoid that case without having to admit that by improving that case you're moving the problem to another one, which happened to get randomly shuffled into sorted order. Hopefully that bad case is a much rarer situation, and even if it does come up the randomness means the problem can't easily be reproduced and blamed on this cheat.
The concept of improving a common case at the expense of a rare one is fine. The randomness as an alternative to actually thinking about which cases will be more or less common is somewhat sloppy.
In case of randomized QuickSort, since the pivot element is randomly chosen, we can expect the split of the input array to be reasonably well balanced on average - as opposed to the case of 1 and (n-1) split in a non randomized version of the algorithm. This helps in preventing the worst-case behavior of QuickSort which occurs in unbalanced partitioning.
Hence, the average case running time of the randomized version of QuickSort is O(nlogn) and not O(n^2);
What does a random shuffle do to the distribution on the input space? To understand this, let's look at a probability distribution, P, defined over a set S, where P is not in our control. Let us create a probability distribution P' by applying a random shuffle, over S to P. In other words, every time we get a sample from P, we map it, uniformly at random to an element of S. What can you say about this resulting distribution P'?
P'(x) = summation over all elements s in S of P(s)*1/|S| = 1/|S|
Thus, P' is just the uniform distribution over S. A random shuffle gives us control over the input probability distribution.
How is this relevant to quicksort? Well, we know the average complexity of quicksort. This is computed wrt the uniform probability distribution and that is a property we want to maintain on our input distribution, irrespective of what it really is. To achieve that, we do a random shuffle of our input array, ensuring that the distribution is not adversarial in any way.
Is the video in coursera?
Unfortunately, shuffle decrease performance to O(N^2) with data n,n,...,n,1,1,...,1.
I have inspected Quick.java with nn11.awk that generate such data.
$ for N in 10000 20000 30000 40000; do time ./nn11.awk $N | java Quick; done | awk 'NF>1'
real 0m10.732s
user 0m10.295s
sys 0m0.948s
real 0m48.057s
user 0m44.968s
sys 0m3.193s
real 1m52.109s
user 1m48.158s
sys 0m3.634s
real 3m38.336s
user 3m31.475s
sys 0m6.253s

Upper bound and lower bound of sorting algorithm

This is a very simple question but I'm struggling too much to understand the concept completely.
I'm trying to understand the difference between the following statements:
There exists an algorithm which sorts an array of n numbers in O(n) in the best case.
Every algorithm sorts an array of n numbers in O(n) in the best case.
There exists an algorithm which sorts an array of n numbers in Omega(n) in the best case.
Every algorithm sorts an array of n numbers in Omega(n) in the best case.
I will first explain what is driving me crazy. I'm not sure regarding 1 and 3 - but I know that for one of them the answer is correct just by specifying one case and for the other one the answer is correct by examining all the possible inputs. Therefore I know one of them must be true just by specifying that the array is already sorted but I can't tell which.
My teacher always told me to think about it like we are examining who's the heighest guy in the class and again by one of these options(1,3) it's enough to say that he is and there is no reason to examine all the class.
I do know that if we were to examine the worst case then none of these statements could be true because the best sorting algorithm without any assumptions or additional memory is Omega(nlogn).
IMPORTANT NOTE: I'm not looking for a solution (an algorithm which is able to do the matching sort) - only trying to understand the concept a little better.
Thank you!
For 1+3 ask yourself - do you know an algorithm that can sort an array at best case in Theta(n) - if the answer is true, then both 1+3 are true - since Theta(n) is O(n) [intersection] Omega(n), and thus if you do have such an algorithm (that runs in Theta(n) best case) - both 1+3 are correct.
Hint: optimized bubble sort.
For 2: ask yourself - does EVERY algorithm sorts an array of numbers in O(n) best case? Do you know an algorithm that have a worst case and best case identical time complexity? What happens to the mentioned bubble sort if you take all optimizations off?
For 4: ask yourself - do you need to read all elements in order to ensure the array is sorted? If you do - Omega(n) is a definite lower bound, you cannot go better then it.
Good Luck!
The difference, obviously, is in terms "O" and "Omega". One says "rising not faster than", second says "rising not slower than".
Make sure that you understand the difference between those terms, and you'll see the difference in the sentences.
1 and 3 both state completely different things, just as 2 and 4 are.
Look at those (those are NOT the same!):
1~ there exists an algorithm that for 10 items doesn't take more than 30 in the best case.
3~ there exists an algorithm that for 10 items doesn't take less than 30 in the best case.
2~ every algorithm that for 10 items takes not more than 30 in the best case.
4~ every algorithm that for 10 items takes not less than 30 in the best case.
Do you sense the difference now? With O/Omega the difference is similar, but the subject of investigation differs. The examples above say about different performance in some point/case, while O/Omega notation tell you about the performance, related to the size of data, but only if the data "is large enough", be it three items or milions, and it drops constant factors:
function 1000000*n is O(n)
function 0.00000*n*n is O(n^2)
For small amounts data, second one is obviously very very better than first. But as the quantity of data rises, soon the first starts to be much better!
Rewriting the above examples into "more proper" terms, that are more similar to your original sentences:
1~ there exists an algorithm that, for more than N items, doesn't take more than X*N in the best case.
3~ there exists an algorithm that, for more than N items, doesn't take less than X*n in the best case.
2~ every algorithm that, for more than N items, takes not more than X*N in the best case.
4~ every algorithm that, for more than N items, takes not less than X*N in the best case.
I hope that this helps you with "seeing"/"feeling" the difference!

Resources