What's the worst case of random search

What's the worst case of random search - sorting

What's the worst case of random search? Say I have N elements, then search for one particular element.
Is the answer infinite? That makes sense to me, since I never find the element in worst case.
Then best case is just 1 right? What about average then?

Your in effect doing a simple random sample of the set. The chance any given element is selected after n samples selected from N elements is given by:
P(n) = 1 - (1 - 1/N)^n
Wikipedia article on Simple Random Sample.

Related

Confusion in calculating number of steps for various algorithms?

I've been learning data structures and algorithms from a book, in which it compares time efficiency in terms of number of steps taken by various sorting algorithms. I'm confused as to what we define as one step while doing this.
So while counting no. of steps we consider the worst case scenarios. I understood how we come up with the no. of steps for bubble sort. But for selection sort, I am confused about the part where we compare every element with the current lowest value.
For example, in the worst case array, lets say 5,4,3,2,1, and lets say we are in the first pass through. When we start, 5 is the current lowest value. When we move to 4, and compare it to 5, we change the current lowest value to 4.
Why isnt this action of changing the current lowest value to 4 counted as a swap or an additional step? I mean, it is a step separate from the comparison step. The book I am referring to states that in the first passthrough, the number of comparisons are n-1 but the no. of swaps is only 1, even in worst case, for an n size array. Here they are assuming that the step of changing the current lowest value is a part of the comparison step, which I think should not be a valid assumption, since there can be an array in which you compare but don't need to change the lowest current value and hence your no. of steps eventually reduce. Point being, we cant assume that the no. of steps in the first pass through for selection sort in the worst case is n-1 (comparisons) + 1 (swap). It should be more than (n-1) + (1).
I understand that both selection sort and bubble sort lie in the same classification of time complexity as per big O methodology, but the book goes on to claim that selection sort has lesser steps than bubble sort in worst case scenarios, and I'm doubting that. This is what the book says: https://ibb.co/dxFP0

Generally in these kinds of exercises you’re interested in whether the algorithm is O(1), O(n), O(n^2) or something higher. You’re generally not interested in O(1) vs O(2) or in O(3n) vs O(5n) because for sufficiently large n only the power of n matters.
To put it another way, small differences in the complexity of each step, maybe favors of 2 or 3 or even 10, don’t matter against choosing an algorithm with a factor of n = 300 or more additional work

The best case for my algorithm is n=1 because that is the fastest? is it correct?

Best case is defined as which input of size n is cheapest among all inputs of size n.
“The best case for my algorithm is n=1 because that is the fastest.”? Is it right or wrong? If i give input N of large size, its mean it will take extra time. if i give input of smaller N value, its mean it will take less time? So, its mean we are dependent on the size of input..? and, if i give input any number(like 45) for the N size array for searching, and element found at the end, its also mean worst case? (but where from N comes? is it already fixed? )
I am confused about all this? If i consider both cases. mean,
We will fix the size of array like N, I made an array of N items
We will put an element as input for search.
its mean, worst case, best case, average case, is dependent on both things that are mentioned above ( N size array, and type of input).
am i right?

n is fixed, you cannot set it to 1: "is cheapest among all inputs of size n". Best case and worst case depends only on the type of input, which must be of size n.
For example, if you do a linear search among n elements, the best case is if you find it immediately on first try, the worst case is if you have to look at all n elements.

Well the thing is - it is not the number of input that is in case over here. Ofcourse if you sort one element then it will be best. If you search in one element list then it will be faster. We generalize this notion keeping in mind that input is n - and it's fixed w.r.t to this analysis. We can't say that mergesort with 1 element is faster than quicksort with 2 elements. It's not a valid comparison. With this being said,
Best case: A case for which it takes fastest time to complete, the conditions, the inputs all are perfect, optimal as expected by the algorithm.
Worst case: The case when input is such that we will run into a higher time.
Average case: Algorithm is run many times on different inputs (not saying that their size will be different - it won't. The size is fixed at n). And then we will take the average over all the running time. Take the average over all the inputs (of this given size n), weighted with probability distribution.
So to answer your question - it's the type of input that we talk about. The property of the input, for example:-
For quicksort best case is O(nlongn) worst case O(n^2) and average case is O(nlogn). (Worst case appears when the pivot is being selected as first element of the numbers).
Take the idea, here for best case we are not considering the number of input. The best case of quicksort occurs when the pivot we pick happens to divide the array into two exactly equal parts, in every step. Again you see number o inputs we are considering as n.
Check CLRS for getting the average case analysis. Solve the math or atleast try to. It's fun how you derive that.

When it is stated that something is O(n), that means that the expected time is proportional to the number of elements in the input. This means that if you double the input, then you double the expected time of the work. An example of this is going through through an array element by element until you find the result. Or adding all the elements of an array.
O(1) means that the function will take the same amount of time regardless of the amount of input. You'll see this when looking up a value in a hash. It is an indexed lookup, so it doesn't have to go through every element.
Something like O(n^2) means that the effort is proportional to the square of the number of elements involved. You'll see this when running all the combinations of the elements. So an array of 10 would provide 100 different possible inputs to a function with 2 parameters.
Searching an ordered array might be done in O(log(n)) because you can guess an element, then eliminate half and never have to search them.

It's up to your algorithm. For example, if I want to use a element in an array, whatever your size is, it takes the same time. Because it's spend O(1) time. However, if you want to use an algorithm that takes O(N) time:
FindMaxElementInAnArray(A)
a=-∞
for each i in A
if i>a
a=i
return a
the bigger the array is, the slower the algorithms runs.
And there's a situation like this
SomeBoredPseudocode(A)
if(A.size()>100)
error "oops, I don't need such a big array"
i=100;
if i==A.size()
exit
else
i=i-1
This one takes O(100-N) time.

Binary search with Random element

I know that Binary Search has time complexity of O(logn) to search for an element in a sorted array. But let's say if instead of selecting the middle element, we select a random element, how would it impact the time complexity. Will it still be O(logn) or will it be something else?
For example :
A traditional binary search in an array of size 18 , will go down like 18 -> 9 -> 4 ...
My modified binary search pings a random element and decides to remove the right part or left part based on the value.

My attempt:
let C(N) be the average number of comparisons required by a search among N elements. For simplicity, we assume that the algorithm only terminates when there is a single element left (no early termination on strict equality with the key).
As the pivot value is chosen at random, the probabilities of the remaining sizes are uniform and we can write the recurrence
C(N) = 1 + 1/N.Sum(1<=i<=N:C(i))
Then
N.C(N) - (N-1).C(N-1) = 1 + C(N)
and
C(N) - C(N-1) = 1 / (N-1)
The solution of this recurrence is the Harmonic series, hence the behavior is indeed logarithmic.
C(N) ~ Ln(N-1) + Gamma
Note that this is the natural logarithm, which is better than the base 2 logarithm by a factor 1.44 !
My bet is that adding the early termination test would further improve the log basis (and keep the log behavior), but at the same time double the number of comparisons, so that globally it would be worse in terms of comparisons.

Let us assume we have a tree of size 18. The number I am looking for is in the 1st spot. In the worst case, I always randomly pick the highest number, (18->17->16...). Effectively only eliminating one element in every iteration. So it become a linear search: O(n) time

The recursion in the answer of #Yves Daoust relies on the assumption that the target element is located either at the beginning or the end of the array. In general, where the element lies in the array changes after each recursive call making it difficult to write and solve the recursion. Here is another solution that proves O(log n) bound on the expected number of recursive calls.
Let T be the (random) number of elements checked by the randomized version of binary search. We can write T=sum I{element i is checked} where we sum over i from 1 to n and I{element i is checked} is an indicator variable. Our goal is to asymptotically bound E[T]=sum Pr{element i is checked}. For the algorithm to check element i it must be the case that this element is selected uniformly at random from the array of size at least |j-i|+1 where j is the index of the element that we are searching for. This is because arrays of smaller size simply won't contain the element under index i while the element under index j is always contained in the array during each recursive call. Thus, the probability that the algorithm checks the element at index i is at most 1/(|j-i|+1). In fact, with a bit more effort one can show that this probability is exactly equal to 1/(|j-i|+1). Thus, we have
E[T]=sum Pr{element i is checked} <= sum_i 1/(|j-i|+1)=O(log n),
where the last equation follows from the summation of harmonic series.

Partial selection sort vs Mergesort to find "k largest in array"

I was wondering if my line of thinking is correct.
I'm preparing for interviews (as a college student) and one of the questions I came across was to find the K largest numbers in an array.
My first thought was to just use a partial selection sort (e.g. scan the array from the first element and keep two variables for the lowest element seen and its index and swap with that index at the end of the array and continue doing so until we've swapped K elements and return a copy of the first K elements in that array).
However, this takes O(K*n) time. If I simply sorted the array using an efficient sorting method like Mergesort, it would only take O(n*log(n)) time to sort the entire array and return the K largest numbers.
Is it good enough to discuss these two methods during an interview (comparing log(n) and K of the input and going with the smaller of the two to compute the K largest) or would it be safe to assume that I'm expected to give a O(n) solution for this problem?

There exists an O(n) algorithm for finding the k'th smallest element, and once you have that element, you can simply scan through the list and collect the appropriate elements. It's based on Quicksort, but the reasoning behind why it works are rather hairy... There's also a simpler variation that probably will run in O(n). My answer to another question contains a brief discussion of this.

Here's a general discussion of this particular interview question found from googling:
http://www.geeksforgeeks.org/k-largestor-smallest-elements-in-an-array/
As for your question about interviews in general, it probably greatly depends on the interviewer. They usually like to see how you think about things. So, as long as you can come up with some sort of initial solution, your interviewer would likely ask questions from there depending on what they were looking for exactly.

IMHO, I think the interviewer wouldn't be satisfied with either of the methods if he says the dataset is huge (say a billion elements). In this case, if K to be returned is huge (nearing a billion) your partial selection would almost result in an O(n^2). I think it entirely depends on the intricacies of the question proposed.
EDIT: Aasmund Eldhuset's answer shows you how to achieve the O(n) time complexity.

If you want to find K (so for K = 5 you'll get five results - five highest numbers ) then the best what you can get is O(n+klogn) - you can build prority queue in O(n) and then invoke pq.Dequeue() k times. If you are looking for K biggest number then you can get it with O(n) quicksort modification - it's called k-th order statistics. Pseudocode looks like that: (it's randomized algorithm, avg time is approximately O(n) however worst case is O(n^2))
QuickSortSelection(numbers, currentLength, k) {
if (currentLength == 1)
return numbers[0];
int pivot = random number from numbers array;
int newPivotIndex = partitionAroundPivot(numbers) // check quicksort algorithm for more details - less elements go left to the pivot, bigger elements go right
if ( k == newPivotIndex )
return pivot;
else if ( k < newPivotIndex )
return QuickSortSelection(numbers[0..newPivotIndex-1], newPivotIndex, k)
else
return QuickSortSelection(numbers[newPivotIndex+1..end], currentLength-newPivotIndex+1, k-newPivotIndex);
}
As i said this algorithm is O(n^2) worst case because pivot is chosen at random (however probability of running time of ~n^2 is something like 1/2^n). You can convert it deterministic algorithm with same running time worst case using for instance median of three median as a pivot - but it is slower in practice (due to constant).

In quicksort If an array is randomized, does using the median of 3 for pivot selection matter?

I've been comparing the run times of various pivot selection algorithms. Surprisingly the simplest one where the first element is always chosen is the fastest. This may be because I'm filling the array with random data.
If the array has been randomized (shuffled) does it matter? For example picking the medium of 3 as the pivot is always(?) better than picking the first element as the pivot. But this isn't what I've noticed. Is it because if the array is already randomized there would be no reason to assume sortedness, and using the medium is assuming there is some degree of sortedness?

The worst case runtime of quicksort is O(n²). Quicksort is only in average case a fast sorting algorithm.
To reach a average runtime of O(n log n) you have to choose a random pivot element.
But instead of choosing a random pivot element, you can shuffle the list and choose the first element.
To see that this holds you can look at this that way: lets say all elements are in a specific order. Shuffling means you use a random permutation on the list of elements, so a random element will be at the first position and also on all other positions. You can also see it by shuffling the list by randomly choose one of all elements for the first element, then choosing randomly one element of the other (not yet coosen elements) for the second element, and so on.
If your list is already a random generated list, you can directly choose the first element as pivot, without shuffling again.
So, choosing the first element is the fastest one because of the random generated input, but choosing the thrid or the last will also as fast as choosing the first.
All other ways to choose a pivot element have to compute something (a median or a random number or something like this), but they have no advantage over a random choice.

A substantially late response, but I believe it will add some additional info.
Surprisingly the simplest one where the first element is always chosen
is the fastest.
This is actually not surprisingly at all, since you mentioned that you test the algorithm with the random data. In the reality, a percentage of almost-sorted and sorted data is much greater than it would statistically be expected. Take for example the chronological data, when you collect it into the log file some elements can be out of order, but most of them are already sorted. Unfortunately, the Quicksort implementation that takes first (or last) element as a pivot is vulnerable to such input and it degenerates into O(n^2) complexity because in the partition step you divide your array into two halves of size 1 and n-1 and therefore you get n partitions instead of log n, on average.
That's why people decided to add some sort of randomization that would make a probability of getting the problematic input as minimum as possible. There are three well-known approaches:
shuffle the input - to quote Robert Sedgewick, "the probability of getting O(n^2) performance with such approach is lower than the probability that you will be hit by a thunderstrike" :)
choose the pivot element randomly - Wikipedia says that in average, expected number of comparisons in this case is 1.386 n log n
choose the pivot element as a median of three - Wikipedia says that in average, expected number of comparisons in this case is 1.188 n log n
However, randomization costs. If you shuffle the input array, that is O(n) which is dominated by O(nlogn), but you need to take in the account the cost of invoking random(..) method n times. With your simple approach, that is avoided and it is thus faster.
See also:
Worst case for Quicksort - when can it occur?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio