Assuming a random process X of n elements: X={x1,...,xn}.
For a given probability p, the corresponding quantile is determined through the quantile function Q defined as:
Q(p)={x|Pr(X<=x)=p}
What is the time complexity of finding the quantile of a given probability p?
Quickselect can use median of medians pivot selection to get O(n) worst-case runtime to select the kth order statistic. This appears to be more or less what you are after unless I am misinterpreting (you want the (n*p)'th smallest element).
You cannot possibly improve upon this in the general case: you must at least look at the whole array or else the element you didn't look at might be the answer you need.
Therefore, the Quickselect is theoretically optimal in the worst case. Note: this pivot selection strategy has bad constants and is not used in practice. In practice, using random pivot selection gives good expected performance but O(n^2) worst-case.
Related
I am prepping for interview leet-code type problems and I came across the k closest problem, but given a sorted array. This problem requires finding the k closest elements by value to an input value from the array. The answer to this problem was fairly straight forward and I did not have any issues determining a linear-time algorithm to solve it.
However, working on this problem got me thinking. Is it possible to solve this problem given an unsorted array in linear time? My first thought was to use a heap and that would give an O(nlogk) time complexity solution, but I am trying to determine if its possible to come up with an O(n) solution? I was thinking about possibly using something like quickselect, but the issue is that this has an expected time of O(n), not a worst case time of O(n).
Is this even possible?
The median-of-medians algorithm makes Quickselect take O(n) time in the worst case.
It is used to select a pivot:
Divide the array into groups of 5 (O(n))
Find the median of each group (O(n))
Use Quickselect to find the median of the n/5 medians (O(n))
The resulting pivot is guaranteed to be greater and less than 30% of the elements, so it guarantees linear time Quickselect.
After selecting the pivot, of course, you have to continue on with the rest of Quickselect, which includes a recursive call like the one we made to select the pivot.
The worst case total time is T(n) = O(n) + T(0.7n) + T(n/5), which is still linear. Compared to the expected time of normal Quickselect, though, it's pretty slow, which is why we don't often use this in practice.
Your heap solution would be very welcome at an interview, I'm sure.
If you really want to get rid of the logk, which in practical applications should seldom be a problem, then yes, using Quickselect would be another option. Something like this:
Partition your array in values smaller and larger than x. <- O(n).
For the lower half, run Quickselect to find the kth largest number, then take the right-side partition which are your k largest numbers. <- O(n)
Repeat step 2 for the higher half, but for the k smallest numbers. <- O(n)
Merge your k smallest and k largest numbers and extract the k closest numbers. <- O(k)
This gives you a total time complexity of O(n), as you said.
However, a few points about your worry about expected time vs worst-case time. I understand that if an interview question explicitly insists on worst-case O(n), then this solution might not be accepted, but otherwise, this can well be considered O(n) in practice.
The key here being that for randomized quickselect and random or well-behaved input, the probability that the time complexity goes beyond O(n) decreases exponentially as the input grows. Meaning that already at largeish inputs, the probability is as small as guessing at a specific atom in the known universe. The assumption on well-behaved input concerns being somewhat random in nature and not adversarial. See this discussion on a similar (not identical) problem.
This is my question I have got somewhere.
Given a list of numbers in random order write a linear time algorithm to find the 𝑘th smallest number in the list. Explain why your algorithm is linear.
I have searched almost half the web and what I got to know is a linear-time algorithm is whose time complexity must be O(n). (I may be wrong somewhere)
We can solve the above question by different algorithms eg.
Sort the array and select k-1 element [O(n log n)]
Using min-heap [O(n + klog n)]
etc.
Now the problem is I couldn't find any algorithm which has O(n) time complexity and satisfies that algorithm is linear.
What can be the solution for this problem?
This is std::nth_element
From cppreference:
Notes
The algorithm used is typically introselect although other selection algorithms with suitable average-case complexity are allowed.
Given a list of numbers
although it is not compatible with std::list, only std::vector, std::deque and std::array, as it requires RandomAccessIterator.
linear search remembering k smallest values is O(n*k) but if k is considered constant then its O(n) time.
However if k is not considered as constant then Using histogram leads to O(n+m.log(m)) time and O(m) space complexity where m is number of possible distinct values/range in your input data. The algo is like this:
create histogram counters for each possible value and set it to zero O(m)
process all data and count the values O(m)
sort the histogram O(m.log(m))
pick k-th element from histogram O(1)
in case we are talking about unsigned integers from 0 to m-1 then histogram is computed like this:
int data[n]={your data},cnt[m],i;
for (i=0;i<m;i++) cnt[i]=0;
for (i=0;i<n;i++) cnt[data[i]]++;
However if your input data values does not comply above condition you need to change the range by interpolation or hashing. However if m is huge (or contains huge gaps) is this a no go as such histogram is either using buckets (which is not usable for your problem) or need list of values which lead to no longer linear complexity.
So when put all this together is your problem solvable with linear complexity when:
n >= m.log(m)
I want to use a Divide-and-Conquer procedure for the computation of the i-th greatest element at a row of integers and analyze the asymptotic time complexity of the algorithm.
Algorithm ith(A,low,high){
q=partition(A,low,high);
if (high-i+1==q) return A[q];
else if (high-i+1<q) ith(A,low,q-1);
else ith(A,q+1,high);
}
Is it right? If so, how could we find its time complexity?
The time complexity is described by the following recurrence relation:
T(n)=T(n-q)+T(q-1)+Θ(n)
But how can we solve this recurrence relation, without knowing the value of q?
Or is there an algorithm with less time complexity that computes the i-th greatest element at a row of integers?
This is a variation of the quick select algorithm (which finds the i-th smallest element rather than i-thgreatest element). It has a running time of O(n^2)in the worst case, and O(n) in the average case.
To see the worst case, assume you are searching for the nth largest element, and it happens that the algorithm always picks q to be the largest element in the remaining range. so you will be calling the ith function n times. In addition the partition subroutine takes O(n) so the total running time is O(n^2).
To understand the average case analysis , check the explanation given by professor Tim Roughgarden here.
i want to find approx median in unsorted list,i know two algorithm
algorithm 1- quickselect
algorithm 2- Median of medians
i can't use quickselect in my project because it take O(n^2) in worst case.
i heard about Median of medians,but my colleagues suggest that it takes O(n) with some constant factor.therefore its time complexity is Cn and constant factor is is large compare to quickselect. i want to know what is the constant factor associated with Median of medians ?and why Median of medians not use pseudo median of 9 element ?
or is their any other algorithm to find approx median in linear time O(n) ?
Although I wouldn't be quick to discard quickselect, since its worst-case performance is greatly improbable with properly-chosen pivots...
Perhaps introselect:
Introselect (short for "introspective selection") is a selection algorithm that is a hybrid of quickselect and median of medians which has fast average performance and optimal worst-case performance.
Introselect works by optimistically starting out with quickselect and only switching to the worst-time linear algorithm if it recurses too many times without making sufficient progress. The switching strategy is the main technical content of the algorithm. Simply limiting the recursion to constant depth is not good enough, since this would make the algorithm switch on all sufficiently large lists. Musser discusses a couple of simple approaches:
Keep track of the list of sizes of the subpartitions processed so far. If at any point k recursive calls have been made without halving the list size, for some small positive k, switch to the worst-case linear algorithm.
Sum the size of all partitions generated so far. If this exceeds the list size times some small positive constant k, switch to the worst-case linear algorithm. This sum is easy to track in a single scalar variable.
Both approaches limit the recursion depth to k ⌈log n⌉ = O(log n) and the total running time to O(n).
For my algorithm design class homework came this brain teaser:
Given a list of N distinct positive integers, partition the list into two
sublists of n/2 size such that the difference between sums of the sublists
is maximized.
Assume that n is even and determine the time complexity.
At first glance, the solution seems to be
sort the list via mergesort
select the n/2 location
for all elements greater than, add to high array
for all elements lower than, add to low array
This would have a time complexity of O((n log n)+ n)
Are there any better algorithm choices for this problem?
Since you can calculate median in O(n) time you can also solve this problem in O(n) time. Calculate median, and using it as threshold, create high array and low array.
See http://en.wikipedia.org/wiki/Median_search on calculating median in O(n) time.
Try
http://en.wikipedia.org/wiki/Selection_algorithm#Linear_general_selection_algorithm_-_Median_of_Medians_algorithm
What you're effectively doing is finding the median. The trick is, once you've found the values, you wouldn't have needed to sort the first n/2 and the last n/2.