Finding xth smallest element in unsorted array - algorithm

I've been trying some coding algorithm exercises and one in particular topic has stood out to me. I've been trying to find out a good answer to this but I've been stuck in analysis paralysis. Let's say I have an array of unsorted integers and I want to determine the xth smallest element in this array.
I know of two options to go about this:
Option 1: Run a sort algorithm, sorting elements least to greatest and look up the xth element. To my understanding, the time complexity to this is O(n*log(n)) and O(1) space.
Option 2: Heapify the array, turning it into a min heap. Then pop() the top of the heap x times. To my understanding, this is O(n) + O(x*log(n)).
I can't tell which is optimal answer and maybe I fundamental misunderstanding of priority queues and when to use them. I've tried to measure runtime and I feel like I'm getting conflicting results. Maybe since with option 2, it depends on how big x is. And maybe there is a better way to go algo. If someone could help, I'd appreciate it!

Worst case time complexity of approach 2 should be O(n + n*log(n)), as maximum value of x = n.
For average case, time complexity = O(n + (1+2+3+....n)/n * log(n)) = O(n + (n+1)*log(n)).
Therefore approach 1 is more efficient than approach 2, but still not optimal.
PS: I would like you to have a look at quick select algorithm which works in O(n) on average case.

This algorithms complexity can revolve around two data points:
Value of x.
Value of n.
Space complexity
In both algos space complexity remains the O(1)
Time complexity
Approach 1
Best Case : O(nlog(n)) for sorting & O(1) for case x == 1;
Average Case : O(nlog(n)) if we consider all elements are unique &
O(x+nlog(n)) if there are duplicates.
Worst Case. : O(n+nlog(n)) for case x==n;
Approach 2:
Best Case : O(n) as just heapify would be require case x==1
Average Case : O(n + xlog(n))
Worst Case. : O(n+nlog(n)) case x==n;
Now Coming to the point to analyze this algo's in runtime.
In general below guidelines are to be followed.
1. Always test for larger values of n.
2. Have a good spread for values being tested(here x).
3. Do multiple iterations of the analysis with clean environment
(array created everytime before the experiment etc) & get the average of all
results.
4. Check for the any predefined functions code complexity for exact implementation.
In this case the sort(can be 2nlogn etc) & various heap operations code.
So if considered above all having idle values.
Method 2 should perform better than Method 1.

Although approach 1 will have less time complexity, but both of these algorithms will use auxiliary space,space complexity of std::sort is O(n). Another way of doing this ,in constant is to do is via binary search. You can do binary search for the xth element . Let l be the smallest element of the array and r be the largest, then time complexity will be O((nlog(r-l)).
int ans=l-1;
while(l<=r){
int mid=(l+r)/2;
int cnt=0;
for(int i=0;i<n;i++){
if(a[i]<=mid)
cnt++;
}
if(cnt<x){
ans=mid;
l=mid+1;
}
else
r=mid-1;
}
Now you can look for the smallest element larger than ans present in the array.
Time complexity-O(nlog(r-l))+O(n)(for the last step)
space complexity-O(1)

You can find xth element in O(n); there are also two simple heap algorithms that improve on your option 2 complexity. I'll start with the latter.
Simple heap algorithm №1: O(x + (n-x) log x) worst-case complexity
Create a max heap out of the first x elements; for each of the remaining elements, pop the max and push them instead:
import heapq
def findKthSmallest(nums: List[int], k: int) -> int:
heap = [-n for n in nums[:k]]
heapq.heapify(heap)
for num in nums[k:]:
if -num > heap[0]:
heapq.heapreplace(heap, -num)
return -heap[0]
Simple heap algorithm №2: O(n + x log x)
Turn the whole array into a min heap, and insert its root into an auxiliary min heap.
k-1 times pop an element from the second heap, and push back its children from the first heap.
Return the root of the second heap.
import heapq
def findKthSmallest(nums: List[int], k: int) -> int:
x = nums.copy()
heapq.heapify(x)
s = [(x[0], 0)] #auxiliary heap
for _ in range(k-1):
ind = heapq.heappop(s)[1]
if 2*ind+1 < len(x):
heapq.heappush(s, (x[2*ind+1], 2*ind+1))
if 2*ind+2 < len(x):
heapq.heappush(s, (x[2*ind+2], 2*ind+2))
return s[0][0]
Which of these is faster? It depends on values of x and n.
A more complicated Frederickson algorithm would allow you to find xth smallest element in a heap in O(x), but that would be overkill, since xth smallest element in unsorted array can be found in O(n) worst-case time.
Median-of-medians algorithm: O(n) worst-case time
Described in [1].
Quickselect algorithm: O(n) average time, O(n^2) worst-case time
def partition(A, lo, hi):
"""rearrange A[lo:hi+1] and return j such that
A[lo:j] <= pivot
A[j] == pivot
A[j+1:hi+1] >= pivot
"""
pivot = A[lo]
if A[hi] > pivot:
A[lo], A[hi] = A[hi], A[lo]
#now A[hi] <= A[lo], and A[hi] and A[lo] need to be exchanged
i = lo
j = hi
while i < j:
A[i], A[j] = A[j], A[i]
i += 1
while A[i] < pivot:
i += 1
j -= 1
while A[j] > pivot:
j -= 1
#now put pivot in the j-th place
if A[lo] == pivot:
A[lo], A[j] = A[j], A[lo]
else:
#then A[right] == pivot
j += 1
A[j], A[hi] = A[hi], A[j]
return j
def quickselect(A, left, right, k):
pivotIndex = partition(A, left, right)
if k == pivotIndex:
return A[k]
elif k < pivotIndex:
return quickselect(A, left, pivotIndex - 1, k)
else:
return quickselect(A, pivotIndex + 1, right, k)
Introselect: O(n) worst-case time
Basically, do quickselect, but if recursion gets too deep, switch to median-of-medians.
import numpy as np
def findKthSmallest(nums: List[int], k: int) -> int:
return np.partition(nums, k, kind='introselect')[k]
Rivest-Floyd algorithm: O(n) average time, O(n^2) worst-case time
Another way to speed up quickselect:
import math
C1 = 600
C2 = 0.5
C3 = 0.5
def rivest_floyd(A, left, right, k):
assert k < len(A)
while right > left:
if right - left > C1:
#select a random sample from A
N = right - left + 1
I = k - left + 1
Z = math.log(N)
S = C2 * math.exp(2/3 * Z) #sample size
SD = C3 * math.sqrt(Z * S * (N - S) / N) * math.copysign(1, I - N/2)
#select subsample such that kth element lies between newleft and newright most of the time
newleft = max(left, k - int(I * S / N + SD))
newright = min(right, k + int((N - I) * S / N + SD))
rivest_floyd(A, newleft, newright, k)
A[left], A[k] = A[k], A[left]
j = partition2(A, left, right)
if j <= k:
left = j+1
if k <= j:
right = j-1
return A[k]
[1]Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2009) [1990]. Introduction to Algorithms (3rd ed.). MIT Press and McGraw-Hill. ISBN 0-262-03384-4., pp.220-223

Related

Q: Count array pairs with bitwise AND > k ~ better than O(N^2) possible?

Given an array nums
Count no. of pairs (two elements) where bitwise AND is greater than K
Brute force
for i in range(0,n):
for j in range(i+1,n):
if a[i]&a[j] > k:
res += 1
Better version:
preprocess to remove all elements ≤k
and then brute force
But i was wondering, what would be the limit in complexity here?
Can we do better with a trie, hashmap approach like two-sum?
( I did not find this problem on Leetcode so I thought of asking here )
Let size_of_input_array = N. Let the input array be of B-bit numbers
Here is an easy to understand and implement solution.
Eliminate all values <= k.
The above image shows 5 10-bit numbers.
Step 1: Adjacency Graph
Store a list of set bits. In our example, 7th bit is set for numbers at index 0,1,2,3 in the input array.
Step 2: The challenge is to avoid counting the same pairs again.
To solve this challenge we take help of union-find data structure as shown in the code below.
//unordered_map<int, vector<int>> adjacency_graph;
//adjacency_graph has been filled up in step 1
vector<int> parent;
for(int i = 0; i < input_array.size(); i++)
parent.push_back(i);
int result = 0;
for(int i = 0; i < adjacency_graph.size(); i++){ // loop 1
auto v = adjacency_graph[i];
if(v.size() > 1){
int different_parents = 1;
for (int j = 1; j < v.size(); j++) { // loop 2
int x = find(parent, v[j]);
int y = find(parent, v[j - 1]);
if (x != y) {
different_parents++;
union(parent, x, y);
}
}
result += (different_parents * (different_parents - 1)) / 2;
}
}
return result;
In the above code, find and union are from union-find data structure.
Time Complexity:
Step 1:
Build Adjacency Graph: O(BN)
Step 2:
Loop 1: O(B)
Loop 2: O(N * Inverse of Ackermann’s function which is an extremely slow-growing function)
Overall Time Complexity
= O(BN)
Space Complexity
Overall space complexity = O(BN)
First, prune everything <= k. Also Sort the value list.
Going from the most significant bit to the least significant we are going to keep track of the set of numbers we are working with (initially all ,s=0, e=n).
Let p be the first position that contains a 1 in the current set at the current position.
If the bit in k is 0, then everything that would yield a 1 world definetly be good and we need to investigate the ones that get a 0. We have (end - p) * (end-p-1) /2 pairs in the current range and (end-p) * <total 1s in this position larger or equal to end> combinations with larger previously good numbers, that we can add to the solution. To continue we update end = p. We want to count 1s in all the numbers above, because we only counted them before in pairs with each other, not with the numbers this low in the set.
If the bit in k is 1, then we can't count any wins yet, but we need to eliminate everything below p, so we update start = p.
You can stop once you went through all the bits or start==end.
Details:
Since at each step we eliminate either everything that has a 0 or everything that has a 1, then everything between start and end will have the same bit-prefix. since the values are sorted we can do a binary search to find p.
For <total 1s in this position larger than p>. We already have the values sorted. So we can compute partial sums and store for every position in the sorted list the number of 1s in every bit position for all numbers above it.
Complexity:
We got bit-by-bit so L (the bit length of the numbers), we do a binary search (logN), and lookup and updates O(1), so this is O(L logN).
We have to sort O(NlogN).
We have to compute partial bit-wise sums O(L*N).
Total O(L logN + NlogN + L*N).
Since N>>L, L logN is subsummed by NlogN. Since L>>logN (probably, as in you have 32 bit numbers but you don't have 4Billion of them), then NlogN is subsummed by L*N. So complexity is O(L * N). Since we also need to keep the partial sums around the memory complexity is also O(L * N).

Finding best algorithm for sum of a section of an array's values

Given an array of n integers in the locations A[1], A[2], …, A[n], describe an O(n^2) time algorithm to
compute the sum A[i] + A[i+1] + … + A[j] for all i, j, 1 ≤ i < j ≤ n.
I've tried multiple ways of solving this problem but none have in O(n^2) time.
So for an array containing {1,2,3,4}
You would output:
1+2 = 3
1+2+3 = 6
1+2+3+4 = 10
2+3 = 5
2+3+4 = 9
3+4 = 7
The answer does not need to be in a specific language, pseudocode is preferred.
A good preperation is everything.
You could create an array of integrals:
I[0..n] = (0, I[0] + A[1], I[1] + A[2], ..., I[n-1]+A[n]);
This will cost you O(n) * O(1) (looping over all elements and doing one addition);
Now you can calculate each Sum(A, i, j) with just a single subtraction: I[j] - I[i-1];
so this has O(1)
Looping over all combinations of i and j with 1 <= (i,j) <= n has O(n^2).
So you end up with O(n) * O(1) + O(n^2) * O(1) = O(n^2) .
Edit:
Your array A starts at 1 - adapted to this - this also solves the little quirk with i-1
So the integral array I starts with index 0 and is 1 element larger than A
Edit:
First you'll maybe have thought about the most naive idea:
Naive idea
Create a function that for given values of i and of j will return the sum A[i] + ... + A[j].
function sumRange(A, i, j):
sum = 0
for k = i to j
sum = sum + A[k]
return sum
Then generate all pairs of i and j (with i < j) and call the above function for each pair:
for i = 1 to n
for j = i+1 to n
output sumRange(A, i, j)
This is not O(n²), because already the two loops on i and j represent O(n²) iterations, and then the function will perform yet another loop, making it O(n³).
Better idea
The above can be improved. Look at the repetition it performs. The sum that was calculated for given values of i and j could be reused to calculate the sum for when j has increased with 1, without starting from scratch and summing the values between i and (now) j-1 again, only to add that one more value to it.
We should just remember what the previous sum was, and add A[j] to it.
So without a separate function:
for i = 1 to n
sum = A[i]
for j = i+1 to n
sum = sum + A[j]
output sum
Note how the sum is not reset to 0 once it is output. It is preserved, so that when j is incremented, only one value needs to be added to it.
Now it is O(n²). Note also how it does not require an extra array for storage. It only needs the memory for a few variables (i, j, sum), so its space complexity is O(1).
As the number of sums you need to output is O(n²), there is no way to improve this time complexity any further.
NB: I assume here that single array values do not constitute a "sum". As you stated in your question, i < j, and also in your example you only showed sums of at least two array values. The above can be easily adapted to also include single value "sums" if ever that were needed.

Get the first x elements of a Heapsort

I'm preparing for a Google developer interview and working on algorithm questions. I need to figure out how to get the first x elements in an array of size n using the Heapsort algorithm. What part of the algorithm needs to be modified to get just the first x smallest elements?
This is the Heapsort algorithm from Introduction to Algorithms by Cormen Leiserson (page 155):
HEAPSORT(A)
{
BUILD-MAX-HEAP(A)
for i = A.length down to 2
exchange A[1] with A[i]
A.heap-size = A.heap-size - 1
MAX-HEAPIFY(A, 1)
}
These are the component algorithms:
BUILD-MAX-HEAP(A)
A.heap-size = A.length
for i = floor(A.length / 2) down to 1
MAX-HEAPIFY(A, i)
MAX-HEAPIFY(A, i)
l = LEFT(i)
r = RIGHT(i)
if l <= A.heap-size and A[l] > A[i]
largest = l
else largest = r
if r <= A.heap-size and A[r] > A[largest]
largest = r
if largest != i
exchange A[i] with A[largest]
MAX-HEAPIFY(A, largest)
I can't figure out what part to modify to get the x smallest elements of the sorted array. Also need to find the time complexity of the modified algorithm.
By changing the condition in MAX-HEAPIFY, we can change it into MIN-HEAPIFY, thus , we can easily obtain a min heap.
Then, the first element of this heap is the smallest element, we can remove this element, and bring the last element in the heap to the first element, and call MIN-HEAPIFY again to maintain the property of the heap. Continuing this process n time, we can obtain the first n smallest object.
Time complexity : log(m) + log(m - 1) + ... + log(m - n) ~ O(nlogm)

How to find 2 numbers and their sum in an unsorted array

This was an interview question that I was asked to solve: Given an unsorted array, find out 2 numbers and their sum in the array. (That is, find three numbers in the array such that one is the sum of the other two.) Please note, I have seen question about the finding 2 numbers when the sum (int k) is given. However, this question expect you to find out the numbers and the sum in the array. Can it be solved in O(n), O(log n) or O(nlogn)
There is a standard solution of going through each integer and then doing a binary search on it. Is there a better solution?
public static void findNumsAndSum(int[] l) {
// sort the array
if (l == null || l.length < 2) {
return;
}
BinarySearch bs = new BinarySearch();
for (int i = 0; i < l.length; i++) {
for (int j = 1; j < l.length; j++) {
int sum = l[i] + l[j];
if (l[l.length - 1] < sum) {
continue;
}
if (bs.binarySearch(l, sum, j + 1, l.length)) {
System.out.println("Found the sum: " + l[i] + "+" + l[j]
+ "=" + sum);
}
}
}
}
This is very similar to the standard problem 3SUM, which many of the related questions along the right are about.
Your solution is O(n^2 lg n); there are O(n^2) algorithms based on sorting the array, which work with slight modification for this variant. The best known lower bound is O(n lg n) (because you can use it to perform a comparison sort, if you're clever about it). If you can find a subquadratic algorithm or a tighter lower bound, you'll get some publications out of it. :)
Note that if you're willing to bound the integers to fall in the range [-u, u], there's a solution for the a + b + c = 0 problem in time O(n + u lg u) using the Fast Fourier Transform. It's not immediately obvious to me how to adjust it to the a + b = c problem, though.
You can solve it in O(nlog(n)) as follows:
Sort your array in O(nlog(n)) ascendingly. You need 2 indices pointing to the left/right end of your array. Lets's call them i and j, i being the left one and j the right one.
Now calculate the sum of array[i] + array[j].
If this sum is greater than k, reduce j by one.
If this sum is smaller than k. increase i by one.
Repeat until the sum equals k.
So with this algorithm you can find the solution in O(nlog(n)) and it is pretty simple to implement
Sorry. It seems that I didn't read your post carefully enough ;)

Finding the median of an unsorted array

To find the median of an unsorted array, we can make a min-heap in O(nlogn) time for n elements, and then we can extract one by one n/2 elements to get the median. But this approach would take O(nlogn) time.
Can we do the same by some method in O(n) time? If we can, then please tell or suggest some method.
You can use the Median of Medians algorithm to find median of an unsorted array in linear time.
I have already upvoted the #dasblinkenlight answer since the Median of Medians algorithm in fact solves this problem in O(n) time. I only want to add that this problem could be solved in O(n) time by using heaps also. Building a heap could be done in O(n) time by using the bottom-up. Take a look to the following article for a detailed explanation Heap sort
Supposing that your array has N elements, you have to build two heaps: A MaxHeap that contains the first N/2 elements (or (N/2)+1 if N is odd) and a MinHeap that contains the remaining elements. If N is odd then your median is the maximum element of MaxHeap (O(1) by getting the max). If N is even, then your median is (MaxHeap.max()+MinHeap.min())/2 this takes O(1) also. Thus, the real cost of the whole operation is the heaps building operation which is O(n).
BTW this MaxHeap/MinHeap algorithm works also when you don't know the number of the array elements beforehand (if you have to resolve the same problem for a stream of integers for e.g). You can see more details about how to resolve this problem in the following article Median Of integer streams
Quickselect works in O(n), this is also used in the partition step of Quicksort.
The quick select algorithm can find the k-th smallest element of an array in linear (O(n)) running time. Here is an implementation in python:
import random
def partition(L, v):
smaller = []
bigger = []
for val in L:
if val < v: smaller += [val]
if val > v: bigger += [val]
return (smaller, [v], bigger)
def top_k(L, k):
v = L[random.randrange(len(L))]
(left, middle, right) = partition(L, v)
# middle used below (in place of [v]) for clarity
if len(left) == k: return left
if len(left)+1 == k: return left + middle
if len(left) > k: return top_k(left, k)
return left + middle + top_k(right, k - len(left) - len(middle))
def median(L):
n = len(L)
l = top_k(L, n / 2 + 1)
return max(l)
No, there is no O(n) algorithm for finding the median of an arbitrary, unsorted dataset.
At least none that I am aware of in 2022. All answers offered here are variations/combinations using heaps, Median of Medians, Quickselect, all of which are strictly O(nlogn).
See https://en.wikipedia.org/wiki/Median_of_medians and http://cs.indstate.edu/~spitla/abstract2.pdf.
The problem appears to be confusion about how algorithms are classified, which is according their limiting (worst case) behaviour. "On average" or "typically" O(n) with "worst case" O(f(n)) means (in textbook terms) "strictly O(f(n))". Quicksort for example, is often discussed as being O(nlogn) (which is how it typically performs), although it is in fact an O(n^2) algorithm because there is always some pathological ordering of inputs for which it can do no better than n^2 comparisons.
It can be done using Quickselect Algorithm in O(n), do refer to Kth order statistics (randomized algorithms).
As wikipedia says, Median-of-Medians is theoretically o(N), but it is not used in practice because the overhead of finding "good" pivots makes it too slow.
http://en.wikipedia.org/wiki/Selection_algorithm
Here is Java source for a Quickselect algorithm to find the k'th element in an array:
/**
* Returns position of k'th largest element of sub-list.
*
* #param list list to search, whose sub-list may be shuffled before
* returning
* #param lo first element of sub-list in list
* #param hi just after last element of sub-list in list
* #param k
* #return position of k'th largest element of (possibly shuffled) sub-list.
*/
static int select(double[] list, int lo, int hi, int k) {
int n = hi - lo;
if (n < 2)
return lo;
double pivot = list[lo + (k * 7919) % n]; // Pick a random pivot
// Triage list to [<pivot][=pivot][>pivot]
int nLess = 0, nSame = 0, nMore = 0;
int lo3 = lo;
int hi3 = hi;
while (lo3 < hi3) {
double e = list[lo3];
int cmp = compare(e, pivot);
if (cmp < 0) {
nLess++;
lo3++;
} else if (cmp > 0) {
swap(list, lo3, --hi3);
if (nSame > 0)
swap(list, hi3, hi3 + nSame);
nMore++;
} else {
nSame++;
swap(list, lo3, --hi3);
}
}
assert (nSame > 0);
assert (nLess + nSame + nMore == n);
assert (list[lo + nLess] == pivot);
assert (list[hi - nMore - 1] == pivot);
if (k >= n - nMore)
return select(list, hi - nMore, hi, k - nLess - nSame);
else if (k < nLess)
return select(list, lo, lo + nLess, k);
return lo + k;
}
I have not included the source of the compare and swap methods, so it's easy to change the code to work with Object[] instead of double[].
In practice, you can expect the above code to be o(N).
Let the problem be: finding the Kth largest element in an unsorted array.
Divide the array into n/5 groups where each group consisting of 5 elements.
Now a1,a2,a3....a(n/5) represent the medians of each group.
x = Median of the elements a1,a2,.....a(n/5).
Now if k<n/2 then we can remove the largets, 2nd largest and 3rd largest element of the groups whose median is greater than the x. We can now call the function again with 7n/10 elements and finding the kth largest value.
else if k>n/2 then we can remove the smallest ,2nd smallest and 3rd smallest element of the group whose median is smaller than the x. We can now call the function of again with 7n/10 elements and finding the (k-3n/10)th largest value.
Time Complexity Analysis:
T(n) time complexity to find the kth largest in an array of size n.
T(n) = T(n/5) + T(7n/10) + O(n)
if you solve this you will find out that T(n) is actually O(n)
n/5 + 7n/10 = 9n/10 < n
Notice that building a heap takes O(n) actually not O(nlogn), you can check this using amortized analysis or simply check in Youtube.
Extract-Min takes O(logn), therefore, extracting n/2 will take (nlogn/2) = O(nlogn) amortized time.
About your question, you can simply check at Median of Medians.

Resources