selection algorithm problem

selection algorithm problem - algorithm

Suppose you have an array A of n items, and you want to find the k items in A closest
to the median of A. For example, if A contains the 9 values {7, 14, 10, 12, 2, 11, 29, 3, 4}
and k = 5, then the answer would be the values {7, 14, 10, 12, 11}, since the median
is 10 and these are the five values in A closest to the value 10. Give an algorithm
to solve this problem in O(n) time.
I know that a selection algorithm (deep selection) is the appropriate algorithm for this problem, but I think that would run in O(n*logn) time instead of O(n). Any help would be greatly appreciated :)

You will first need to find the median, which can be done in O(n) (for example using Hoare's Quickselect algorithm).
Then you will need to implement a sorting algorithm which sorts the elements in the array according to their absolute distance to the median (smallest distances first).
If you were to sort the entire array this way, this would typically take somewhere from O(n * log n) to O(n^2), depending on the algorithm being used. However since you only need the first k values, the complexity can be reduced to O(k * log n) to O(k * n).
Since k is a constant and does not depend on the size of the array, the overall complexity in a worst case scenario will be: O(n) (for finding the median) + O(k * n) (sorting), which is O(n) overall.

I think you can do this using a variant on quicksort.
You start with a set S of n items and are looking for the "middle" k items. You can think of this as partitioning S into three parts of sizes n - k/2 (the "lower" items), k (the "middle" items), and n - k/2 (the "upper" items).
This gives us a strategy: first remove the lower n - k/2 items from S, leaving S'. Then remove the upper n - k/2 items from S', leaving S'', which is the middle k items of S.
You can easily partition a set this way using "half a quicksort": choose a pivot, partition the set into L and U (lower and upper elements w.r.t. the pivot), then you know the items to discard in the partition must be either all of L and some of U or vice versa: recurse accordingly.
[Thinking further, this may not be exactly what you want if you define "closest to the median" in some other way, but it's a start.]

Assumption: we care about the k values in A that are closest to the median. If we had an A={1,2,2,2,2,2,2,2,2,2,2,2,3}, and k=3, the answer is {2,2,2}. Similarly, if we have A={0,1,2,3,3,4,5,6}, and k=3, answers {2,3,3} and {3,3,4} are equally valid. Furthermore, we are not interested in the indices from which these values came, though I imagine some small tweaks to the algorithm would work.
As Grodrigues states, first find the median in O(n) time. While we're at it, keep track of the largest and smallest number
Next, create an array K, k items long. This array will contain the distance an item is from the median. (note that
Copy the first k items from A into K.
For each item A[i], compare the distance of A[i] from the median to each item of K. If A[i] is closer to the median than the farthest item from the median in K, replace that item. As an optimization, we could also track K's closest and farthest items from the median, so we have a faster comparison to K, or we could keep K sorted, but neither optimization is necessary to operate in O(n) time.
Pseudocode, C++ ish:
/* n = length of array
* array = A, given in the problem
* result is a pre-allocated array where the result will be placed
* k is the length of result
*
* returns
* 0 for success
* -1 for invalid input
* 1 for other errors
*
* Implementation note: optimizations are skipped.
*/
#define SUCCESS 0
#define INVALID_INPUT -1
#define ERROR 1
void find_k_closest(int n, int[] array, int k, int[] result)
{
// if we're looking for more results than possible,
// it's impossible to give a valid result.
if( k > n ) return INVALID_INPUT;
// populate result with the first k elements of array.
for( int i=0; i<k; i++ )
{
result[i] = array[i];
}
// if we're looking for n items of an n length array,
// we don't need to do any comparisons
// Up to this point, function is O(k). Worst case, k==n,
// and we're O(n)
if( k==n ) return 0;
// Assume an O(n) median function
// Note that we don't bother finding the median if there's an
// error or if the output is the input.
int median = median(array);
// Convert the result array to be distance, not
// actual numbers
for( int i=0; i<k; i++)
{
result[i] = result[i]-median;
// if array[i]=1, median=3, array[i] will be set to 2.
// 4 3 -1.
}
// Up to this point, function is O(2k+n) = O(n)
// find the closest items.
// Outer loop is O(n * order_inner_loop)
// Inner loop is O(k)
// Thus outer loop is O(2k*n) = O(n)
// Note that we start at k, since the first k elements
// of array are already in result.
OUTER: for(int i=k; i<n; i++)
{
int distance = array[i]-median;
int abs_distance = abs(distance);
// find the result farthest from the median
int idx = 0;
#define FURTHER(a,b) ((abs(a)>abs(b)) ? 1 : 0;
INNER: for( int i=1; i<k; i++ )
{
idx = (FURTHER(result[i],result[i-1])) ? i:i-1;
}
// If array[i] is closer to the median than the farthest element of
// result, replace the farthest element of result with array[i]
if( abs_distance < result[idx] ){ result[idx] = distance; }
}
}
// Up to this point, function is O(2n)
// convert result from distance to values
for( int i=0; i<k; i++)
{
result[i] = median - result[i];
// if array[i]=2 , median=3, array[i] will be set to 1.
// -1 3 4.
}
}

Related

Q: Count array pairs with bitwise AND > k ~ better than O(N^2) possible?

Given an array nums
Count no. of pairs (two elements) where bitwise AND is greater than K
Brute force
for i in range(0,n):
for j in range(i+1,n):
if a[i]&a[j] > k:
res += 1
Better version:
preprocess to remove all elements ≤k
and then brute force
But i was wondering, what would be the limit in complexity here?
Can we do better with a trie, hashmap approach like two-sum?
( I did not find this problem on Leetcode so I thought of asking here )

Let size_of_input_array = N. Let the input array be of B-bit numbers
Here is an easy to understand and implement solution.
Eliminate all values <= k.
The above image shows 5 10-bit numbers.
Step 1: Adjacency Graph
Store a list of set bits. In our example, 7th bit is set for numbers at index 0,1,2,3 in the input array.
Step 2: The challenge is to avoid counting the same pairs again.
To solve this challenge we take help of union-find data structure as shown in the code below.
//unordered_map<int, vector<int>> adjacency_graph;
//adjacency_graph has been filled up in step 1
vector<int> parent;
for(int i = 0; i < input_array.size(); i++)
parent.push_back(i);
int result = 0;
for(int i = 0; i < adjacency_graph.size(); i++){ // loop 1
auto v = adjacency_graph[i];
if(v.size() > 1){
int different_parents = 1;
for (int j = 1; j < v.size(); j++) { // loop 2
int x = find(parent, v[j]);
int y = find(parent, v[j - 1]);
if (x != y) {
different_parents++;
union(parent, x, y);
}
}
result += (different_parents * (different_parents - 1)) / 2;
}
}
return result;
In the above code, find and union are from union-find data structure.
Time Complexity:
Step 1:
Build Adjacency Graph: O(BN)
Step 2:
Loop 1: O(B)
Loop 2: O(N * Inverse of Ackermann’s function which is an extremely slow-growing function)
Overall Time Complexity
= O(BN)
Space Complexity
Overall space complexity = O(BN)

First, prune everything <= k. Also Sort the value list.
Going from the most significant bit to the least significant we are going to keep track of the set of numbers we are working with (initially all ,s=0, e=n).
Let p be the first position that contains a 1 in the current set at the current position.
If the bit in k is 0, then everything that would yield a 1 world definetly be good and we need to investigate the ones that get a 0. We have (end - p) * (end-p-1) /2 pairs in the current range and (end-p) * <total 1s in this position larger or equal to end> combinations with larger previously good numbers, that we can add to the solution. To continue we update end = p. We want to count 1s in all the numbers above, because we only counted them before in pairs with each other, not with the numbers this low in the set.
If the bit in k is 1, then we can't count any wins yet, but we need to eliminate everything below p, so we update start = p.
You can stop once you went through all the bits or start==end.
Details:
Since at each step we eliminate either everything that has a 0 or everything that has a 1, then everything between start and end will have the same bit-prefix. since the values are sorted we can do a binary search to find p.
For <total 1s in this position larger than p>. We already have the values sorted. So we can compute partial sums and store for every position in the sorted list the number of 1s in every bit position for all numbers above it.
Complexity:
We got bit-by-bit so L (the bit length of the numbers), we do a binary search (logN), and lookup and updates O(1), so this is O(L logN).
We have to sort O(NlogN).
We have to compute partial bit-wise sums O(L*N).
Total O(L logN + NlogN + L*N).
Since N>>L, L logN is subsummed by NlogN. Since L>>logN (probably, as in you have 32 bit numbers but you don't have 4Billion of them), then NlogN is subsummed by L*N. So complexity is O(L * N). Since we also need to keep the partial sums around the memory complexity is also O(L * N).

Is there a in-place algorithm with time complexity O(n * Log(n)) to select the second largest number in an array/vector?

It's quite clear that we see an O(n^2) algorithm to choose the second largest number, and an algorithm using tree style with O(n * Log(n)), but, with extra space cost， like below:
But, eh..., is there a in-place algorithm with time complexity O(n * Log(n)) to select the second largest number in an array/vector?

Yes, in fact you can do this with a single pass over the range without modifying it. Here's an example algorithm:
Let m and M be the second largest, and largest elements. Initialize them to the smallest possible values the input range could contain.
For each number n in the range, the new second largest number depends on the relative order between n, m and M. The 3 possible orderings are n < m < M, m < n < M, or m < M < n. The new second largest element must be m, n, and M respectively. Essentially, n must be clamped between m and M.
The new largest number can't be m, so it must be the larger of n and M.
Here's a demonstration in c++:
int m = 0, M = 0; // assuming a range with non-negative values
for (int n : v)
{
m = std::clamp(n, m, M);
M = std::max(n, M);
}

If you are looking for something very simple O(n):
int getSecondLargest(vector<int>& vec){
int firstLargest = INT_MIN, secondLargest = INT_MIN;
for(auto i: vec){
if(i >= firstLargest){
if(firstLargest != INT_MIN){
secondLargest = firstLargest;
}
firstLargest = i;
}
else if(i > secondLargest){
secondLargest = i;
}
}
return secondLargest;
}
nth_element:
Pros:
If tomorrow you want not the second largest but say fifth largest, you won't need much code changes. The above algorithm I presented won't help.
Cons:
If you are just looking for second largest, nth_element is an overkill. The swaps and/or writes are more as compared to the above algorithm I showed.
Why are you guys giving me O(n) when I am asking for O(nlogn)?
You can find various in-place O(nlogn) sorting algorithms. One of them is Block Sort.
No. I want it to solve it with a tree style and I want O(nlogn) and I want in place. Do you have something like that?
No. That is not possible. When you say in-place, you can't use extra space depending on n. Constant extra space is fine. But tree style would require O(logn) extra space.

Insertion sort comparison?

How to count number of comparisons in insertion sort in less than O(n^2) ?

When we're inserting an element, we alternate comparisons and swaps until either (1) the element compares not less than the element to its right (2) we hit the beginning of the array. In case (1), there is one comparison not paired with a swap. In case (2), every comparison is paired with a swap. The upward adjustment for number of comparisons can be computed by counting the number of successive minima from left to right (or however your insertion sort works), in time O(n).
num_comparisons = num_swaps
min_so_far = array[0]
for i in range(1, len(array)):
if array[i] < min_so_far:
min_so_far = array[i]
else:
num_comparisons += 1

As commented, to do it in less than O(n^2) is hard, maybe impossible if you must pay the price for sorting. If you already know the number of comparisons done at each external iteration then it would be possible in O(n), but the price for sorting was payed sometime before.
Here is a way for counting the comparisons inside the method (in pseudo C++):
void insertion_sort(int p[], const size_t n, size_t & count)
{
for (long i = 1, j; i < n; ++i)
{
auto tmp = p[i];
for (j = i - 1; j >= 0 and p[j] > tmp; --j) // insert a gap where put tmp
p[j + 1] = p[j];
count += i - j; // i - j is the number of comparisons done in this iteration
p[j + 1] = tmp;
}
}
n is the number of elements and count the comparisons counter which must receive a variable set to zero.

If I remember correctly, this is how insertion sort works:
A = unsorted input array
B := []; //sorted output array
while(A is not empty) {
remove first element from A and add it to B, preserving B's sorting
}
If the insertion to B is implemented by linear search from the left until you find a greater element, then the number of comparisons is the number of pairs (i,j) such that i < j and A[i] >= A[j] (I'm considering the stable variant).
In other words, for each element x, count the number of elements before x that have less or equal value. That can be done by scanning A from the left, adding it's element to some balanced binary search tree, that also remembers the number of elements under each node. In such tree, you can find number of elements lesser or equal to a certain value in O(log n). Total time: O(n log n).

Choosing k out of n

I want to choose k elements uniformly at random out of a possible n without choosing the same number twice. There are two trivial approaches to this.
Make a list of all n possibilities. Shuffle them (you don't need
to shuffle all n numbers just k of them by performing the first
k steps of Fisher Yates). Choose the first k. This approach
takes O(k) time (assuming allocating an array of size n takes
O(1) time) and O(n) space. This is a problem if k is very
small relative to n.
Store a set of seen elements. Choose a number at random from [0, n-1]. While the element is in the set then choose a new number.
This approach takes O(k) space. The run-time is a little more
complicated to analyze. If k = theta(n) then the run-time is
O(k*lg(k))=O(n*lg(n)) because it is the coupon collector's
problem. If k is small relative to n then it takes slightly
more than O(k) because of the probability (albeit low) of choosing
the same number twice. This is better than the above solution in
terms of space but worse in terms of run-time.
My question:
is there an O(k) time, O(k) space algorithm for all k and n?

With an O(1) hash table, the partial Fisher-Yates method can be made to run in O(k) time and space. The trick is simply to store only the changed elements of the array in the hash table.
Here's a simple example in Java:
public static int[] getRandomSelection (int k, int n, Random rng) {
if (k > n) throw new IllegalArgumentException(
"Cannot choose " + k + " elements out of " + n + "."
);
HashMap<Integer, Integer> hash = new HashMap<Integer, Integer>(2*k);
int[] output = new int[k];
for (int i = 0; i < k; i++) {
int j = i + rng.nextInt(n - i);
output[i] = (hash.containsKey(j) ? hash.remove(j) : j);
if (j > i) hash.put(j, (hash.containsKey(i) ? hash.remove(i) : i));
}
return output;
}
This code allocates a HashMap of 2×k buckets to store the modified elements (which should be enough to ensure that the hash table is never rehashed), and just runs a partial Fisher-Yates shuffle on it.
Here's a quick test on Ideone; it picks two elements out of three 30,000 times, and counts the number of times each pair of elements gets chosen. For an unbiased shuffle, each ordered pair should appear approximately 5,000 (&pm;100 or so) times, except for the impossible cases where both elements would be equal.

Your second approach does not take Theta(k log k) time on average, it takes about n/(n-k+1) + n/(n-k+2) + ... + n/n operations, which is less than k(n/(n-k)) since you have k terms which are each smaller than n/(n-k). For k <= n/2, it takes under 2*k operations on average. For k>n/2, you can choose a random subset of size n-k, and take the complement. So, this is already an O(k) average time and space algorithm.

What you could use is the following algorithm (using javascript instead of pseudocode):
var k = 3;
var n = [1,2,3,4,5,6];
// O(k) iterations
for(var i = 0, tmp; i < k; ++i) {
// Random index O(1)
var index = Math.floor(Math.random() * (n.length - i));
// Output O(1)
console.log(n[index]);
// Swap and lookup O(1)
tmp = n[index];
n[index] = n[n.length - i - 1];
n[n.length - i - 1] = tmp;
}
In short, you swap the selected value with the last item and in the next iteration sample from the reduced subset. This assumes your original set is wholly unique.
The storage is O(n), if you wish to retrieve the numbers as a set, just refer to the last k entries from n.

Number of all increasing subsequences in given sequence?

You may have heard about the well-known problem of finding the longest increasing subsequence. The optimal algorithm has O(n*log(n))complexity.
I was thinking about problem of finding all increasing subsequences in given sequence. I have found solution for a problem where we need to find a number of increasing subsequences of length k, which has O(n*k*log(n)) complexity (where n is a length of a sequence).
Of course, this algorithm can be used for my problem, but then solution has O(n*k*log(n)*n) = O(n^2*k*log(n)) complexity, I suppose. I think, that there must be a better (I mean - faster) solution, but I don't know such yet.
If you know how to solve the problem of finding all increasing subsequences in given sequence in optimal time/complexity (in this case, optimal = better than O(n^2*k*log(n))), please let me know about that.
In the end: this problem is not a homework. There was mentioned on my lecture a problem of the longest increasing subsequence and I have started thinking about general idea of all increasing subsequences in given sequence.

I don't know if this is optimal - probably not, but here's a DP solution in O(n^2).
Let dp[i] = number of increasing subsequences with i as the last element
for i = 1 to n do
dp[i] = 1
for j = 1 to i - 1 do
if input[j] < input[i] then
dp[i] = dp[i] + dp[j] // we can just append input[i] to every subsequence ending with j
Then it's just a matter of summing all the entries in dp

You can compute the number of increasing subsequences in O(n log n) time as follows.
Recall the algorithm for the length of the longest increasing subsequence:
For each element, compute the predecessor element among previous elements, and add one to that length.
This algorithm runs naively in O(n^2) time, and runs in O(n log n) (or even better, in the case of integers), if you compute the predecessor using a data structure like a balanced binary search tree (BST) (or something more advanced like a van Emde Boas tree for integers).
To amend this algorithm for computing the number of sequences, store in the BST in each node the number of sequences ending at that element. When processing the next element in the list, you simply search for the predecessor, count the number of sequences ending at an element that is less than the element currently being processed (in O(log n) time), and store the result in the BST along with the current element. Finally, you sum the results for every element in the tree to get the result.
As a caveat, note that the number of increasing sequences could be very large, so that the arithmetic no longer takes O(1) time per operation. This needs to be taken into consideration.
Psuedocode:
ret = 0
T = empty_augmented_bst() // with an integer field in addition to the key
for x int X:
// sum of auxiliary fields of keys less than x
// computed in O(log n) time using augmented BSTs
count = 1 + T.sum_less(x)
T.insert(x, 1 + count) // sets x's auxiliary field to 1 + count
ret += count // keep track of return value
return ret

I'm assuming without loss of generalization the input A[0..(n-1)] consists of all integers in {0, 1, ..., n-1}.
Let DP[i] = number of increasing subsequences ending in A[i].
We have the recurrence:
To compute DP[i], we only need to compute DP[j] for all j where A[j] < A[i]. Therefore, we can compute the DP array in the ascending order of values of A. This leaves DP[k] = 0 for all k where A[k] > A[i].
The problem boils down to computing the sum DP[0] to DP[i-1]. Supposing we have already calculated DP[0] to DP[i-1], we can calculate DP[i] in O(log n) using a Fenwick tree.
The final answer is then DP[0] + DP[1] + ... DP[n-1]. The algorithm runs in O(n log n).

This is an O(nklogn) solution where n is the length of the input array and k is the size of the increasing sub-sequences. It is based on the solution mentioned in the question.
vector<int> values, an n length array, is the array to be searched for increasing sub-sequences.
vector<int> temp(n); // Array for sorting
map<int, int> mapIndex; // This will translate from the value in index to the 1-based count of values less than it
partial_sort_copy(values.cbegin(), values.cend(), temp.begin(), temp.end());
for(auto i = 0; i < n; ++i){
mapIndex.insert(make_pair(temp[i], i + 1)); // insert will only allow each number to be added to the map the first time
}
mapIndex now contains a ranking of all numbers in values.
vector<vector<int>> binaryIndexTree(k, vector<int>(n)); // A 2D binary index tree with depth k
auto result = 0;
for(auto it = values.cbegin(); it != values.cend(); ++it){
auto rank = mapIndex[*it];
auto value = 1; // Number of sequences to be added to this rank and all subsequent ranks
update(rank, value, binaryIndexTree[0]); // Populate the binary index tree for sub-sequences of length 1
for(auto i = 1; i < k; ++i){ // Itterate over all sub-sequence lengths 2 - k
value = getValue(rank - 1, binaryIndexTree[i - 1]); // Retrieve all possible shorter sub-sequences of lesser or equal rank
update(rank, value, binaryIndexTree[i]); // Update the binary index tree for sub sequences of this length
}
result += value; // Add the possible sub-sequences of length k for this rank
}
After placing all n elements of values into all k dimensions of binaryIndexTree. The values collected into result represent the total number of increasing sub-sequences of length k.
The binary index tree functions used to obtain this result are:
void update(int rank, int increment, vector<int>& binaryIndexTree)
{
while (rank < binaryIndexTree.size()) { // Increment the current rank and all higher ranks
binaryIndexTree[rank - 1] += increment;
rank += (rank & -rank);
}
}
int getValue(int rank, const vector<int>& binaryIndexTree)
{
auto result = 0;
while (rank > 0) { // Search the current rank and all lower ranks
result += binaryIndexTree[rank - 1]; // Sum any value found into result
rank -= (rank & -rank);
}
return result;
}
The binary index tree is obviously O(nklogn), but it is the ability to sequentially fill it out that creates the possibility of using it for a solution.
mapIndex creates a rank for each number in values, such that the smallest number in values has a rank of 1. (For example if values is "2, 3, 4, 3, 4, 1" then mapIndex will contain: "{1, 1}, {2, 2}, {3, 3}, {4, 5}". Note that "4" has a rank of "5" because there are 2 "3"s in values
binaryIndexTree has k different trees, level x would represent the total number of increasing sub-strings that can be formed of length x. Any number in values can create a sub-string of length 1, so each element will increment it's rank and all ranks above it by 1.
At higher levels an increasing sub-string depends on there already being a sub-string available of a shorter length and lower rank.
Because elements are inserted into binary index tree according to their order in values, the order of occurrence in values is preserved, so if an element has been inserted in binaryIndexTree that is because it preceded the current element in values.
An excellent description of how binary index tree is available here: http://www.geeksforgeeks.org/binary-indexed-tree-or-fenwick-tree-2/
You can find an executable version of the code here: http://ideone.com/GdF0me

Let us take an example -
Take an array {7, 4, 6, 8}
Now if you consider each individual element also as a subsequence then the number of increasing subsequence that can be formed are -
{7} {4} {6} {4,6} {8} {7,8} {4,8} {6,8} {4,6,8}
A total of 9 increasing subsequence can be formed for this array.
So the answer is 9.
The code is as follows -
int arr[] = {7, 4, 6, 8};
int T[] = new int[arr.length];
for(int i=0; i<arr.length; i++)
T[i] = 1;
int sum = 1;
for(int i=1; i<arr.length; i++){
for(int j=0; j<i; j++){
if(arr[i] > arr[j]){
T[i] = T[i] + T[j];
}
}
sum += T[i];
}
System.out.println(sum);
The complexity of the code is O(N log N).

You can use sparse segment tree to get optimal solution with O(nlog(n)).
The solution running as follow :
for(int i=0;i<n;i++)
{
dp[i]=1+query(0,a[i]);
update(a[i],dp[i]);
}
The query parameters are : query(first position, last position)
The update parameters are : update(position,value)
And the final answer is the sum of all values of dp array.

Java version as an example:
int[] A = {1, 2, 0, 0, 0, 4};
int[] dp = new int[A.length];
for (int i = 0; i < A.length; i++) {
dp[i] = 1;
for (int j = 0; j <= i - 1; j++) {
if (A[j] < A[i]) {
dp[i] = dp[i] + dp[j];
}
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

selection algorithm problem - algorithm

Related

Q: Count array pairs with bitwise AND > k ~ better than O(N^2) possible?

Is there a in-place algorithm with time complexity O(n * Log(n)) to select the second largest number in an array/vector?

Insertion sort comparison?

Choosing k out of n

Number of all increasing subsequences in given sequence?

Categories

Resources