Median and order statistics with O(n) time complexity - algorithm

Describe an O(n)-time algorithm that, given a set S of n distinct numbers and a positive
integer k≤n , outputs the k numbers in S that are closest to the median of S (excluding the
median). Hint: The target numbers may not be evenly placed around the median in the sorted
version of the array. E.g., consider 1,2,3,8,10; the 2 numbers closest to the median 3 are 1,2,
excluding the median itself, but they are both less than the median. Note: this is just an
illustration; don't assume that the array is sorted)
Here is the answer that I found link:
Answer: Find the n/2 − k/2 largest element in linear time. Partition on that element. Then, find the k largest element in the bigger subarray formed from the partition. Then, the elements in the smaller subarray from partitioning on this element are the desired k numbers.
My illustration:
Suppose I have an array with 11 elements and the array is an unsorted array
index_number 1 2 3 4 5 6 7 8 9 10 11
arr_elements 2 5 3 10 4 7 1 12 6 13 8
As there are 11 elements median should be 11/2= 5.5 approximately, 6. So arr_element 7 is the median. Now the solution said Find the n/2 − k/2 largest element in linear time. Suppose k=4 so, k/4 = 2, therefore need to find out largest element from index 2 through index 6. The array elements from index 2 through 6 are {5,3,10,4,7}. So the largest element is 10. Now the answer said Partition on that element. So there will be two sub array after partitioning from arr_element 10. The sub arrays are {2,5,3} and {4,7,1,12,6,13,8}. Then the answer said Then, find the k largest element in the bigger subarray formed from the partition. k=4 so kth largest element means 4th largest element. The 4th largest element in the big subarray is 8. Now, the algorithm said Then, the elements in the smaller subarray from partitioning on this element are the desired k numbers. I did not understand this statement.
The problem came from Cormen's Introduction to algorithm Chapter 9: Median and order statistics
Any hints would be appreciated.

The problem is to find the median, then find the distance d such that exactly k or k+1 points are within that distance from the median, and then output those points.
Hint: Study quickselect.

Related

find maximum possible min value of array

There is an array containing n integers. In each step we are allowed to increment all the elements present in any subarray of size w by 1. The maximum number of such steps allowed is m. Any element of the array cannot be incremented more than k times. We are required to maximize the minimum possible element in the array after these operations.
For example we are given n=6, m=2, w=3, k=1
And the array is 2 2 2 2 1 1.
then the answer is 2, as k=1( we can only increment each element once, hence considering a window of size 3 at the end of the array will give us the required answer. Also note that since m=2, the first 3 elements will be incremented in the next step.)
How do i approach this problem?
Edit: Constraints are
1 ≤ w ≤ n ≤ 10^5
1 ≤ k ≤ m ≤ 10^5
Elements in the array are in range 1 to 10^9.

Sequence increasing and decreasing by turns

Let's assume we've got a sequence of integers of given length n. We want to delete some elements (maybe none), so that the sequence is increasing and decreasing by turns in result. It means, that every element should have neighbouring elements either both bigger or both smaller than itself.
For example 1 3 2 7 6 and 5 1 4 2 10 are both sequences increasing and decreasing by turns.
We want to delete some elements to transform our sequence that way, but we also want to maximize the sum of elements left. So, for example, from sequence 2 18 6 7 8 2 10 we want to delete 6 and make it 2 18 7 8 2 10.
I am looking for an effective solution to that problem. Example above shows that the most naive greedy algorithm (delete every first element that breaks the sequence) won't work - it would delete 7 instead of 6, which would not maximize the sum of elements left.
Any ideas how to solve that effectively (O(n) or O(n log n) probably) and correctly?
For every element of the sequence with index i we will calculate F(i, high) and F(i, low), where F(i, high) equals to the biggest sum of the subsequence with wanted characteristics that ends with the i-th element and this element is a "high peak". (I'll explain mainly the "high" part, the "low" part can be done similarly). We can calculate these functions using the following relations:
The answer is maximal among all F(i, high) and F(i, low) values.
That gives us a rather simple dynamic programming solution with O(n^2) time complexity. But we can go further.
We can optimize a calculation of max(F(j,low)) part. What we need to do is to find the biggest value among previously calculated F(j, low) with the condition that a[j] < a[i]. This can be done with segment trees.
First of all, we'll "squeeze" our initial sequence. We need the real value of the element a[i] only when calculating the sum. But we need only the relative order of the elements when checking that a[j] is less than a[i]. So we'll map every element to its index in the sorted elements array without duplicates. For example, sequence a = 2 18 6 7 8 2 10 will be translated to b = 0 5 1 2 3 0 4. This can be done in O(n*log(n)).
The biggest element of b will be less than n, as a result, we can build a segment tree on the segment [0, n] with every node containing the biggest sum within the segment (we need two segment trees for "high" and "low" part accordingly). Now let's describe the step i of the algorithm:
Find the biggest sum max_low on the segment [0, b[i]-1] using the "low" segment tree (initially all nodes of the tree contain zero).
F(i, high) is equal to max_low + a[i].
Find the biggest sum max_high on the segment [b[i]+1, n] using the "high" segment tree.
F(i, low) is equal to max_high + a[i].
Update the [b[i], b[i]] segment of the "high" segment tree with F(i, high) value recalculating maximums of the parent nodes (and [b[i], b[i]] node itself).
Do the same for "low" segment tree and F(i, low).
Complexity analysis: b sequence calculation is O(n*log(n)). Segment tree max/update operations have O(log(n)) complexity and there are O(n) of them. The overall complexity of this algorithm is O(n*log(n)).

Select pairs of numbers with the minimum overall difference

Given n pairs of numbers, select k pairs so that the difference between the minimum value and the maximum value is minimal. Note that 2 numbers in 1 pair cannot be separated. Example (n=5, k=3):
INPUT OUTPUT (return the index of the pairs)
5 4 1 2 4
1 5
9 8
1 0
2 7
In this case, choosing (5,4) (1,5) (1,0) will give a difference of 5 (max is 5, min is 0). I'm looking for an efficient way (n log n) of doing this since the input will be pretty large and I don't want to go through every possible case.
Thank you.
NOTE: No code is needed. An explanation of the solution is enough.
Here's a method with O(n log n) time complexity:
First sort the array according to the smaller number in the pair. Now iterate back from the last element in the sorted array (the pair with the highest minimum).
As we go backwards, the elements already visited will necessarily have an equal or higher minimum than the current element. Store the visited pairs in a max heap according to the maximal number in the visited pair. If the heap size is smaller than k-1, keep adding to the heap.
Once the heap size equals k-1, begin recording and comparing the best interval so far. If the heap size exceeds k-1, pop the maximal element off. The heap is guaranteed to contain the first k-1 pairs where the minimal number is greater than or equal to the current minimal number and the maximal is smallest (since we keep popping off the maximal element when the heap size exceeds k-1).
Total time O(n log n) for sorting + O(n log n) to iterate and maintain the heap = O(n log n) in total.
Example:
5 4
1 5
9 8
1 0
2 7
k = 3
Sort pairs by the smaller number in each pair:
[(1,0),(1,5),(2,7),(5,4),(9,8)]
Iterate from end to start:
i = 4; Insert (9,8) into heap
i = 3; Insert (5,4) into heap
i = 2; Range = 2-9
i = 1; Pop (9,8) from heap; Range = 1-7
i = 0; Pop (2,7) from heap; Range = 0-5
Minimal interval [0,5] (find k matching indices in O(n) time)
Lets keep to sorted arrays: one which sorted according to minimal number in pair and other to maximal. Lets iterate over first array and fix minimal number in answer. We can keep pointer on k-th number in second array. When we go to next pair we remove all pairs with less minimal value from second array and forward pointer if needed. To find position in log n time in second array we can keep additional map between pair and position.

Inversion distance

First of all let's recall definition of inversion.
Inversion of some sequence S which contains numbers is situation when S[i] > S[j] and i < j or frankly speaking it's situation when we have disordered elements. For instance for sequence:
1 4 3 7 5 6 2
We have following inversions (4,3), (4,2), (3,2), (7,5), etc.
We state problem as follows: distance of inversion is maximum (in terms of indexing) distance between two values that are inversion. For out example we can perform human-brain searching that gives us pair (4,2) <=> (S[1], S[6]) and there for index distance is 6-1 = 5 which is maximum possible for this case.
This problem can be solved trivial way in O(n^2) by finding all inversions and keeping max distance (or updated if we find better option)
We can also perform better inversion searching using merge sort and therefore do the same in O(nlogn). Is there any possibility for existence of O(n) algorithm? Take in mind that we just want maximum distance, we don't want to find all inversions. Elaborate please.
Yes, O(n) algorithm is possible.
We could extract strictly increasing subsequence with greedy algorithm:
source: 1 4 3 7 5 6 2
strictly increasing subsequence: 1 4 7
Then we could extract strictly decreasing subsequence going backwards:
source: 1 4 3 7 5 6 2
strictly decreasing subsequence: 1 2
Note that after this strictly decreasing subsequence is found we could interpret it as increasing sequence (in normal direction).
For each element of these subsequences we need to store their index in source sequence.
Now "inversion distance" could be found by merging these two subsequences (similar to merge sort mentioned in OP, but only one merge pass is needed):
merge 1 & 1 ... no inversion, advance both indices
merge 4 & 2 ... inversion found, distance=5, should advance second index,
but here is end of subsequence, so we are done, max distance = 5
Maybe my idea is the same as #Evgeny.
Here is the explanation:
make a strictly increasing array from the beginning we call it array1
make a strictly decreasing array from the ending which is array2 (But keep the values in increasing order)
***Keep track of original indexes of the values of both arrays.
Now start from the beginning of both arrays.
Do this loop following untill array1 or array2 checking is complete
While( array1[index] > arry2[index] )
{
check the original distance between array1 index and arry2 index.
Update result accordingly.
increase array2 index.
}
increase both array index
Continue with the loop
At the end of this process you will have the maximum result. Proof of this solution is not that complex, you can try it yourself.

divide and conquer - find median between two arrays of equal size that contain unique elements?

I am trying to solve a problem exactly like this:
nth smallest number among two databases of size n each using divide and conquer
From what I could figure out, the "comparing medians/median of medians" algorithm would give us the solution? My question is whether I am understanding this correctly.
array 1: [7 8 6 5 3]
array 2: [4 10 1 2 9]
First, find the median for each. we can do this by querying for k=n/2, where n is the size of that array. Being the 3rd smallest element in this case, this gives us 6 for the first array (call this m1), and 4 for the second array (call this m2).
Since m1 > m2, create 2 arrays using the elements that are less than m1 and greater than m2 in that array.
array 1: [5 3]
array 2: [10 9]
^ How would we find the elements that are less than m1 and greater than m2? Would we just take m1 and m2 and compare them with every element in their respective arrays? I know this works when the two arrays are both sorted, but would sorting them first allow us to still get O(log(n)) queries?
I'm assuming we can continue to use our special query (can we?) to get the k=n/2 smallest element (median) for that particular array. If this is the case, we query for k=n/2=1, leaving us with new m1 = 3, m2 = 9.
m1 < m2, so we make 2 arrays using elements that are greater than m1 and less than m2 in that array.
Since there are no elements in array 2 that are less than m2 = 9, we are only left with one array with one element greater than m1 = 3.
[5] <- this is the median
I am also interested in seeing the proof of correctness (that this finds the median) by induction.
The O(n) meidan of median algorithm actually partitions the array so that the elements before it are less than it and after it are greater than it.
When you recurse with the median of medians as pivot, you are partitioning the array so that it looks like
(elements less than the median) - p - (elements greater than the median)
On the correctness, when you first query for k = n/2. You get m1 and m2(m1 > m2). Now you know that there are more than n elements that are less than m1. so elements following it will never be candidates for the median.
Similarly elements before m2. there are more than n elements ahead of them, so they will never be a candidate for the median. So the median must lie somewhere in the second half of the second array and the first half of the first array.
But now when you recurse you should keep in mind that you have n/2 elements of the second array counted for, so you need to find the element that would occupy the n/2th position in the sorted union of the two arrays(second half and first half).
This seems asymptotically optimal since you're always reducing the size of the arrays you are recursing on to half.
something like O(n) + O(n/2) + O(n/4) ... = O(n).
For sorted arrays you can do this is O(logn).

Resources