Finding the medians of multiple subarrays in an unsorted array - algorithm

Suppose you are given an unsorted array of integers S and a list of ranges in T, return a list of medians from each of the ranges.
For example, S = [3,6,1,5,0,0,1,-2], T = [[1,3],[0,5],[4,4]]. Return [5, 2, 0].
Is there a better approach than running Median of Medians on each range? Can we somehow precompute/cache the results?

Let me introduce you to an interesting data structure called Wavelet Tree:
You build it by looking at the bit-string representation of your integers and recursively bisecting them:
You first separate your integers into those with most significant bit (MSB) 0 and those with MSB 1. However you store the MSBs in their original order in a bitvector. Then for each of these subsets of integers, you ignore the MSB and recursively repeat this construction for the next-most significant bit.
If you repeat this down to the least significant bit, you get a tree structure like this (note that the indices are just there for illustration, you should store only the bitvectors):
You can easily see that the construction of this data structure takes O(n log N) time where n is the number of integers and N is their maximum value.
Wavelet trees have the nice property that they represent the original sequence as well as their sorted counterpart at the same time:
If you read the topmost bitvector, you get the MSBs of the input sequence. To reconstruct the next bit of the entries, you can alternate between looking in the bitvector in the root's left child (if the MSB is 0) or in the right child (if the MSB is 1). For the following bits, you can continue recursively.
If you read the leaf nodes from left to right, you get the sorted sequence.
To use a Wavelet tree efficiently, you need two fundamental operations on the bitvectors:
rank1(k) tells you how many 1s come before the kth position in the bitvector, rank0 does the same for 0s
select1(k) tells you the index of the kth 1 in the bitvector, select0 does the same for 0s
Note that there are bitvector representations that require only o(n) (small o) bits of additional storage to implement these operations in O(1)
You can utilize them as follows:
If you are looking at the first 7 in the sequence above, it has index 3. If you now want to know which index it has in the right child node, you simply call rank1(3) on the root bitvector and get 2, which is exactly the index of the first 7 in the right child
If you are at the child containing 4544 and want to know the position of the second 4 (with index 2) in the parent node containing 46754476, you call select0(2) on the parent's bitvector and get the index 5.
Now how can you implement a range median query with this? The most important realization you need to make is that finding the median of a range of size k is equivalent to selecting the k/2 th element.
The basic idea of the algorithm is similar to Quickselect: Bisect the element range and recurse only into the range containing the element you are looking for.
Let's say we want to find the median of the range starting at the second 2 (inclusive) and ending at the 1 (exclusive).
These are 7 elements, thus the median has rank 4 (fourth-smallest element) in that range.
Now using a rank0/1 call in the root bitvector at the beginning and end of this range, we find the corresponding ranges in the children of the root:
As you can see, the left range (which contains only smaller elements) has only 3 elements, thus the element with rank 4 must be contained in the right child of the root. We can now recursively search for the element with rank 4 - 3 = 1 in that right child. By recursively descending the wavelet tree until you reach a leaf, you can thus identify the median with only two rank operations (à O(1) time) per level of the Wavelet tree, thus the whole range median query takes O(log N) time where N is the maximum number in your input sequence.
If you want to see a practical implementation of these Wavelet trees, have a look at the Succinct Data Structures Library (SDSL) which implements the aforementioned bitvectors and different WT variants.

Related

How to effectively answer range queries in an array of integers?

How to effectively and range queries in an array of integers?
Queries are of one type only, which is, given a range [a,b], find the sum of elements that are less than x (here x is a part of each query, say of the form a b x).
Initially, I tried to literally go from a to b and check if current element is less than x and adding up. But, this way is very inefficient as complexity is O(n).
Now I am trying with segment trees and sort the numbers while merging. But now my challenge is if I sort, then I am losing integers relative order. So when a query comes, I cannot use the sorted array to get values from a to b.
Here are two approaches to solving this problem with segment trees:
Approach 1
You can use a segment tree of sorted arrays.
As usual, the segment tree divides your array into a series of subranges of different sizes. For each subrange you store a sorted list of the entries plus a cumulative sum of the sorted list. You can then use binary search to find the sum of entries below your threshold value in any subrange.
When given a query, you first work out the O(log(n)) subrange that cover your [a,b] range. For each of these you use a O(log(n)) binary search. Overall this is O(qlog^2n) complexity to answer q queries (plus the preprocessing time).
Approach 2
You can use a dynamic segment tree.
A segment tree allows you to answer queries of the form "Compute sum of elements from a to b" in O(logn) time, and also to modify a single entry in O(logn).
Therefore if you start with an empty segment tree, you can reinsert the entries in increasing order. Suppose we have added all entries from 1 to 5, so our array may look like:
[0,0,0,3,0,0,0,2,0,0,0,0,0,0,1,0,0,0,4,4,0,0,5,1]
(The 0s represent entries that are bigger than 5 so haven't been added yet.)
At this point you can answer any queries that have a threshold of 5.
Overall this will cost O(nlog(n)) to add all the entries into the segment tree, O(qlog(q)) to sort the queries, and O(qlog(n)) to use the segment tree to answer the queries.

Minimum amount of swaps needed so no number has two neighbours that are both greater..?

The problem statement goes like this: Given a list of N < 500 000 distinct numbers, find the minimum number of swaps of adjacent elements required such that no number has two neighbours that are both greater. A number can only be swapped with a neighbour.
Hint given: Use a segment tree or a fenwick tree.
I don't really get the idea of how I should use a sum-tree to solve this problem.
Example inputs:
Input 1:
5 (amount of elements in the list)
3 1 4 2 0
output 1: 1
input 2:
6
4 5 2 0 1 3
output 2: 4
I can do it in O(n log n) time and O(n) extra space. But first let's look at the quadratic solution I've hinted at earlier:
initialize the result accumulator to 0
while the input list has more than two elements
find the lowest element in the list
add its distance from the closer end of the list to the accumulator
remove the element from the list.
output the accumulator.
Why does this work? First, Let's look at how a sequence that requires zero swap looks like. Since there are no duplicates, if the lowest element is anywhere but at either end, it is surrounded by two elements that are both greater, violating the requirement, thus the lowest element must be at one of the ends. Recurse into the subsequence that excludes this element. To bring a sequence into this state: at least as many swaps involving the lowest element as in the greedy algorithm are required to move the lowest element to one end, and since swaps involving the lowest element do not change the relative ordering of the rest, there is no penalty to reordering them to the front either.
Unfortunately, just implementing this with a list is quadratic. How do you make it faster? Have a finger tree that tracks the subtree weight and minimum value of each subtree and update these as you remove individual minima:
To initialize the tree: First, think of each element in the list as a one-element sublist with its minimum equal to its value. Then, while you have more than one sublist, group the subsequences in pairs, building a tree of subsequences. The length of a sequence is the sum of lengths of both its halves, and its minimum is equal to whichever minimum from both halves is lower.
To remove the minimum from a subsequence while tracking its index in the sequence:
Decrease the length of the subsequence
Remove the minimum from whichever half's minimum is equal to this subsequence minimum
the new minimum is the lower of its halves' new minima
The index of the minimum is equal to its index in its respective half, plus the length of the left half if the minimum was in the right half.
The distance from one end is then equal to either the index or (length before removal - index - 1), whichever is lower.

Binary search in 2 sorted integer arrays

There is a big array which consists of 2 small integer arrays written one at the end of another. Both small arrays are sorted by ascending. We have to find an element in big array as fast, as possible. My idea was to find the end of the left array by binsearch in big array and then implement 2 binsearches on small arrays. The problem is that I don't know how to find that end. If you have an idea, how to find element without finding borders of smaller arrays, you're welcome!
Information about arrays: both small arrays have integer elements, both are sorted by ascending, they both can have length from 0 to any positive integer number, but there can be only one copy of an element.
Here are some examples of big arrays:
1 2 3 4 5 6 7 (all the elements of the second array are bigger, than the maximum of the first array)
100 1 (both arrays have only one element)
1 3 5 2 4 6 or 2 4 6 1 3 5 (most common situations)
This problem is impossible to solve in guaranteed time complexity faster than O(n) and not possible to solve at all for certain arrays. Binary search runs in O(log n) for a sorted array, but the big array is not guaranteed to be sorted and will in the worst-case require one or more comparisions per element, which is O(n). The best guaranteed time complexity is O(n) with the trivial algorithm: compare every item with its neighbour until you find the "turning point" with A[i] > A[i+1]. However, if you use a breadth-first search, you may get lucky and find the "turning point" early.
Proof that the problem is unsolvable for some arrays: let the array M = [A B] be our big array. To find the point where the arrays meet we're looking for an index i where M[i] > M[i+1]. Now let A=[1 2 3] and B=[4 5]. There is no index in the array M for which the condition holds true, thus the problem is unsolvable for some arrays.
Informal proof for the former: let M=[A B] and A=[1..x] and B=[(x+1)..y] be two sorted arrays. Then swap the positions of element x and y in M. We have no way of finding the index of x without (in the worst case) checking every index, thus the problem is O(n).
Binary search relies on being able to eliminate half the solution space with each comparision, but in this case we cannot eliminate anything from the array and so we cannot do better than a linear search.
(From a practical standpoint, you should never do this in a program. The two arrays should be separate. If this isn't possible, append the length of either array to the bigger array.)
Edit: changed my answer after question was updated. It's possible to do it faster than linear time for some arrays, but not all possible arrays. Here's my idea for an algorithm using breadth-first search:
Start with the interval [0..n-1] where n is the length of the big array.
Make a list of intervals and put the starting interval in it.
For each interval in the list:
if the interval is only two elements and the first element is greater than the last
we found the turning point, return it
else if the interval is two elements or less
remove it from the list
else if the first element of the interval is greater than the last
turning point is in this interval
clear the list
split this interval in two equal parts and add them to the list
else
split this interval in two equal parts and replace this interval in the list with the two parts
I think a breadth-first approach will increase the odds of finding an interval where A[first] > A[last] early. Note that this approach will not work if the turning point is between two intervals, but it's something to get you started. I would test this myself, but unfortunately I don't have the time now.

Algorithm to generate a 'nearly sorted' or 'k sorted' list?

I want to generate some test data to test a function that merges 'k sorted' lists (lists where each element is at most k positions away from it's correct sorted position) into a single fully sorted list. I have an approach that works but I'm not sure how well randomized it is and I feel there should be a simpler / more elegant way to do this. My current approach:
Generate n random elements paired with an integer index.
Sort random elements.
Set paired index for each element to its sorted position.
Work backwards through the elements, swapping each element with an element a random distance between 1 and k positions behind it in the list. Only swap with the target element if its paired index is its current index (this avoids swapping an element that is already out of place and moving it further than k positions away from where it should be).
Copy the perturbed elements out into another list.
Like I say, this works but I'm interested in alternative / better approaches.
I think you could just fill an array with random integers and then run quicksort on it with a custom stopping condition.
If in a particular quicksort recursion your start and end indexes are less than k apart, then just return instead of continuing to recur.
Because of how quicksort works, every number in the start..end interval belongs somewhere in that region; worst case is that array[start] might really belong at array[end] (or vice versa) in truly sorted order. So, assuring that start and end are no more than k apart should be sufficient.
You can generate array of random numbers and then h-sort it like in shellsort, but without fiew last sorting steps when h is less then k.
Step 1: Randomly permute disjoint segments of length k. (Eg. 1 to K, k+1 to 2k ...)
Step 2: Permute conditionally again by swapping (that they don't break k-sorted assumption (1+t yo k+t, k+1+t to 1+2k+t ...) where t is a number between 1 and k (most preferably k/2)
Probably repeat step 2 multiple times with different t.
If I understand the problem, you want an algorithm to randomly pick a single k-sorted list of length n, uniformly selected from the universe U of all k-sorted lists of length n. (You will then run this algorithm m times to produce m lists as input test data.)
The first step is to count them. What is the size of U? |U|
The next step is to enumerate them. Create any one-to-one mapping F between the integers (1,2,...,|U|) and k-sorted lists of length n.
Then randomly select an integer x between 1 and |U| inclusive, and then apply F(x) to get the list.

IOI Qualifier INOI task 2

I can't figure out how to solve question 2 in the following link in an efficient manner:
http://www.iarcs.org.in/inoi/2012/inoi2012/inoi2012-qpaper.pdf
You can do this in On log n) time. (Or linear if you really care to.) First, pad the input array out to the next power of two using some really big negative number. Now, build an interval tree-like data structure; recursively partition your array by dividing it in half. Each node in the tree represents a subarray whose length is a power of two and which begins at a position that is a multiple of its length, and each nonleaf node has a "left half" child and a "right half" child.
Compute, for each node in your tree, what happens when you add 0,1,2,3,... to that subarray and take the maximum element. Notice that this is trivial for the leaves, which represent subarrays of length 1. For internal nodes, this is simply the maximum of the left child with length/2 + right child. So you can build this tree in linear time.
Now we want to run a sequence of n queries on this tree and print out the answers. The queries are of the form "what happens if I add k,k+1,k+2,...n,1,...,k-1 to the array and report the maximum?"
Notice that, when we add that sequence to the whole array, the break between n and 1 either occurs at the beginning/end, or smack in the middle, or somewhere in the left half, or somewhere in the right half. So, partition the array into the k,k+1,k+2,...,n part and the 1,2,...,k-1 part. If you identify all of the nodes in the tree that represent subarrays lying completely inside one of the two sequences but whose parents either don't exist or straddle the break-point, you will have O(log n) nodes. You need to look at their values, add various constants, and take the maximum. So each query takes O(log n) time.

Resources