Here are two integers set, say A and B
and we can get another set C, in which every element is sum of element a in A and element b in B.
For example, A = {1,2}, B = {3,4} and we get C = {4, 5, 6} where 4=1+3, 5=1+4=2+3, 6=2+4
Now I want to find out which number is the kth largest one in set C, for example 5 is 2nd largest one in above example.
Is there a efficient solution?
I know that pairwise sums sorting is an open problem and has a n^2 lower time bound. But since only kth largest number is needed, maybe we can learn from the O(n) algorithm of finding median number in an unsorted array.
Thanks.
If k is very close to 1 or N, any algorithm that generates the sorted sums lazily could simply be run until the kth or N-kth item pops out.
In particular, I'm thinking of best-first search of the following space: (a,b) means the ath item from A, the first list, added to the bth from B, the second.
Keep in a best=lowest priority queue pairs (a,b) with cost(a,b) = A[a]+B[b].
Start with just (1,1) in the priority queue, which is the minimum.
Repeat until k items popped:
pop the top (a,b)
if a<|A|, push (a+1,b)
if a=1 and b<|B|, push (a,b+1)
This gives you a saw-tooth comb connectivity and saves you from having to mark each (a,b) visited in an array. Note that cost(a+1,b)>=cost(a,b) and cost(a,b+1)>=cost(a,b) because A and B are sorted.
Here's a picture of a comb to show the successor generation rule above (you start in the upper left corner; a is the horizontal direction):
|-------
|-------
|-------
It's just best-first exploration of (up to) all |A|*|B| tuples and their sums.
Note that the most possible items pushed before popping k is 2*k, because each item has either 1 or 2 successors. Here's a possible queue state, where items pushed into the queue are marked *:
|--*----
|-*-----
*-------
Everything above and to the left of the * frontier has already been popped.
For the N-k<k case, do the same thing but with reversed priority queue order and exploration order (or, just negate and reverse the values, get the (N-k)th least, then negate and return the answer).
See also: sorted list of pairwise sums on SO, or the Open problems project.
Sort arrays A & B : O(mlogm + nlogn)
Apply a modified form of algorithm for merging 2 sorted arrays : O(m+n)
i.e. at each point, u sum the the two elements.
When u have got (m+n-k+1)th element in C, stop merging. That element is essentially kth largest.
E.g.
{1,2} & {3,4} : Sorted
C:
{1+3,(1+4)|(2+3),2+4}
Well, O(n) would be a lower bound (probably not tight though), otherwise you could run the O(n) algorithm n times to get a sorted list in O(n^2).
Can you assume the two sets are sorted (you present them in sorted order above)? If so, you could possibly get something with an average case that's decently better by doing an "early out", starting at the last pair of elements, etc. Just a hunch though.
Related
Let a1,...,an be a sequence of real numbers. Let m be the minimum of the sequence, and let M be the maximum of the sequence.
I proved that there exists 2 elements in the sequence, x,y, such that |x-y|<=(M-m)/n.
Now, is there a way to find an algorithm that finds such 2 elements in time complexity of O(n)?
I thought about sorting the sequence, but since I dont know anything about M I cannot use radix/bucket or any other linear time algorithm that I'm familier with.
I'd appreciate any idea.
Thanks in advance.
First find out n, M, m. If not already given they can be determined in O(n).
Then create a memory storage of n+1 elements; we will use the storage for n+1 buckets with width w=(M-m)/n.
The buckets cover the range of values equally: Bucket 1 goes from [m; m+w[, Bucket 2 from [m+w; m+2*w[, Bucket n from [m+(n-1)*w; m+n*w[ = [M-w; M[, and the (n+1)th bucket from [M; M+w[.
Now we go once through all the values and sort them into the buckets according to the assigned intervals. There should be at a maximum 1 element per bucket. If the bucket is already filled, it means that the elements are closer together than the boundaries of the half-open interval, e.g. we found elements x, y with |x-y| < w = (M-m)/n.
If no such two elements are found, afterwards n buckets of n+1 total buckets are filled with one element. And all those elements are sorted.
We once more go through all the buckets and compare the distance of the content of neighbouring buckets only, whether there are two elements, which fulfil the condition.
Due to the width of the buckets, the condition cannot be true for buckets, which are not adjoining: For those the distance is always |x-y| > w.
(The fulfilment of the last inequality in 4. is also the reason, why the interval is half-open and cannot be closed, and why we need n+1 buckets instead of n. An alternative would be, to use n buckets and make the now last bucket a special case with [M; M+w]. But O(n+1)=O(n) and using n+1 steps is preferable to special casing the last bucket.)
The running time is O(n) for step 1, 0 for step 2 - we actually do not do anything there, O(n) for step 3 and O(n) for step 4, as there is only 1 element per bucket. Altogether O(n).
This task shows, that either sorting of elements, which are not close together or coarse sorting without considering fine distances can be done in O(n) instead of O(n*log(n)). It has useful applications. Numbers on computers are discrete, they have a finite precision. I have sucessfuly used this sorting method for signal-processing / fast sorting in real-time production code.
About #Damien 's remark: The real threshold of (M-m)/(n-1) is provably true for every such sequence. I assumed in the answer so far the sequence we are looking at is a special kind, where the stronger condition is true, or at least, for all sequences, if the stronger condition was true, we would find such elements in O(n).
If this was a small mistake of the OP instead (who said to have proven the stronger condition) and we should find two elements x, y with |x-y| <= (M-m)/(n-1) instead, we can simplify:
-- 3. We would do steps 1 to 3 like above, but with n buckets and the bucket width set to w = (M-m)/(n-1). The bucket n now goes from [M; M+w[.
For step 4 we would do the following alternative:
4./alternative: n buckets are filled with one element each. The element at bucket n has to be M and is at the left boundary of the bucket interval. The distance of this element y = M to the element x in the n-1th bucket for every such possible element x in the n-1thbucket is: |M-x| <= w = (M-m)/(n-1), so we found x and y, which fulfil the condition, q.e.d.
First note that the real threshold should be (M-m)/(n-1).
The first step is to calculate the min m and max M elements, in O(N).
You calculate the mid = (m + M)/2value.
You concentrate the value less than mid at the beginning, and more than mid at the end of he array.
You select the part with the largest number of elements and you iterate until very few numbers are kept.
If both parts have the same number of elements, you can select any of them. If the remaining part has much more elements than n/2, then in order to maintain a O(n) complexity, you can keep onlyn/2 + 1 of them, as the goal is not to find the smallest difference, but one difference small enough only.
As indicated in a comment by #btilly, this solution could fail in some cases, for example with an input [0, 2.1, 2.9, 5]. For that, it is needed to calculate the max value of the left hand, and the min value of the right hand, and to test if the answer is not right_min - left_max. This doesn't change the O(n) complexity, even if the solution becomes less elegant.
Complexity of the search procedure: O(n) + O(n/2) + O(n/4) + ... + O(2) = O(2n) = O(n).
Damien is correct in his comment that the correct results is that there must be x, y such that |x-y| <= (M-m)/(n-1). If you have the sequence [0, 1, 2, 3, 4] you have 5 elements, but no two elements are closer than (M-m)/n = (4-0)/5 = 4/5.
With the right threshold, the solution is easy - find M and m by scanning through the input once, and then bucket the input into (n-1) buckets of size (M-m)/(n-1), putting values that are on the boundaries of a pair of buckets into both buckets. At least one bucket must have two values in it by the pigeon-hole principle.
Lets say that you are given n sorted arrays of numbers and you need to pick one number from each array such that the minimum distance between the n chosen elements is maximized.
Example:
arrays:
[0, 500]
[100, 350]
[200]
2<=n<=10 and every array could have ~10^3-10^4 elements.
In this example the optimal solution to maximize minimum distance is pick numbers: 500, 350, 200 or 0, 200, 350 where min distance is 150 and is the maximum possible of every combination.
I am looking for an algorithm to solve this. I know that I could binary search the max min distance but I can't see how to decide is there is a solution with max min distance of at least d, in order for the binary search to work. I am thinking maybe dynamic programming could help but haven't managed to find a solution with dp.
Of course generating all combination with n elements is not efficient. I have already tried backtracking but it is slow since it tries every combination.
n ≤ 10 suggests that we can take an exponential dependence on n. Here's
an O(2n m n)-time algorithm where m is the total size of the
arrays.
The dynamic programming approach I have in mind is, for each subset of
arrays, calculate all of the pairs (maximum number, minimum distance) on
the efficient frontier, where we have to choose one number from each of
the arrays in the subset. By efficient frontier I mean that if we have
two pairs (a, b) ≠ (c, d) with a ≤ c and b ≥ d, then (c, d) is not on
the efficient frontier. We'll want to keep these frontiers sorted for
fast merges.
The base case with the empty subset is easy: there's one pair, (minimum
distance = ∞, maximum number = −∞).
For every nonempty subset of arrays in some order that extends the
inclusion order, we compute a frontier for each array in the subset,
representing the subset of solutions where that array contributes the
maximum number. Then we merge these frontiers. (Naively this costs us
another factor of log n, which maybe isn't worth the hassle to avoid
given that n ≤ 10, but we can avoid it by merging the arrays once at the
beginning to enable future merges to use bucketing.)
To construct a new frontier from a subset of arrays and another array
also involves a merge. We initialize an iterator at the start of the
frontier (i.e., least maximum number) and an iterator at the start of
the array (i.e., least number). While neither iterator is past the end,
Emit a candidate pair (min(minimum distance, array number − maximum
number), array number).
If the min was less than or equal to minimum distance, increment the
frontier iterator. If the min was less than or equal to array number
− maximum number, increment the array iterator.
Cull the candidate pairs to leave only the efficient frontier. There is
an elegant way to do this in code that is more trouble to explain.
I am going to give an algorithm that for a given distance d, will output whether it is possible to make a selection where the distance between any pair of chosen numbers is at least d. Then, you can binary-search the maximum d for which the algorithm outputs "YES", in order to find the answer to your problem.
Assume the minimum distance d be given. Here is the algorithm:
for every permutation p of size n do:
last := -infinity
ok := true
for p_i in p do:
x := take the smallest element greater than or equal to last+d in the p_i^th array (can be done efficiently with binary search).
if no such x was found; then
ok = false
break
end
last = x
done
if ok; then
return "YES"
end
done
return "NO"
So, we brute-force the order of arrays. Then, for every possible order, we use a greedy method to choose elements from each array, following the order. For example, take the example you gave:
arrays:
[0, 500]
[100, 350]
[200]
and assume d = 150. For the permutation 1 3 2, we first take 0 from the 1st array, then we find the smallest element in the 3rd array that is greater than or equal to 0+150 (it is 200), then we find the smallest element in the 2nd array which is greater than or equal to 200+150 (it is 350). Since we could find an element from every array, the algorithm outputs "YES". But for d = 200 for instance, the algorithm would output "NO" because none of the possible orderings would result in a successful selection.
The complexity for the above algorithm is O(n! * n * log(m)) where m is the maximum number of elements in an array. I believe it would be sufficient, since n is very small. (For m = 10^4, 10! * 10 * 13 ~ 5*10^8. It can be computed under a second on a modern CPU.)
Lets look at an example with optimal choices, x (horizontal arrays A, B, C, D):
A x
B b x b
C x c
D d x
Our recurrence based on range could be: let f(low, excluded) represent the maximum closest distance between two chosen elements (from arrays 1 to n) of the subset without elements in excluded, where low is the lowest chosen element. Then:
(1)
f(low, excluded) when |excluded| = n-1:
max(low)
for low in the only permitted array
(2)
f(low, excluded):
max(
min(
a - low,
f(a, excluded')
)
)
for a ≥ low, a not in excluded'
where excluded' = excluded ∪ {low's array}
We can limit a. For one thing the maximum we can achieve is
(3)
m = (highest - low) / (n - |excluded| - 1)
which means a need not go higher than low + m.
Secondly, we can store results for all f(a, excluded'), keyed by excluded' (we have 2^10 possible keys), each in a decorated binary tree ordered by a. The decoration will be the highest result achievable in the right subtree, meaning we can find the max for all f(v, excluded'), v ≥ a in logarithmic time.
The latter establishes a dominance relationship and clearly we are intetested in both a larger a and a larger f(a, excluded') so as to maximise the min function in (2). Picking an a in the middle, we can use a binary search. If we have:
a - low < max(v, excluded'), v ≥ a
where max(v, excluded') is the lookup
for a in the decorated tree
then we look to the right since max(v, excluded) indicates there's a better answer on the right, where a - low is also larger.
And if we have:
a - low ≥ max(v, excluded), v ≥ a
then we record this candidate and look to the left since to the right, the answer is fixed at max(v, excluded), given that a - low could not decrease.
In order to conduct the binary search on the range, [low, low + m] (see (3)), rather than merge and label all the arrays at the outset, we can keep them separate and compare the closest candidates to mid out of each array we are currently permitted to choose a from. (The trees have the mixed results, keyed by subset.) (The flow of this part is not completely clear to me.)
Worst case with this method, given that n = C is constant seems to be
O(C * array_length * 2^C * C * log(array_length) * log(C * array_length))
C * array_length is the iteration on low
Each low can be paired with 2^C inclusions
C * log(array_length) is the separated binary-search
And log(C * array_length) is the tree lookup
Simplifying:
= O(array_length * log^2(array_length))
although in practice, there could be many dead-end branches that exit early where a full selection wouldn't be possible.
In case, it wasn't clear, the iteration is on a fixed lowest element in the selection. In other words, we want the best f(low, excluded) for all different lows (and excludeds). For bottom-up, we would iterate from the highest value down so our results for a get stored as we iterate.
The problem is this, given an array of length say N, we have to find all subsequences of length W such that those W elements, when sorted, forms an arithmetic progression with interval 1. So for an array like [1,4,6,3,5,2,7,9], and W as 5, the slice [4,6,3,5,2] can be regarded as one such subsequence, since, when sorted, it yields [2,3,4,5,6], an A.P with common difference 1.
The immediate solution which comes to the mind is to have a sliding window, for each new element, pop the old one, push the new one, sort the window, and if for that window, window[w-1] - window[0] + 1 = w, then it is such a subsequence. However, it takes O(NlogN) time, whereas the solution at Codechef proposes a O(N) time algorithm that uses double-ended queue. I am having difficulty in understanding the algorithm, what is being pushed and popped, and why so, and how it maintains the window in sorted order without the need to resort with each new element. Can anybody explain it?
You are correct in observing that a segment is valid if max(segment) - min(segment) + 1 = W. So, the problem reduces to finding the min and max of all length W segments in O(N).
For this, we can use a deque D. Suppose we want to find the min. We will store the indexes of elements in D, assuming 0-based indexing. Let A be the original array.
for i = 0 to N - 1:
if D.first() == i - W:
D.popFirst() <- this means that the element is too old,
so we no longer care about it
while not D.empty() and A[ D.last() ] >= A[i]:
D.popLast()
D.pushBack(i)
For each i, this will give you the minimum in [i - W + 1, i] as the element at index D.first().
popFirst() removes the first element from D. We have to do this when the first element in D is more than W steps away from i, because it will not contribute to the minimum in the interval above.
popLast() removes the last element from D. We do this to maintain the sorted order: if the last element in D is the index of an element larger than A[i], then adding i at the end of D would break the order. So we have to keep removing the last element to ensure that D stays sorted.
pushBack() adds an element at the end of D. After adding it, D will definitely remain sorted.
This is O(1) (to find a min, the above pseudocode is O(n)) because each element will be pushed and popped to / from D at most once.
This works because D will always be a sliding window of indexes sorted by their associated value in A. When we are at an element that would break this order, we can pop elements from D (the sliding window) until the order is restored. Since the new element is smaller than those we are popping, there is no way those can contribute to a solution.
Note that you can implement this even without the methods I used by keeping two pointers associated with D: start and end. Then make D an array of length N and you are done.
I'm working on a sorting/ranking algorithm that works with quite large number of items and I need to implement the following algorithm in an efficient way to make it work:
There are two lists of numbers. They are equally long, about 100-500 thousand items. From this I need to find the n-th biggest product between these lists, ie. if you create a matrix where on top you have one list, on the side you have the other one and each cell is the product of the number above and the number on the side.
Example: The lists are A=[1, 3, 4] and B=[2, 2, 5]. Then the products are [2, 2, 5, 6, 6, 15, 8, 8, 20]. If I wanted the 3rd biggest from that it would be 8.
The naive solution would be to simply generate those numbers, sort them and then select the n-th biggest. But that is O(m^2 * log m^2) where m is the number of elements in the small lists, and that is just not fast enough.
I think what I need is to first sort the two small lists. That is O(m * log m). Then I know for sure that the biggest one A[0]*B[0]. Second biggest one is either A[0]*B[1] or A[1]*B[0], ...
I feel like this could be done in O(f(n)) steps, independent of the size of the matrix. But I can't figure out an efficient way to do this part.
Edit: There was an answer that got deleted, which suggested to remember position in the two sorted sets and then look at A[a]*B[b+1] and A[a+1]*B[b], returning the bigger one and incrementing a/b. I was going to post this comment before it got deleted:
This won't work. Imagine two lists A=B=[3,2,1]. This will give you
matrix like [9,6,3 ; 6,4,2 ; 3,2,1]. So you start at (0,0)=9, go to
(0,1)=6 and then the choice is (0,2)=3 or (1,1)=4. However, this will
miss the (1,0)=6 which is bigger then both. So you can't just look to
the two neighbors but you have to backtrack.
I think it can be done in O(n log n + n log m). Here's a sketch of my algorithm, which I think will work. It's a little rough.
Sort A descending. (takes O(m log m))
Sort B descending. (takes O(m log m))
Let s be min(m, n). (takes O(1))
Create s lazy sequence iterators L[0] through L[s-1]. L[i] will iterate through the s values A[i]*B[0], A[i]*B[1], ..., A[i]*B[s-1]. (takes O(s))
Put the iterators in a priority queue q. The iterators will be prioritized according to their current value. (takes O(s) because initially they are already in order)
Pull n values from q. The last value pulled will be the desired result. When an iterator is pulled, it is re-inserted in q using its next value as the new priority. If the iterator has been exhausted, do not re-insert it. (takes O(n log s))
In all, this algorithm will take O(m log m + (s + n)log s), but s is equal to either m or n.
I don't think there is an algorithm of O(f(n)), which is independent of m.
But there is a relatively fast O(n*logm) algo:
At first, we sort the two arrays, we get A[0] > A[1] > ... > A[m-1] and B[0] > B[1] > ... > B[m-1]. (This is O(mlogm), of course.)
Then we build a max-heap, whose elements are A[0]*B[0], A[0]*B[1], ... A[0]*B[m-1]. And we maintain a "pointer array" P[0], P[1], ... P[m-1]. P[i]=x means that B[i]*A[x] is in the heap currently. All the P[i] are zero initially.
In each iteration, we pop the max element from the heap, which is the next largest product. Assuming it comes from B[i]*A[P[i]] (we can record the elements in the heap come from which B[i]), we then move the corresponding pointer forward: P[i] += 1, and push the new B[i] * A[P[i]] into the heap. (If P[i] is moved to out-of-range (>=m), we simply push a -inf into the heap.)
After the n-th iteration, we get the n-th largest product.
There are n iterations, and each one is O(logm).
Edit: add some details
You don't need to sort the the 500 000 elements to get the top 3.
Just take the first 3, put them in a SortedList, and iterate over the list, replacing the smallest of the 3 elements with the new value, if that is higher, and resort the resulting list.
Do this for both lists, and you'll end with a 3*3 matrix, where it should be easy to take the 3rd value.
Here is an implementation in scala.
If we assume n is smaller than m, and A=[1, 3, 4] and B=[2, 2, 5], n=2:
You would take (3, 4) => sort them (4,3)
Then take (2,5) => sort them (5, 2)
You could now do an zipped search. Of course the biggest product now is (5, 4). But the next one is either (4*2) or (5*3). For longer lists, you could keep in mind what the result of 4*2 was, compare it only with the next product, taken the other way. That way you would only calculate one product too much.
Specifically in the domain of one-dimensional sets of items of the same type, such as a vector of integers.
Say, for example, you had a vector of size 32,768 containing the sorted integers 0 through 32,767.
What I mean by "next permutation" is performing the next permutation in a lexical ordering system.
Wikipedia lists two, and I'm wondering if there are any more (besides something bogo :P)
O(N) implementation
This is based on Eyal Schneider's mapping Zn! -> P(n)
def get_permutation(k, lst):
N = len(lst)
while N:
next_item = k/f(N-1)
lst[N-1], lst[next_item] = lst[next_item], lst[N-1]
k = k - next_item*f(N-1)
N = N-1
return lst
It reduces his O(N^2) algorithm by integrating the conversion step with finding the permutation. It essentially has the same form as Fisher-Yates but replaces a call to random with the next step of the mapping. If the mapping is in fact a bijection (which I'm working to prove) then this is a better algorithm than Fisher-Yates because it only calls out to pseudo random number generator once and so will be more efficient. Note also that this returns the action of permutation (N! - k) rather than permutation k but that's of little consequence because if k is uniform on [0, N!], then so is N! - k.
old answer
This is slightly related to the idea of "next" permutation. If the items can be well ordered, then one can construct lexicographical ordering on the permutations. This allows you to construct a map from the integers into the space of permutations.
Then finding a random permutation is equivalent to choosing a random integer between 0 and N! and constructing the corresponding permutation. This algorithm will be as efficient as (and as difficult to implement) as calculating the n'th permutation of the set in question. This trivially gives a uniform choice of permutation if our choice of n is uniform.
A little more detail about ordering the permutations. given a set S = {a b c d}, mathematicians view the set of permutations of S as a group with the operation of composition. if p is one permutation, lets say (b a c d), then p operates on S by taking b to a, a to c, c to d and d to b. if q is another permutation, lets say (d b c a) then pq is obtained by first applying q and then p which gives (d a b)(c). for example, q takes d to b and p takes b to a so that pq takes d to a. You'll see that pq has two cycles because it takes b to d and fixes c. It's customary to omit 1-cycles but I left it in for clarity.
We're going to use some facts from group theory.
disjoint cycles commute. (a b)(c d) is the same as (c d)(a b)
we can arrange elements in a cycle in any cyclic order. that is (a b c) = (b c a) = (c a b)
So given a permutation, order the cycles so that the largest cycles come first. When two cycles are the same length, arrange their items so that the largest (we can always order a denumerable set, even if arbitrarily so) item comes first. Then we just have a lexicographical ordering first on the length of the cycles, then on their contents. This is well ordered because two permutations that consist of the same cycles must be the same permutation so if p > q and q > p then p = q.
This algorithm can be trivially executed in O(N!logN! + N!) time. just construct all the permutations (EDIT: Just to be clear, I had my mathematician hat on when I proposed this and it was tongue in cheek anyway) , quicksort them and find the n'th. It is a different algorithm than the two you mention though.
Here is an idea on how to improve aaronasterling's answer. It avoids generating all N! permutations and sorting them according to their lexicographic order, and therefore has a much better time complexity.
Internally it uses an unusual permutation representation, that simulates a selection & removal process from a shrinking array. For example, the sequence <0,1,0> represents a permutation resulting from removing item #0 from [0,1,2], then removing item #1 from [1,2], and then removing item #0 from [1]. The resulting permutation is <0,2,1>. With this representation, the first permutation will always be <0,0,...0>, and the last one will always be <N-1,N-2,...0>. I will call this special representation the "array representation".
Clearly, an array representation of size N can be converted to a standard permutation representation in O(N^2) time, by using an array and shrinking it when necessary.
The following function can be used to return the Kth permutation on {0,1,2...,N-1}, in the array representation:
getPermutation(k, N) {
while(N > 0) {
nextItem = floor(k / (N-1)!)
output nextItem
k = k - nextItem * (N-1)!
N = N - 1
}
}
This algorithm works in O(N^2) time (due to the representation conversion), instead of O(N! log N) time.
--Example--
getPermutation(4,3) returns <2,0,0>. This array representation corresponds to <C,A,B>, which is really the permutation at index 4 in the ordered list of permutations on {A,B,C}:
ABC
ACB
BAC
BCA
CAB
CBA
You can adapt merge sort such that it will shuffle the input randomly instead of sorting it.
In particular, when merging two lists, you choose the new head element at random instead of choosing it to be the smallest head element. The probability of choosing the element from the first list must be n/(n+m) where n is the length of the first and m the length of the second list for this to work.
I've written a detailed explanation here: Random Permutations and Sorting.
Another possibility is to build an LFSR or PRNG with a period equal to the number of items you want.
Start with a sorted array. Pick 2 random indexes, switch the elements at those indexes. Repeat O(n lg n) times.
You need to repeat O(n lg n) times to ensure that the distribution approaches uniform. (You need to make sure that each index is picked at least once, which is a balls-in-bins problem.)