Number of reshuffles used to sort - algorithm

I found an interesting problem recently, which looks like this:
There is a dull sorting algorithm which takes the 1st number from an array, it finds an element which is lower by 1 than the 1st element (or it takes the highest element when there is no lower one), and puts it in the front. Cost of putting element with index x (counting from 0) in the front is equal to its index. It continues this process until the array is sorted. The task is to count the cost of sorting all the n! permutations of numbers from 1 to n. The answer might be big so the answer should be modulo m (n and m are given in the input)
Example:
Input (n,m): 3 40
Answer: 15
There are permutations of numbers from 1 to 3. The costs of sorting them are:
(1,2,3)->0
(1,3,2)->5
(2,1,3)->1
(2,3,1)->2
(3,1,2)->4
(3,2,1)->3
sum = 15
My program generates all the possible arrays and sorts them one by one. Its complexity is O(n!*n^2), which is way too high. I am stuck with all my thoughts and this brute force solution.
There are also some funny things I have discovered:
When I have grouped the costs of sorting permutations by the quantity of those costs I am getting a strange figure:
n=7
https://cdn.discordapp.com/attachments/503942261051228160/905511062546292807/unknown.png
(Each row has 2 numbers x * y, and y spaces before them. x-cost of sort, y-number of permutations(1-n) that have cost of sort equal to y). The score is all the rows x*y summed up. I can provide this figure for other n's, if needed.
It might be about fibonacci numbers, just a feeling.
As you can see the figure from the 1. has a segment in which all the y's are the same. It starts at x=n^2-4n+3.
Question: How can I solve this problem more efficiently?

The sorting algorithm has two phases: it first sorts the permutation
into some rotation of the identity and then rotates it to the identity.
We account for the cost of these phases separately.
The first phase consists of at most n−2 moves. After n−1−j moves, the
permutation consists of n−j values x, x+1, x+2, … mod n followed by a
permutation of the remaining j values that, assuming that we start from
a random permutation, are equally likely to be in any particular order.
The expected distance that we have to move x−1 mod n is ((n−j)+(n−1))/2.
But hang on, we only count the move if we’re still in the first phase.
Thus we need to discount the cases where the permutation is already a
rotation. There are n!/j! of them, and they all have x−1 at the end, so
the discount for each is n−1.
The second phase consists on average of (n−1)/2 moves from the end of
the permutation to the beginning, each costing n−1. The average cost
over all n! permutations is thus (n−1)²/2.
I’ll leave the modular arithmetic/strength reduction of the Python below
as an exercise.
from itertools import permutations
from math import factorial
# Returns the total cost of sorting all n-element permutations.
def fast_total_sort_cost(n):
cost = 0
for j in range(n - 1, 0, -1):
cost += factorial(n) * ((n - j) + (n - 1)) // 2
cost -= (factorial(n) // factorial(j)) * (n - 1)
return cost + factorial(n) * (n - 1) ** 2 // 2
# Reference implementation and test.
def reference_total_sort_cost(n):
return sum(sort_cost(perm) for perm in permutations(range(n)))
def sort_cost(perm):
cost = 0
perm = list(perm)
while not is_sorted(perm):
i = perm.index((perm[0] - 1) % len(perm))
cost += i
perm.insert(0, perm.pop(i))
return cost
def is_sorted(perm):
return all(perm[i - 1] <= perm[i] for i in range(1, len(perm)))
if __name__ == "__main__":
for m in range(1, 9):
print(fast_total_sort_cost(m), reference_total_sort_cost(m))

From the sorted permutation 123...n, you can build a tree using the reverse of the rule, and get all the permutations. See this tree for n=4.
Now, observe that
if node==1234 then cost(node) = 0
if node!=1234 then cost(node) = blue_label(node) + cost(parent)
What you need is to formulate the reverse rule to generate the tree.
Maybe use some memoization technique to avoid recomputing cost everytime.

Related

Algorithm to generate permutations by order of fewest positional changes

I'm looking for an algorithm to generate or iterate through all permutations of a list of objects such that:
They are generated by fewest to least positional changes from the original. So first all the permutations with a single pair of elements swapped, then all the permutations with only two pairs of elements swapped, etc.
The list generated is complete, so for n objects in a list there should be n! total, unique permutations.
Ideally (but not necessarily) there should be a way of specifying (and generating) a particular permutation without having to generate the full list first and then reference the index.
The speed of the algorithm is not particularly important.
I've looked through all the permutation algorithms that I can find, and none so far have met criteria 1 and 2, let alone 3.
I have an idea how I could write this algorithm myself using recursion, and filtering for duplicates to only get unique permutations. However, if there is any existing algorithm I'd much rather use something proven.
This code answers your requirement #3, which is to compute permutation at index N directly.
This code relies on the following principle:
The first permutation is the identity; then the next (n choose 2) permutations just swap two elements; then the next (n choose 3)(subfactorial(3)) permutations derange 3 elements; then the next (n choose 4)(subfactorial(4)) permutations derange 4 elements; etc. To find the Nth permutation, first figure out how many elements it deranges by finding the largest K such that sum[k = 0 ^ K] (n choose k) subfactorial(k) ⩽ N.
This number K is found by function number_of_derangements_for_permutation_at_index in the code.
Then, the relevant subset of indices which must be deranged is computed efficiently using more_itertools.nth_combination.
However, I didn't have a function nth_derangement to find the relevant derangement of the deranged subset of indices. Hence the last step of the algorithm, which computes this derangement, could be optimised if there exists an efficient function to find the nth derangement of a sequence efficiently.
As a result, this last step takes time proportional to idx_r, where idx_r is the index of the derangement, a number between 0 and factorial(k), where k is the number of elements which are deranged by the returned permutation.
from sympy import subfactorial
from math import comb
from itertools import count, accumulate, pairwise, permutations
from more_itertools import nth_combination, nth
def number_of_derangements_for_permutation_at_index(n, idx):
#n = len(seq)
for k, (low_acc, high_acc) in enumerate(pairwise(accumulate((comb(n,k) * subfactorial(k) for k in count(2)), initial=1)), start=2):
if low_acc <= idx < high_acc:
return k, low_acc
def is_derangement(seq, perm):
return all(i != j for i,j in zip(seq, perm))
def lift_permutation(seq, deranged, permutation):
result = list(seq)
for i,j in zip(deranged, permutation):
result[i] = seq[j]
return result
# THIS FUNCTION NOT EFFICIENT
def nth_derangement(seq, idx):
return nth((p for p in permutations(seq) if is_derangement(seq, p)),
idx)
def nth_permutation(seq, idx):
if idx == 0:
return list(seq)
n = len(seq)
k, acc = number_of_derangements_for_permutation_at_index(n, idx)
idx_q, idx_r = divmod(idx - acc, subfactorial(k))
deranged = nth_combination(range(n), k, idx_q)
derangement = nth_derangement(deranged, idx_r) # TODO: FIND EFFICIENT VERSION
return lift_permutation(seq, deranged, derangement)
Testing for correctness on small data:
print( [''.join(nth_permutation('abcd', i)) for i in range(24)] )
# ['abcd',
# 'bacd', 'cbad', 'dbca', 'acbd', 'adcb', 'abdc',
# 'bcad', 'cabd', 'bdca', 'dacb', 'cbda', 'dbac', 'acdb', 'adbc',
# 'badc', 'bcda', 'bdac', 'cadb', 'cdab', 'cdba', 'dabc', 'dcab', 'dcba']
Testing for speed on medium data:
from math import factorial
seq = 'abcdefghij'
n = len(seq) # 10
N = factorial(n) // 2 # 1814400
perm = ''.join(nth_permutation(seq, N))
print(perm)
# fcjdibaehg
Imagine a graph with n! nodes labeled with every permutation of n elements. If we add edges to this graph such that nodes which can be obtained by swapping one pair of elements are connected, an answer to your problem is obtained by doing a breadth-first search from whatever node you like.
You can actually generate the graph or just let it be implied and just deduce at each stage what nodes should be adjacent (and of course, keep track of ones you've already visited, to avoid revisiting them).
I concede this probably doesn't help with point 3, but maybe is a viable strategy for getting points 1 and 2 answered.
To solve 1 & 2, you could first generate all possible permutations, keeping track of how many swaps occurred during generation for each list. Then sort them by number of swaps. Which I think is O(n! + nlgn) = O(n!)

Given N people, some of which are enemies, find number of intervals with no enemies

A friend gave me this problem as a challenge, and I've tried to find a problem like this on LeetCode, but sadly could not.
Question
Given a line of people numbered from 1 to N, and a list of pairs of M enemies, find the total number of sublines with people that contain no two people that are enemies.
Example: N = 5, enemies = [[3,4], [3,5]]
Answer: 9
Explanation: These continuous subintervals are:
[1,1], [2,2], [3,3], [1,2], [2,3], [1,3], [4,4], [4,5], [5,5]
My approach
We define a non-conflicting interval as a contiguous interval from (and including) [a,b] where no two people are enemies in that interval.
Working backwards, if I know there is a non conflicting interval from [1,3] like in the example given above, I know the number of contiguous intervals between those two numbers is n(n+1)/2 where n is the length of the interval. In this case, the interval length is 3, and so there are 6 intervals between (and including) [1,3] that count.
Extending this logic, if I have a list of all non-conflicting intervals, then the answer is simply the sum of (n_i*(n_i+1))/2 for every interval length n_i.
Then all I need to do is find these intervals. This is where I'm stuck.
I can't really think of a similar programming problem. This seems similar, but the opposite of what the Merge Intervals problem on leetcode asks for. In that problem we're sorta given the good intervals and are asked to combine them. Here we're given the bad.
Any guidance?
EDIT: Best I could come up with:
Does this work?
So let's define max_enemy[i] as the largest enemy that is less that a particular person i, where i is the usual [1,N]. We can generate this value in O(M) time simply using a the following loop:
max_enemy = [-1] * (N+1) # -1 means it doesn't exist
for e1, e2 in enms:
e1, e2 = min(e1,e2), max(e1, e2)
max_enemy[e2] = max(max_enemy[e2], e1)
Then if we go through the person's array keeping a sliding window. The sliding window ends as soon as we find a person i who has: max_enemy[i] < i. This way we know that including this person will break our contiguous interval. So we now know our interval is [s, i-1] and we can do our math. We reset s=i and continue.
Here is a visualization of how this works visually. We draw a path between any two enemies:
N=5, enemies = [[3,4], [3,5]]
1 2 3 4 5
| | |
-----
| |
--------
EDIT2: I know this doesn't work for N=5, enemies=[[1,4][3,5]], currently working on a fix, still stuck
You can solve this in O(M log M) time and O(M) space.
Let ENDINGAT(i) be the number of enemy-free intervals ending at position/person i. This is also the size of the largest enemy-free interval ending at i.
The answer you seek is the sum of all ENDINGAT(i) for every person i.
Let NEAREST(i) be the nearest enemy of person i that precedes person i. Let it be -1 if i has no preceding enemies.
Now we can write a simple formula to calculate all the ENDINGAT(values):
ENDINGAT(1) = 1, since there is only one interval ending at 1. For larger values:
ENDINGAT(i) = MIN( ENDINGAT(i-1)+1, i-NEAREST(i) )
So, it is very easy to calculate all the ENDINGAT(i) in order, as long as we can have all the NEAREST(i) in order. To get that, all you need to do is sort the enemy pairs by the highest member. Then for each i you can walk over all the pairs ending at i to find the closest one.
That's it -- it turns out to be pretty easy. The time is dominated by the O(M log M) time required to sort the enemy pairs, unless N is much bigger than M. In that case, you can skip runs of ENDINGAT for people with no preceding enemies, calculating their effect on the sum mathematically.
There's a cool visual way to see this!
Instead of focusing the line, let's look at the matrix of pairs of players. If ii and j are enemies, then the effect of this enemiship is precisely to eliminate from consideration (1) this interval, and (2) any interval strictly larger than it. Because enemiship is symmetric, we may as well just look at the upper-right half of the matrix, and the diagonal; we'll use the characters
"X" to denote that a pair is enemies,
"*" to indicate that a pair has been obscured by a pair of enemies, and
"%" in the lower half to mark it as not part of the upper-half matrix.
For the two examples in your code, observe their corresponding matrices:
# intervals: 9 # intervals: 10
0 1 2 3 4 0 1 2 3 4
------------------------ ------------------------
* * | 0 * * | 0
% * * | 1 % X * | 1
% % X X | 2 % % X | 2
% % % | 3 % % % | 3
% % % % | 4 % % % % | 4
The naive solution, provided below, solves the problem in O(N^2 M) time and O(N^2) space.
def matrix(enemies):
m = [[' ' for j in range(N)] for i in range(N)]
for (i,j) in enemies:
m[i][j] = 'X' #Mark Enemiship
# Now mark larger intervals as dead.
for q in range(0,i+1):
for r in range(j,N):
if m[q][r] == ' ':
m[q][r] = '*'
num_int = 0
for i in range(N):
for j in range(N):
if(j < i):
m[i][j] = '%'
elif m[i][j] == ' ':
num_int+=1
print("# intervals: ", num_int)
return m
To convince yourself further, here are the matrices where
player 2 is enemies with himself, so that there is a barrier, and there are two smaller versions of the puzzle on the intervals [0,1] and [3,4] each of which admits 3 sub-intervals)
Every player is enemies with the person two to their left, so that only length-(1 or 0) intervals are allowed (of which there are 4+5=9 intervals)
# intervals: 6 # intervals: 9
0 1 2 3 4 0 1 2 3 4
---------[===========+ --------[============+
* * * || 0 X * * || 0
% * * * || 1 % X * || 1
% % X * * II 2 % % X II 2
% % % | 3 % % % | 3
% % % % | 4 % % % % | 4
Complexity: Mathematically the same as sorting a list, or validating that it is sorted. that is, O(M log M) in the worst case, and O(M) space to sort, and still at least O(M) time in the best case to recognize if the list is sorted.
Bonus: This is also an excellent example to illustrate power of looking at the identity a problem, rather than its solution. Such a view of the of the problem will also inform smarter solutions. We can clearly do much better than the code I gave above...
We would clearly be done, for instance, if we could count the number of un-shaded points, which is the area of the smallest convex polygon covering the enemiships, together with the two boundary points. (Finding the two additional points can be done in O(M) time.) Now, this is probably not a problem you can solve in your sleep, but fortunately the problem of finding a convex hull, is so natural that the algorithms used to do it are well known.
In particular, a Graham Scan can do it in O(M) time, so long as we happen to be given the pairs of enemies so that one of their coordinates is sorted. Better still, once we have the set of points in the convex hull, the area can be calculated by dividing it into at most M axis-aligned rectangles. Therefore, if enemy pairs are sorted, the entire problem could be solved in O(M) time. Keep in mind that M could be dramatically than N, and we don't even need to store N numbers in an array! This corresponds to the arithmetic proposed to skip lines in the other answer to this question.
If they are not sorted, other Convex Hull algorithms yield an O(M log M) running time, with O(M) space, as given by #Matt Timmermans's solution. In fact, this is the general lower bound! This can be shown with a more complicated geometric reduction: if you can solve the problem, then you can compute the sum of the heights of each number, multiplied by its distance to "the new zero", of agents satisfying j+i = N. This sum can be used to compute distances to the diagonal line, which is enough to sort a list of numbers in O(M) time---a problem which cannot be solved in under O(M log M) time for adversarial inputs.
Ah, so why is it the case that we can get an O(N + M) solution by manually performing this integration, as done explicitly in the other solution? It is because we can sort the M numbers if we know that they fall into N bins, by Bucket Sort.
Thanks for sharing the puzzle!

How many times variable m is updated

Given the following pseudo-code, the question is how many times on average is the variable m being updated.
A[1...n]: array with n random elements
m = a[1]
for I = 2 to n do
if a[I] < m then m = a[I]
end for
One might answer that since all elements are random, then the variable will be updated on average on half the number of iterations of the for loop plus one for the initialization.
However, I suspect that there must be a better (and possibly the only correct) way to prove it using binomial distribution with p = 1/2. This way, the average number of updates on m would be
M = 1 + Σi=1 to n-1[k.Cn,k.pk.(1-p)(n-k)]
where Cn,k is the binomial coefficient. I have tried to solve this but I have stuck some steps after since I do not know how to continue.
Could someone explain me which of the two answers is correct and if it is the second one, show me how to calculate M?
Thank you for your time
Assuming the elements of the array are distinct, the expected number of updates of m is the nth harmonic number, Hn, which is the sum of 1/k for k ranging from 1 to n.
The summation formula can also be represented by the recursion:
H1 &equals; 1
Hn &equals; Hn−1&plus;1/n (n > 1)
It's easy to see that the recursion corresponds to the problem.
Consider all permutations of n−1 numbers, and assume that the expected number of assignments is Hn−1. Now, every permutation of n numbers consists of a permutation of n−1 numbers, with a new smallest number inserted in one of n possible insertion points: either at the beginning, or after one of the n−1 existing values. Since it is smaller than every number in the existing series, it will only be assigned to m in the case that it was inserted at the beginning. That has a probability of 1/n, and so the expected number of assignments of a permutation of n numbers is Hn−1 + 1/n.
Since the expected number of assignments for a vector of length one is obviously 1, which is H1, we have an inductive proof of the recursion.
Hn is asymptotically equal to ln n &plus; γ where γ is the Euler-Mascheroni constant, approximately 0.577. So it increases without limit, but quite slowly.
The values for which m is updated are called left-to-right maxima, and you'll probably find more information about them by searching for that term.
I liked #rici answer so I decided to elaborate its central argument a little bit more so to make it clearer to me.
Let H[k] be the expected number of assignments needed to compute the min m of an array of length k, as indicated in the algorithm under consideration. We know that
H[1] = 1.
Now assume we have an array of length n > 1. The min can be in the last position of the array or not. It is in the last position with probability 1/n. It is not with probability 1 - 1/n. In the first case the expected number of assignments is H[n-1] + 1. In the second, H[n-1].
If we multiply the expected number of assignments of each case by their probabilities and sum, we get
H[n] = (H[n-1] + 1)*1/n + H[n-1]*(1 - 1/n)
= H[n-1]*1/n + 1/n + H[n-1] - H[n-1]*1/n
= 1/n + H[n-1]
which shows the recursion.
Note that the argument is valid if the min is either in the last position or in any the first n-1, not in both places. Thus we are using that all the elements of the array are different.

How to find pair with kth largest sum?

Given two sorted arrays of numbers, we want to find the pair with the kth largest possible sum. (A pair is one element from the first array and one element from the second array). For example, with arrays
[2, 3, 5, 8, 13]
[4, 8, 12, 16]
The pairs with largest sums are
13 + 16 = 29
13 + 12 = 25
8 + 16 = 24
13 + 8 = 21
8 + 12 = 20
So the pair with the 4th largest sum is (13, 8). How to find the pair with the kth largest possible sum?
Also, what is the fastest algorithm? The arrays are already sorted and sizes M and N.
I am already aware of the O(Klogk) solution , using Max-Heap given here .
It also is one of the favorite Google interview question , and they demand a O(k) solution .
I've also read somewhere that there exists a O(k) solution, which i am unable to figure out .
Can someone explain the correct solution with a pseudocode .
P.S.
Please DON'T post this link as answer/comment.It DOESN'T contain the answer.
I start with a simple but not quite linear-time algorithm. We choose some value between array1[0]+array2[0] and array1[N-1]+array2[N-1]. Then we determine how many pair sums are greater than this value and how many of them are less. This may be done by iterating the arrays with two pointers: pointer to the first array incremented when sum is too large and pointer to the second array decremented when sum is too small. Repeating this procedure for different values and using binary search (or one-sided binary search) we could find Kth largest sum in O(N log R) time, where N is size of the largest array and R is number of possible values between array1[N-1]+array2[N-1] and array1[0]+array2[0]. This algorithm has linear time complexity only when the array elements are integers bounded by small constant.
Previous algorithm may be improved if we stop binary search as soon as number of pair sums in binary search range decreases from O(N2) to O(N). Then we fill auxiliary array with these pair sums (this may be done with slightly modified two-pointers algorithm). And then we use quickselect algorithm to find Kth largest sum in this auxiliary array. All this does not improve worst-case complexity because we still need O(log R) binary search steps. What if we keep the quickselect part of this algorithm but (to get proper value range) we use something better than binary search?
We could estimate value range with the following trick: get every second element from each array and try to find the pair sum with rank k/4 for these half-arrays (using the same algorithm recursively). Obviously this should give some approximation for needed value range. And in fact slightly improved variant of this trick gives range containing only O(N) elements. This is proven in following paper: "Selection in X + Y and matrices with sorted rows and columns" by A. Mirzaian and E. Arjomandi. This paper contains detailed explanation of the algorithm, proof, complexity analysis, and pseudo-code for all parts of the algorithm except Quickselect. If linear worst-case complexity is required, Quickselect may be augmented with Median of medians algorithm.
This algorithm has complexity O(N). If one of the arrays is shorter than other array (M < N) we could assume that this shorter array is extended to size N with some very small elements so that all calculations in the algorithm use size of the largest array. We don't actually need to extract pairs with these "added" elements and feed them to quickselect, which makes algorithm a little bit faster but does not improve asymptotic complexity.
If k < N we could ignore all the array elements with index greater than k. In this case complexity is equal to O(k). If N < k < N(N-1) we just have better complexity than requested in OP. If k > N(N-1), we'd better solve the opposite problem: k'th smallest sum.
I uploaded simple C++11 implementation to ideone. Code is not optimized and not thoroughly tested. I tried to make it as close as possible to pseudo-code in linked paper. This implementation uses std::nth_element, which allows linear complexity only on average (not worst-case).
A completely different approach to find K'th sum in linear time is based on priority queue (PQ). One variation is to insert largest pair to PQ, then repeatedly remove top of PQ and instead insert up to two pairs (one with decremented index in one array, other with decremented index in other array). And take some measures to prevent inserting duplicate pairs. Other variation is to insert all possible pairs containing largest element of first array, then repeatedly remove top of PQ and instead insert pair with decremented index in first array and same index in second array. In this case there is no need to bother about duplicates.
OP mentions O(K log K) solution where PQ is implemented as max-heap. But in some cases (when array elements are evenly distributed integers with limited range and linear complexity is needed only on average, not worst-case) we could use O(1) time priority queue, for example, as described in this paper: "A Complexity O(1) Priority Queue for Event Driven Molecular Dynamics Simulations" by Gerald Paul. This allows O(K) expected time complexity.
Advantage of this approach is a possibility to provide first K elements in sorted order. Disadvantages are limited choice of array element type, more complex and slower algorithm, worse asymptotic complexity: O(K) > O(N).
EDIT: This does not work. I leave the answer, since apparently I am not the only one who could have this kind of idea; see the discussion below.
A counter-example is x = (2, 3, 6), y = (1, 4, 5) and k=3, where the algorithm gives 7 (3+4) instead of 8 (3+5).
Let x and y be the two arrays, sorted in decreasing order; we want to construct the K-th largest sum.
The variables are: i the index in the first array (element x[i]), j the index in the second array (element y[j]), and k the "order" of the sum (k in 1..K), in the sense that S(k)=x[i]+y[j] will be the k-th greater sum satisfying your conditions (this is the loop invariant).
Start from (i, j) equal to (0, 0): clearly, S(1) = x[0]+y[0].
for k from 1 to K-1, do:
if x[i+1]+ y[j] > x[i] + y[j+1], then i := i+1 (and j does not change) ; else j:=j+1
To see that it works, consider you have S(k) = x[i] + y[j]. Then, S(k+1) is the greatest sum which is lower (or equal) to S(k), and such as at least one element (i or j) changes. It is not difficult to see that exactly one of i or j should change.
If i changes, the greater sum you can construct which is lower than S(k) is by setting i=i+1, because x is decreasing and all the x[i'] + y[j] with i' < i are greater than S(k). The same holds for j, showing that S(k+1) is either x[i+1] + y[j] or x[i] + y[j+1].
Therefore, at the end of the loop you found the K-th greater sum.
tl;dr: If you look ahead and look behind at each iteration, you can start with the end (which is highest) and work back in O(K) time.
Although the insight underlying this approach is, I believe, sound, the code below is not quite correct at present (see comments).
Let's see: first of all, the arrays are sorted. So, if the arrays are a and b with lengths M and N, and as you have arranged them, the largest items are in slots M and N respectively, the largest pair will always be a[M]+b[N].
Now, what's the second largest pair? It's going to have perhaps one of {a[M],b[N]} (it can't have both, because that's just the largest pair again), and at least one of {a[M-1],b[N-1]}. BUT, we also know that if we choose a[M-1]+b[N-1], we can make one of the operands larger by choosing the higher number from the same list, so it will have exactly one number from the last column, and one from the penultimate column.
Consider the following two arrays: a = [1, 2, 53]; b = [66, 67, 68]. Our highest pair is 53+68. If we lose the smaller of those two, our pair is 68+2; if we lose the larger, it's 53+67. So, we have to look ahead to decide what our next pair will be. The simplest lookahead strategy is simply to calculate the sum of both possible pairs. That will always cost two additions, and two comparisons for each transition (three because we need to deal with the case where the sums are equal);let's call that cost Q).
At first, I was tempted to repeat that K-1 times. BUT there's a hitch: the next largest pair might actually be the other pair we can validly make from {{a[M],b[N]}, {a[M-1],b[N-1]}. So, we also need to look behind.
So, let's code (python, should be 2/3 compatible):
def kth(a,b,k):
M = len(a)
N = len(b)
if k > M*N:
raise ValueError("There are only %s possible pairs; you asked for the %sth largest, which is impossible" % M*N,k)
(ia,ib) = M-1,N-1 #0 based arrays
# we need this for lookback
nottakenindices = (0,0) # could be any value
nottakensum = float('-inf')
for i in range(k-1):
optionone = a[ia]+b[ib-1]
optiontwo = a[ia-1]+b[ib]
biggest = max((optionone,optiontwo))
#first deal with look behind
if nottakensum > biggest:
if optionone == biggest:
newnottakenindices = (ia,ib-1)
else: newnottakenindices = (ia-1,ib)
ia,ib = nottakenindices
nottakensum = biggest
nottakenindices = newnottakenindices
#deal with case where indices hit 0
elif ia <= 0 and ib <= 0:
ia = ib = 0
elif ia <= 0:
ib-=1
ia = 0
nottakensum = float('-inf')
elif ib <= 0:
ia-=1
ib = 0
nottakensum = float('-inf')
#lookahead cases
elif optionone > optiontwo:
#then choose the first option as our next pair
nottakensum,nottakenindices = optiontwo,(ia-1,ib)
ib-=1
elif optionone < optiontwo: # choose the second
nottakensum,nottakenindices = optionone,(ia,ib-1)
ia-=1
#next two cases apply if options are equal
elif a[ia] > b[ib]:# drop the smallest
nottakensum,nottakenindices = optiontwo,(ia-1,ib)
ib-=1
else: # might be equal or not - we can choose arbitrarily if equal
nottakensum,nottakenindices = optionone,(ia,ib-1)
ia-=1
#+2 - one for zero-based, one for skipping the 1st largest
data = (i+2,a[ia],b[ib],a[ia]+b[ib],ia,ib)
narrative = "%sth largest pair is %s+%s=%s, with indices (%s,%s)" % data
print (narrative) #this will work in both versions of python
if ia <= 0 and ib <= 0:
raise ValueError("Both arrays exhausted before Kth (%sth) pair reached"%data[0])
return data, narrative
For those without python, here's an ideone: http://ideone.com/tfm2MA
At worst, we have 5 comparisons in each iteration, and K-1 iterations, which means that this is an O(K) algorithm.
Now, it might be possible to exploit information about differences between values to optimise this a little bit, but this accomplishes the goal.
Here's a reference implementation (not O(K), but will always work, unless there's a corner case with cases where pairs have equal sums):
import itertools
def refkth(a,b,k):
(rightia,righta),(rightib,rightb) = sorted(itertools.product(enumerate(a),enumerate(b)), key=lamba((ia,ea),(ib,eb):ea+eb)[k-1]
data = k,righta,rightb,righta+rightb,rightia,rightib
narrative = "%sth largest pair is %s+%s=%s, with indices (%s,%s)" % data
print (narrative) #this will work in both versions of python
return data, narrative
This calculates the cartesian product of the two arrays (i.e. all possible pairs), sorts them by sum, and takes the kth element. The enumerate function decorates each item with its index.
The max-heap algorithm in the other question is simple, fast and correct. Don't knock it. It's really well explained too. https://stackoverflow.com/a/5212618/284795
Might be there isn't any O(k) algorithm. That's okay, O(k log k) is almost as fast.
If the last two solutions were at (a1, b1), (a2, b2), then it seems to me there are only four candidate solutions (a1-1, b1) (a1, b1-1) (a2-1, b2) (a2, b2-1). This intuition could be wrong. Surely there are at most four candidates for each coordinate, and the next highest is among the 16 pairs (a in {a1,a2,a1-1,a2-1}, b in {b1,b2,b1-1,b2-1}). That's O(k).
(No it's not, still not sure whether that's possible.)
[2, 3, 5, 8, 13]
[4, 8, 12, 16]
Merge the 2 arrays and note down the indexes in the sorted array. Here is the index array looks like (starting from 1 not 0)
[1, 2, 4, 6, 8]
[3, 5, 7, 9]
Now start from end and make tuples. sum the elements in the tuple and pick the kth largest sum.
public static List<List<Integer>> optimization(int[] nums1, int[] nums2, int k) {
// 2 * O(n log(n))
Arrays.sort(nums1);
Arrays.sort(nums2);
List<List<Integer>> results = new ArrayList<>(k);
int endIndex = 0;
// Find the number whose square is the first one bigger than k
for (int i = 1; i <= k; i++) {
if (i * i >= k) {
endIndex = i;
break;
}
}
// The following Iteration provides at most endIndex^2 elements, and both arrays are in ascending order,
// so k smallest pairs must can be found in this iteration. To flatten the nested loop, refer
// 'https://stackoverflow.com/questions/7457879/algorithm-to-optimize-nested-loops'
for (int i = 0; i < endIndex * endIndex; i++) {
int m = i / endIndex;
int n = i % endIndex;
List<Integer> item = new ArrayList<>(2);
item.add(nums1[m]);
item.add(nums2[n]);
results.add(item);
}
results.sort(Comparator.comparing(pair->pair.get(0) + pair.get(1)));
return results.stream().limit(k).collect(Collectors.toList());
}
Key to eliminate O(n^2):
Avoid cartesian product(or 'cross join' like operation) of both arrays, which means flattening the nested loop.
Downsize iteration over the 2 arrays.
So:
Sort both arrays (Arrays.sort offers O(n log(n)) performance according to Java doc)
Limit the iteration range to the size which is just big enough to support k smallest pairs searching.

Find subset with elements that are furthest apart from eachother

I have an interview question that I can't seem to figure out. Given an array of size N, find the subset of size k such that the elements in the subset are the furthest apart from each other. In other words, maximize the minimum pairwise distance between the elements.
Example:
Array = [1,2,6,10]
k = 3
answer = [1,6,10]
The bruteforce way requires finding all subsets of size k which is exponential in runtime.
One idea I had was to take values evenly spaced from the array. What I mean by this is
Take the 1st and last element
find the difference between them (in this case 10-1) and divide that by k ((10-1)/3=3)
move 2 pointers inward from both ends, picking out elements that are +/- 3 from your previous pick. So in this case, you start from 1 and 10 and find the closest elements to 4 and 7. That would be 6.
This is based on the intuition that the elements should be as evenly spread as possible. I have no idea how to prove it works/doesn't work. If anyone knows how or has a better algorithm please do share. Thanks!
This can be solved in polynomial time using DP.
The first step is, as you mentioned, sort the list A. Let X[i,j] be the solution for selecting j elements from first i elements A.
Now, X[i+1, j+1] = max( min( X[k,j], A[i+1]-A[k] ) ) over k<=i.
I will leave initialization step and memorization of subset step for you to work on.
In your example (1,2,6,10) it works the following way:
1 2 6 10
1 - - - -
2 - 1 5 9
3 - - 1 4
4 - - - 1
The basic idea is right, I think. You should start by sorting the array, then take the first and the last elements, then determine the rest.
I cannot think of a polynomial algorithm to solve this, so I would suggest one of the two options.
One is to use a search algorithm, branch-and-bound style, since you have a nice heuristic at hand: the upper bound for any solution is the minimum size of the gap between the elements picked so far, so the first guess (evenly spaced cells, as you suggested) can give you a good baseline, which will help prune most of the branches right away. This will work fine for smaller values of k, although the worst case performance is O(N^k).
The other option is to start with the same baseline, calculate the minimum pairwise distance for it and then try to improve it. Say you have a subset with minimum distance of 10, now try to get one with 11. This can be easily done by a greedy algorithm -- pick the first item in the sorted sequence such that the distance between it and the previous item is bigger-or-equal to the distance you want. If you succeed, try increasing further, if you fail -- there is no such subset.
The latter solution can be faster when the array is large and k is relatively large as well, but the elements in the array are relatively small. If they are bound by some value M, this algorithm will take O(N*M) time, or, with a small improvement, O(N*log(M)), where N is the size of the array.
As Evgeny Kluev suggests in his answer, there is also a good upper bound on the maximum pairwise distance, which can be used in either one of these algorithms. So the complexity of the latter is actually O(N*log(M/k)).
You can do this in O(n*(log n) + n*log(M)), where M is max(A) - min(A).
The idea is to use binary search to find the maximum separation possible.
First, sort the array. Then, we just need a helper function that takes in a distance d, and greedily builds the longest subarray possible with consecutive elements separated by at least d. We can do this in O(n) time.
If the generated array has length at least k, then the maximum separation possible is >=d. Otherwise, it's strictly less than d. This means we can use binary search to find the maximum value. With some cleverness, you can shrink the 'low' and 'high' bounds of the binary search, but it's already so fast that sorting would become the bottleneck.
Python code:
def maximize_distance(nums: List[int], k: int) -> List[int]:
"""Given an array of numbers and size k, uses binary search
to find a subset of size k with maximum min-pairwise-distance"""
assert len(nums) >= k
if k == 1:
return [nums[0]]
nums.sort()
def longest_separated_array(desired_distance: int) -> List[int]:
"""Given a distance, returns a subarray of nums
of length k with pairwise differences at least that distance (if
one exists)."""
answer = [nums[0]]
for x in nums[1:]:
if x - answer[-1] >= desired_distance:
answer.append(x)
if len(answer) == k:
break
return answer
low, high = 0, (nums[-1] - nums[0])
while low < high:
mid = (low + high + 1) // 2
if len(longest_separated_array(mid)) == k:
low = mid
else:
high = mid - 1
return longest_separated_array(low)
I suppose your set is ordered. If not, my answer will be changed slightly.
Let's suppose you have an array X = (X1, X2, ..., Xn)
Energy(Xi) = min(|X(i-1) - Xi|, |X(i+1) - Xi|), 1 < i <n
j <- 1
while j < n - k do
X.Exclude(min(Energy(Xi)), 1 < i < n)
j <- j + 1
n <- n - 1
end while
$length = length($array);
sort($array); //sorts the list in ascending order
$differences = ($array << 1) - $array; //gets the difference between each value and the next largest value
sort($differences); //sorts the list in ascending order
$max = ($array[$length-1]-$array[0])/$M; //this is the theoretical max of how large the result can be
$result = array();
for ($i = 0; i < $length-1; $i++){
$count += $differences[i];
if ($length-$i == $M - 1 || $count >= $max){ //if there are either no more coins that can be taken or we have gone above or equal to the theoretical max, add a point
$result.push_back($count);
$count = 0;
$M--;
}
}
return min($result)
For the non-code people: sort the list, find the differences between each 2 sequential elements, sort that list (in ascending order), then loop through it summing up sequential values until you either pass the theoretical max or there arent enough elements remaining; then add that value to a new array and continue until you hit the end of the array. then return the minimum of the newly created array.
This is just a quick draft though. At a quick glance any operation here can be done in linear time (radix sort for the sorts).
For example, with 1, 4, 7, 100, and 200 and M=3, we get:
$differences = 3, 3, 93, 100
$max = (200-1)/3 ~ 67
then we loop:
$count = 3, 3+3=6, 6+93=99 > 67 so we push 99
$count = 100 > 67 so we push 100
min(99,100) = 99
It is a simple exercise to convert this to the set solution that I leave to the reader (P.S. after all the times reading that in a book, I've always wanted to say it :P)

Resources