Max subset of arrays whose mean is larger than threshold - algorithm

I recently came across the following problem, and so far got no insight on how to solve it.
Let S = {v1, v2, v3, ..., vn} be a set of n arrays defined on the ℝ6. That is, each array has 6 entries.
For a given set of arrays, let the mean of a dimension be the average between the coordinates corresponding to that dimension for all elements in the set.
Also, let us define a certain property P of a set of arrays as the lowest value amongst all means of a set (there is a total of 6 means, one for each dimension). For instance, if a certain set has {10, 4, 1, 5, 6, 3} as means for its dimensions, then P for this set is 1.
Now to the definition of the problem: Return the maximum cardinality amongst all the subsets S' of S such that P(S') ≥ T, T a known threshold value, or 0 if such subset does not exist. Additionally, output any maximal S' (such that P(S') ≥ T).
Summarising: Inputs: the set S and the threshold value T. Output: A certain subset S' (|S'| is evidently immediate).
I first began trying to come up with a greedy solution, but got no success. Then, I moved on to a dynamic programming approach, but could not establish a recursion that solved the problem. I could expand a little more on my thoughts on the solution, but I don't think they would be of much use, given how far I got (or didn't get).
Any help would be greatly appreciated!

Bruteforce evaluation through recursion would have a time complexity of O(2^n) because each array can either be present in the subset or not.
One (still inefficient but slightly better) way to solve this problem is by taking the help of Integer Linear Programming.
Define Xi = { 1 if array Vi is present in the maximal subset, 0 otherwise }
Hence Cardinality k = summation(Xi) {i = 1 to n }
Also, since the average of all dimensions >= T, this means:
d11X1 + d12X2 + ... + d1nXn >= T*k
d21X1 + d22X2 + ... + d2nXn >= T*k
d31X1 + d32X2 + ... + d3nXn >= T*k
d41X1 + d42X2 + ... + d4nXn >= T*k
d51X1 + d52X2 + ... + d5nXn >= T*k
d61X1 + d62X2 + ... + d6nXn >= T*k
Objective function: Maximize( k )
Actually you should eliminate k by the cardinality equation but I have included it here for clarity.

Related

How to generate a k-combination in a n-element set in TLA+?

In math, a k-combination of an n-element set is a set of all sets that take k element of the n-element set.
However, how can I compute this in TLA+?
I don't know how to compute (n, k), due to my poor algorithm knowledge.
However, I find an ugly way that can compute (n, 2) by using cartesian product.
Suppose the n-element set is X, so the following CombinationSeq2(X) computes the cartesian product of X and X. If X is {1, 2}, then the result is {<<1,1>>, <<1,2>>, <<2,1>>, <<2,2>>}, so we must use s[1] < s[2] to filter repeated sets, thus yielding the final result {<<1,2>>}.
CombinationSeq2(X) == {s \in X \X X: s[1] < s[2]}
Then I convert inner tuple to set by the following
Combination2(X) == { { s[1], s[2] } : s \in CombinationSeq2(X) }
However, the above solution is ugly:
it do not support arbitrary k.
it requires element of the set to have order. However, we don't need order here, telling equal or not is already enough.
I wonder is there any solution to do this? I added algorithm tag to this question because I believe if TLA+ don't support this, there should be some algorithm way to do this. If so, I need an idea here, so I can translate them into TLA+.
In the Community Modules, kSubset is defined as
kSubset(k, S) ==
{ s \in SUBSET S : Cardinality(s) = k }
If done purely in TLA+, this will generate 2^S elements before finding the subsets. The community module also has a Java override to implement the calculation more efficiently. See the readme for instructions on how to use the override.
If efficiency is not a concern, then something like {s \in SUBSET X: CARDINALITY s = k} should be correct. (I say "something like" because I'm kind of guessing at the syntax. But CARDINALITY is supposed to be in FiniteSet.)
All reasonably efficient algorithms that I'm aware of do process the set in some kind of implicit order.
We can generate k-combinations of the given set by writing a recursive function, that takes two parameters, i(current position in the set) and k(how many elements are left in the current combination). The function returns the set of all combinations that have k elements.
f(i, k) = {prepend X[i] to all sets returned by f(i + 1, k - 1)} and f(i + 1, k)
f(n + 1, _) = {}
f(_, 0) = {}
Here, '_'(underscore) refers to any possible value.
We can get the answer for k-combinations of n-elements set from f(1, k).
Note: As you can see the function works irrespective of the order of the elements in the given set.
I don't know TLA+, but here's an implementation in C++.

arrangement with constraint on the sum

I'm looking to construct an algorithm which gives the arrangements with repetition of n sequences of a given step S (which can be a positive real number), under the constraint that the sum of all combinations is k, with k a positive integer.
My problem is thus to find the solutions to the equation:
x 1 + x 2 + ⋯ + x n = k
where
0 ≤ x i ≤ b i
and S (the step) a real number with finite decimal.
For instance, if 0≤xi≤50, and S=2.5 then xi = {0, 2.5 , 5,..., 47.5, 50}.
The point here is to look only through the combinations having a sum=k because if n is big it is not possible to generate all the arrangements, so I would like to bypass this to generate only the combinations that match the constraint.
I was thinking to start with n=2 for instance, and find all linear combinations that match the constraint.
ex: if xi = {0, 2.5 , 5,..., 47.5, 50} and k=100, then we only have one combination={50,50}
For n=3, we have the combination for n=2 times 3, i.e. {50,50,0},{50,0,50} and {0,50,50} plus the combinations {50,47.5,2.5} * 3! etc...
If xi = {0, 2.5 , 5,..., 37.5, 40} and k=100, then we have 0 combinations for n=2 because 2*40<100, and we have {40,40,20} times 3 for n=3... (if I'm not mistaken)
I'm a bit lost as I can't seem to find a proper way to start the algorithm, knowing that I should have the step S and b as inputs.
Do you have any suggestions?
Thanks
You can transform your problem into an integer problem by dividing everything by S: We want to find all integer sequences y1, ..., yn with:
(1) 0 ≤ yi ≤ ⌊b / S⌋
(2) y1 + ... + yn = k / S
We can see that there is no solution if k is not a multiple of S. Once we have reduced the problem, I would suggest using a pseudopolynomial dynamic programming algorithm to solve the subset sum problem and then reconstruct the solution from it. Let f(i, j) be the number of ways to make sum j with i elements. We have the following recurrence:
f(0,0) = 1
f(0,j) = 0 forall j > 0
f(i,j) = sum_{m = 0}^{min(floor(b / S), j)} f(i - 1, j - m)
We can solve f in O(n * k / S) time by filling it row by row. Now we want to reconstruct the solution. I'm using Python-style pseudocode to illustrate the concept:
def reconstruct(i, j):
if f(i,j) == 0:
return
if i == 0:
yield []
return
for m := 0 to min(floor(b / S), j):
for rest in reconstruct(i - 1, j - m):
yield [m] + rest
result = reconstruct(n, k / S)
result will be a list of all possible combinations.
What you are describing sounds like a special case of the subset sum problem. Once you put it in those terms, you'll find that Pisinger apparently has a linear time algorithm for solving a more general version of your problem, since your weights are bounded. If you're interested in designing your own algorithm, you might start by reading Pisinger's thesis to get some ideas.
Since you are looking for all possible solutions and not just a single solution, the dynamic programming approach is probably your best bet.

Find subset with elements that are furthest apart from eachother

I have an interview question that I can't seem to figure out. Given an array of size N, find the subset of size k such that the elements in the subset are the furthest apart from each other. In other words, maximize the minimum pairwise distance between the elements.
Example:
Array = [1,2,6,10]
k = 3
answer = [1,6,10]
The bruteforce way requires finding all subsets of size k which is exponential in runtime.
One idea I had was to take values evenly spaced from the array. What I mean by this is
Take the 1st and last element
find the difference between them (in this case 10-1) and divide that by k ((10-1)/3=3)
move 2 pointers inward from both ends, picking out elements that are +/- 3 from your previous pick. So in this case, you start from 1 and 10 and find the closest elements to 4 and 7. That would be 6.
This is based on the intuition that the elements should be as evenly spread as possible. I have no idea how to prove it works/doesn't work. If anyone knows how or has a better algorithm please do share. Thanks!
This can be solved in polynomial time using DP.
The first step is, as you mentioned, sort the list A. Let X[i,j] be the solution for selecting j elements from first i elements A.
Now, X[i+1, j+1] = max( min( X[k,j], A[i+1]-A[k] ) ) over k<=i.
I will leave initialization step and memorization of subset step for you to work on.
In your example (1,2,6,10) it works the following way:
1 2 6 10
1 - - - -
2 - 1 5 9
3 - - 1 4
4 - - - 1
The basic idea is right, I think. You should start by sorting the array, then take the first and the last elements, then determine the rest.
I cannot think of a polynomial algorithm to solve this, so I would suggest one of the two options.
One is to use a search algorithm, branch-and-bound style, since you have a nice heuristic at hand: the upper bound for any solution is the minimum size of the gap between the elements picked so far, so the first guess (evenly spaced cells, as you suggested) can give you a good baseline, which will help prune most of the branches right away. This will work fine for smaller values of k, although the worst case performance is O(N^k).
The other option is to start with the same baseline, calculate the minimum pairwise distance for it and then try to improve it. Say you have a subset with minimum distance of 10, now try to get one with 11. This can be easily done by a greedy algorithm -- pick the first item in the sorted sequence such that the distance between it and the previous item is bigger-or-equal to the distance you want. If you succeed, try increasing further, if you fail -- there is no such subset.
The latter solution can be faster when the array is large and k is relatively large as well, but the elements in the array are relatively small. If they are bound by some value M, this algorithm will take O(N*M) time, or, with a small improvement, O(N*log(M)), where N is the size of the array.
As Evgeny Kluev suggests in his answer, there is also a good upper bound on the maximum pairwise distance, which can be used in either one of these algorithms. So the complexity of the latter is actually O(N*log(M/k)).
You can do this in O(n*(log n) + n*log(M)), where M is max(A) - min(A).
The idea is to use binary search to find the maximum separation possible.
First, sort the array. Then, we just need a helper function that takes in a distance d, and greedily builds the longest subarray possible with consecutive elements separated by at least d. We can do this in O(n) time.
If the generated array has length at least k, then the maximum separation possible is >=d. Otherwise, it's strictly less than d. This means we can use binary search to find the maximum value. With some cleverness, you can shrink the 'low' and 'high' bounds of the binary search, but it's already so fast that sorting would become the bottleneck.
Python code:
def maximize_distance(nums: List[int], k: int) -> List[int]:
"""Given an array of numbers and size k, uses binary search
to find a subset of size k with maximum min-pairwise-distance"""
assert len(nums) >= k
if k == 1:
return [nums[0]]
nums.sort()
def longest_separated_array(desired_distance: int) -> List[int]:
"""Given a distance, returns a subarray of nums
of length k with pairwise differences at least that distance (if
one exists)."""
answer = [nums[0]]
for x in nums[1:]:
if x - answer[-1] >= desired_distance:
answer.append(x)
if len(answer) == k:
break
return answer
low, high = 0, (nums[-1] - nums[0])
while low < high:
mid = (low + high + 1) // 2
if len(longest_separated_array(mid)) == k:
low = mid
else:
high = mid - 1
return longest_separated_array(low)
I suppose your set is ordered. If not, my answer will be changed slightly.
Let's suppose you have an array X = (X1, X2, ..., Xn)
Energy(Xi) = min(|X(i-1) - Xi|, |X(i+1) - Xi|), 1 < i <n
j <- 1
while j < n - k do
X.Exclude(min(Energy(Xi)), 1 < i < n)
j <- j + 1
n <- n - 1
end while
$length = length($array);
sort($array); //sorts the list in ascending order
$differences = ($array << 1) - $array; //gets the difference between each value and the next largest value
sort($differences); //sorts the list in ascending order
$max = ($array[$length-1]-$array[0])/$M; //this is the theoretical max of how large the result can be
$result = array();
for ($i = 0; i < $length-1; $i++){
$count += $differences[i];
if ($length-$i == $M - 1 || $count >= $max){ //if there are either no more coins that can be taken or we have gone above or equal to the theoretical max, add a point
$result.push_back($count);
$count = 0;
$M--;
}
}
return min($result)
For the non-code people: sort the list, find the differences between each 2 sequential elements, sort that list (in ascending order), then loop through it summing up sequential values until you either pass the theoretical max or there arent enough elements remaining; then add that value to a new array and continue until you hit the end of the array. then return the minimum of the newly created array.
This is just a quick draft though. At a quick glance any operation here can be done in linear time (radix sort for the sorts).
For example, with 1, 4, 7, 100, and 200 and M=3, we get:
$differences = 3, 3, 93, 100
$max = (200-1)/3 ~ 67
then we loop:
$count = 3, 3+3=6, 6+93=99 > 67 so we push 99
$count = 100 > 67 so we push 100
min(99,100) = 99
It is a simple exercise to convert this to the set solution that I leave to the reader (P.S. after all the times reading that in a book, I've always wanted to say it :P)

Interview question - Finding numbers

I just got this question on a SE position interview, and I'm not quite sure how to answer it, other than brute force:
Given a natural number N, find two numbers, A and P, such that:
N = A + (A+1) + (A+2) + ... + (A+P-1)
P should be the maximum possible.
Ex: For N=14, A = 2 and P = 4
N = 2 + (2+1) + (2+2) + (4+2-1)
N = 2 + 3 + 4 + 5
Any ideas?
If N is even/odd, we need an even/odd number of odd numbers in the sum. This already halfes the number of possible solutions. E.g. for N=14, there is no point in checking any combinations where P is odd.
Rewriting the formula given, we get:
N = A + (A+1) + (A+2) + ... + (A+P-1)
= P*A + 1 + 2 + ... + (P-1)
= P*A + (P-1)P/2 *
= P*(A + (P-1)/2)
= P/2*(2*A + P-1)
The last line means that N must be divisible by P/2, this also rules out a number of possibilities. E.g. 14 only has these divisors: 1, 2, 7, 14. So possible values for P would be 2, 4, 14 and 28. 14 and 28 are ruled our for obvious reasons (in fact, any P above N/2 can be ignored).
This should be a lot faster than the brute-force approach.
(* The sum of the first n natural numbers is n(n+1)/2)
With interview questions, it is often wise to think about what is probably the purpose of the question. If I would be asking you this question, it is not because I think you know the solution, but I want to see you finding the solution. Reformulating the problem, making implications, devising what is known, ... this is what I would like to see.
If you just sit and tell me "I do not know how to solve it", you immediately fail the interview.
If you say: I know how to solve it by brute force, and I am aware it will be probably slow, I will give you some hints or help you to get you started. If that does not help, you most likely fail (unless you show some extraordinary skills to compensate for the fact you are probably lacking something in the field of general problem analysis, e.g. you will show how to implement a solution paralelized for many cores or implemented on GPU).
If you bring me a ready solution, but you are unable to derive it, I will give you another similar problem, because I am not interested about solution, I am interested in your thinking.
A + (A+1) + (A+2) + ... + (A+P-1) simplifies to P*A + (P*(P-1)/2) resp P*(A+(P-1)/2).
Thus, you could just enumerate all divisors of N, and then test each divisor P to the following:
Is A = (N-(P*(P-1)/2))/P (solved the first simplification for A) an integral number? (I assume it should be an integral number, otherwise it would be trivial.) If so, return it as a solution.
Can be solved using 0-1 Knapsack solution .
Observation : N/2 + N/2 + 1 > N
so our series is 1,2,...,N/2
Consider the constraints of W=N and vi =1 for all elements, I think this trivially maps to 0-1 knapsack, O(n^2)
Here is a O(n) solution.
It uses the property of the sum of an arithmetic progression.
S = difference*(first_term + last_term)/2
Here our sum is N, the difference is P and first term is A.
Manipulation the above equation we get some equations and we can iterate P from 1 to n - 1 to get a valid A.
def solve(n,p):
return (2*n - (p**2) + p)/(2*p)
def condition(n,p,a):
if (2*n == (2*a*p) + (p**2) - p) and (a*-1 < 0):
return True
else:
return False
def find(n):
for x in xrange(n,-1,-1):
a = solve(n,x)
if condition(n,x,a):
return n,x,a

Computing number of permutations of two values, with a restriction on runs

I was thinking about ways to solve this other question about counting the number of values whose digits sum to a target, and decided to try the case where the range was of the form [0, n^base). So essentially you get N independent digits to work with, which is a simpler problem.
The number of ways N natural numbers can sum to a target T is easy to compute. If you think of it as placing N-1 dividers among T sticks, you should see the answer is (T+N-1)!/(T!(N-1)!).
However, our N natural numbers are restricted to [0, base) and so there will be fewer possibilities. I want to find a simple formula for this case as well.
The first thing I considered was deducting the number of possibilities where 'base' of the sticks had been replaced with a 'big stick'. Unfortunately, some possibilities are double counted because they have multiple places a 'big stick' could be inserted.
Any ideas?
You can use generating functions.
Assuming that the order matters, then you are looking for the coefficient of x^T in
(1 + x + x^2 + ... + x^b)(1 + x + x^2 + .. + x^b) ... n times
= (x^(b+1) - 1)^n/(x-1)^n
Using binomial theorem (works even for -n), you should be able to write you answer as a sum of products of binomial coefficients.
Let b+1 = B.
Using binomial theorem we have
(x^(b+1) - 1)^n = Sum_{r=0}^{n} (-1)^(n-r)* (n choose r) x^(Br)
1/(x-1)^n = Sum (n+s-1 choose s) x^s
So the answer we need is:
Sum (-1)^(n-r) * (n choose r)*(n+s-1 choose s)
for any r and s subject to the condition that
Br + s = T.

Resources