How to divide a set of values into two sets of fixed size, such that their sums approach a particular value - algorithm

I have a set of 18 values (it will always be 18) which I need to distribute into two sets, one of 10 items, and one of 8 items.
The rule for distribution is that the values of each set must be equal (or as close as possible) to a particular known value - so in the first set the sum of the values must be as close as possible to 1500000 and in the second set the sum iof the values must be as close as possible to 1000000.
What is the best (and that may mean simplest) algorithm to do this?
Further clarification, the values all range between 110000 and 200000. The values are always multiples of a 100 and are all positive integers, and there can be duplicates.

There are only 43758 such selections. Go through each of them and find the best.

It is an optimization problem. Here you have two optimization criteria, which should be combine to single one. For example like this:
F(A, B) = w1*abs(sum(A) - 1500000) + w2*abs(sum(B) - 1000000)
where A and B your sets, sum() is a sum of elements in a set, and w1 and w2 is weights.
Then you should find a strategy for iteration over possible combinations. The simpliest strategy is to find all 10-combinations of 18, and select that one which minimize F(A,B). There are C(18,10) = 43758 combinations.

While brute force is probably best for this problem size, there are other tricks you can play if you're willing to get an approximate solution or if the brute force method is still too expensive. the basic idea is to snap the values to a small grid, and then do brute force on the (much smaller) set of entries in the grid.
in your case, (pretending I've already divided by 100), all numbers are between 1100 and 2000, so you can "snap" them to the 10 integers 1100, 1200 and so on. The maximum error in doing is at most 50/1100 which is less than 5%. Now you've halved the input size, which makes the brute force run a bit faster.
Again, I wouldn't recommend this unless (a) brute force is really slow right now or (a) the problem size increases beyond 18.
p.s the problem is called SUBSET SUM (or sometimes KNAPSACK depending on the formulation) and is NP-complete. Here's a reference for the approximation idea.

Your problem, as stated, is np unless there is a pattern to the data.
The only way to achieve the best answer is find all permutations of 18 into 10 and 8 and associated sums. Weight according to your preference.

Looks like an optimization problem to me. Randomly separate the values into the two sets, and then start swapping values (use a good heuristic), and accept the change if the result is better.

Related

Finding combination of subsets with largest intersection

Suppose I have a list of n sets, is there an efficient way of computing the combination of r sets whose intersection is larger than any other combination of r sets in the list?
In particular, I have a list of about 6000 sets of strings and I want to choose about 9 sets from this list such that they share the most strings out of all combinations of 9 sets. The problem is that if I were to brute-force it and compute all of the combinations and look for the max intersection it would take (6000 choose 9) ~ 3e28 computations, so I need either a more efficient algorithm or some reliable heuristic.
In addition, I would like to extend this question, if possible, to choosing a variable r such that the total element size of any combination is less than some arbitrary threshold, as opposed to choosing a constant r. That is, instead of just choosing 9 sets from the list of 6000 to make a combination, the algorithm would add sets of strings until the total number of strings in the combination exceeds some threshold, say 40 strings.
This is a lot closer to what I originally wanted to do, but I realized that taking constant size combinations of 9 sets works somewhat decently for the list I'm working with and would probably be a lot easier to implement, although an algorithm that could accomplish this would be preferable. From what I can gather, this problem is similar to the knapsack problem, although the only efficient way I know of solving that is with dynamic programming and I'm not sure how I would implement dynamic programming in this case given that I have to compute the running intersection to get the weights instead of having a pre-computed list of weights as with regular knapsack.
Here's an idea for you:
Choose 9 sets at random from the 6000. For the remaining 5991 sets, determine which of the 9 sets to replace to maximize the intersection (or don't use the new set if it provides no improvement). That's roughly 6000 * 9 operations.
Do this K times, keeping track of the answers. Then find the set that occurs in most answers (the winning set). Repeat the whole process, but now the randomly chosen starting set must include the winning set, and the winning set may not be replaced.
Repeat until 8 of the 9 sets are winners. Then the 9th set is whatever set maximizes the intersection. You could also terminate early if all the answers are identical.
The run time is roughly 6000 * 9 * K * 8 operations, where one operation computes the intersection of 9 sets.

a dynamic program about subarray sum

I've see a problem about dynamic program
like this:
let's say there is a array like this: [600, 500, 300, 220, 210]
I want to find a sub array whose sum is the most closest to 1000 and bigger than it.(>=1000).
how can I write the code? I already understand the 01 backpack problem but still cannot make out this problem
A few things:
First, I think you are referring to "dynamic programming", not "a dynamic program"; read up here if you want to know the difference: https://en.wikipedia.org/wiki/Dynamic_programming
Second, I think you mean "closest to 1000 but NOT bigger than it (< 1000)", since that is the general constraint. If you were allowed to go over 1000, then the problem doesn't make sense because there is no constraint.
Like the backpack problem, this is going to be a non-polynomial (NP) time problem (a problem where the time required to compute increases faster than polynomial growth - usually exponential or faster), where you would normally have to check every possible combination of numbers, which can take a long time for seemingly small set sizes.
I believe that the correct answer from the 5 you provided is 500+220+210, which sums to 930, the largest that you can make without going over 1000.
The basic idea of dynamic programming is to break the problem into smaller similar problems that are more easily computable; for example, if you had a million numbers and wanted to find the subset that is closest to 100000 but not over, you might divide the million into 100,000 subsets of 10 elements, and find the closest to a smaller number of each of those subsets, then use the resulting set of 100,000 sums to repeat with 10,000 sets, etc, until you reduce it to a close-but-not-perfect solution.
In any non-polynomial-time problem, dynamic programming can only be used in building a close approximation, since the solution isn't guaranteed to be optimal.
You can use transaction optimizer from the EmerCoin wallet.
It exacly does, what you're looking for.
An approach to solve this problem can be done in two steps:
define a function which takes a subarray and gives you an evaluation or a score of this subarray so that you can actually compare subarrays and take the best. A function could be simply
if(sum(subarray) < 1000) return INFINITY
else return sum(subarray) - 1000
note that you can also use dynamic programming to compute the sum of subarrays
Assuming that the length of your goal array is N, you will need to solve the problems of size 1 to N. If the array's length is 1 then obviously there is one possibility and it's the best. If size > 1 then we take the solution of the problem with length size - 1 and we compare it with every subarray containing the last element of the array and we take the best subarray as the solution of the problem with length size.
I hope my explanation makes sense

Algorithm to generate k element subsets in order of their sum

If I have an unsorted large set of n integers (say 2^20 of them) and would like to generate subsets with k elements each (where k is small, say 5) in increasing order of their sums, what is the most efficient way to do so?
Why I need to generate these subsets in this fashion is that I would like to find the k-element subset with the smallest sum satisfying a certain condition, and I thus would apply the condition on each of the k-element subsets generated.
Also, what would be the complexity of the algorithm?
There is a similar question here: Algorithm to get every possible subset of a list, in order of their product, without building and sorting the entire list (i.e Generators) about generating subsets in order of their product, but it wouldn't fit my needs due to the extremely large size of the set n
I intend to implement the algorithm in Mathematica, but could do it in C++ or Python too.
If your desired property of the small subsets (call it P) is fairly common, a probabilistic approach may work well:
Sort the n integers (for millions of integers i.e. 10s to 100s of MB of ram, this should not be a problem), and sum the k-1 smallest. Call this total offset.
Generate a random k-subset (say, by sampling k random numbers, mod n) and check it for P-ness.
On a match, note the sum-total of the subset. Subtract offset from this to find an upper bound on the largest element of any k-subset of equivalent sum-total.
Restrict your set of n integers to those less than or equal to this bound.
Repeat (goto 2) until no matches are found within some fixed number of iterations.
Note the initial sort is O(n log n). The binary search implicit in step 4 is O(log n).
Obviously, if P is so rare that random pot-shots are unlikely to get a match, this does you no good.
Even if only 1 in 1000 of the k-sized sets meets your condition, That's still far too many combinations to test. I believe runtime scales with nCk (n choose k), where n is the size of your unsorted list. The answer by Andrew Mao has a link to this value. 10^28/1000 is still 10^25. Even at 1000 tests per second, that's still 10^22 seconds. =10^14 years.
If you are allowed to, I think you need to eliminate duplicate numbers from your large set. Each duplicate you remove will drastically reduce the number of evaluations you need to perform. Sort the list, then kill the dupes.
Also, are you looking for the single best answer here? Who will verify the answer, and how long would that take? I suggest implementing a Genetic Algorithm and running a bunch of instances overnight (for as long as you have the time). This will yield a very good answer, in much less time than the duration of the universe.
Do you mean 20 integers, or 2^20? If it's really 2^20, then you may need to go through a significant amount of (2^20 choose 5) subsets before you find one that satisfies your condition. On a modern 100k MIPS CPU, assuming just 1 instruction can compute a set and evaluate that condition, going through that entire set would still take 3 quadrillion years. So if you even need to go through a fraction of that, it's not going to finish in your lifetime.
Even if the number of integers is smaller, this seems to be a rather brute force way to solve this problem. I conjecture that you may be able to express your condition as a constraint in a mixed integer program, in which case solving the following could be a much faster way to obtain the solution than brute force enumeration. Assuming your integers are w_i, i from 1 to N:
min sum(i) w_i*x_i
x_i binary
sum over x_i = k
subject to (some constraints on w_i*x_i)
If it turns out that the linear programming relaxation of your MIP is tight, then you would be in luck and have a very efficient way to solve the problem, even for 2^20 integers (Example: max-flow/min-cut problem.) Also, you can use the approach of column generation to find a solution since you may have a very large number of values that cannot be solved for at the same time.
If you post a bit more about the constraint you are interested in, I or someone else may be able to propose a more concrete solution for you that doesn't involve brute force enumeration.
Here's an approximate way to do what you're saying.
First, sort the list. Then, consider some length-5 index vector v, corresponding to the positions in the sorted list, where the maximum index is some number m, and some other index vector v', with some max index m' > m. The smallest sum for all such vectors v' is always greater than the smallest sum for all vectors v.
So, here's how you can loop through the elements with approximately increasing sum:
sort arr
for i = 1 to N
for v = 5-element subsets of (1, ..., i)
set = arr{v}
if condition(set) is satisfied
break_loop = true
compute sum(set), keep set if it is the best so far
break if break_loop
Basically, this means that you no longer need to check for 5-element combinations of (1, ..., n+1) if you find a satisfying assignment in (1, ..., n), since any satisfying assignment with max index n+1 will have a greater sum, and you can stop after that set. However, there is no easy way to loop through the 5-combinations of (1, ..., n) while guaranteeing that the sum is always increasing, but at least you can stop checking after you find a satisfying set at some n.
This looks to be a perfect candidate for map-reduce (http://en.wikipedia.org/wiki/MapReduce). If you know of any way of partitioning them smartly so that passing candidates are equally present in each node then you can probably get a great throughput.
Complete sort may not really be needed as the map stage can take care of it. Each node can then verify the condition against the k-tuples and output results into a file that can be aggregated / reduced later.
If you know of the probability of occurrence and don't need all of the results try looking at probabilistic algorithms to converge to an answer.

Algorithm to optimally group list of values

I have several numbers. I need to group them in several groups, so that sums of all numbers in one group are between predefined min and max. The point is to left as few numbers ungrouped as possible.
Input:
min, max: range for sum of numbers
N1, N2, N3 ... Ni: numbers to group
Output:
[N1,N3,N5],[Ni,Nj,Nk,Nm...]...: groups where sum of numbers is between min and max
Na,Nb,Nc...: numbers, left ingrouped.
This problem could be viewed as bin packing into bins of size max, with a funny objective: minimize the number of items not packed into bins holding at least min. One idea from the bin-packing literature is that the "small" items (in this case, items that are small relative to max - min) are easy to pack but are accountable for most of the combinatorial explosion of possibilities. Thus some approximation algorithms for bin packing do something clever for big items and then fill in with the small. Another way to reduce the number of possibilities is to round the numbers to belong to a smaller set. It's somewhat obvious how to do that for bin packing (round up), but it's not clear what to do for this problem.
Okay, I'll give an example of how these ideas could be instantiated. Suppose that max = 1 and min = 1/2. Let's try to find a solution that's competitive with the optimum for when max = 2 and min = 1/2. (That may sound terrible, but this sort of approximation guarantee where OPT is held to higher standards is sometimes used in the literature.)
First round every item's size up to a power of 2. Very large items, of size 4 or greater, can't be packed. Large items, of size 2 or 1 or 1/2, are given their own bins. Small items, of size 1/4 or less, are dealt with as follows. Whenever two items of size 1/4 or less have the same size, combine them into one super-item. Pack all of the new items of size 1/2 into their own bins. The remainder has total size less than 1/2. If there is space in another bin, put them there. Otherwise, give them their own bin.
The quality of the resulting solution for max = 2 is at least as good as the quality of OPT for max = 1. Take the optimal solution for max = 1 and round the item sizes. The set of bad bins remains the same, because no item is smaller, and each bin stores less than 2 because each item is less than twice as large as it used to be. Now it suffices to show that the packing algorithm I gave for powers of 2 is optimal. I'll leave that as an exercise.
I don't expect this instantly to generalize into a full algorithm. I have to get back to work, but the approach I would take would be to force OPT to deal with max = 1 while ALG gets to use max = 1 + epsilon, substitute powers of (1 + epsilon) for powers of two in the rounding step, and then figure out how to pack the small items, probably using a dynamic program since greed likely won't work.
If you're not worried about efficiency, simply generate each possible grouping and choose the one that is correct and optimal in the sense you describe. Clearly, this works for any finite list of numbers (and is, by definition, optimal).
If efficiency is desired, the problem seems to become somewhat more difficult. :D I'll keep thinking.
EDIT: Come to think of it, this problem seems at least as hard as "subset sum" and, as such, I don't think there is a solution significantly better than the one I give (i.e., no known polynomial-time algorithm can solve it, if it is NP-Hard.

How to choose group of numbers in the vector

I have an application with some probabilities of measured features. I want to select n-best features from vector. I have a vector of real numbers. Vector is normalized, sum of all numbers is 1 (it is probability of some features).
I want to select group of n less than N (assume approx. 8) largest numbers. Numbers has to be close together without gaps and they're also should have large sum (sum of remaining numbers should be several times lower).
Any ideas how to accomplish that?
I tried to use 80% quantile (but it is not sensitive to relative large gaps like [0.2, 0.2, 0.01, 0.01, 0.001, 0.001 ... len ~ 100] ), I tried a some treshold between two successive numbers, but nothing work too good.
I have some partial solution at this moment but I am just wondering if there is some simple solution that I have overlooked.
John's answer is good. Also you might try
sort the probabilities
find the largest gap between successive probabilities
work up from there
From there, it's starting to sound like a pattern-recognition problem.My favorite method is markov-chain-monte-carlo(MCMC).
Edit: Since you clarified your question, my first thought is, since you only have 8 possible answers, develop a score for each one, based on how much probability it contains and whether or not it splits at a gap, and make a heuristic judgement.
Further edit: This sounds a bit like logistic regression. You want to find a value of P that effectively divides your set into members and non-members. For a given value of P, you can compute a log-likelihood for the ensemble, and choose P that maximizes that.
It sounds like you're wanting to select the n largest probabilities but the number n is flexible. If n were fixed, say n=10, you could just sort your vector and pull out the top 10 items. But from your example it sounds like you'd like to use a smaller value of n if there's a natural break in the data. Maybe you want to start with the largest probability and go down the list selecting items until the sum of the probabilities you pick crosses some threshold.
Maybe you have an implicit optimization problem where you want to maximize some probability with some penalty for large n. Try stating your problem that way. You might find your own answer, or you might be able to rephrase your question here in a way that helps other people give you a better answer.
I'm not really sure if this is what you want, but it seems you want to do the following.
Lets assume that the probabilities are x_1,...,x_N in increasing order. Then you should try to find 1<= i < j <= N such that the function
f(i,j) = (x_i + x_(i+1) + ... + x_j)/(x_j - x_i)
is maximized. This can be done naively in quadratic time.

Resources