Algorithm to optimally group list of values - algorithm

I have several numbers. I need to group them in several groups, so that sums of all numbers in one group are between predefined min and max. The point is to left as few numbers ungrouped as possible.
Input:
min, max: range for sum of numbers
N1, N2, N3 ... Ni: numbers to group
Output:
[N1,N3,N5],[Ni,Nj,Nk,Nm...]...: groups where sum of numbers is between min and max
Na,Nb,Nc...: numbers, left ingrouped.

This problem could be viewed as bin packing into bins of size max, with a funny objective: minimize the number of items not packed into bins holding at least min. One idea from the bin-packing literature is that the "small" items (in this case, items that are small relative to max - min) are easy to pack but are accountable for most of the combinatorial explosion of possibilities. Thus some approximation algorithms for bin packing do something clever for big items and then fill in with the small. Another way to reduce the number of possibilities is to round the numbers to belong to a smaller set. It's somewhat obvious how to do that for bin packing (round up), but it's not clear what to do for this problem.
Okay, I'll give an example of how these ideas could be instantiated. Suppose that max = 1 and min = 1/2. Let's try to find a solution that's competitive with the optimum for when max = 2 and min = 1/2. (That may sound terrible, but this sort of approximation guarantee where OPT is held to higher standards is sometimes used in the literature.)
First round every item's size up to a power of 2. Very large items, of size 4 or greater, can't be packed. Large items, of size 2 or 1 or 1/2, are given their own bins. Small items, of size 1/4 or less, are dealt with as follows. Whenever two items of size 1/4 or less have the same size, combine them into one super-item. Pack all of the new items of size 1/2 into their own bins. The remainder has total size less than 1/2. If there is space in another bin, put them there. Otherwise, give them their own bin.
The quality of the resulting solution for max = 2 is at least as good as the quality of OPT for max = 1. Take the optimal solution for max = 1 and round the item sizes. The set of bad bins remains the same, because no item is smaller, and each bin stores less than 2 because each item is less than twice as large as it used to be. Now it suffices to show that the packing algorithm I gave for powers of 2 is optimal. I'll leave that as an exercise.
I don't expect this instantly to generalize into a full algorithm. I have to get back to work, but the approach I would take would be to force OPT to deal with max = 1 while ALG gets to use max = 1 + epsilon, substitute powers of (1 + epsilon) for powers of two in the rounding step, and then figure out how to pack the small items, probably using a dynamic program since greed likely won't work.

If you're not worried about efficiency, simply generate each possible grouping and choose the one that is correct and optimal in the sense you describe. Clearly, this works for any finite list of numbers (and is, by definition, optimal).
If efficiency is desired, the problem seems to become somewhat more difficult. :D I'll keep thinking.
EDIT: Come to think of it, this problem seems at least as hard as "subset sum" and, as such, I don't think there is a solution significantly better than the one I give (i.e., no known polynomial-time algorithm can solve it, if it is NP-Hard.

Related

Algorithm to generate k element subsets in order of their sum

If I have an unsorted large set of n integers (say 2^20 of them) and would like to generate subsets with k elements each (where k is small, say 5) in increasing order of their sums, what is the most efficient way to do so?
Why I need to generate these subsets in this fashion is that I would like to find the k-element subset with the smallest sum satisfying a certain condition, and I thus would apply the condition on each of the k-element subsets generated.
Also, what would be the complexity of the algorithm?
There is a similar question here: Algorithm to get every possible subset of a list, in order of their product, without building and sorting the entire list (i.e Generators) about generating subsets in order of their product, but it wouldn't fit my needs due to the extremely large size of the set n
I intend to implement the algorithm in Mathematica, but could do it in C++ or Python too.
If your desired property of the small subsets (call it P) is fairly common, a probabilistic approach may work well:
Sort the n integers (for millions of integers i.e. 10s to 100s of MB of ram, this should not be a problem), and sum the k-1 smallest. Call this total offset.
Generate a random k-subset (say, by sampling k random numbers, mod n) and check it for P-ness.
On a match, note the sum-total of the subset. Subtract offset from this to find an upper bound on the largest element of any k-subset of equivalent sum-total.
Restrict your set of n integers to those less than or equal to this bound.
Repeat (goto 2) until no matches are found within some fixed number of iterations.
Note the initial sort is O(n log n). The binary search implicit in step 4 is O(log n).
Obviously, if P is so rare that random pot-shots are unlikely to get a match, this does you no good.
Even if only 1 in 1000 of the k-sized sets meets your condition, That's still far too many combinations to test. I believe runtime scales with nCk (n choose k), where n is the size of your unsorted list. The answer by Andrew Mao has a link to this value. 10^28/1000 is still 10^25. Even at 1000 tests per second, that's still 10^22 seconds. =10^14 years.
If you are allowed to, I think you need to eliminate duplicate numbers from your large set. Each duplicate you remove will drastically reduce the number of evaluations you need to perform. Sort the list, then kill the dupes.
Also, are you looking for the single best answer here? Who will verify the answer, and how long would that take? I suggest implementing a Genetic Algorithm and running a bunch of instances overnight (for as long as you have the time). This will yield a very good answer, in much less time than the duration of the universe.
Do you mean 20 integers, or 2^20? If it's really 2^20, then you may need to go through a significant amount of (2^20 choose 5) subsets before you find one that satisfies your condition. On a modern 100k MIPS CPU, assuming just 1 instruction can compute a set and evaluate that condition, going through that entire set would still take 3 quadrillion years. So if you even need to go through a fraction of that, it's not going to finish in your lifetime.
Even if the number of integers is smaller, this seems to be a rather brute force way to solve this problem. I conjecture that you may be able to express your condition as a constraint in a mixed integer program, in which case solving the following could be a much faster way to obtain the solution than brute force enumeration. Assuming your integers are w_i, i from 1 to N:
min sum(i) w_i*x_i
x_i binary
sum over x_i = k
subject to (some constraints on w_i*x_i)
If it turns out that the linear programming relaxation of your MIP is tight, then you would be in luck and have a very efficient way to solve the problem, even for 2^20 integers (Example: max-flow/min-cut problem.) Also, you can use the approach of column generation to find a solution since you may have a very large number of values that cannot be solved for at the same time.
If you post a bit more about the constraint you are interested in, I or someone else may be able to propose a more concrete solution for you that doesn't involve brute force enumeration.
Here's an approximate way to do what you're saying.
First, sort the list. Then, consider some length-5 index vector v, corresponding to the positions in the sorted list, where the maximum index is some number m, and some other index vector v', with some max index m' > m. The smallest sum for all such vectors v' is always greater than the smallest sum for all vectors v.
So, here's how you can loop through the elements with approximately increasing sum:
sort arr
for i = 1 to N
for v = 5-element subsets of (1, ..., i)
set = arr{v}
if condition(set) is satisfied
break_loop = true
compute sum(set), keep set if it is the best so far
break if break_loop
Basically, this means that you no longer need to check for 5-element combinations of (1, ..., n+1) if you find a satisfying assignment in (1, ..., n), since any satisfying assignment with max index n+1 will have a greater sum, and you can stop after that set. However, there is no easy way to loop through the 5-combinations of (1, ..., n) while guaranteeing that the sum is always increasing, but at least you can stop checking after you find a satisfying set at some n.
This looks to be a perfect candidate for map-reduce (http://en.wikipedia.org/wiki/MapReduce). If you know of any way of partitioning them smartly so that passing candidates are equally present in each node then you can probably get a great throughput.
Complete sort may not really be needed as the map stage can take care of it. Each node can then verify the condition against the k-tuples and output results into a file that can be aggregated / reduced later.
If you know of the probability of occurrence and don't need all of the results try looking at probabilistic algorithms to converge to an answer.

Algorithm for categorizing values

What would be the best algorithm to solve this problem? I spent a couple of hours on this problem. But couldn't sort it out.
A guy purchased a necklace and planned to make it into two pieces in such a way that the average brightness of each piece should be either greater than or equal to the original piece.
The criteria for dividing the necklaces are
1.The difference in number of pearls between the two pearls sets should not be greater than 10% of the number of pearls in the original necklace or 3 whichever is higher.
2.The difference between number of pearls in 2 necklaces should be minimum.
3.In case if the average brightness of any one of the necklace is less than the average brightness of the original set return 0 as output.
4.Two necklaces should have their average brightness greater than the original one and the difference between the average brightness of the two pieces is minimum.
5.The average brightness of each piece should be either greater than or equal to the original piece.
This problem is rather hard to do efficiently (in NP somewhere).
Say you had a set that averaged to X. That is, X = (x1 + x2 + ... + xn) / n.
Suppose you break it up into sets that average to S and T with s and t items in each set, respectively.
You can mathematically prove that if one of the averages, S or T, is greater than X, the other of the two must be less than X.
Hence, the two sets must have exactly the same brightness because that's the only way your conditions are satisfiable.
Knowing this, you're ending up with the sumset sum problem -- you want to find a subset that sums to exactly half of the sum of the entire set. That's a problem that's known to be hard. (It's been classified NP. And alright, it's not exactly the same as the subset sum problem, but if subtract the average of the full set from each of the brightness values, solving the subset sum problem will give you your answer. (Do the reverse to see how you can solve the subset sum problem from your problem.)
Hence, there's no fast way of doing this -- only approximations or exponential running times... However, maybe this will help. It mentions better running times if your weights (in your case, brightness levels) are bounded.

Mean and std dev. of large set of data using recursion

Lets say I have a large set of data .
Then I can divide it into two find mean of those two and calculate the mean of the last 2 values I get.
a) Is this the mean of the original big quantity ?
b) Can I do this sort of method for calculating standard deviation ??
a) only if the sets you divide into are always the same size, meaning that the original set size must be a power of 2.
For example, the mean of {6} is 6, and the mean of {3,6} is 4.5, but the mean of {3,6,6} is not 5.25, it's 5.
Certainly you could recursively divide into parts to calculate the sum, though, and divide by the total size at the end. Not sure if that does you any good.
b) no
For example, the s.d of {2} is 0, and the s.d. of {1} is 0, but the s.d of {1,2} is not 0.
Once you've calculated the mean of the whole set, you can recursively divide to calculate the sum square deviation from the mean, and as with the mean calculation, divide by the total size and take square root at the end. [Edit: in fact all you need to calculate s.d is the sumsquare, the sum, and the count. Forgot about that. So you don't have to calculate the mean first]
It is incorrect, but if you can express the mean and standard deviation of a set from the means, standard deviations, and size of the sets which that set is divided into.
Specifically, if m_x, s_x and n_x are the means, standard deviations, and sizes of x, and X is partitioned into many x's, then
n_X = sum_x(n_x)
m_X = sum_x(n_x m_x)/n_X
s_X^2 = (sum_x(n_x(s_x^2 + m_x^2)) - m_X)/n_X
assuming the standard deviation is of the form sum(x - mean(x))/n; if it is the sample unbiased estimator, just adjust the weights accordingly.
Sure you can. No need for equal sets, power of two. Pseudo code:
N1,mean1,s1;
N2,mean2,s2;
N12,mean12,s12;
N12 = N1+N2;
mean12 = ((mean1*N1) + (mean2*N2)) / N12;
s12 = sqrt( (s1*s1*N1 + s2*s2*N2) / N12 + N1*N2/(N12*N12)*(s1-s2)*(s1-s2) );
http://en.wikipedia.org/wiki/Weighted_mean
http://en.wikipedia.org/wiki/Standard_deviation#Combining_standard_deviations
On (a) - it's only precisely correct if you precisely divided the set into two. If there were an odd number of items, for instance, there is a slight weighting toward the smaller "half". The larger the set, the less significant the problem. However, the problem recurs for the smaller sets as you subdivide. You get very large error when dividing a set of three items into a single item and a pair - each item in the pair is only half as significant to the final result as the single item.
I don't see the gain, though. You still do as many additions. You even end up doing more divisions. More importantly, you access memory in a non-sequential order, leading to poor cache performance.
The usual approach for a mean and standard deviation is to first calculate the sum of all items, and the sum of the squares - both in the same loop. Old calculators used to handle this with running totals, also keeping count of the number of items as they went. At the end, those three values (n, sum-of-x and sum-of-x-squared) are all you need - the rest is just substitution into the standard formulae for the mean and standard deviation.
EDIT
If you're dead set on using recursion for this, look up "tail recursion". Mathematically, tail recursion and iteration are equivalent - different representations of the same thing. In implementation terms tail recursion might cause a stack overflow where iteration would work, but (1) some languages guarantee this will not happen (e.g. Scheme, Haskell), and (2) many compilers will handle this as an optimisation anyway (e.g. GCC for C or C++).

How to divide a set of values into two sets of fixed size, such that their sums approach a particular value

I have a set of 18 values (it will always be 18) which I need to distribute into two sets, one of 10 items, and one of 8 items.
The rule for distribution is that the values of each set must be equal (or as close as possible) to a particular known value - so in the first set the sum of the values must be as close as possible to 1500000 and in the second set the sum iof the values must be as close as possible to 1000000.
What is the best (and that may mean simplest) algorithm to do this?
Further clarification, the values all range between 110000 and 200000. The values are always multiples of a 100 and are all positive integers, and there can be duplicates.
There are only 43758 such selections. Go through each of them and find the best.
It is an optimization problem. Here you have two optimization criteria, which should be combine to single one. For example like this:
F(A, B) = w1*abs(sum(A) - 1500000) + w2*abs(sum(B) - 1000000)
where A and B your sets, sum() is a sum of elements in a set, and w1 and w2 is weights.
Then you should find a strategy for iteration over possible combinations. The simpliest strategy is to find all 10-combinations of 18, and select that one which minimize F(A,B). There are C(18,10) = 43758 combinations.
While brute force is probably best for this problem size, there are other tricks you can play if you're willing to get an approximate solution or if the brute force method is still too expensive. the basic idea is to snap the values to a small grid, and then do brute force on the (much smaller) set of entries in the grid.
in your case, (pretending I've already divided by 100), all numbers are between 1100 and 2000, so you can "snap" them to the 10 integers 1100, 1200 and so on. The maximum error in doing is at most 50/1100 which is less than 5%. Now you've halved the input size, which makes the brute force run a bit faster.
Again, I wouldn't recommend this unless (a) brute force is really slow right now or (a) the problem size increases beyond 18.
p.s the problem is called SUBSET SUM (or sometimes KNAPSACK depending on the formulation) and is NP-complete. Here's a reference for the approximation idea.
Your problem, as stated, is np unless there is a pattern to the data.
The only way to achieve the best answer is find all permutations of 18 into 10 and 8 and associated sums. Weight according to your preference.
Looks like an optimization problem to me. Randomly separate the values into the two sets, and then start swapping values (use a good heuristic), and accept the change if the result is better.

How to choose group of numbers in the vector

I have an application with some probabilities of measured features. I want to select n-best features from vector. I have a vector of real numbers. Vector is normalized, sum of all numbers is 1 (it is probability of some features).
I want to select group of n less than N (assume approx. 8) largest numbers. Numbers has to be close together without gaps and they're also should have large sum (sum of remaining numbers should be several times lower).
Any ideas how to accomplish that?
I tried to use 80% quantile (but it is not sensitive to relative large gaps like [0.2, 0.2, 0.01, 0.01, 0.001, 0.001 ... len ~ 100] ), I tried a some treshold between two successive numbers, but nothing work too good.
I have some partial solution at this moment but I am just wondering if there is some simple solution that I have overlooked.
John's answer is good. Also you might try
sort the probabilities
find the largest gap between successive probabilities
work up from there
From there, it's starting to sound like a pattern-recognition problem.My favorite method is markov-chain-monte-carlo(MCMC).
Edit: Since you clarified your question, my first thought is, since you only have 8 possible answers, develop a score for each one, based on how much probability it contains and whether or not it splits at a gap, and make a heuristic judgement.
Further edit: This sounds a bit like logistic regression. You want to find a value of P that effectively divides your set into members and non-members. For a given value of P, you can compute a log-likelihood for the ensemble, and choose P that maximizes that.
It sounds like you're wanting to select the n largest probabilities but the number n is flexible. If n were fixed, say n=10, you could just sort your vector and pull out the top 10 items. But from your example it sounds like you'd like to use a smaller value of n if there's a natural break in the data. Maybe you want to start with the largest probability and go down the list selecting items until the sum of the probabilities you pick crosses some threshold.
Maybe you have an implicit optimization problem where you want to maximize some probability with some penalty for large n. Try stating your problem that way. You might find your own answer, or you might be able to rephrase your question here in a way that helps other people give you a better answer.
I'm not really sure if this is what you want, but it seems you want to do the following.
Lets assume that the probabilities are x_1,...,x_N in increasing order. Then you should try to find 1<= i < j <= N such that the function
f(i,j) = (x_i + x_(i+1) + ... + x_j)/(x_j - x_i)
is maximized. This can be done naively in quadratic time.

Resources