Fewest subsets with sum less than N - algorithm

I have a specific sub-problem for which I am having trouble coming up with an optimal solution. This problem is similar to the subset sum group of problems as well as space filling problems, but I have not seen this specific problem posed anywhere. I don't necessarily need the optimal solution (as I am relatively certain it is NP-hard), but an effective and fast approximation would certainly suffice.
Problem: Given a list of positive valued integers find the fewest number of disjoint subsets containing the entire list of integers where each subset sums to less than N. Obviously no integer in the original list can be greater than N.
In my application I have many lists and I can concatenate them into columns of a matrix as long as they fit in the matrix together. For downstream purposes I would like to have as little "wasted" space in the resulting ragged matrix, hence the space filling similarity.
Thus far I am employing a greedy-like approach, processing from the largest integers down and finding the largest integer that fits into the current subset under the limit N. Once the smallest integer no longer fits into the current subset I proceed to the next subset similarly until all numbers are exhausted. This almost certainly does not find the optimal solution, but was the best I could come up with quickly.
BONUS: My application actually requires batches, where there is a limit on the number of subsets in each batch (M). Thus the larger problem is to find the fewest batches where each batch contains M subsets and each subset sums to less than N.

Straight from Wikipedia (with some bold amendments):
In the bin packing problem, objects [Integers] of different volumes [values] must be
packed into a finite number of bins [sets] or containers each of volume V [summation of the subset < V] in
a way that minimizes the number of bins [sets] used. In computational
complexity theory, it is a combinatorial NP-hard problem.
https://en.wikipedia.org/wiki/Bin_packing_problem
As far as I can tell, this is exactly what you are looking for.

Related

Fixed radius nearest neighbours, with sets

I need to efficiently solve the following problem, a variant of the Fixed radius nearest neighbours problem:
Given a list of n sets S, where each set S[i] consists of (2-dimensional) input points, and a query point q: List the indices of all sets in S such that at least one point of the set is within distance 'r' of q.
Approaches involving range trees, k-d trees and similar data structures storing all the points solve this in running times similar to O(log(n) + k), where n is the total number of points, and k is the number of results (points) returned. My problem is that each set is quite large, and while I can deal with large values of n, large values of k make my algorithm run very slowly and consume prohibitive amounts of space, when I actually only need the indices of the valid sets rather than all of the individual points or nearest point in each set.
If I make randomized k-d trees of each set, and then query for q in each set, (correct me if I'm wrong), I can solve the problem in O(m*log(n/m)) amortized time, where m is the number of the sets, which is a significant improvement over the first approach, but before implementing it, I wonder if there are other better practical ways of solving the problem, especially as m and n can become 10x or more of what they are now, and I am also concerned about the space/memory used in this approach. Elements could also be added to the sets, which may make the k-d trees unbalanced, and may require frequent reconstructions.
Other approaches I've tried involve partitioning the 2-d space into grids, and then using bloom filters, (and taking their union), but that that takes a prohibitive amount of space, and I still need to query for m sets. I also can't use a disjoint set to compute unions, because the points in each partition are not disjoint, and cannot be made disjoint.
Current values I am working with:
Total number of points: 250 million (could become 10x larger)
Number of sets: 50,000
The number of points in a set are thus, on average, ~5,000, but there are sets having 200,000+ points.
Values of k (number of matching points), for radii of interest: up to 40 million when there are 250 million points. The points are very densely clustered in some places. Even for such a large value of k, the number of matching sets is only 30,000 or so.
I'd welcome an approach which involves "once you've found any point in the set within the radius, don't bother about processing other points in the set." Any other approach that solves this problem efficiently, is of course, equally welcome.
I don't have to store the entire data structure in memory, I can store the structure in a database and retrieve parts that are needed.
On a side note, I'd also appreciate if someone could point me to a well-tested k-d tree implementation in Java, which at least works well for 2 dimensions, and serializes and deserializes properly.

Algorithms for bucketizing integers into buckets with zero sums

Suppose we have an array of integers (both negative and positive) A[1 ... n] such that all the elements sum to zero. Now, whenever I have a bunch of integers that sum to zero, I will call them a group and I want to split A in as many disjoint groups as possible. Can you suggest any paper discussing this very same problem?
It sounds like your problem consists of two NP-Complete problems.
The first would be finding all subsets that solve the Subset Sum problem. This problem does have an exponential time complexity (as implied by amit in the comments), but it is a very reasonable extension of the Subset Sum problem from a theoretical standpoint. For example, if you can solve the Subset Sum problem by dynamic programming and generate the canonical 2D array as a result, this array will contain enough information to generate all possible solutions using a traceback.
The second NP-Complete problem embedded within your problem is the Integer Linear Programming problem. Given all possible subsets solving the Subset Sum problem, N total, we want to select select 0<=n<=N, such that the value of n is maximized and no element of A is repeated.
I doubt there is a publication devoted to describing this problem because it seems to involve a straightforward application of known theory.

Finding a single cluster of points with low variance

Given a collection of points in the complex plane, I want to find a "typical value", something like mean or mode. However, I expect that there will be a lot of outliers, and that only a minority of the points will be close to the typical value. Here is the exact measure that I would like to use:
Find the mean of the largest set of points with variance less than some programmer-defined constant C
The closest thing I have found is the article Finding k points with minimum diameter and related problems, which gives an efficient algorithm for finding a set of k points with minimum variance, for some programmer-defined constant k. This is not useful to me because the number of points close to the typical value could vary a lot and there may be other small clusters. However, incorporating the article's result into a binary search algorithm shows that my problem can be solved in polynomial time. I'm asking here in the hope of finding a more efficient solution.
Here is way to do it (from what i have understood of problem) : -
select the point k from dataset and calculate sorted list of points in ascending order of their distance from k in O(NlogN).
Keeping k as mean add the points from sorted list into set till variance < C and then stop.
Do this for all points
Keep track of set which is largest.
Time Complexity:- O(N^2*logN) where N is size of dataset
Mode-seeking algorithms such as Mean-Shift clustering may still be a good choice.
You could then just keep the mode with the largest set of points that has variance below the threshold C.
Another approach would be to run k-means with a fairly large k. Then remove all points that contribute too much to variance, decrease k and repeat. Even though k-means does not handle noise very well, it can be used (in particular with a large k) to identify such objects.
Or you might first run some simple outlier detection methods to remove these outliers, then identify the mode within the reduced set only. A good candidate method is 1NN outlier detection, which should run in O(n log n) if you have an R-tree for acceleration.

Algorithms with superexponential runtime?

I was talking with a student the other day about the common complexity classes of algorithms, like O(n), O(nk), O(n lg n), O(2n), O(n!), etc. I was trying to come up with an example of a problem for which solutions whose best known runtime is super-exponential, such as O(22n), but still decidable (e.g. not the halting problem!) The only example I know of is satisfiability of Presburger arithmetic, which I don't think any intro CS students would really understand or be able to relate to.
My question is whether there is a well-known problem whose best known solution has runtime that is superexponential; at least ω(n!) or ω(nn). I would really hope that there is some "reasonable" problem meeting this description, but I'm not aware of any.
Maximum Parsimony is the problem of finding an evolutionary tree connecting n DNA sequences (representing species) that requires the fewest single-nucleotide mutations. The n given sequences are constrained to appear at the leaves; the tree topology and the sequences at internal nodes are what we get to choose.
In more CS terms: We are given a bunch of length-k strings that must appear at the leaves of some tree, and we have to choose a tree, plus a length-k string for each internal node in the tree, so as to minimise the sum of Hamming distances across all edges.
When a fixed tree is also given, the optimal assignment of sequences to internal nodes can be determined very efficiently using the Fitch algorithm. But in the usual case, a tree is not given (i.e. we are asked to find the optimal tree), and this makes the problem NP-hard, meaning that every tree must in principle be tried. Even though an evolutionary tree has a root (representing the hypothetical ancestor), we only need to consider distinct unrooted trees, since the minimum number of mutations required is not affected by the position of the root. For n species there are 3 * 5 * 7 * ... * (2n-5) leaf-labelled unrooted binary trees. (There is just one such tree with 3 species, which has a single internal vertex and 3 edges; the 4th species can be inserted at any of the 3 edges to produce a distinct 5-edge tree; the 5th species can be inserted at any of these 5 edges, and so on -- this process generates all trees exactly once.) This is sometimes written (2n-5)!!, with !! meaning "double factorial".
In practice, branch and bound is used, and on most real datasets this manages to avoid evaluating most trees. But highly "non-treelike" random data requires all, or almost all (2n-5)!! trees to be examined -- since in this case many trees have nearly equal minimum mutation counts.
Showing all permutation of string of length n is n!, finding Hamiltonian cycle is n!, minimum graph coloring, ....
Edit: even faster Ackerman functions. In fact they seems without bound function.
A(x,y) = y+1 (if x = 0)
A(x,y) = A(x-1,1) (if y=0)
A(x,y) = A(x-1, A(x,y-1)) otherwise.
from wiki:
A(4,3) = 2^2^65536,...
Do algorithms to compute real numbers to a certain precision count? The formula for the area of the Mandelbrot set converges extremely slowly; 10118 terms for two digits, 101181 terms for three.
This is not a practical everyday problem, but it's a way to construct relatively straightforward problems of increasing complexity.
The Kolmogorov complexity K(x) is the size of the smallest program that outputs the string $x$ on a pre-determined universal computer U. It's easy to show that most strings cannot be compressed at all (since there are more strings of length n than programs of length n).
If we give U a maximum running time (say some polynomial function P), we get a time-bounded Kolmogorov complexity. The same counting argument holds: there are some strings that are incompressible under this time bounded Kolmogorov complexity. Let's call the first such string (of some length n) xP
Since the time-bounded Kolmogorov complexity is computable, we can test all strings, and find xP
Finding xP can't be done in polynomial time, or we could use this algorithm to compress it, so finding it must be a super-polynomial problem. We do know we can find it in exp(P) time, though. (Jumping over some technical details here)
So now we have a time-bound E = exp(P). We can repeat the procedure to find xE, and so on.
This approach gives us a decidable super-F problem for every time-constructible function F: find the first string of length n (some large constant) that is incompressible under time-bound F.

Algorithm for generating a size k error-correcting code on n bits

I want to generate a code on n bits for k different inputs that I want to classify. The main requirement of this code is the error-correcting criteria: that the minimum pairwise distance between any two encodings of different inputs is maximized. I don't need it to be exact - approximate will do, and ease of use and speed of computational implementation is a priority too.
In general, n will be in the hundreds, k in the dozens.
Also, is there a reasonably tight bound on the minimum hamming distance between k different n-bit binary encodings?
The problem of finding the exact best error-correcting code for given parameters is very hard, even approximately best codes are hard. On top of that, some codes don't have any decent decoding algorithms, while for others the decoding problem is quite tricky.
However, you're asking about a particular range of parameters where n ≫ k, where if I understand correctly you want a k-dimensional code of length n. (So that k bits are encoded in n bits.) In this range, first, a random code is likely to have very good minimum distance. The only problem is that decoding is anywhere from impractical to intractible, and actually calculating the minimum distance is not that easy either.
Second, if you want an explicit code for the case n ≫ k, then you can do reasonably well with a BCH code with q=2. As the Wikipedia page explains, there is a good decoding algorithm for BCH codes.
Concerning upper bounds for the minimum Hamming distance, in the range n ≫ k you should start with the Hamming bound, also known as the volume bound or the sphere packing bound. The idea of the bound is simple and beautiful: If the minimum distance is t, then the code can correct errors up to distance floor((t-1)/2). If you can correct errors out to some radius, it means that the Hamming balls of that radius don't overlap. On the other hand, the total number of possible words is 2n, so if you divide that by the number of points in one Hamming ball (which in the binary case is a sum of binomial coefficients), you get an upper bound on the number of error-free code words. It is possible to beat this bound, but for large minimum distance it's not easy. In this regime it's a very good bound.

Resources