Grouping an ordered dataset into minimal number of clusters - algorithm

I have an ordered list of weighted items, weight of each is less-or-equal than N.
I need to convert it into a list of clusters.
Each cluster should span several consecutive items, and total weight of a cluster has to be less-or-equal than N.
Is there an algorithm which does it while minimizing the total number of clusters and keeping their weights as even as possible?
E.g. list [(a,5),(b,1),(c,2),(d,5)], N=6 should be converted into [([a],5),([b,c],3),([d],5)]

Since the dataset is ordered, one possible approach is to assign a "badness" score to each possible cluster and use a dynamic program reminiscent of Knuth's word wrapping ( http://en.wikipedia.org/wiki/Word_wrap ) to minimize the sum of the badness scores. The badness function will let you explore tradeoffs between minimizing the number of clusters (larger constant term) and balancing them (larger penalty for deviating from the average number of items).

Your problem is under-specified.
The issue is that you are trying to optimize two different properties of the resulting data, and these properties may be in opposition to one another. For a given set of data, it may be that the most even distribution has many clusters, and that the smallest number of clusters has a very uneven distribution.
For example, consider: [(a,1),(b,1),(c,1),(d,1),(e,1)], N=2
The most even distribution is [([a],1),([b],1),([c],1),([d],1),([e],1)]
But the smallest number of clusters is [([a,b],2),([c,d],2),([e],1)]
How is an algorithm supposed to know which of these (or which clustering in between them) you want? You need find some way to quantify the tradeoff that you are willing to accept between number of clusters and evenness of distribution.
You can create an example with an arbitrarily large discrepancy between the two possibilities by creating any set with 2k + 1 elements, and assigning them all the value N/2. This will lead to the smallest number of clusters being k+1 clusters (k of 2 elements and 1 of 1) with a weight difference of N/2 between the largest and smallest clusters. And then the most even distribution for this set will be 2k + 1 clusters of 1 element each, with no weight difference.
Edit: Also, "evenness" itself is not a well-defined idea. Are you looking to minimize the largest absolute difference in weights between clusters, or the mean difference in weights, or the median difference in weights, or the standard deviation in weights?

Related

Fixed radius nearest neighbours, with sets

I need to efficiently solve the following problem, a variant of the Fixed radius nearest neighbours problem:
Given a list of n sets S, where each set S[i] consists of (2-dimensional) input points, and a query point q: List the indices of all sets in S such that at least one point of the set is within distance 'r' of q.
Approaches involving range trees, k-d trees and similar data structures storing all the points solve this in running times similar to O(log(n) + k), where n is the total number of points, and k is the number of results (points) returned. My problem is that each set is quite large, and while I can deal with large values of n, large values of k make my algorithm run very slowly and consume prohibitive amounts of space, when I actually only need the indices of the valid sets rather than all of the individual points or nearest point in each set.
If I make randomized k-d trees of each set, and then query for q in each set, (correct me if I'm wrong), I can solve the problem in O(m*log(n/m)) amortized time, where m is the number of the sets, which is a significant improvement over the first approach, but before implementing it, I wonder if there are other better practical ways of solving the problem, especially as m and n can become 10x or more of what they are now, and I am also concerned about the space/memory used in this approach. Elements could also be added to the sets, which may make the k-d trees unbalanced, and may require frequent reconstructions.
Other approaches I've tried involve partitioning the 2-d space into grids, and then using bloom filters, (and taking their union), but that that takes a prohibitive amount of space, and I still need to query for m sets. I also can't use a disjoint set to compute unions, because the points in each partition are not disjoint, and cannot be made disjoint.
Current values I am working with:
Total number of points: 250 million (could become 10x larger)
Number of sets: 50,000
The number of points in a set are thus, on average, ~5,000, but there are sets having 200,000+ points.
Values of k (number of matching points), for radii of interest: up to 40 million when there are 250 million points. The points are very densely clustered in some places. Even for such a large value of k, the number of matching sets is only 30,000 or so.
I'd welcome an approach which involves "once you've found any point in the set within the radius, don't bother about processing other points in the set." Any other approach that solves this problem efficiently, is of course, equally welcome.
I don't have to store the entire data structure in memory, I can store the structure in a database and retrieve parts that are needed.
On a side note, I'd also appreciate if someone could point me to a well-tested k-d tree implementation in Java, which at least works well for 2 dimensions, and serializes and deserializes properly.

speed-up computation of sum over several subsets

Let's say I have a huge array of doubles w[] indexed from 0 to n-1.
I also have a list of m subsets of [0;n-1]. For each subset S, I am trying to compute the sums of w[i] over S.
Obviously I can compute this separately for each subset, which is going to be in O(m * n).
However is there any faster way to do this? I'm talking from a practical standpoint, as I think you can't have a lower asymptotic bound. Is it possible to pre-process all the subsets and store them in such a way that computing all the sums is faster?
Thanks!
edit :
to give some order of magnitude, my n would be around 20 millions, and m around ~200.
For subsets that are dense (or nearly dense) you may be able to speed up the computation by computing a running sum of the elements. That is, create another array in parallel with w, where each element in the parallel array contains the sum of the elements of w up to that point.
To compute the sum for a dense subset, you that the starting and ending positions in the parallel array, and subtract the running sum at the start from the running sum at the end. The difference between the two is (ignoring rounding errors) the sum for that subset.
For a nearly dense subset, you start by doing the same, then subtract off the values of the (relatively few) items in that range that aren't part of the set.
These may not produce exactly the same result as you'd get by naively summing the subset though. If you need better accuracy, you'd probably want to use Kahan summation for your array of running sums, and possibly preserve its error residual at each point, to be taken into account when doing the subtraction.

Fewest subsets with sum less than N

I have a specific sub-problem for which I am having trouble coming up with an optimal solution. This problem is similar to the subset sum group of problems as well as space filling problems, but I have not seen this specific problem posed anywhere. I don't necessarily need the optimal solution (as I am relatively certain it is NP-hard), but an effective and fast approximation would certainly suffice.
Problem: Given a list of positive valued integers find the fewest number of disjoint subsets containing the entire list of integers where each subset sums to less than N. Obviously no integer in the original list can be greater than N.
In my application I have many lists and I can concatenate them into columns of a matrix as long as they fit in the matrix together. For downstream purposes I would like to have as little "wasted" space in the resulting ragged matrix, hence the space filling similarity.
Thus far I am employing a greedy-like approach, processing from the largest integers down and finding the largest integer that fits into the current subset under the limit N. Once the smallest integer no longer fits into the current subset I proceed to the next subset similarly until all numbers are exhausted. This almost certainly does not find the optimal solution, but was the best I could come up with quickly.
BONUS: My application actually requires batches, where there is a limit on the number of subsets in each batch (M). Thus the larger problem is to find the fewest batches where each batch contains M subsets and each subset sums to less than N.
Straight from Wikipedia (with some bold amendments):
In the bin packing problem, objects [Integers] of different volumes [values] must be
packed into a finite number of bins [sets] or containers each of volume V [summation of the subset < V] in
a way that minimizes the number of bins [sets] used. In computational
complexity theory, it is a combinatorial NP-hard problem.
https://en.wikipedia.org/wiki/Bin_packing_problem
As far as I can tell, this is exactly what you are looking for.

NumPy: Uniformly distributed N-dimensional samples

Suppose I have a list of ranges (in a form of lower bound and upper bound, inclusive) ranges = [(lb1, ub1), (lb2, ub2)...] and a positive number k. Is there some way how to sample k N-dimensional vectors (N is given by len(ranges)) from the N-dimensional interval given by ranges such that the samples cover the interval as evenly as possible?
I have no definiton of evenly, it's just intuitive (maybe that the distances between "neighboring" points are similar). I'm not looking for a precise algorithm (which is not possible without the definition) but rather for ideas of how to do that and that are nice in python/numpy.
I'm (probably) not looking for just random sampling which could very easily create unwanted clusters of samples, but the algorithm can definitely be stochastic.
If the points are independent, then there should be clusters. So, you want the points not to be independent. You want something like a low discrepancy sequence in N dimensions. One type of low discrepancy sequence in N dimensions is a Sobol sequence. These were designed for high dimensional numerical integration, and are suitable for many but not all purposes.

Finding a single cluster of points with low variance

Given a collection of points in the complex plane, I want to find a "typical value", something like mean or mode. However, I expect that there will be a lot of outliers, and that only a minority of the points will be close to the typical value. Here is the exact measure that I would like to use:
Find the mean of the largest set of points with variance less than some programmer-defined constant C
The closest thing I have found is the article Finding k points with minimum diameter and related problems, which gives an efficient algorithm for finding a set of k points with minimum variance, for some programmer-defined constant k. This is not useful to me because the number of points close to the typical value could vary a lot and there may be other small clusters. However, incorporating the article's result into a binary search algorithm shows that my problem can be solved in polynomial time. I'm asking here in the hope of finding a more efficient solution.
Here is way to do it (from what i have understood of problem) : -
select the point k from dataset and calculate sorted list of points in ascending order of their distance from k in O(NlogN).
Keeping k as mean add the points from sorted list into set till variance < C and then stop.
Do this for all points
Keep track of set which is largest.
Time Complexity:- O(N^2*logN) where N is size of dataset
Mode-seeking algorithms such as Mean-Shift clustering may still be a good choice.
You could then just keep the mode with the largest set of points that has variance below the threshold C.
Another approach would be to run k-means with a fairly large k. Then remove all points that contribute too much to variance, decrease k and repeat. Even though k-means does not handle noise very well, it can be used (in particular with a large k) to identify such objects.
Or you might first run some simple outlier detection methods to remove these outliers, then identify the mode within the reduced set only. A good candidate method is 1NN outlier detection, which should run in O(n log n) if you have an R-tree for acceleration.

Resources