I have a dataset consisting of 2D points, where each point has an associated key k that influences its y coordinate. The key is an integer k ∈ {0, 1, ..., K}.
I would like to partition the keys into intervals, and cluster the associated 2D points with a fixed budget of N clusters per interval such that some criteria is optimized, e.g. maximize likelihood, minimize SSE, etc. Ultimately, I want to use the clusters to estimate the density of out-of-sample points, i.e. model P(X, Y | k).
To illustrate, I've generated a simple dataset below. Each color represents a different group, and since there are more groups than the cluster budget allows, each group will need to be associated with a cluster.
Here the cluster budget per key interval is N=3. One solution is to partition the key k into intervals {[0, 2], [3, 5], [6, 8]}. The black x markers indicate a cluster center.
Example of how such a model may be used:
Observe key k
Look up interval containing k
Sample points from resulting clusters, or estimate density at some point p
I'm familiar with methods such as K-means and GMMs, but it's not clear how to incorporate my key input and cluster budget constraint. I've experimented with kernel mixture networks, but the training time and results were not satisfactory.
Related
I need to (roughly) evenly distribute points in space, but the dimensionality isn't fixed.
I've seen the Fibonacci Sphere algorithm, but as it uses sin+cos for x,z it seems it's only suited for 3D space. I've also seen the sunflower spiral algorithm, but it similarly is limited to 2D.
Is there a general algorithm that takes
a number of points
a number of dimensions
and spreads points throughout?
We can fill your space with k^n n-dimensional hypercubes by dividing each dimension into k equally-sized regions.
Given r points and n dimensions, we want r = k^n, so k = r^(1/n).
E.g., for 1000 points and 2 dimensions we'd want k = 1000^(1/2) = 31.6 regions per dimension, but for 3 dimensions we'd want k = 1000^(1/3) = 10 regions per dimension.
For non-integer values, I'd recommend rounding up (so 31.6 becomes 32). This will give you a few more cells than points. You can either select which cells don't get points at random, or distribute them towards the edges or however you like.
Once you have the cells that should have points, assign 1 point randomly to a location within each cell, choosing a float between 0-1 per each dimension as the points location along that dimension's axis segment within the cell.
Since the cells are perfectly distributed (except possibly a few extra empty cells) and there is one point per cell, the points are reasonably distributed in space while still being random.
I'm currently writing a script that is supposed to remove redundant data points from my graph. My data includes overlaps from adjacent data sets and I only want the data that is generally higher.
(Imagine two Gaussians with an x offset that overlap slightly. I'm only interested in the higher values in the overlap region, so that my final graph doesn't get all noisy when I combine the data in order to make a single spectrum.)
Here are my problems:
1) The x values aren't the same between the two data sets, so I can't just say "at x, take max y value". They're close together, but not equal.
2) The distances between x values aren't equal.
3) The data is noisy, so there can be multiple points where the data sets intersect. And while Gaussian A is generally higher after the intersection than Gaussian B, the noise means Gaussian B might still have SOME values which are higher. Meaning I can't just say "always take the highest values in this x area", because then I'd wildly combine the noise of both data sets.
4) I have n overlaps of this type, so I need an efficient algorithm and all I can come up with is somewhere at O(n^3), which would be something like "for each overlap, store data sets into two arrays and for each combination of data points (x0,y0) and (x1,y1) cycle through until you find the lowest combination of abs(x1-x0) AND abs(y1-y0)"
As I'm not a programmer, I'm completely lost. I also wasn't able to find an algorithm for this problem anywhere - most algorithms assume that the entries in the arrays I'm comparing are equal integers, but I'm working with almost-equal floats.
I'm using IDL, but I'd also be grateful for a general algorithm or at least a tip what I could try. Thanks!
One way you can do this is if you fit gaussians to your data and then take the max assuming each data point is equal to the gaussian at that point.
This can be done as follows:
Fit some gaussian G1 to dataset X1 and some gaussian G2 to dataset X2, where the mean of G1 is less than the mean of G2.
Then, find their intersection point with some arithmetic.
Then, for all values of x less then the intersection take X1 and all values of x greater than the intersection take X2.
I am designing an agglomerative, bottom-up clustering algorithm for millions of 50-1000 dimensional points. In two parts of my algorithm, I need to compare two clusters of points and decide the separation between the two clusters. The exact distance is the minimum Euclidean distance taken over all pairs of points P1-P2 where P1 is taken from cluster C1 and P2 is taken from cluster C2. If C1 has X points and C2 has Y points then this requires X*Y distance measurements.
I currently estimate this distance in a way that requires X+Y measurements:
Find the centroid Ctr1 of cluster C1.
Find the point P2 in cluster C2 that is closest to Ctr1. (Y comparisons.)
Find the point P1 in C1 that is closest to P2. (X comparisons.)
The distance from P1 to P2 is an approximate measure of the distance between the clusters C1 and C2. It is an upper-bound on the true value.
If clusters are roughly spherical, this works very well. My test data is composed of ellipsoidal Gaussian clusters, so it works very well. However, if clusters have weird, folded, bendy shapes, it may yield poor results. My questions are:
Is there an algorithm that uses even fewer than X+Y distance measurements and in the average case yields good accuracy?
OR
Is there an algorithm that (like mine) uses X+Y distance measurements but delivers better accuracy than mine?
(I am programming this in C#, but a description of an algorithm in pseudo-code or any other language is fine. Please avoid references to specialized library functions from R or Matlab. An algorithm with probabilistic guarantees like "95% chance that the distance is within 5% of the minimum value" is acceptable.)
NOTE: I just found this related question, which discusses a similar problem, though not necessarily for high dimensions. Given two (large) sets of points, how can I efficiently find pairs that are nearest to each other?
NOTE: I just discovered that this is called the bichromatic closest-pair problem.
For context, here is an overview of the overall clustering algorithm:
The first pass consolidates the densest regions into small clusters using a space-filling curve (Hilbert Curve). It misses outliers and often fails to merge adjacent clusters that are very close to one another. However, it does discover a characteristic maximum linkage-distance. All points separated by less than this characteristic distance must be clustered together. This step has no predefined number of clusters as its goal.
The second pass performs single-linkage agglomeration by combining clusters together if their minimum distance is less than the maximum linkage-distance. This is not hierarchical clustering; it is partition-based. All clusters whose minimum distance from one another is less than this maximum linkage-distance will be combined. This step has no predefined number of clusters as its goal.
The third pass performs additional single-linkage agglomeration, sorting all inter-cluster distances and only combining clusters until the number of clusters equals a predefined target number of clusters. It handles some outliers, preferring to only merge outliers with large clusters. If there are many outliers (and their usually are), this may fail to reduce the number of clusters to the target.
The fourth pass combines all remaining outliers with the nearest large cluster, but causes no large clusters to merge with other large clusters. (This prevents two adjacent clusters from accidentally being merged due to their outliers forming a thin chain between them.)
You could use an index. That is the very classic solution.
A spatial index can help you find the nearest neighbor of any point in roughly O(log n) time. So if your clusters have n and m objects, choose the smaller cluster and index the larger cluster, to find the closest pair in O(n log m) or O(m log n).
A simpler heuristic approach is to iterate your idea multiple times, shrinking your set of candidates. So you find a good pair of objects a, b from the two clusters. Then you discard all objects from each cluster that must (by triangle inequality) be further apart (using the upper bound!).
Then you repeat this, but not choosing the same a, b again. Once your candidate sets stops improving, do the pairwise comparisons on the remaining objects only. Worst case of this approach should remain O(n*m).
I have found a paper that describes a linear time, randomized, epsilon-approximate algorithm for the closest bichromatic point problem:
http://www.cs.umd.edu/~samir/grant/cp.pdf
I will attempt to implement it and see if it works.
UPDATE - After further study, it is apparent that the runtime is proportional to 3^D, where D is the number of dimensions. This is unacceptable. After trying several other approaches, I hit on the following.
Perform a rough clustering into K clusters using an efficient but incomplete method. This method will properly cluster some points, but yield too many clusters. These small clusters remain to be further consolidated to form larger clusters. This method will determine an upper bound distance DMAX between points that are considered to be in the same cluster.
Sort the points in Hilbert curve order.
Throw out all points immediately preceded and succeeded by a neighbor from the same cluster. More often than not, these are interior points of a cluster, not surface points.
For each point P1, search forward, but no farther than the next point from the same cluster.
Compute the distance from the point P1 from cluster C1 to each visited point P2 from cluster C2 and record the distance if it is smaller than any prior distance measured between points in C1 and C2.
However, if P1 has already been compared to a point in C2, do not do so again. Only make a single comparison between P1 and any point in C2.
After all comparisons have been made, there will be at most K(K-1) distances recorded, and many discarded because they are larger than DMAX. These are estimated closest point distances.
Perform merges between clusters if they are nearer than DMAX.
It is hard to conceive of how the Hilbert curve wiggles among the clusters, so my estimate of how efficient this approach to finding closest pairs was that it was proportional to K^2. However, my testing shops that it is closer to K. It may be around K*log(K). Further research is necessary.
As for accuracy:
Comparing every point to every other point is 100% accurate.
Using the centroid method outlined in my question has distances that are about 0.1% too high.
Using this method finds distances that are at worst 10% too high, and on average 5% too high. However, the true closest cluster almost always turns up as among the first through third closest cluster, so qualitatively it is good. The final clustering results using this method are excellent. My final clustering algorithm seems to be proportional to DNK or DNK*Log(K).
I have a large number of points in 3D space (x,y,z) represented as an array of 3 float structs. I also have access to a strong graphics card with CUDA capability. I want the following:
Divide the points in the array into clusters so that every point within a cluster has a maximum euclidean distance of X to at least one other point within the cluster.
Examle in 2D:
The "brute force" way of doing this is of course to calculate the distance between every point and every other point, to see if any of the distances is below the threshold X, and if so mark those points as belonging to the same cluster. This is an O(n²) algorithm.
This can be done in parallel in CUDA ofcourse with n² threads, but is there a better way?
The algorithm can be reduced to O(n) by using binning:
impose a 3D grid spaced as X, that is a 3D lattice (each cell of the lattice is a cubic bin);
assign each points in space to the corresponding bin (the bin that geometrically contains that points);
every time you need to evaluate the distances from one point, you just use only the points in the bin of the point itself and the ones in the 26 neighbouring bins (3x3x3 = 27)
The points in the other bins are further than X, so you don't need to evaluate the distances at all.
In this way, assuming a constant density in the points, you will have to compute the distance only for a constant number of pair points / total number of points.
Assigning the points to the bins is O(n) as well.
If the points are not uniformly distributed, the bins can be smaller (and you must consider more than 26 neighbours to evaluate the distances) and eventually sparse.
This is a typical trick used for molecular dynamics, ray tracing, meshing,... However I know of the term binning from molecular dynamics simulation: the name can change (link-cell, kd-trees too use the same principle, even if more articulated), the algorithm remains the same!
And, good news, the algorithm is well suited for parallel implementation.
refs:
https://en.wikipedia.org/wiki/Cell_lists
I am reading a bit about the means shift clustering algorithm (http://en.wikipedia.org/wiki/Mean_shift) and this is what i got so far. For each point in your data set : select all points within a certain distance of it (including the original point), calculate the mean for all these points, repeat until these means stabilize.
What I'm confused about is how does one go from here in deciding what the final clusters are , and on what conditions do these means merge. Also, does the distance used to select the points fluctuate through the iterations or does it remain constant?
Thanks in advance
The mean shift cluster finding is a simple iterative process which is actually guaranteed to converge. The iteration starts from a starting point x, and the iteration steps are (note that x may have several components, as the algorithm will work in higher dimensions, as well):
calculate the weighted mean position x' of all points around x - maybe the simplest form is to calculate the average of positions of all points within d distance from x, but the gaussian function is also commonly used and mathematically beneficial.
set x <- x'
repeat until the difference between x and x' is very small
This can be used in cluster analysis by starting with different values of x. The final values will end up at different cluster centers. The number of clusters cannot be known (other than it is <= number of points).
The upper level algorithm is:
go through a selection of starting values
for each value, calculate the convergence value as shown above
if the value is not already in the list of convergence values, add it to the list (allow some reasonable tolerance for numerical imprecision)
And then you have the list of clusters. The only difficult thing is finding a reasonable selection of starting values. It is easy with one or two dimensions, but with higher dimensionalities exhaustive searches are not quite possible.
All starting points, which end up into the same mode (point of convergence) belong to the same cluster.
It may be of interest that if you are doing this on a 2D image, it should be sufficient to calculate the gradient (i.e. the first iteration) for each pixel. This is a fast operation with common convolution techniques, and then it is relatively easy to group the pixels into clusters.