There are about 10 points in 3d space. The goal is to find the number of clusters and it's position. What is the best clustering method for it?
I have seen Complete_Linkage and DBSCAN clustering. Which one is more efficiently?
You can try a space filling curve for example translate the co-ordinate to a binary and interleave it. Treat it as base-4 number. Then sort the numbers.
With 10 points, efficiency is your least concern.
DBSCAN is meant for larger data. Hierarchical clustering is the way to go with such tiny data.
Related
I'm looking for an algorithm that takes the unstructured 2D point set as illustrated above and gives me a decomposition into bounding boxes as shown below. The bounding boxes may overlap, but the algorithm should nevertheless try to find a tight fit (it does not need to be the best one possible, but a good one).
I already tried to work with K-Means but that doesn't give me useful results as I'd need to know already how many clusters I need.
There are several approaches to achieve this. The one I'd use is to apply the RANSAC algorithm in order to sequentially fit Oriented Bounding Boxes (OBB) to the data. By adopting this approach you can even improve the sample selection by computing a constrained Delaunay triangulation on your data, thus discarding bigger edges when fitting the OBB.
Alternatively, you can apply the Alpha-shape algorithm on the entire set of points and start decomposing in smaller shapes, but this requires the definition of an optimal alpha value which is not trivial.
I currently have 2 clusters which essentially lie along 2 lines on a 3D surface:
I've tried a simple kmeans algorithm which yielded the above separation. (Big red dots are the means)
I have tried clustering with soft k-means with different variances for each mean along the 3 dimensions. However, it also failed to model this, presumably as it cannot fit the shape of a diagonal Gaussian.
Is there another algorithm that can take into account that the data is diagonal? Alternatively is there a way to essentially "rotate" the data such that soft k-means could be made to work?
K-means is not well prepared for correlations.
But these clusters look reasonably Gaussian to me, you should try Gaussian Mixture Modeling instead.
I have a set of 300.000 or so vectors which I would like to compare in some way, and given one vector I want to be able to find the closest vector I have thought of three methods.
Simple Euclidian distance
Cosine similarity
Use a kernel (for instance Gaussian) to calculate the Gram matrix.
Treat the vector as a discrete probability distribution (which makes
sense to do) and calculate some divergence measure.
I do not really understand when it is useful to do one rather than the other. My data has a lot of zero-elements. With that in mind, is there some general rule of thumbs as to which of the three methods is the best?
Sorry for the weak question, but I had to start somewhere...
Thank you!
Your question is not quite clear, are you looking for a distance metric between vectors, or an algorithm to efficiently find the nearest neighbour?
If your vectors just contain a numeric type such as doubles or integers, you can find a nearest neighbour efficiently using a structure such as the kd-tree. (since you are just looking at points in d-dimensional space). See http://en.wikipedia.org/wiki/Nearest_neighbor_search, for other methods.
Otherwise, choosing a distance metric and algorithm is very much dependent on the content of the vectors.
If your vectors are very sparse in nature and if they are binary, you can use Hamming or Hellinger distance. When your vector dimensions are large, avoid using Euclidean (refer http://en.wikipedia.org/wiki/Curse_of_dimensionality)
Please refer to http://citeseerx.ist.psu.edu/viewdoc/download?rep=rep1&type=pdf&doi=10.1.1.154.8446 for a survey of distance/similarity measures, although the paper limits it to pair of probability distributions.
I'm working on an app that lets users select regions by finger painting on top of a map. The points then get converted to a latitude/longitude and get uploaded to a server.
The touch screen is delivering way too many points to be uploaded over 3G. Even small regions can accumulate up to ~500 points.
I would like to smooth this touch data (approximate it within some tolerance). The accuracy of drawing does not really matter much as long as the general area of the region is the same.
Are there any well known algorithms to do this? Is this work for a Kalman filter?
There is the Ramer–Douglas–Peucker algorithm (wikipedia).
The purpose of the algorithm is, given
a curve composed of line segments, to
find a similar curve with fewer
points. The algorithm defines
'dissimilar' based on the maximum
distance between the original curve
and the simplified curve. The
simplified curve consists of a subset
of the points that defined the
original curve.
You probably don't need anything too exotic to dramatically cut down your data.
Consider something as simple as this:
Construct some sort of error metric. An easy one would be a normalized sum of the distances from the omitted points to the line that was approximating them. Decide what a tolerable error using this metric is.
Then starting from the first point construct the longest line segment that falls within the tolerable error range. Repeat this process until you have converted the entire path into a polyline.
This will not give you the globally optimal approximation but it will probably be good enough.
If you want the approximation to be more "curvey" you might consider using splines or bezier curves rather than straight line segments.
You want to subdivide the surface into a grid with a quadtree or a space-filling-curve. A sfc reduce the 2d complexity to a 1d complexity. You want to look for Nick's hilbert curve quadtree spatial index blog.
I was going to do something this in an app, but was intending on generating a path from the points on-the-fly. I was going to use a technique mentioned in this Point Sequence Interpolation thread
What kind of data structure could be used for an efficient nearest neighbor search in a large set of geo coordinates? With "regular" spatial index structures like R-Trees that assume planar coordinates, I see two problems (Are there others I have overlooked?):
Wraparound at the poles and the International Date Line
Distortion of distances near the poles
How can these factors be allowed for? I guess the second one could compensated by transforming the coordinates. Can an R-Tree be modified to take wraparound into account? Or are there specialized geo-spatial index structures?
Could you use a locality-sensitive hashing (LSH) algorithm in 3 dimensions? That would quickly give you an approximate neighboring group which you could then sanity-check by calculating great-circle distances.
Here's a paper describing an algorithm for efficient LSH on the surface of a unit d-dimensional hypersphere. Presumably it works for d=3.
Take a look at Geohash.
Also, to compensate for wraparound, simply use not one but three orthogonal R-trees, so that there does not exist a point on the earth surface such that all three trees have a wraparound at that point. Then, two points are close if they are close according to at least one of these trees.