Clustering [assessment] algorithm with distance matrix as an input

Clustering [assessment] algorithm with distance matrix as an input - algorithm

Can anyone suggest some clustering algorithm which can work with distance matrix as an input? Or the algorithm which can assess the "goodness" of the clustering also based on the distance matrix?
At this moment I'm using a modification of Kruskal's algorithm (http://en.wikipedia.org/wiki/Kruskal%27s_algorithm) to split data into two clusters. It has a problem though. When the data has no distinct clusters the algorithm will still create two clusters with one cluster containing one element and the other containing all the rest. In this case I would rather have one cluster containing all the elements and another one which is empty.
Are there any algorithms which are capable of doing this type of clustering?
Are there any algorithms which can estimate how well the clustering was done or even better how many clusters are there in the data?
The algorithms should work only with distance(similarity) matrices as an input.

Or the algorithm which can assess the
"goodness" of the clustering also
based on the distance matrix?
KNN should be useful in assessing the “goodness” of a clustering assignment. Here's how:
Given a distance matrix with each point labeled according to the cluster it belongs to (its “cluster label”):
Test the cluster label of each point against the cluster labels implied from k-nearest neighbors classification
If the k-nearest neighbors imply an alternative cluster, that classified point lowers the overall “goodness” rating of the cluster
Sum up the “goodness rating” contributions from each one of your pixels to get a total “goodness rating” for the whole cluster
Unlike k-means cluster analysis, your algorithm will return information about poorly categorized points. You can use that information to reassign certain points to a new cluster thereby improving the overall "goodness" of your clustering.
Since the algorithm knows nothing about the placement of the centroids of the clusters and hence, nothing about the global cluster density, the only way to insure clusters that are both locally and globally dense would be to run the algorithm for a range of k values and finding an arrangement that maximizes the goodness over the range of k values.
For a significant amount of points, you'll probably need to optimize this algorithm; possibly with a hash-table to keep track of the the nearest points relative to each point. Otherwise this algorithm will take quite awhile to compute.

Some approaches that can be used to estimate the number of clusters are:
Minimum Description Length
Bayesian Information Criterion
The gap statistic

scipy.cluster.hierarchy runs 3 steps, just like Matlab(TM)
clusterdata:
Y = scipy.spatial.distance.pdist( pts ) # you have this already
Z = hier.linkage( Y, method ) # N-1
T = hier.fcluster( Z, ncluster, criterion=criterion )
Here linkage might be a modified Kruskal, dunno.
This SO answer
(ahem) uses the above.
As a measure of clustering, radius = rms distance to cluster centre is fast and reasonable,
for 2d/3d points.
Tell us about your Npt, ndim, ncluster, hier/flat ?
Clustering is a largish area, one size does not fit all.

Related

Clustering elements based on highest similarity

I'm working with Docker images which consist of a set of re-usable layers. Now given a collection of images, I would like to combine images which have a large amount of shared layers.
To be more exact: Given a collection of N images, I want to create clusters where all images in a cluster share more than X percent of services with eachother. Each image is only allowed to belong to one cluster.
My own research points in the direction of cluster algorithms where I use a similarity measure to decide which images belong in a cluster together. The similarity measure I know how to write. However, I'm having difficulty finding an exact algorithm or pseudo-algorithm to get started.
Can someone recommend an algorithm to solve this problem or provide pseudo-code please?
EDIT: after some more searching I believe I'm looking for something like this hierarchical clustering ( https://github.com/lbehnke/hierarchical-clustering-java ) but with a threshold X so that neighbors with less than X% similarity don't get combined and stay in a separate cluster.

I believe you are a developer and you have no experience with data science?
There are a number of clustering algorithms and they have their advantages and disadvantages (please consult https://en.wikipedia.org/wiki/Cluster_analysis), but I think solution for your problem is easier than one can think.
I assume that N is small enough so you can store a matrix with N^2 float values in RAM memory? If this is the case, you are in a very comfortable situation. You write that you know how to implement similarity measure, so just calculate the measure for all N^2 pairs and store it in a matrix (it is a symmetric matrix, so only half of it can be stored). Please ensure that your similarity measure assigns special value for pair of images, where similarity measure is less than some X%, like 0 or infinity (it depends on that you treat a function like similarity measure or like a distance). I think perfect solution is to assign 1 for pairs, where similarity is greater than X% threshold and 0 otherwise.
After that, treat is just like a graph. Get first vertex and make, e.g., deep first search or any other graph walking routine. This is your first cluster. After that get first not visited vertex and repeat graph walking. Of course you can store graph as an adjacency list to save memory.
This algorithm assumes that you really do not pay attention to that how much images are similar and which pairs are more similar than other, but if they are similar enough (similarity measure is greater than a given threshold).
Unfortunately in cluster analysis it is common that 100% of possible pairs has to be computed. It is possible to save some number of distance calls using some fancy data structures for k-nearest neighbor search, but you have to assure that your similarity measure hold triangle inequality.
If you are not satisfied with this answer, please specify more details of your problem and read about:
K-means (main disadvantage: you have to specify number of clusters)
Hierarchical clustering (slow computation time, at the top all images are in one cluster, you have to cut a dendrogram at proper distance)
Spectral clustering (for graphs, but I think it is too complicated for this easy problem)

I ended up solving the problem by using hierarchical clustering and then traversing each branch of the dendrogram top to bottom until I find a cluster where the distance is below a threshold. Worst case there is no such cluster but then I'll end up in a leaf of the dendrogram which means that element is in a cluster of its own.

Estimate the minimum Distance between two Clusters

I am designing an agglomerative, bottom-up clustering algorithm for millions of 50-1000 dimensional points. In two parts of my algorithm, I need to compare two clusters of points and decide the separation between the two clusters. The exact distance is the minimum Euclidean distance taken over all pairs of points P1-P2 where P1 is taken from cluster C1 and P2 is taken from cluster C2. If C1 has X points and C2 has Y points then this requires X*Y distance measurements.
I currently estimate this distance in a way that requires X+Y measurements:
Find the centroid Ctr1 of cluster C1.
Find the point P2 in cluster C2 that is closest to Ctr1. (Y comparisons.)
Find the point P1 in C1 that is closest to P2. (X comparisons.)
The distance from P1 to P2 is an approximate measure of the distance between the clusters C1 and C2. It is an upper-bound on the true value.
If clusters are roughly spherical, this works very well. My test data is composed of ellipsoidal Gaussian clusters, so it works very well. However, if clusters have weird, folded, bendy shapes, it may yield poor results. My questions are:
Is there an algorithm that uses even fewer than X+Y distance measurements and in the average case yields good accuracy?
OR
Is there an algorithm that (like mine) uses X+Y distance measurements but delivers better accuracy than mine?
(I am programming this in C#, but a description of an algorithm in pseudo-code or any other language is fine. Please avoid references to specialized library functions from R or Matlab. An algorithm with probabilistic guarantees like "95% chance that the distance is within 5% of the minimum value" is acceptable.)
NOTE: I just found this related question, which discusses a similar problem, though not necessarily for high dimensions. Given two (large) sets of points, how can I efficiently find pairs that are nearest to each other?
NOTE: I just discovered that this is called the bichromatic closest-pair problem.
For context, here is an overview of the overall clustering algorithm:
The first pass consolidates the densest regions into small clusters using a space-filling curve (Hilbert Curve). It misses outliers and often fails to merge adjacent clusters that are very close to one another. However, it does discover a characteristic maximum linkage-distance. All points separated by less than this characteristic distance must be clustered together. This step has no predefined number of clusters as its goal.
The second pass performs single-linkage agglomeration by combining clusters together if their minimum distance is less than the maximum linkage-distance. This is not hierarchical clustering; it is partition-based. All clusters whose minimum distance from one another is less than this maximum linkage-distance will be combined. This step has no predefined number of clusters as its goal.
The third pass performs additional single-linkage agglomeration, sorting all inter-cluster distances and only combining clusters until the number of clusters equals a predefined target number of clusters. It handles some outliers, preferring to only merge outliers with large clusters. If there are many outliers (and their usually are), this may fail to reduce the number of clusters to the target.
The fourth pass combines all remaining outliers with the nearest large cluster, but causes no large clusters to merge with other large clusters. (This prevents two adjacent clusters from accidentally being merged due to their outliers forming a thin chain between them.)

You could use an index. That is the very classic solution.
A spatial index can help you find the nearest neighbor of any point in roughly O(log n) time. So if your clusters have n and m objects, choose the smaller cluster and index the larger cluster, to find the closest pair in O(n log m) or O(m log n).
A simpler heuristic approach is to iterate your idea multiple times, shrinking your set of candidates. So you find a good pair of objects a, b from the two clusters. Then you discard all objects from each cluster that must (by triangle inequality) be further apart (using the upper bound!).
Then you repeat this, but not choosing the same a, b again. Once your candidate sets stops improving, do the pairwise comparisons on the remaining objects only. Worst case of this approach should remain O(n*m).

I have found a paper that describes a linear time, randomized, epsilon-approximate algorithm for the closest bichromatic point problem:
http://www.cs.umd.edu/~samir/grant/cp.pdf
I will attempt to implement it and see if it works.
UPDATE - After further study, it is apparent that the runtime is proportional to 3^D, where D is the number of dimensions. This is unacceptable. After trying several other approaches, I hit on the following.
Perform a rough clustering into K clusters using an efficient but incomplete method. This method will properly cluster some points, but yield too many clusters. These small clusters remain to be further consolidated to form larger clusters. This method will determine an upper bound distance DMAX between points that are considered to be in the same cluster.
Sort the points in Hilbert curve order.
Throw out all points immediately preceded and succeeded by a neighbor from the same cluster. More often than not, these are interior points of a cluster, not surface points.
For each point P1, search forward, but no farther than the next point from the same cluster.
Compute the distance from the point P1 from cluster C1 to each visited point P2 from cluster C2 and record the distance if it is smaller than any prior distance measured between points in C1 and C2.
However, if P1 has already been compared to a point in C2, do not do so again. Only make a single comparison between P1 and any point in C2.
After all comparisons have been made, there will be at most K(K-1) distances recorded, and many discarded because they are larger than DMAX. These are estimated closest point distances.
Perform merges between clusters if they are nearer than DMAX.
It is hard to conceive of how the Hilbert curve wiggles among the clusters, so my estimate of how efficient this approach to finding closest pairs was that it was proportional to K^2. However, my testing shops that it is closer to K. It may be around K*log(K). Further research is necessary.
As for accuracy:
Comparing every point to every other point is 100% accurate.
Using the centroid method outlined in my question has distances that are about 0.1% too high.
Using this method finds distances that are at worst 10% too high, and on average 5% too high. However, the true closest cluster almost always turns up as among the first through third closest cluster, so qualitatively it is good. The final clustering results using this method are excellent. My final clustering algorithm seems to be proportional to DNK or DNK*Log(K).

Trajectory Clustering: Which Clustering Method?

As a newbie in Machine Learning, I have a set of trajectories that may be of different lengths. I wish to cluster them, because some of them are actually the same path and they just SEEM different due to the noise.
In addition, not all of them are of the same lengths. So maybe although Trajectory A is not the same as Trajectory B, yet it is part of Trajectory B. I wish to present this property after the clustering as well.
I have only a bit knowledge of K-means Clustering and Fuzzy N-means Clustering. How may I choose between them two? Or should I adopt other methods?
Any method that takes the "belongness" into consideration?
(e.g. After the clustering, I have 3 clusters A, B and C. One particular trajectory X belongs to cluster A. And a shorter trajectory Y, although is not clustered in A, is identified as part of trajectory B.)
=================== UPDATE ======================
The aforementioned trajectories are the pedestrians' trajectories. They can be either presented as a series of (x, y) points or a series of step vectors (length, direction). The presentation form is under my control.

It might be a little late but I am also working on the same problem.
I suggest you take a look at TRACLUS, an algorithm created by Jae-Gil Lee, Jiawei Han and Kyu-Young Wang, published on SIGMOD’07.
http://web.engr.illinois.edu/~hanj/pdf/sigmod07_jglee.pdf
This is so far the best approach I have seen for clustering trajectories because:
Can discover common sub-trajectories.
Focuses on Segments instead of points (so it filters out noise-outliers).
It works over trajectories of different length.
Basically is a 2 phase approach:
Phase one - Partition: Divide trajectories into segments, this is done using MDL Optimization with complexity of O(n) where n is the numbers of points in a given trajectory. Here the input is a set of trajectories and output is a set of segments.
Complexity: O(n) where n is number of points on a trajectory
Input: Set of trajectories.
Output: Set D of segments
Phase two - Group: This phase discovers the clusters using some version of density-based clustering like in DBSCAN. Input in this phase is the set of segments obtained from phase one and some parameters of what constitutes a neighborhood and the minimum amount of lines that can constitute a cluster. Output is a set of clusters. Clustering is done over segments. They define their own distance measure made of 3 components: Parallel distance, perpendicular distance and angular distance. This phase has a complexity of O(n log n) where n is the number of segments.
Complexity: O(n log n) where n is number of segments on set D
Input: Set D of segments, parameter E that sets neighborhood treshold and parameter MinLns that is the minimun number of lines.
Output: Set C of Cluster, that is a Cluster of segments (trajectories clustered).
Finally they calculate a for each cluster a representative trajectory, which is nothing else that a discovered common sub-trajectory in each cluster.
They have pretty cool examples and the paper is very well explained. Once again this is not my algorithm, so don't forget to cite them if you are doing research.
PS: I made some slides based on their work, just for educational purposes:
http://www.slideshare.net/ivansanchez1988/trajectory-clustering-traclus-algorithm

Every clustering algorithm needs a metric. You need to define distance between your samples. In your case simple Euclidean distance is not a good idea, especially if the trajectories can have different lengths.
If you define a metric, than you can use any clustering algorithm that allows for custom metric. Probably you do not know the correct number of clusters beforehand, then hierarchical clustering is a good option. K-means doesn't allow for custom metric, but there are modifications of K-means that do (like K-medoids)
The hard part is defining distance between two trajectories (time series). Common approach is DTW (Dynamic Time Warping). To improve performance you can approximate your trajectory by smaller amount of points (many algorithms for that).

Neither will work. Because what is a proper mean here?
Have a look at distance based clustering methods, such as hierarchical clustering (for small data sets, but you probably don't have thousands of trajectories) and DBSCAN.
Then you only need to choose an appropriate distance function that allows e.g. differences in time and spatial resolution of trajectories.
Distance functions such as dynamic time warping (DTW) distance can accomodate this.

This is good concept and having possibility for real-time applications. In my view, one can adopt any clustering but need to select appropriate dissimilarity measure, later need to think about computational complexity.
Paper (http://link.springer.com/chapter/10.1007/978-81-8489-203-1_15) used Hausdorff and suggest the technique for reducing complexity, and paper (http://www.cit.iit.bas.bg/CIT_2015/v-15-2/4-5-TCMVS%20A-edited-md-Gotovop.pdf) described the use of "Trajectory Clustering Technique Based on Multi-View Similarity"

3D clustering Algorithm

Problem Statement:
I have the following problem:
There are more than a billion points in 3D space. The goal is to find the top N points which has largest number of neighbors within given distance R. Another condition is that the distance between any two points of those top N points must be greater than R. The distribution of those points are not uniform. It is very common that certain regions of the space contain a lot of points.
Goal:
To find an algorithm that can scale well to many processors and has a small memory requirement.
Thoughts:
Normal spatial decomposition is not sufficient for this kind of problem due to the non-uniform distribution. irregular spatial decomposition that evenly divide the number of points may help us the problem. I will really appreciate that if someone can shed some lights on how to solve this problem.

Use an Octree. For 3D data with a limited value domain that scales very well to huge data sets.
Many of the aforementioned methods such as locality sensitive hashing are approximate versions designed for much higher dimensionality where you can't split sensibly anymore.
Splitting at each level into 8 bins (2^d for d=3) works very well. And since you can stop when there are too few points in a cell, and build a deeper tree where there are a lot of points that should fit your requirements quite well.
For more details, see Wikipedia:
https://en.wikipedia.org/wiki/Octree
Alternatively, you could try to build an R-tree. But the R-tree tries to balance, making it harder to find the most dense areas. For your particular task, this drawback of the Octree is actually helpful! The R-tree puts a lot of effort into keeping the tree depth equal everywhere, so that each point can be found at approximately the same time. However, you are only interested in the dense areas, which will be found on the longest paths in the Octree without even having to look at the actual points yet!

I don't have a definite answer for you, but I have a suggestion for an approach that might yield a solution.
I think it's worth investigating locality-sensitive hashing. I think dividing the points evenly and then applying this kind of LSH to each set should be readily parallelisable. If you design your hashing algorithm such that the bucket size is defined in terms of R, it seems likely that for a given set of points divided into buckets, the points satisfying your criteria are likely to exist in the fullest buckets.
Having performed this locally, perhaps you can apply some kind of map-reduce-style strategy to combine spatial buckets from different parallel runs of the LSH algorithm in a step-wise manner, making use of the fact that you can begin to exclude parts of your problem space by discounting entire buckets. Obviously you'll have to be careful about edge cases that span different buckets, but I suspect that at each stage of merging, you could apply different bucket sizes/offsets such that you remove this effect (e.g. perform merging spatially equivalent buckets, as well as adjacent buckets). I believe this method could be used to keep memory requirements small (i.e. you shouldn't need to store much more than the points themselves at any given moment, and you are always operating on small(ish) subsets).
If you're looking for some kind of heuristic then I think this result will immediately yield something resembling a "good" solution - i.e. it will give you a small number of probable points which you can check satisfy your criteria. If you are looking for an exact answer, then you are going to have to apply some other methods to trim the search space as you begin to merge parallel buckets.
Another thought I had was that this could relate to finding the metric k-center. It's definitely not the exact same problem, but perhaps some of the methods used in solving that are applicable in this case. The problem is that this assumes you have a metric space in which computing the distance metric is possible - in your case, however, the presence of a billion points makes it undesirable and difficult to perform any kind of global traversal (e.g. sorting of the distances between points). As I said, just a thought, and perhaps a source of further inspiration.

Here are some possible parts of a solution.
There are various choices at each stage,
which will depend on Ncluster, on how fast the data changes,
and on what you want to do with the means.
3 steps: quantize, box, K-means.
1) quantize: reduce the input XYZ coordinates to say 8 bits each,
by taking 2^8 percentiles of X,Y,Z separately.
This will speed up the whole flow without much loss of detail.
You could sort all 1G points, or just a random 1M,
to get 8-bit x0 < x1 < ... x256, y0 < y1 < ... y256, z0 < z1 < ... z256
with 2^(30-8) points in each range.
To map float X -> 8 bit x, unrolled binary search is fast —
see Bentley, Pearls p. 95.
Added: Kd trees
split any point cloud into different-sized boxes, each with ~ Leafsize points —
much better than splitting X Y Z as above.
But afaik you'd have to roll your own Kd tree code
to split only the first say 16M boxes, and keep counts only, not the points.
2) box: count the number of points in each 3d box,
[xj .. xj+1, yj .. yj+1, zj .. zj+1].
The average box will have 2^(30-3*8) points;
the distribution will depend on how clumpy the data is.
If some boxes are too big or get too many points, you could
a) split them into 8,
b) track the centre of the points in each box,
otherwide just take box midpoints.
3)
K-means clustering
on the 2^(3*8) box centres.
(Google parallel "k means" -> 121k hits.)
This depends strongly on K aka Ncluster, also on your radius R.
A rough approach would be to grow a
heap
of the say 27*Ncluster boxes with the most points,
then take the biggest ones subject to your Radius constraint.
(I like to start with a
Minimum spanning tree,
then remove the K-1 longest links to get K clusters.)
See also
Color quantization .
I'd make Nbit, here 8, a parameter from the beginning.
What is your Ncluster ?
Added: if your points are moving in time, see
collision-detection-of-huge-number-of-circles on SO.

I would also suggest to use an octree. The OctoMap framework is very good at dealing with huge 3D point clouds. It does not store all the points directly, but updates the occupancy density of every node (aka 3D box).
After the tree is built, you can use a simple iterator to find the node with the highest density. If you would like to model the point density or distribution inside the nodes, the OctoMap is very easy to adopt.
Here you can see how it was extended to model the point distribution using a planar model.

Just an idea. Create a graph with given points and edges between points when distance < R.
Creation of this kind of graph is similar to spatial decomposition. Your questions can be answered with local search in graph. First are vertices with max degree, second is finding of maximal unconnected set of max degree vertices.
I think creation of graph and search can be made parallel. This approach can have large memory requirement. Splitting domain and working with graphs for smaller volumes can reduce memory need.

How to cluster objects (without coordinates)

I have a list of opaque objects. I am only able to calculate the distance between them (not true, just setting the conditions for the problem):
class Thing {
public double DistanceTo(Thing other);
}
I would like to cluster these objects. I would like to control the number of clusters and I would like for "close" objects to be in the same cluster:
List<Cluster> cluster(int numClusters, List<Thing> things);
Can anyone suggest (and link to ;-)) some clustering algorithms (the simpler, the better!) or libraries that can help me?
Clarification Most clustering algorithms require that the objects be laid out in some N-dimensional space. This space is used to find "centroids" of clusters. In my case, I do not know what N is, nor do I know how to extract a coordinate system from the objects. All I know is how far apart 2 objects are. I would like to find a good clustering algorithm that uses only that information.
Imagine that you are clustering based upon the "smell" of an object. You don't know how to lay "smells out" on a 2D plane, but you do know whether two smells are similar or not.

I think you are looking for K-Medoids. It's like K-means in that you specify the number of clusters, K, in advance, but it doesn't require that you have a notion of "averaging" the objects you're clustering like K-means does.
Instead, every cluster has a representative medoid, which is the member of the cluster closest to the middle. You could think of it as a version of K-means that finds "medians" instead of "means". All you need is a distance metric to cluster things, and I've used this in some of my own work for exactly the same reasons you cite.
Naive K-medoids is not the fastest algorithm, but there are fast variants that are probably good enough for your purposes. Here are descriptions of the algorithms and links to the documentation for their implementations in R:
PAM is the basic O(n^2) implementation of K-medoids.
CLARA is a much faster, sampled version of PAM. It works by clustering randomly sampled subset of objects with PAM and grouping the entire set of objects based on the subset. You should still be able to get very good clusterings fast with this.
If you need more information, here's a paper that gives an overview of these and other K-medoids methods.

Here's an outline for a clustering algorithm that doesn't have the K-means requirement of finding a centroid.
Determine the distance between all objects. Record the n most separate objects. [finds roots of our clusters, time O(n^2)]
Assign each of these n random points to n new distinct clusters.
For every other object:[assign objects to clusters, time O(n^2)]
For each cluster:
Calculate the average distance from a cluster to that object by averaging the distance of each object in the cluster to the object.
Assign the object to the closest cluster.
This algorithm will certainly cluster the objects. But its runtime is O(n^2). Plus it is guided by those first n points chosen.
Can anyone improve upon this (better runtime perf, less dependent upon initial choices)? I would love to see your ideas.

Here's a quick algorithm.
While (points_left > 0) {
Select a random point that is not already clustered
Add point and all points within x distance
that aren't already clustered to a new cluster.
}
Alternatively, read the wikipedia page. K-means clustering is a good choice:
The K-means algorithm assigns each point to the cluster whose center (also called centroid) is nearest. The center is the average of all the points in the cluster — that is, its coordinates are the arithmetic mean for each dimension separately over all the points in the cluster.
The algorithm steps are:
* Choose the number of clusters, k.
* Randomly generate k clusters and determine the cluster centers, or
directly generate k random points as cluster centers.
* Assign each point to the nearest cluster center.
* Recompute the new cluster centers.
* Repeat the two previous steps until some convergence criterion is
met (usually that the assignment hasn't changed).
The main advantages of this algorithm
are its simplicity and speed which
allows it to run on large datasets.
Its disadvantage is that it does not
yield the same result with each run,
since the resulting clusters depend on
the initial random assignments. It
minimizes intra-cluster variance, but
does not ensure that the result has a
global minimum of variance. Another
disadvantage is the requirement for
the concept of a mean to be definable
which is not always the case. For such
datasets the k-medoids variant is
appropriate.

How about this approach:
Assign all objects to one cluster.
Find the two objects, a and b, that are within the same cluster, k, and that are maximum distance apart. To clarify, there should be one a and b for the whole set, not one a and b for each cluster.
Split cluster k into two clusters, k1 and k2, one with object a and one with object b.
For all other objects in cluster k, add them to either k1 or k2 by determining the minimum average distance to all other objects in that cluster.
Repeat steps 2-5 until N clusters are formed.
I think this algorithm should give you a fairly good clustering, although the efficiency might be pretty bad. To improve the efficiency you could alter step 3 so that you find the minimum distance to only the original object that started the cluster, rather than the average distance to all objects already in the cluster.

Phylogenetic DNA sequence analysis regularly uses hierarchical clustering on text strings, with [alignment] distance matrices. Here's a nice R tutorial for clustering:
http://www.statmethods.net/advstats/cluster.html
(Shortcut: Go straight to the "Hierarchical Agglomerative" section...)
Here are some other [language] libraries :
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/
http://code.google.com/p/scipy-cluster/
This approach could help determine how many [k] "natural" clusters there are and which objects to use as roots for the k-means approaches above.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio