Create N-Clusters out of Min spanning tree? - algorithm

Let say I created a Minimum Spanning Tree out of Graph with M nodes. Is there an algorithm to create N number of clusters.
I'm looking to cut some of the links such as that I end up with N clusters and label them i.e. given a node X I can query in which cluster it belongs.
What I think is once I have the MST, I cut the top/max M-N edges of the MST and I will get N clusters ?
Is my logic correct ?

That seems a good way to me. You ask whether it's "correct" -- that I can't say, since I don't know what other unstated criteria you have in mind. All you have actually stated that you want is to create N clusters -- which you could also achieve by throwing away the MST, putting vertex 1 in the first cluster, vertex 2 in the second, ..., vertex N-1 in the (N-1)th, and all remaining vertices in the Nth.
If you're using Kruskal's algorithm to build the MST, you can achieve what you're suggesting by simply stopping the algorithm early, as soon as only N components remain.

A tree is a (very sparse) subset of edges of a graph, if you cut based on them you are not taking into consideration a (possible) vast majority of edges in your graph.
Based on the fact that you want to use a M(inimum)ST algorithm to create clusters, it would seem you want to minimize the set of edges that lie in the n-way cut induced by your clustering. Using an MST as a proxy with a graph with very similar weight edges will produce likely terrible results.
Graph clustering is a heavily studied topic, have you considered using an existing library to accomplish this? If you insist on implementing your own algorithm, I would recommend spectral clustering as a starting point as it will produce decent results without much effort.
Edit based on feedback in coments:
If your main bottleneck is the similarity matrix then the following should be considered:
Investigate sparse matrix/graph representation while implementing something like spectral clustering which is probably going to give much more robust results than single-linkage clustering
Investigate pruning edges from the similarity matrix which you think are unimportant. If pruning is combined with a sparse representation of the similarity matrix, this should yield comparable performance to the MST approach while giving a smooth continuum to tune performance vs quality.

Related

How to build a Minimum Spanning Tree given a list of 200 000 nodes?

Problem
I have a list of approximatly 200000 nodes that represent lat/lon position in a city and I have to compute the Minimum Spanning Tree. I know that I need to use Prim algorithm but first of all I need a connected graph. (We can assume that those nodes are in a Euclidian plan)
To build this connected graph I thought firstly to compute the complete graph but (205000*(205000-1)/2 is around 19 billions edges and I can't handle that.
Options
Then I came across to Delaunay triangulation: with the fact that if I build this "Delauney graph", it contains a sub graph that is the Minimum Spanning Tree according and I have a total of around 600000 edges according to Wikipedia [..]it has at most 3n-6 edges. So it may be a good starting point for a Minimum Spanning Tree algorithm.
Another options is to build an approximately connected graph but with that I will maybe miss important edges that will influence my Minimum Spanning Tree.
My question
Is Delaunay a reliable solution in this case? If so, is there any other reliable solution than delaunay triangulation to this problem ?
Further information: this problem has to be solved in C.
The Delaunay triangulation of a point set is always a superset of the EMST of these points. So it is absolutely "reliable"*. And recommended, as it has a size linear in the number of points and can be efficiently built.
*When there are cocircular point quadruples, neither the triangulation nor the EMST are uniquely defined, but this is usually harmless.
There's a big question here of what libraries you have access to and how much you trust yourself as a coder. (I'm assuming the fact that you're new on SO should not be taken as a measure of your overall experience as a programmer - if it is, well, RIP.)
If we assume you don't have access to Delaunay and can't implement it yourself, minimum spanning trees algorithms that pre-suppose a graph aren't necessarily off limits to you. You can have the complete graph conceptually but not actually. Kruskal's algorithm, for instance, assumes you have a sorted list of all edges in your graph; most of your edges will not be near the minimum, and you do not have to compare all n^2 to find the minimum.
You can find minimum edges quickly by estimations that give you a reduced set, then refinement. For instance, if you divide your graph into a 100*100 grid, for any point p in the graph, points in the same grid square as p are guaranteed to be closer than points three or more squares away. This gives a much smaller set of points you have to compare to safely know you've found the closest.
It still won't be easy, but that might be easier than Delaunay.

Partitioning a graph into k similar subgraphs

I have a graph which is connected and the edges have weights on them. Less the weight between an edge, the more closer the adjoining vertices are. I want to divide the graph into k smaller subgraphs such that nodes in all the subgraphs are very similar.
In other words, I need to cluster the graph. Can somebody suggest clustering algorithms that are suitable for graphs and have less time comlexity(lesser than O(n^2))?
Clustering is difficult problem that in general could be solved exaclty by brute force only. So in case of efficient algorithms your have to rely on some heuristic approach. In case if you can solve your problem as a vertex clustering task, you can try something like k-means, but it could be slow on larger data sets.
For graph specifically, I would suggest using MCL algorithm on your problem. MCL seems to be efficient in a lot of cases. It uses randomized flow simulation to detect clusters in weighted/unweighted graphs. Basic idea is that flow gathers within a cluster, while links between clusters tend to be less saturated.

What clustering algorithms can I consider for graph?

Suppose I have an undirected weighted connected graph. I want to group vertices that have highest edges' value all together (vertices degree). Using clustering algorithms is one way. What clustering algorithms can I consider for this task? I hope it is clear; any question for clarification, please ask. Thanks.
There are two main approach - giving your graph as an input to existing tool, or using expert knowledge you have on this graph (and its domain) in order to create a representation, and then apply machine learning methods on it.
I'll start with the second approach:
If you have only the nodes and edges (no farther data for each node), you first need to think of a representation for each node\edge. I going to explain about nodes, but it should should be similar for edges' case.
The simplest approach is to represent each node n as a connectivity vector:
Every node will be represented as n=(Ia(n),Ib(n),Ic(n),Id(n),Ie(n)), where Ii(n)=1 in case node n is a 'friend' (neighbor) of node i, and 0 otherwise. (e.g.a=(0,1,1,0,1))
Note that you can decide if a node is a friend of itself.
Second approach, which is quite similar to the first one, is to use edges' weights vector:
n=(W(a,n),W(b,n),W(c,n),W(d,n),W(e,n)) , where W(i,n) is the weight of the edge (i,n).
There are a few more ways to represent nodes, but this is enough in order to run some calculations on it.
After you have this presentation, you can start applying some clustering algorithms on it.
kmeans is considered great for this task, and sklearn has a great implementation. It has some parameters you can (and should) configure (i.e. the distance measure).
The product of kmeans, is k different non-intersecting groups of nodes.
If you want to pass you graph to an algorithm and get some measures, there are more advanced algorithms you can apply. community detection is used to find communities in a graph. Again, there is a nice python implementation in the networkxpackage.

Why Kruskal clustering generates suboptimal classes?

I was trying to develop a clustering algorithm tasked with finding k classes on a set of 2D points, (with k given as input) using use the Kruskal algorithm lightly modified to find k spanning trees instead of one.
I compared my output to a proposed optimum (1) using the rand index, which for k = 7 resulted on 95.5%. The comparison can be seen on the link below.
Problem:
The set have 5 clearly spaced clusters that are easily classified by the algorithm, but the results are rather disappointing for k > 5, which is when things start to get tricky. I believe that my algorithm is correct, and maybe the data is particularly bad for a Kruskal approach. Single Linkage Agglomerative Clustering, such as Kruskal's, are known to perform badly at some problems since it reduces the assessment of cluster quality to a single similarity between a pair of points.
The idea of the algorithm is very simple:
Make a complete graph with the data set, with the weight of the edges
being the euclidean distance between the pair.
Sort the edge list by weight.
For each edge (in order), add it to the spanning forest if it doesn't form a cycle. Stop when all the edges have been traversed or when the remaining forest has k trees.
Bottomline:
Why is the algorithm failing like that? Is it Kruskal's fault? If so, why precisely? Any suggestions to improve the results without abandoning Kruskal?
(1): Gionis, A., H. Mannila, and P. Tsaparas, Clustering aggregation. ACM Transactions on
Knowledge Discovery from Data(TKDD),2007.1(1):p.1-30.
This is known as single-link effect.
Kruskal seems to be a semi-clever way of computing single-linkage clustering. The naive approach for "hierarchical clustering" is O(n^3), and the Kruskal approach should be O(n^2 log n) due to having to sort the n^2 edges.
Note that SLINK can do single-linkage clustering in O(n^2) runtime and O(n) memory.
Have you tried loading your data set e.g. into ELKI, and compare your result to single-link clustering.
To get bette results, try other linkages (usually in O(n^3) runtime) or density-based clustering such as DBSCAN (in O(n^2) without index, and O(n log n) with index). On this toy data set, epsilon=2 and minPts=5 should work good.
The bridges between clusters that should be different are a classic example of Kruskal getting things wrong. You might try, for each point, overwriting the shortest distance from that point with the second shortest distance from that point - this might increase the lengths in the bridges without increasing other lengths.
By eye, this looks like something K-means might do well - except for the top left, the clusters are nearly circular.
You can try the manhattan distance but to get better you can try a classic line and circle detection algorithm.

Efficient minimal spanning tree in metric space

I have a large set of points (n > 10000 in number) in some metric space (e.g. equipped with Jaccard Distance). I want to connect them with a minimal spanning tree, using the metric as the weight on the edges.
Is there an algorithm that runs in less than O(n2) time?
If not, is there an algorithm that runs in less than O(n2) average time (possibly using randomization)?
If not, is there an algorithm that runs in less than O(n2) time and gives a good approximation of the minimum spanning tree?
If not, is there a reason why such algorithm can't exist?
Thank you in advance!
Edit for the posters below:
Classical algorithms for finding minimal spanning tree don't work here. They have an E factor in their running time, but in my case E = n2 since I actually consider the complete graph. I also don't have enough memory to store all the >49995000 possible edges.
Apparently, according to this: Estimating the weight of metric minimum spanning trees in sublinear time there is no deterministic o(n^2) (note: smallOh, which is probably what you meant by less than O(n^2), I suppose) algorithm. That paper also gives a sub-linear randomized algorithm for the metric minimum weight spanning tree.
Also look at this paper: An optimal minimum spanning tree algorithm which gives an optimal algorithm. The paper also claims that the complexity of the optimal algorithm is not yet known!
The references in the first paper should be helpful and that paper is probably the most relevant to your question.
Hope that helps.
When I was looking at a very similar problem 3-4 years ago, I could not find an ideal solution in the literature I looked at.
The trick I think is to find a "small" subset of "likely good" edges, which you can then run plain old Kruskal on. In general, it's likely that many MST edges can be found among the set of edges that join each vertex to its k nearest neighbours, for some small k. These edges might not span the graph, but when they don't, each component can be collapsed to a single vertex (chosen randomly) and the process repeated. (For better accuracy, instead of picking a single representative to become the new "supervertex", pick some small number r of representatives and in the next round examine all r^2 distances between 2 supervertices, choosing the minimum.)
k-nearest-neighbour algorithms are quite well-studied for the case where objects can be represented as vectors in a finite-dimensional Euclidean space, so if you can find a way to map your objects down to that (e.g. with multidimensional scaling) then you may have luck there. In particular, mapping down to 2D allows you to compute a Voronoi diagram, and MST edges will always be between adjacent faces. But from what little I've read, this approach doesn't always produce good-quality results.
Otherwise, you may find clustering approaches useful: Clustering large datasets in arbitrary metric spaces is one of the few papers I found that explicitly deals with objects that are not necessarily finite-dimensional vectors in a Euclidean space, and which gives consideration to the possibility of computationally expensive distance functions.

Resources