I'm looking for an algorithm to partition a graph into groups of vertices (each of which is connected if it were own graph) of maximum size n while keeping the number of groups minimized. I need this algorithm to partition a delaunay triangulation into regions with equal number of vertices in each region. If anyone has a better idea for tackling this problem, let me know!
It seems you're looking for a solution to the uniform k-way graph partitioning problem, where, given a graph G(V,E), the goal is to partition the vertex-set V into a series of k disjoint subsets V1, V2, ..., Vk such that the size of each subset Vi is approximately |V|/k. Additionally, it's typical to require "nice" partitions, where the sum of the edge weights between any two subsets Vi and Vj is minimised.
Firstly, it's well known that this problem is NP-complete, precluding the existence of efficient exact algorithms. On the up side, a number of effective heuristics have been developed that perform pretty well on many practical problems.
Specifically, schemes based on an iterative multi-level approach have been very successful in practice. In such methods, a hierarchy of graphs is created by incrementally merging adjacent vertices to form a smaller "coarse" graph at each level of the hierarchy. An initial partition is formed when the coarse graph becomes small enough, with this partition then "mapped" back down the hierarchy onto the original graph. An iterative refinement of the partition is typically performed as part of this mapping process, generally leading to pseudo-optimal partitions.
The implementation of such algorithms is non-trivial, but a number of existing packages support such routines. Specifically, the METIS package is used extensively.
Related
Let say I created a Minimum Spanning Tree out of Graph with M nodes. Is there an algorithm to create N number of clusters.
I'm looking to cut some of the links such as that I end up with N clusters and label them i.e. given a node X I can query in which cluster it belongs.
What I think is once I have the MST, I cut the top/max M-N edges of the MST and I will get N clusters ?
Is my logic correct ?
That seems a good way to me. You ask whether it's "correct" -- that I can't say, since I don't know what other unstated criteria you have in mind. All you have actually stated that you want is to create N clusters -- which you could also achieve by throwing away the MST, putting vertex 1 in the first cluster, vertex 2 in the second, ..., vertex N-1 in the (N-1)th, and all remaining vertices in the Nth.
If you're using Kruskal's algorithm to build the MST, you can achieve what you're suggesting by simply stopping the algorithm early, as soon as only N components remain.
A tree is a (very sparse) subset of edges of a graph, if you cut based on them you are not taking into consideration a (possible) vast majority of edges in your graph.
Based on the fact that you want to use a M(inimum)ST algorithm to create clusters, it would seem you want to minimize the set of edges that lie in the n-way cut induced by your clustering. Using an MST as a proxy with a graph with very similar weight edges will produce likely terrible results.
Graph clustering is a heavily studied topic, have you considered using an existing library to accomplish this? If you insist on implementing your own algorithm, I would recommend spectral clustering as a starting point as it will produce decent results without much effort.
Edit based on feedback in coments:
If your main bottleneck is the similarity matrix then the following should be considered:
Investigate sparse matrix/graph representation while implementing something like spectral clustering which is probably going to give much more robust results than single-linkage clustering
Investigate pruning edges from the similarity matrix which you think are unimportant. If pruning is combined with a sparse representation of the similarity matrix, this should yield comparable performance to the MST approach while giving a smooth continuum to tune performance vs quality.
There is a specific problem in operations research that is converted into a graph:
I would like to partition the graph into n sub-graphs, so as the nodes in sub-graphs are connected, and the sum of the weights of the nodes is between two numbers, a and b.
Since I'm coming from an operations research background, I first thought about proposing a MIP model in order to solve the problem, but soon there were a lot of problems on how to model the constraints. I understand the usual edge-cut algorithms and have used them to tackle the problem.
I'll describe my thought process until now, hoping you have any ideas on this problem:
I generate a spanning tree, then I pick up some n edges, so the number of the partitions would be fine, from now on I need to add or delete nodes from each partition to reach a feasible response. At this stage, I feel like there should be a logical process. I tried to describe a procedure on paper, but couldn't get it right.
I wish also to know if the problem of partitioning the graph into n subgraphs with weights between a and b, could be feasible, given a special graph.
I have a graph which is connected and the edges have weights on them. Less the weight between an edge, the more closer the adjoining vertices are. I want to divide the graph into k smaller subgraphs such that nodes in all the subgraphs are very similar.
In other words, I need to cluster the graph. Can somebody suggest clustering algorithms that are suitable for graphs and have less time comlexity(lesser than O(n^2))?
Clustering is difficult problem that in general could be solved exaclty by brute force only. So in case of efficient algorithms your have to rely on some heuristic approach. In case if you can solve your problem as a vertex clustering task, you can try something like k-means, but it could be slow on larger data sets.
For graph specifically, I would suggest using MCL algorithm on your problem. MCL seems to be efficient in a lot of cases. It uses randomized flow simulation to detect clusters in weighted/unweighted graphs. Basic idea is that flow gathers within a cluster, while links between clusters tend to be less saturated.
As an exercise I have to build a satnav system which plans the shortest and fastest routes from location to location. It has to be as fast as I can possibly make it without using too much memory.
I am having trouble deciding which structure to use to represent the graph. I understand that a matrix is better for dense graphs and that a list would be better for sparse graphs. I'm leaning more towards using a list as I'm assuming that adding vertexes will be the most taxing part of this program.
I just want get some of your guys' opinions. If I were to look at a typical road-map as a graph with various locations being nodes and the roads being edges. Would you consider it to be sparse or dense? Which structure seems better in this scenario?
I would go for lists because its only 1 time investment. The good thing about it is that it is able to iterate over all the adjacent vertices faster than matrix which is an important and most frequent steps in most of the shortest path algorithms.
So where matrix is O(N) adjacency list goes only O(k) where k is number of adjacent vertices.
I have a large set of points (n > 10000 in number) in some metric space (e.g. equipped with Jaccard Distance). I want to connect them with a minimal spanning tree, using the metric as the weight on the edges.
Is there an algorithm that runs in less than O(n2) time?
If not, is there an algorithm that runs in less than O(n2) average time (possibly using randomization)?
If not, is there an algorithm that runs in less than O(n2) time and gives a good approximation of the minimum spanning tree?
If not, is there a reason why such algorithm can't exist?
Thank you in advance!
Edit for the posters below:
Classical algorithms for finding minimal spanning tree don't work here. They have an E factor in their running time, but in my case E = n2 since I actually consider the complete graph. I also don't have enough memory to store all the >49995000 possible edges.
Apparently, according to this: Estimating the weight of metric minimum spanning trees in sublinear time there is no deterministic o(n^2) (note: smallOh, which is probably what you meant by less than O(n^2), I suppose) algorithm. That paper also gives a sub-linear randomized algorithm for the metric minimum weight spanning tree.
Also look at this paper: An optimal minimum spanning tree algorithm which gives an optimal algorithm. The paper also claims that the complexity of the optimal algorithm is not yet known!
The references in the first paper should be helpful and that paper is probably the most relevant to your question.
Hope that helps.
When I was looking at a very similar problem 3-4 years ago, I could not find an ideal solution in the literature I looked at.
The trick I think is to find a "small" subset of "likely good" edges, which you can then run plain old Kruskal on. In general, it's likely that many MST edges can be found among the set of edges that join each vertex to its k nearest neighbours, for some small k. These edges might not span the graph, but when they don't, each component can be collapsed to a single vertex (chosen randomly) and the process repeated. (For better accuracy, instead of picking a single representative to become the new "supervertex", pick some small number r of representatives and in the next round examine all r^2 distances between 2 supervertices, choosing the minimum.)
k-nearest-neighbour algorithms are quite well-studied for the case where objects can be represented as vectors in a finite-dimensional Euclidean space, so if you can find a way to map your objects down to that (e.g. with multidimensional scaling) then you may have luck there. In particular, mapping down to 2D allows you to compute a Voronoi diagram, and MST edges will always be between adjacent faces. But from what little I've read, this approach doesn't always produce good-quality results.
Otherwise, you may find clustering approaches useful: Clustering large datasets in arbitrary metric spaces is one of the few papers I found that explicitly deals with objects that are not necessarily finite-dimensional vectors in a Euclidean space, and which gives consideration to the possibility of computationally expensive distance functions.