Partitioning a graph into k similar subgraphs - algorithm

I have a graph which is connected and the edges have weights on them. Less the weight between an edge, the more closer the adjoining vertices are. I want to divide the graph into k smaller subgraphs such that nodes in all the subgraphs are very similar.
In other words, I need to cluster the graph. Can somebody suggest clustering algorithms that are suitable for graphs and have less time comlexity(lesser than O(n^2))?

Clustering is difficult problem that in general could be solved exaclty by brute force only. So in case of efficient algorithms your have to rely on some heuristic approach. In case if you can solve your problem as a vertex clustering task, you can try something like k-means, but it could be slow on larger data sets.
For graph specifically, I would suggest using MCL algorithm on your problem. MCL seems to be efficient in a lot of cases. It uses randomized flow simulation to detect clusters in weighted/unweighted graphs. Basic idea is that flow gathers within a cluster, while links between clusters tend to be less saturated.

Related

Create N-Clusters out of Min spanning tree?

Let say I created a Minimum Spanning Tree out of Graph with M nodes. Is there an algorithm to create N number of clusters.
I'm looking to cut some of the links such as that I end up with N clusters and label them i.e. given a node X I can query in which cluster it belongs.
What I think is once I have the MST, I cut the top/max M-N edges of the MST and I will get N clusters ?
Is my logic correct ?
That seems a good way to me. You ask whether it's "correct" -- that I can't say, since I don't know what other unstated criteria you have in mind. All you have actually stated that you want is to create N clusters -- which you could also achieve by throwing away the MST, putting vertex 1 in the first cluster, vertex 2 in the second, ..., vertex N-1 in the (N-1)th, and all remaining vertices in the Nth.
If you're using Kruskal's algorithm to build the MST, you can achieve what you're suggesting by simply stopping the algorithm early, as soon as only N components remain.
A tree is a (very sparse) subset of edges of a graph, if you cut based on them you are not taking into consideration a (possible) vast majority of edges in your graph.
Based on the fact that you want to use a M(inimum)ST algorithm to create clusters, it would seem you want to minimize the set of edges that lie in the n-way cut induced by your clustering. Using an MST as a proxy with a graph with very similar weight edges will produce likely terrible results.
Graph clustering is a heavily studied topic, have you considered using an existing library to accomplish this? If you insist on implementing your own algorithm, I would recommend spectral clustering as a starting point as it will produce decent results without much effort.
Edit based on feedback in coments:
If your main bottleneck is the similarity matrix then the following should be considered:
Investigate sparse matrix/graph representation while implementing something like spectral clustering which is probably going to give much more robust results than single-linkage clustering
Investigate pruning edges from the similarity matrix which you think are unimportant. If pruning is combined with a sparse representation of the similarity matrix, this should yield comparable performance to the MST approach while giving a smooth continuum to tune performance vs quality.

What clustering algorithms can I consider for graph?

Suppose I have an undirected weighted connected graph. I want to group vertices that have highest edges' value all together (vertices degree). Using clustering algorithms is one way. What clustering algorithms can I consider for this task? I hope it is clear; any question for clarification, please ask. Thanks.
There are two main approach - giving your graph as an input to existing tool, or using expert knowledge you have on this graph (and its domain) in order to create a representation, and then apply machine learning methods on it.
I'll start with the second approach:
If you have only the nodes and edges (no farther data for each node), you first need to think of a representation for each node\edge. I going to explain about nodes, but it should should be similar for edges' case.
The simplest approach is to represent each node n as a connectivity vector:
Every node will be represented as n=(Ia(n),Ib(n),Ic(n),Id(n),Ie(n)), where Ii(n)=1 in case node n is a 'friend' (neighbor) of node i, and 0 otherwise. (e.g.a=(0,1,1,0,1))
Note that you can decide if a node is a friend of itself.
Second approach, which is quite similar to the first one, is to use edges' weights vector:
n=(W(a,n),W(b,n),W(c,n),W(d,n),W(e,n)) , where W(i,n) is the weight of the edge (i,n).
There are a few more ways to represent nodes, but this is enough in order to run some calculations on it.
After you have this presentation, you can start applying some clustering algorithms on it.
kmeans is considered great for this task, and sklearn has a great implementation. It has some parameters you can (and should) configure (i.e. the distance measure).
The product of kmeans, is k different non-intersecting groups of nodes.
If you want to pass you graph to an algorithm and get some measures, there are more advanced algorithms you can apply. community detection is used to find communities in a graph. Again, there is a nice python implementation in the networkxpackage.

Topological sort of cyclic graph with minimum number of violated edges

I am looking for a way to perform a topological sorting on a given directed unweighted graph, that contains cycles. The result should not only contain the ordering of vertices, but also the set of edges, that are violated by the given ordering. This set of edges shall be minimal.
As my input graph is potentially large, I cannot use an exponential time algorithm. If it's impossible to compute an optimal solution in polynomial time, what heuristic would be reasonable for the given problem?
Eades, Lin, and Smyth proposed A fast and effective heuristic for the feedback arc set problem. The original article is behind a paywall, but a free copy is available from here.
There’s an algorithm for topological sorting that builds the vertex order by selecting a vertex with no incoming arcs, recursing on the graph minus the vertex, and prepending that vertex to the order. (I’m describing the algorithm recursively, but you don’t have to implement it that way.) The Eades–Lin–Smyth algorithm looks also for vertices with no outgoing arcs and appends them. Of course, it can happen that all vertices have incoming and outgoing arcs. In this case, select the vertex with the highest differential between incoming and outgoing. There is undoubtedly room for experimentation here.
The algorithms with provable worst-case behavior are based on linear programming and graph cuts. These are neat, but the guarantees are less than ideal (log^2 n or log n log log n times as many arcs as needed), and I suspect that efficient implementations would be quite a project.
Inspired by Arnauds answer and other interesting topological sorting algorithms have I created the cyclic-toposort project and published it on github. cyclic_toposort does exactly what you desire in that it quickly sorts a directed cyclic graph providing the minimal amount of violating edges. It optionally also provides the maximum groupings of nodes that are on the same topological level (and can therefore be activated at the same time) if desired.
If the problem is still relevant to you then I would be happy if you check out my project and let me know what you think!
This project was useful to my own neural network topology research, so I had to create something like this anyway. I am answering your question this late in case anyone else stumbles upon this thread in search for the same question.

computing vertex connectivity of graph

Is there an algorithm that, when given a graph, computes the vertex connectivity of that graph (the minimum number of vertices to remove in order to separate the graph into two connected graphs). (Note that the graph may be already be disconnected). Thanks!
See:
Determining if a graph is K-vertex-connected
k-vertex connectivity of a graph
When you combine this with binary search you are done.
This book chapter should have everything you need to get started; it is basically a survey over algorithms to determine the edge connectivity and vertex connectivity of graphs, with pseudo code for the algorithms described in it. Page 12 has an overview over the available algorithms alongside complexity analyses and references. Most of the solutions for the general case are flow-based, with the exception of one randomized algorithm. The different general solutions optimize for different properties of the graph, so you can choose the most asymptotically efficient one beforehand. Also, for some classes of graphs, there exist specialized algorithms with better complexity than the general solutions provide.

Finding equal subgraphs

Given:
a directed Graph
Nodes have labels
the same label can appear more than once
edges don't have labels
I want to find the set of largest (connected) subgraphs which are equal taking the labels of the nodes into account.
The graph could be huge (millions of nodes) does anyone know an efficient solution for this?
I'm looking for algorithm and ideally a Java implementation.
Update: Since this problem is most likely NP-complete. I would also be interested in an algorithm that produces an approximated solution.
This seems to be close at least:
Frequent Subgraphs
I strongly suspect that's NP-hard.
Even if all the labels are the same that's at least as hard as graph isomorphism. (Join the two graphs together as a single disconnected graph; are the largest equal subgraphs the two original graphs?)
If identical labels are relatively rare it might be tractable.

Resources