What clustering algorithms can I consider for graph? - algorithm

Suppose I have an undirected weighted connected graph. I want to group vertices that have highest edges' value all together (vertices degree). Using clustering algorithms is one way. What clustering algorithms can I consider for this task? I hope it is clear; any question for clarification, please ask. Thanks.

There are two main approach - giving your graph as an input to existing tool, or using expert knowledge you have on this graph (and its domain) in order to create a representation, and then apply machine learning methods on it.
I'll start with the second approach:
If you have only the nodes and edges (no farther data for each node), you first need to think of a representation for each node\edge. I going to explain about nodes, but it should should be similar for edges' case.
The simplest approach is to represent each node n as a connectivity vector:
Every node will be represented as n=(Ia(n),Ib(n),Ic(n),Id(n),Ie(n)), where Ii(n)=1 in case node n is a 'friend' (neighbor) of node i, and 0 otherwise. (e.g.a=(0,1,1,0,1))
Note that you can decide if a node is a friend of itself.
Second approach, which is quite similar to the first one, is to use edges' weights vector:
n=(W(a,n),W(b,n),W(c,n),W(d,n),W(e,n)) , where W(i,n) is the weight of the edge (i,n).
There are a few more ways to represent nodes, but this is enough in order to run some calculations on it.
After you have this presentation, you can start applying some clustering algorithms on it.
kmeans is considered great for this task, and sklearn has a great implementation. It has some parameters you can (and should) configure (i.e. the distance measure).
The product of kmeans, is k different non-intersecting groups of nodes.
If you want to pass you graph to an algorithm and get some measures, there are more advanced algorithms you can apply. community detection is used to find communities in a graph. Again, there is a nice python implementation in the networkxpackage.

Related

Create N-Clusters out of Min spanning tree?

Let say I created a Minimum Spanning Tree out of Graph with M nodes. Is there an algorithm to create N number of clusters.
I'm looking to cut some of the links such as that I end up with N clusters and label them i.e. given a node X I can query in which cluster it belongs.
What I think is once I have the MST, I cut the top/max M-N edges of the MST and I will get N clusters ?
Is my logic correct ?
That seems a good way to me. You ask whether it's "correct" -- that I can't say, since I don't know what other unstated criteria you have in mind. All you have actually stated that you want is to create N clusters -- which you could also achieve by throwing away the MST, putting vertex 1 in the first cluster, vertex 2 in the second, ..., vertex N-1 in the (N-1)th, and all remaining vertices in the Nth.
If you're using Kruskal's algorithm to build the MST, you can achieve what you're suggesting by simply stopping the algorithm early, as soon as only N components remain.
A tree is a (very sparse) subset of edges of a graph, if you cut based on them you are not taking into consideration a (possible) vast majority of edges in your graph.
Based on the fact that you want to use a M(inimum)ST algorithm to create clusters, it would seem you want to minimize the set of edges that lie in the n-way cut induced by your clustering. Using an MST as a proxy with a graph with very similar weight edges will produce likely terrible results.
Graph clustering is a heavily studied topic, have you considered using an existing library to accomplish this? If you insist on implementing your own algorithm, I would recommend spectral clustering as a starting point as it will produce decent results without much effort.
Edit based on feedback in coments:
If your main bottleneck is the similarity matrix then the following should be considered:
Investigate sparse matrix/graph representation while implementing something like spectral clustering which is probably going to give much more robust results than single-linkage clustering
Investigate pruning edges from the similarity matrix which you think are unimportant. If pruning is combined with a sparse representation of the similarity matrix, this should yield comparable performance to the MST approach while giving a smooth continuum to tune performance vs quality.

Partitioning a graph into k similar subgraphs

I have a graph which is connected and the edges have weights on them. Less the weight between an edge, the more closer the adjoining vertices are. I want to divide the graph into k smaller subgraphs such that nodes in all the subgraphs are very similar.
In other words, I need to cluster the graph. Can somebody suggest clustering algorithms that are suitable for graphs and have less time comlexity(lesser than O(n^2))?
Clustering is difficult problem that in general could be solved exaclty by brute force only. So in case of efficient algorithms your have to rely on some heuristic approach. In case if you can solve your problem as a vertex clustering task, you can try something like k-means, but it could be slow on larger data sets.
For graph specifically, I would suggest using MCL algorithm on your problem. MCL seems to be efficient in a lot of cases. It uses randomized flow simulation to detect clusters in weighted/unweighted graphs. Basic idea is that flow gathers within a cluster, while links between clusters tend to be less saturated.

Graph Topology Profiling

Can anyone suggest me some algorithms that can be used to analyze the graph topology classification?
Input: Adjacency list with raw graph information.
Output : What kind of graph is it? Currently I want to focus only on Pure Types - Daisy chain, Mesh, Ring, Star, Tree.
Which area of algorithm study is responsible for such algorithm? Is it Computational Geometry?
Edit - The size of graph will not exceed 32 nodes. However, there will be redundant links between nodes.
Edit - I understand that my question might be too broad, but at least give me the clue of what is wrong with the question before down-voting it. Or is it because of my reputation :-(
Start by checking that your graph is fully connected.
Then, check the distribution of the nodes' degree:
Ring: All nodes would have degree 2
Daisy chain: all nodes would have degree 2 except for 2 nodes with degree 1 (there are alternative definitions for what a daisy chain is).
Star: Each node would have degree 1, except for one node with degree n-1
Tree: The sum of the degrees is 2*(number of nodes-1). Also, if the highest degree is k, then there are at least k nodes with degree 1.
Mesh: Anything goes...
I don't think there is a 'area' of algorithms that deals with such problems, but the term 'graph classes' is quite common (See for example here), though it is not a formal term.
To classify a new instance, you need a classification system in the first place!
Putting it another way, your graph (the item to classify) fits somewhere in some kind of data structure of graph topologies (the classification system). The system could be as simple as a list; in which case, you carry out the simple algorithm outlined in this other post where the list of topologies is keyed by degree distribution.
A more complex system could be a hierarchical one, similar to biological classification systems. This would only really be necessary for very large numbers of graph topologies, where it would make it faster to classify based on a series of decisions. Essentially a decision tree.
It may be difficult to find much research in this area (for pure graphs) as it's a little hard to think of applications. There are applications for protein fold topologies, but that may not be of interest.

What is a good measure of strength of a link and influence of a node?

In the context of social networks, what is a good measure of strength of a link between two nodes? I am currently thinking that the following should give me what I want:
For two nodes A and B:
Strength(A,B) = (neighbors(A) intersection neighbors(B))/neighbors(A)
where neighbors(X) gives the total number of nodes directly connected to X and the intersection operation above gives the number of nodes that are connected to both A and B.
Of course, Strength(A,B) != Strength(B,A).
Now knowing this, is there a good way to determine the influence of a node? I was initially using the Degree Centrality of a node to determine its "influence" but I somehow think its not a good idea because just because a node has a lot of outgoing links does not mean anything. Those links should be powerful as well. In that case, maybe using an aggregate of the strengths of each node connected to this node is a good idea to estimate its influence? Am I in the right direction? Does anyone have any suggestions?
My Philosophy (and understanding of the terms):
Strength indicates how far A is
willing to do what B has already done
Influence indicates how far A can make B do something (persuasion perhaps?)
Constraints:
Access to only a subgraph. I mean, I am trying to be realistic here because social networks are huge and having a complete view is not so practical.
you might want to check out some more sophisticated notions of distance.
A really cool one is "resistance distance", which lets you view distance as how likely a random path from one node will lead you to another
there are several days of lecture notes plus references to further reading at http://www.cs.yale.edu/homes/spielman/462/.
Few thoughts on this:
When you talk about influence of a node in a graph one centrality measurement that comes to mind it closeness centrality. Closeness centrality looks at the number of shortest paths in a graph the node is on. From an influence point of view, the node that is on the most shortest paths is the node that can share information the easiest, ie its nearer to more nodes than any other.
You also mention using the strengths of each node connected to a node. Maybe you should look at eigenvector centrality which ranks a node highly if its connected to other high degree nodes. This is an undirected version of PageRank.
Some questions that might affect you choice here are:
Is you graph directed?
Do you edges have weight? You mention strength... do you mean weights of some kind?
If you do have weights maybe the next step from a simple degree centrality would be to try a weighted degree centrality approach. Thus, just having a high number of connections doesn't automatically make you the most influential.

Graph Isomorphism

Is there an algorithm or heuristics for graph isomorphism?
Corollary: A graph can be represented in different different drawings.
What s the best approach to find different drawing of a graph?
It is a hell of a problem.
In general, the basic idea is to simplify the graph into a canonical form, and then perform comparison of canonical forms. Spanning trees are generated with this objective, but spanning trees are not unique, so you need to have a canonical way to create them.
After you have canonical forms, you can perform isomorphism comparison (relatively) easy, but that's just the start, since non-isomorphic graphs can have the same spanning tree. (e.g. think about a spanning tree T and a single addition of an edge to it to create T'. These two graphs are not isomorph, but they have the same spanning tree).
Other techniques involve comparing descriptors (e.g. number of nodes, number of edges), which can produce false positive in general.
I suggest you to start with the wiki page about the graph isomorphism problem. I also have a book to suggest: "Graph Theory and its applications". It's a tome, but worth every page.
As from you corollary, every possible spatial distribution of a given graph's vertexes is an isomorph. So two isomorph graphs have the same topology and they are, in the end, the same graph, from the topological point of view. Another matter is, for example, to find those isomorph structures enjoying particular properties (e.g. with non crossing edges, if exists), and that depends on the properties you want.
One of the best algorithms out there for finding graph isomorphisms is VF2.
I've written a high-level overview of VF2 as applied to chemistry - where it is used extensively. The post touches on the differences between VF2 and Ullmann. There is also a from-scratch implementation of VF2 written in Java that might be helpful.
A very similar problem - graph automorphism - can be solved by saucy, which is available in source code. This finds all symmetries of a graph. If you have two graphs, join them into one and any isomorphism can be discovered as an automorphism of the join.
Disclaimer: I am one of co-authors of saucy.
There are algorithms to do this -- however, I have not had cause to seriously investigate them as of yet. I believe Donald Knuth is either writing or has written on this subject in his Art of Computing series during his second pass at (re)writing them.
As for a simple way to do something that might work in practice on small graphs, I would recommend counting degrees, then for each vertex, also note the set of degrees for those vertexes that are adjacent. This will then give you a set of potential vertex isomorphisms for each point. Then just try all those (via brute force, but choosing the vertexes in increasing order of potential vertex isomorphism sets) from this restricted set. Intuitively, most graph isomorphism can be practically computed this way, though clearly there would be degenerate cases that might take a long time.
I recently came across the following paper : http://arxiv.org/abs/0711.2010
This paper proposes "A Polynomial Time Algorithm for Graph Isomorphism"
My project - Griso - at sf.net: http://sourceforge.net/projects/griso/ with this description:
Griso is a graph isomorphism testing utility written in C++. It is based on my own POLYNOMIAL-TIME (in this point the salt of the project) algorithm. See Griso's sample input/output on http://funkybee.narod.ru/graphs.htm page.
nauty and Traces
nauty and Traces are programs for computing automorphism groups of graphs and digraphs [*]. They can also produce a canonical label. They are written in a portable subset of C, and run on a considerable number of different systems.
AutGroupGraph command in GRAPE's package of GAP.
bliss: another symmetry and canonical labeling program.
conauto: a graph ismorphism package.
As for heuristics: i've been fantasising about a modified Ullmann's algorithm, where you don't only use breadth first search but mix it with depth first search the way, that first you use breadth first search intensively, than you set a limit for breadth analysis and go deeper after checking a few neighbours, and you lower the breadh every step at some amount. This is practically how i find my way on a map: first locate myself with breadth first search, then search the route with depth first search - largely, and this is the best evolution of my brain has ever invented. :) On the long term some intelligence may be added for increasing breadth first search neighbour count at critical vertexes - for example where there are a large number of neighbouring vertexes with the same edge count. Like checking your actual route sometimes with the car (without a gps).
I've found out that the algorithm belongs in the category of k-dimension Weisfeiler-Lehman algorithms, and it fails with regular graphs. For more here:
http://dabacon.org/pontiff/?p=4148
Original post follows:
I've worked on the problem to find isomorphic graphs in a database of graphs (containing chemical compositions).
In brief, the algorithm creates a hash of a graph using the power iteration method. There might be false positive hash collisions but the probability of that is exceedingly small (i didn't had any such collisions with tens of thousands of graphs).
The way the algorithm works is this:
Do N (where N is the radius of the graph) iterations. On each iteration and for each node:
Sort the hashes (from the previous step) of the node's neighbors
Hash the concatenated sorted hashes
Replace node's hash with newly computed hash
On the first step, a node's hash is affected by the direct neighbors of it. On the second step, a node's hash is affected by the neighborhood 2-hops away from it. On the Nth step a node's hash will be affected by the neighborhood N-hops around it. So you only need to continue running the Powerhash for N = graph_radius steps. In the end, the graph center node's hash will have been affected by the whole graph.
To produce the final hash, sort the final step's node hashes and concatenate them together. After that, you can compare the final hashes to find if two graphs are isomorphic. If you have labels, then add them (on the first step) in the internal hashes that you calculate for each node.
There is more background here:
https://plus.google.com/114866592715069940152/posts/fmBFhjhQcZF
You can find the source code of it here:
https://github.com/madgik/madis/blob/master/src/functions/aggregate/graph.py

Resources