Efficient Algorithm for subgraph enumeration - algorithm

I have searched related issues about subgraph enumeration. However, they didn't meet my requirement(*). (If I misunderstood something, please tell me.)
Is there an efficient algorithm or tools for the enumeration of all "connected, and unlabelled" subgraphs of a undirected parent graph.
In my case, the parent graph is an Internet topology so the amount of nodes could be large. And I would like to enumerate all of the connected unlabelled patterns (i.e. subgraphs) of the parent graph.
(*) I have searched Efficiently find all connected subgraphs and Subgraph enumeration but both of them were targeting on vertex-labelled induced and complete subgraphs respectively. But all I want is just the connected unlabelled subgraphs.

A topic name that might be helpful is "frequent subgraph mining", which is what it seems to be one name for this. There are various tools and algorithms in this area, although they may not do exactly what you want, of course.
As other point out in the answers to the two questions in your links, the number of subgraphs of large graphs can be very large. Assuming you actually want to list them, not just count them then it might take a long time.
Edit : OP has pointed out that the input here is ONE large graph, not a set of smaller ones, which will not work with standard graph mining
I still think the general approach can work here. The input set of graphs for mining is some subset of the subgraphs of your data graph. But that subgraph-set is what you want in the first place!
So lets say you pick a size of subgraph that you want (let's say 6 vertices) then you randomly pick starting vertices in your parent (the internet topology) and 'grow' these seeds, weeding out at each growth step those that don't match. Then repeat for different sizes of subgraph.
Of course, this is a probabilistic algorithm, but it could give you some idea.

Related

Significance of various graph types

There are a lot of named graph types. I am wondering what is the criteria behind this categorization. Are different types applicable in different context? Moreover, can a business application (from design and programming perspective) benefit anything out of these categorizations? Is this analogous to design patterns?
We've given names to common families of graphs for several reasons:
Certain families of graphs have nice, simple properties. For example, trees have numerous useful properties (there's exactly one path between any pair of nodes, they're maximally acyclic, they're minimally connected, etc.) that don't hold of arbitrary graphs. Directed acyclic graphs can be topologically sorted, which normal graphs cannot. If you can model a problem in terms of one of these types of graphs, you can use specialized algorithms on them to extract properties that can't necessarily be obtained from an arbitrary graph.
Certain algorithms run faster on certain types of graphs. Many NP-hard problems on graphs, which as of now don't have any polynomial-time algorithms, can be solved very easily on certain types of graphs. For example, the maximum independent set problem (choose the largest collection of nodes where no two nodes are connected by an edge) is NP-hard, but can be solved in polynomial time for trees and bipartite graphs. The 4-coloring problem (determine whether the nodes of a graph can be colored one of four different colors without assigning the same color to adjacent nodes) is NP-hard in general, but is immediately true for planar graphs (this is the famous four-color theorem).
Certain algorithms are easier on certain types of graphs. A matching in a graph is a collection of edges in the graph where no two edges share an endpoint. Maximum matchings can be used to represent ways of pairing people up into groups. In a bipartite graph, a maximum matching can be used to represent a way of assigning people to tasks such that no person is assigned two tasks and no task is assigned to two people. There are many fast algorithms for finding maximum matchings in bipartite graphs that work quickly and are easy to understand. The corresponding algorithms for general graphs are significantly more complicated and slightly less efficient.
Certain graphs are historically significant. Many named graphs are named after someone who used the graph to disprove a conjecture about properties of arbitrary graphs. The Petersen graph, for example, is a counterexample to many theorems that seem true about graphs but are actually not.
Certain graphs are useful in theoretical computer science. An expander graph is a graph where, intuitively, any collection of nodes must be connected to a proportionally larger collection of nodes in the graph. Not all graphs are expander graphs. Expander graphs are used in many results in theoretical computer science, such as one proof of the PCP theorem and in the proof that SL = L.
This is not an exhaustive list of why we care about different graph families, but hopefully it helps motivate their usage and study.
Hope this helps!

Is it possible to develop an algorithm to solve a graph isomorphism?

Or will I need to develop an algorithm for every unique graph? The user is given a type of graph, and they are then supposed to use the interface to add nodes and edges to an initial graph. Then they submit the graph and the algorithm is supposed to confirm whether the user's graph matches the given graph.
The algorithm needs to confirm not only the neighbours of each node, but also that each node and each edge has the correct value. The initial graphs will always have a root node, which is where the algorithm can start from.
I am wondering if I can develop the logic for such an algorithm in the general sense, or will I need to actually code a unique algorithm for each unique graph. It isn't a big deal if it's the latter case, since I only have about 20 unique graphs.
Thanks. I hope I was clear.
Graph isomorphism problem might not be hard. But it's very hard to prove this problem is not hard.
There are three possibilities for this problem.
1. Graph isomorphism problem is NP-hard.
2. Graph isomorphism problem has a polynomial time solution.
3. Graph isomorphism problem is neither NP-hard or P.
If two graphs are isomorphic, then there exist a permutation for this isomorphism. Take this permutation as a certificate, we could prove this two graphs are isomorphic to each other in polynomial time. Thus, graph isomorphism lies in the territory of NP set. However, it has been more than 30 years that no one could prove whether this problem is NP-hard or P. Thus, this problem is intrinsically hard despite its simple problem description.
If I understand the question properly, you can have ONE single algorithm, which will work by accepting one of several reference graphs as its input (in addition to the input of the unknown graph which isomorphism with the reference graph is to be asserted).
It appears that you seek to assert whether a given graph is exactly identical to another graph rather than asserting if the graphs are isomorph relative to a particular set of operations or characteristics. This implies that the algorithm be supplied some specific reference graph, rather than working off some set of "abstract" rules such as whether neither graphs have loops, or both graphs are fully connected etc. even though the graphs may differ in some other fashion.
Edit, following confirmation that:
Yeah, the algorithm would be supplied a reference graph (which is the answer), and will then check the user's graph to see if it is isomorphic (including the values of edges and nodes) to the reference
In that case, yes, it is quite possible to develop a relatively simple algorithm which would assert isomorphism of these two graphs. Note that the considerations mentioned in other remarks and answers and relative to the fact that the problem may be NP-Hard are merely indicative that a simple algorithm [or any algorithm for that matter] may not be sufficient to solve the problem in a reasonable amount of time for graphs which size and complexity are too big. However, assuming relatively small graphs and taking advantage (!) of the requirement that the weights of edges and nodes also need to match, the following algorithm should generally be applicable.
General idea:
For each sub-graph that is disconnected from the rest of the graph, identify one (or possibly several) node(s) in the user graph which must match a particular node of the reference graph. By following the paths from this node [in an orderly fashion, more on this below], assert the identity of other nodes and/or determine that there are some nodes which cannot be matched (and hence that the two structures are not isomorphic).
Rough pseudo code:
1. For both the reference and the user supplied graph, make the the list of their Connected Components i.e. the list of sub-graphs therein which are disconnected from the rest of the graph. Finding these connected components is done by following either a breadth-first or a depth-first path from starting at a given node and "marking" all nodes on that path with an arbitrary [typically incremental] element ID number. Once a given path has been fully visited, repeat the operation from any other non-marked node, and do so until there are no more non-marked nodes.
2. Build a "database" of the characteristics of each graph.
This will be useful to identify matching candidates and also to determine, early on, instances of non-isomorphism.
Each "database" would have two kinds of "records" : node and edge, with the following fields, respectively:
- node_id, Connected_element_Id, node weight, number of outgoing edges, number of incoming edges, sum of outgoing edges weights, sum of incoming edges weight.
node
- edge_id, Connected_element_Id, edge weight, node_id_of_start, node_id_of_end, weight_of_start_node, weight_of_end_node
3. Build a database of the Connected elements of each graph
Each record should have the following fields: Connected_element_id, number of nodes, number of edges, sum of node weights, sum of edge weights.
4. [optionally] Dispatch the easy cases of non-isomorphism:
4.a mismatch of the number of connected elements
4.b mismatch of of number of connected elements, grouped-by all fields but the id (number of nodes, number of edges, sum of nodes weights, sum of edges weights)
5. For each connected element in the reference graph
5.1 Identify candidates for the matching connected element in the user-supplied graph. The candidates must have the same connected element characteristics (number of nodes, number of edges, sum of nodes weights, sum of edges weights) and contain the same list of nodes and edges, again, counted by grouping by all characteristics but the id.
5.2 For each candidate, finalize its confirmation as an isomorph graph relative to the corresponding connected element in the reference graph. This is done by starting at a candidate node-match, i.e. a node, hopefully unique which has the exact same characteristics on both graphs. In case there is not such a node, one needs to disqualify each possible candidate until isomorphism can be confirmed (or all candidates are exhausted). For the candidate node match, walk the graph, in, say, breadth first, and by finding matches for the other nodes, on the basis of the direction and weight of the edges and weight of the nodes.
The main tricks with this algorithm is are to keep proper accounting of the candidates (whether candidate connected element at higher level or candidate node, at lower level), and to also remember and mark other identified items as such (and un-mark them if somehow the hypothetical candidate eventually proves to not be feasible.)
I realize the above falls short of a formal algorithm description, but that should give you an idea of what is required and possibly a starting point, would you decide to implement it.
You can remark that the requirement of matching nodes and edges weights may appear to be an added difficulty for asserting isomorphism, effectively simplify the algorithm because the underlying node/edge characteristics render these more unique and hence make it more likely that the algorithm will a) find unique node candidates and b) either quickly find other candidates on the path and/or quickly assert non-isomorphism.

Odd generalization of trees?

When dealing with directed graphs, a tree is a graph in which every node except one (the root) has a single incoming edge? Are there any examples of treelike structures in which every node has at most some constant number of incoming edges; say, at most two, or at most three? I haven't come across any graphs specifically described this way; is there a particular application in which they are used?
In graph theory, a tree is a connected acyclic graph. There is no requirement that every node have one incoming edge. In computer science, we often deal with rooted trees that agree with your definition.
Here is one description of a tree where some of the nodes have a constant number of incoming edges: an assignment of projects to employees, where each employee can be assigned at most three projects.
The most common generalization of a tree is a "DAG" (Directed Acyclic Graph) which is tangentially related but does not set a maximum on the size of in-neighborhoods (arcs which lead into a vertex) and specification of a single source (vertices with empty in-neighborhood).
From what I know, there's no neat term for what you're looking for. You'll need to find a true mathematician with a deep interest in graph theory to know with any certainty!
Lattices (partially ordered sets) have that property.

Graph Algorithm To Find All Paths Between N Arbitrary Vertices

I have an graph with the following attributes:
Undirected
Not weighted
Each vertex has a minimum of 2 and maximum of 6 edges connected to it.
Vertex count will be < 100
Graph is static and no vertices/edges can be added/removed or edited.
I'm looking for paths between a random subset of the vertices (at least 2). The paths should simple paths that only go through any vertex once.
My end goal is to have a set of routes so that you can start at one of the subset vertices and reach any of the other subset vertices. Its not necessary to pass through all the subset nodes when following a route.
All of the algorithms I've found (Dijkstra,Depth first search etc.) seem to be dealing with paths between two vertices and shortest paths.
Is there a known algorithm that will give me all the paths (I suppose these are subgraphs) that connect these subset of vertices?
edit:
I've created a (warning! programmer art) animated gif to illustrate what i'm trying to achieve: http://imgur.com/mGVlX.gif
There are two stages pre-process and runtime.
pre-process
I have a graph and a subset of the vertices (blue nodes)
I generate all the possible routes that connect all the blue nodes
runtime
I can start at any blue node select any of the generated routes and travel along it to reach my destination blue node.
So my task is more about creating all of the subgraphs (routes) that connect all blue nodes, rather than creating a path from A->B.
There are so many ways to approach this and in order not confuse things, here's a separate answer that's addressing the description of your core problem:
Finding ALL possible subgraphs that connect your blue vertices is probably overkill if you're only going to use one at a time anyway. I would rather use an algorithm that finds a single one, but randomly (so not any shortest path algorithm or such, since it will always be the same).
If you want to save one of these subgraphs, you simply have to save the seed you used for the random number generator and you'll be able to produce the same subgraph again.
Also, if you really want to find a bunch of subgraphs, a randomized algorithm is still a good choice since you can run it several times with different seeds.
The only real downside is that you will never know if you've found every single one of the possible subgraphs, but it doesn't really sound like that's a requirement for your application.
So, on to the algorithm: Depending on the properties of your graph(s), the optimal algorithm might vary, but you could always start of with a simple random walk, starting from one blue node, walking to another blue one (while making sure you're not walking in your own old footsteps). Then choose a random node on that path and start walking to the next blue from there, and so on.
For certain graphs, this has very bad worst-case complexity but might suffice for your case. There are of course more intelligent ways to find random paths, but I'd start out easy and see if it's good enough. As they say, premature optimization is evil ;)
A simple breadth-first search will give you the shortest paths from one source vertex to all other vertices. So you can perform a BFS starting from each vertex in the subset you're interested in, to get the distances to all other vertices.
Note that in some places, BFS will be described as giving the path between a pair of vertices, but this is not necessary: You can keep running it until it has visited all nodes in the graph.
This algorithm is similar to Johnson's algorithm, but greatly simplified thanks to the fact that your graph is unweighted.
Time complexity: Since there is a constant number of edges per vertex, each BFS will take O(n), and the total will take O(kn), where n is the number of vertices and k is the size of the subset. As a comparison, the Floyd-Warshall algorithm will take O(n^3).
What you're searching for is (if I understand it correctly) not really all paths, but rather all spanning trees. Read the wikipedia article about spanning trees here to determine if those are what you're looking for. If it is, there is a paper you would probably want to read:
Gabow, Harold N.; Myers, Eugene W. (1978). "Finding All Spanning Trees of Directed and Undirected Graphs". SIAM J. Comput. 7 (280).

Finding equal subgraphs

Given:
a directed Graph
Nodes have labels
the same label can appear more than once
edges don't have labels
I want to find the set of largest (connected) subgraphs which are equal taking the labels of the nodes into account.
The graph could be huge (millions of nodes) does anyone know an efficient solution for this?
I'm looking for algorithm and ideally a Java implementation.
Update: Since this problem is most likely NP-complete. I would also be interested in an algorithm that produces an approximated solution.
This seems to be close at least:
Frequent Subgraphs
I strongly suspect that's NP-hard.
Even if all the labels are the same that's at least as hard as graph isomorphism. (Join the two graphs together as a single disconnected graph; are the largest equal subgraphs the two original graphs?)
If identical labels are relatively rare it might be tractable.

Resources