How can I find Maximum Common Subgraph of two graphs? - algorithm

Hi i need a help in finding a graph algorithm
im working on the following equation related to distance functions
d (g1, g2) = 1- │mcs(g1,g2) │ /
│g1│+│g2│-│mcs (g1, g2) │
Where
d (g1,g2) : is a distance function based on maximum common sub graph
.
g1, g2 are two graphs .
mcs (g1,g2): is the maximum common sub graph of two graphs g1,g2
where mcs is the largest graph (by some measure involving the number
of nodes and edges )contained in both subject graphs .
│g1│: Cardinality of the common induced sub graph g1
│g2│: Cardinality of the common induced sub graph g2
My Question: How can I calculate MCS?
I searched the internet but most of the algorithms are complicated anyone know from where i can get a simple algorithm to program this equation in matlab.

The problem is NP-Complete1.
The reduction from the Clique Problem2. Given an instance of Clique Problem - a graph G=(V,E), create a complete clique G'=(V,E') such that E' = {(u,v) | u != v, for each u,v in V).
The solution to the maximal clique problem is the same solution for the maximal subgraph problem for G and G'. Since clique problem is NP-Hard, so does this problem.
Thus, there is no known polynomial solution to this problem.
If you are looking for an exact algorithm, you could try exhaustive search approach and/or a branch & bound approach to solve it. Sorry for the bad news, but at least you know not to look for something that (probably) doesn't exist (unless P=NP, of course)
EDIT: exponential brute force solution to the problem:
You can just check all possible subsets, and check if it is a feasible solution.
Pseudo Code:
findMCS(vertices,G1,G2,currentSubset):
if vertices is empty: //base clause, no more candidates to check
if isCommonSubgraph(G1,G2,currentSubset):
return clone(currentSubset)
else:
return {}
v <- vertices.pop() //take a look at the first element
cand1 <- findMCS(vertices,G1,G2,currentSubset) //find MCS if it is NOT in the subset
currentSubset.append(v)
if isCommonSubgrah(G1,G2,currentSubset): //find MCS if it is in the subset
cand2 <- findMCS(vertices,G1,G2,currentSubset)
currentSubset.remvoe(v) //clean up environment before getting back from recursive call
return (|cand1| > |cand2| ? cand1 : cand2) //return the maximal subset from all candidates
Complexity of the above is O(2^n) (checking all possible subsets), and invoke it with: findMCS(G1.vertices, G1, G2, []) (where [] is an empty list).
Note:
isCommonSubgrah(G1,G2,currentSubset) is an easy to calculate method that just answers true if and only if currentSubset is a common subgraph of G1 and G2.
|cand1| and |cand2| is the sizes of these lists.
(1)Assuming that Maximum sub graph is a subset U in V such that for each u1,u2 in U (u1,u2) is in E1 if and only if (u1,u2) is in E2 (intuitively, a maximal subset of the vertices that share the exact same edges in the two graphs)
(2) Clique Problem: Given an instance of G=(V,E) find maximal subset U in V such that for each u1,u2 in U : u1 = u2 or (u1,u2) is in E.

The backtracking search algorithm proposed by James J. McGregor may be utilized to identify the MCS between two graphs.

You can't even check if one graph is a subgraph of the other one, it's the subgraph isomorphism problem known to be NP-complete. Hereby you can't find the maximal subgraph because you can't check the isomorphism property (in polynomial time).

The main problem is finding a correspondence between nodes in the original graphs (essentially a renumbering of the vertices). For instance, if we have node p in graph g1 and node q in graph g2 where p and q are equivalent, we'd like to map them to a node s in the common subgraph, c.
The reason that the Clique Problem is so difficult is that, without any way of checking whether two nodes in different graphs actually refer to the same node, we have to try all possible combinations of pairs of nodes and check if each pair is consistent and represents the "best" correspondence.
Since the nodes in these graphs represent geographic locations, we should be able to come up with a reasonable distance metric that tells us how likely it is that a node in one graph is the same as any node in the other graph. Since the GPS coordinates of the two nodes are probably not identical, we need to make some assumptions based on the problem.
If we have a map of the region in which the data points occur, represented as a graph m, we can renumber or rename the nodes in g1 and g2 to correspond to their closest equivalent in in m.
Distance (either between the original graphs and m or between points in g1 and g2) can either be the Euclidean distance or the Manhattan distance, depending on what makes more sense for your graphs.
You'll have to be careful in deciding how far apart two nodes can be and still be considered equivalent. Too small and you won't get any matches; too large and your entire graph could be condensed into one node.
Two or more nodes in an original graph could possible all map to the same node in c. If the location data is updated frequently in relation to the distance between nodes, for instance.
Conversely, an edge between a pair of successive nodes in an original graph could also map to a path containing multiple edges if the update frequency is low in relation to the distances. So you'll have to figure out whether it makes sense to introduce these intermediate nodes into the common graph or treat the whole path as a single edge.
Once you've got the renumbering of the nodes you can use the method that Jens suggests to find the intersection of the renumbered graphs. This is all very general since I don't have a lot of details about your specific problem, but hopefully it's enough to get you started.

Related

Find Minimum Vertex Connected Sub-graph

First of all, I have to admit I'm not good at graph theory.
I have a weakly connected directed graph G=(V,E) where V is about 16 millions and E is about 180 millions.
For a given set S, which is a subset of V (size of S will be around 30), is it possible to find a weakly connected sub-graph G'=(V',E') where S is a subset of V' but try to keep the number of V' and E' as small as possible?
The graph G may change and I hope there's a way to find the sub-graph in real time. (When a process is writing into G, G will be locked, so don't worry about G get changed when your sub-graph calculation is still running.)
My current solution is find the shortest path for each pair of vertex in S and merge those paths to get the sub-graph. The result is OK but the running time is pretty expensive.
Is there a better way to solve this problem?
If you're happy with the results from your current approach, then it's certainly possible to do at least as well a lot faster:
Assign each vertex in S to a set in a disjoint set data structure: https://en.wikipedia.org/wiki/Disjoint-set_data_structure. Then:
Do a breadth-first-search of the graph, starting with S as the root set.
When you the search discovers a new vertex, remember its predecessor and assign it to the same set as its predecessor.
When you discover an edge that connects two sets, merge the sets and follow the predecessor links to add the connecting path to G'
Another way to think about doing exactly the same thing:
Sort all the edges in E according to their distance from S. You can use BFS discovery order for this
Use Kruskal's algorithm to generate a spanning tree for G, processing the edges in that order (https://en.wikipedia.org/wiki/Kruskal%27s_algorithm)
Pick a root in S, and remove any subtrees that don't contain a member of S. When you're done, every leaf will be in S.
This will not necessarily find the smallest possible subgraph, but it will minimize its maximum distance from S.

Minimizing the number of connected-checks in finding a shortest path in an implicit graph

I'm quite surprised I couldn't find anything on this anywhere, it seems to be a problem that should be quite well known:
Consider the Euclidean shortest path problem, in two dimensions. Given a set of obstacle polygons P and two points a and b, we want to find the shortest path from a to b not intersecting the (interior of) any p in P.
To solve this, one can create the visibility graph for this problem, the graph whose nodes are the vertices of the elements of P, and where two nodes are connected if the straight line between them does not intersect any element of P. The edge weight for any such edge is simply the Euclidean distance between such two points. To solve this, one can then determine the shortest path from a to b in the graph, let's say with A*.
However, this is not a good approach. Creating the visibility graph in advance requires checking if any two vertices from any two polygons are connected, a check that has higher complexity than determining the distance between two nodes. So working with a modified version of A* that "does everything what it can before checking if two nodes are actually connected" actually speeds up the problem.
Still, A* and all other shortest path problems always start with an explicitly given graph for which adjacent vertices can be traversed cheaply. So my question is, is there a good (optimal?) algorithm for finding a shortest path between two nodes a and b in an "implicit graph" that minimizes checking if two nodes are connected?
Edit:
To clarify what I mean, this is an example of what I'm looking for:
Let V be a set, a, b elements of V. Suppose w: V x V -> D is a weighing function (to some linearly ordered set D) and c: V x V -> {true, false} returns true iff two elements of V are considered to be connected. Then the following algorithm finds the shortest path from a to b in V, i.e., returns a list [x_i | i < n] such that x_0 = a, x_{n-1} = b, and c(x_i, x_{i+1}) = true for all i < n - 1.
Let (V, E) be the complete graph with vertex set V.
do
Compute shortest path from a to b in (V, E) and put it in P = [p_0, ..., p_{n-1}]
if P = empty (there is no shortest path), return NoShortestPath
Let all_good = true
for i = 0 ... n - 2 do
if c(p_i, p_{i+1}) == false, remove edge (p_i, p_{i+1}) from E, set all_good = false and exit for loop
while all_good = false
For computing the shortest paths in the loop, one could use A* if an appropriate heuristic exists. Obviously this algorithm produces a shortest path from a to b.
Also, I suppose this algorithm is somehow optimal in calling c as rarely as possible. For its found shortest path, it must have ruled out all shorter paths that the function w would have allowed for.
But surely there is a better way?
Edit 2:
So I found a solution that works relatively well for what I'm trying to do: Using A*, when relaxing a node, instead of going through the neighbors and adding them to / updating them in the priority queue, I put all vertices into the priority queue, marked as hypothetical, together with hypothetical f and g values and the hypothetical parent. Then, when picking the next element from the priority queue, I check if the node's connection to its parent is actually given. If so, the node is progressed as normal, if not, it is discarded.
This greatly reduces the number of connectivity checks and improves performance for me a lot. But I'm sure there's still a more elegant way, in particular one where the "hypothetical new path" doesn't just extend by length one (parents are always actual, not hypothetical).
A* or Dijkstra's algorithm do not need an explicit graph to work, they actually only need:
source vertex (s)
A function next:V->2^V such that next(v)={u | there is an edge from v to u }
A function isGoal:V->{0,1} such that isGoal(v) = 1 iff v is a target node.
A weight function w:E->R such that w(u,v)= cost to move from u to v
And, of course, in addition A* is going to need a heuristic function h:V->R such that h(v) is the cost approximation.
With these functions, you can generate only the portion of the graph that is needed to find shortest path, on the fly.
In fact, A* algorithm is often used on infinite graphs (or huge graphs that do not fit in any existing storage) in artificial inteliigence problems using this approach.
The idea is, you only look on edges in A* from a given node (all (u,v) in E for some given u). You don't need the entire edges set E in order to do it, you can just use your next(u) function instead.

Graph Has Two / Three Different Minimal Spanning Trees ?

I'm trying to find an efficient method of detecting whether a given graph G has two different minimal spanning trees. I'm also trying to find a method to check whether it has 3 different minimal spanning trees. The naive solution that I've though about is running Kruskal's algorithm once and finding the total weight of the minimal spanning tree. Later , removing an edge from the graph and running Kruskal's algorithm again and checking if the weight of the new tree is the weight of the original minimal spanning tree , and so for each edge in the graph. The runtime is O(|V||E|log|V|) which is not good at all, and I think there's a better way to do it.
Any suggestion would be helpful,
thanks in advance
You can modify Kruskal's algorithm to do this.
First, sort the edges by weight. Then, for each weight in ascending order, filter out all irrelevant edges. The relevant edges form a graph on the connected components of the minimum-spanning-forest-so-far. You can count the number of spanning trees in this graph. Take the product over all weights and you've counted the total number of minimum spanning trees in the graph.
You recover the same running time as Kruskal's algorithm if you only care about the one-tree, two-trees, and three-or-more-trees cases. I think you wind up doing a determinant calculation or something to enumerate spanning trees in general, so you likely wind up with an O(MM(n)) worst-case in general.
Suppose you have a MST T0 of a graph. Now, if we can get another MST T1, it must have at least one edge E different from the original MST. Throw away E from T1, now the graph is separated into two components. However, in T0, these two components must be connected, so there will be another edge across this two components that has exactly the same weight as E (or we could substitute the one with more weight with the other one and get a smaller ST). This means substitute this other edge with E will give you another MST.
What this implies is if there are more than one MSTs, we can always change just a single edge from a MST and get another MST. So if you are checking for each edge, try to substitute the edge with the ones with the same weight and if you get another ST it is a MST, you will get a faster algorithm.
Suppose G is a graph with n vertices and m edges; that the weight of any edge e is W(e); and that P is a minimal-weight spanning tree on G, weighing Cost(W,P).
Let δ = minimal positive difference between any two edge weights. (If all the edge weights are the same, then δ is indeterminate; but in this case, any ST is an MST so it doesn't matter.) Take ε such that δ > n·ε > 0.
Create a new weight function U() with U(e)=W(e)+ε when e is in P, else U(e)=W(e). Compute Q, an MST of G under U. If Cost(U,Q) < Cost(U,P) then Q≠P. But Cost(W,Q) = Cost(W,P) by construction of δ and ε. Hence P and Q are distinct MSTs of G under W. If Cost(U,Q) ≥ Cost(U,P) then Q=P and distinct MSTs of G under W do not exist.
The method above determines if there are at least two distinct MSTs, in time O(h(n,m)) if O(h(n,m)) bounds the time to find an MST of G.
I don't know if a similar method can treat whether three (or more) distinct MSTs exist; simple extensions of it fall to simple counterexamples.

Consensus on multiple graphs

Let G = (V,E) be a Directed Acyclic Graph (DAG). V is the set of vertexes, while E is the set of edges.
Now, suppose that G is corrupted by some annotators in a crowd, according to the crowdsourcing paradigm:
Some of them may decide to remove some edge e belonging to E
Some of them may decide to add an edge e which was not existing
The result of the work of an annotator i is a graph whose set of vertexes V is the same as the original one and whose set of edges Ei may differ from the original one. If n is the number of annotators, we come up with n different graphs, having the same set of vertexes V, but a different set of edges E. Let G1 = (V,E1), ..., Gn = (V,En) be the set of graphs.
I would like to know whether there is a way of merging these graphs, so as to find a consensus on the presence/absence of each possible edge e between two vertexes v1,v2 in V. The purpose of this operation is the one of fusing the opinion of each annotator about the construction of the set of edges E in the graph G. The final graph has to be a DAG.
Let...
U be the distinct union of all Ei sets plus the original set E
T be some arbitrary threshold value
H(x) be some heuristic function
F be the final consensus set of edges
Pseudocode:
for each Edge e in U
if H(e) >= T then F.Add(e)
The question is then of course how to define your heuristic function. A naive approach would be set based voting. Count the number of E sets containing the edge, and if enough people agree that it's in the graph, include it. This is a simple and efficient function to implement. Some weaknesses of this heuristic are its inability to detect and compensate for bad annotators or small crowd sizes.
For each edge count the number of graphs that contains it. If it is greater than some threshold, assume it was an original edge.
You may face some problems if some of the actions are biased. That is, each user does not randomly choose a particular edge to act upon.

minimum connected subgraph containing a given set of nodes

I have an unweighted, connected graph. I want to find a connected subgraph that definitely includes a certain set of nodes, and as few extras as possible. How could this be accomplished?
Just in case, I'll restate the question using more precise language. Let G(V,E) be an unweighted, undirected, connected graph. Let N be some subset of V. What's the best way to find the smallest connected subgraph G'(V',E') of G(V,E) such that N is a subset of V'?
Approximations are fine.
This is exactly the well-known NP-hard Steiner Tree problem. Without more details on what your instances look like, it's hard to give advice on an appropriate algorithm.
I can't think of an efficient algorithm to find the optimal solution, but assuming that your input graph is dense, the following might work well enough:
Convert your input graph G(V, E) to a weighted graph G'(N, D), where N is the subset of vertices you want to cover and D is distances (path lengths) between corresponding vertices in the original graph. This will "collapse" all vertices you don't need into edges.
Compute the minimum spanning tree for G'.
"Expand" the minimum spanning tree by the following procedure: for every edge d in the minimum spanning tree, take the corresponding path in graph G and add all vertices (including endpoints) on the path to the result set V' and all edges in the path to the result set E'.
This algorithm is easy to trip up to give suboptimal solutions. Example case: equilateral triangle where there are vertices at the corners, in midpoints of sides and in the middle of the triangle, and edges along the sides and from the corners to the middle of the triangle. To cover the corners it's enough to pick the single middle point of the triangle, but this algorithm might choose the sides. Nonetheless, if the graph is dense, it should work OK.
The easiest solutions will be the following:
a) based on mst:
- initially, all nodes of V are in V'
- build a minimum spanning tree of the graph G(V,E) - call it T.
- loop: for every leaf v in T that is not in N, delete v from V'.
- repeat loop until all leaves in T are in N.
b) another solution is the following - based on shortest paths tree.
- pick any node in N, call it v, let v be a root of a tree T = {v}.
- remove v from N.
loop:
1) select the shortest path from any node in T and any node in N. the shortest path p: {v, ... , u} where v is in T and u is in N.
2) every node in p is added to V'.
3) every node in p and in N is deleted from N.
--- repeat loop until N is empty.
At the beginning of the algorithm: compute all shortest paths in G using any known efficient algorithm.
Personally, I used this algorithm in one of my papers, but it is more suitable for distributed enviroments.
Let N be the set of nodes that we need to interconnect. We want to build a minimum connected dominating set of the graph G, and we want to give priority for nodes in N.
We give each node u a unique identifier id(u). We let w(u) = 0 if u is in N, otherwise w(1).
We create pair (w(u), id(u)) for each node u.
each node u builds a multiset relay node. That is, a set M(u) of 1-hop neigbhors such that each 2-hop neighbor is a neighbor to at least one node in M(u). [the minimum M(u), the better is the solution].
u is in V' if and only if:
u has the smallest pair (w(u), id(u)) among all its neighbors.
or u is selected in the M(v), where v is a 1-hop neighbor of u with the smallest (w(u),id(u)).
-- the trick when you execute this algorithm in a centralized manner is to be efficient in computing 2-hop neighbors. The best I could get from O(n^3) is to O(n^2.37) by matrix multiplication.
-- I really wish to know what is the approximation ration of this last solution.
I like this reference for heuristics of steiner tree:
The Steiner tree problem, Hwang Frank ; Richards Dana 1955- Winter Pawel 1952
You could try to do the following:
Creating a minimal vertex-cover for the desired nodes N.
Collapse these, possibly unconnected, sub-graphs into "large" nodes. That is, for each sub-graph, remove it from the graph, and replace it with a new node. Call this set of nodes N'.
Do a minimal vertex-cover of the nodes in N'.
"Unpack" the nodes in N'.
Not sure whether or not it gives you an approximation within some specific bound or so. You could perhaps even trick the algorithm to make some really stupid decisions.
As already pointed out, this is the Steiner tree problem in graphs. However, an important detail is that all edges should have weight 1. Because |V'| = |E'| + 1 for any Steiner tree (V',E'), this achieves exactly what you want.
For solving it, I would suggest the following Steiner tree solver (to be transparent: I am one of the developers):
https://scipjack.zib.de/
For graphs with a few thousand edges, you will usually get an optimal solution in less than 0.1 seconds.

Resources