Subgraph enumeration - algorithm

What is an efficient algorithm for the enumeration of all subgraphs of a parent graph. In my particular case, the parent graph is a molecular graph, and so it will be connected and typically contain fewer than 100 vertices.
Edit: I am only interested in the connected subgraphs.

This question has a better answer in the accepted answer to this question. It avoids the computationally complex step marked "you fill in above function" in #ninjagecko's answer. It can deal efficiently with compounds where there are several rings.
See the linked question for the full details, but here's the summary. (N(v) denotes the set of neighbors of vertex v. In the "choose a vertex" step, you can choose any arbitrary vertex.)
GenerateConnectedSubgraphs(verticesNotYetConsidered, subsetSoFar, neighbors):
if subsetSoFar is empty:
let candidates = verticesNotYetConsidered
else
let candidates = verticesNotYetConsidered intersect neighbors
if candidates is empty:
yield subsetSoFar
else:
choose a vertex v from candidates
GenerateConnectedSubgraphs(verticesNotYetConsidered - {v},
subsetSoFar,
neighbors)
GenerateConnectedSubgraphs(verticesNotYetConsidered - {v},
subsetSoFar union {v},
neighbors union N(v))

What is an efficient algorithm for the enumeration of all subgraphs of a parent graph. In my particular case, the parent graph is a molecular graph, and so it will be connected and typically contain fewer than 100 vertices.
Comparison with mathematical subgraphs:
You could give each element a number from 0 to N, then enumerate each subgraph as any binary number of length N. You wouldn't need to scan the graph at all.
If what you really want is subgraphs with a certain property (fully connected, etc.) that is different, and you'd need to update your question. As a commentor noted, 2^100 is very large, so you definitely don't want to (like above) enumerate the mathematically-correct-but-physically-boring disconnected subgraphs. It would literally take you, assuming a billion enumerations per second, at least 40 trillion years to enumerate them all.
Connected-subgraph-generator:
If you want some kind of enumeration that retains the DAG property of subgraphs under some metric, e.g. (1,2,3)->(2,3)->(2), (1,2,3)->(1,2)->(2), you'd just want an algorithm that could generate all CONNECTED subgraphs as an iterator (yielding each element). This can be accomplished by recursively removing a single element at a time (optionally from the "boundary"), checking if the remaining set of elements is in a cache (else adding it), yielding it, and recursing. This works fine if your molecule is very chain-like with very few cycles. For example if your element was a 5-pointed star of N elements, it would only have about (100/5)^5 = 3.2million results (less than a second). But if you start adding in more than a single ring, e.g. aromatic compounds and others, you might be in for a rough ride.
e.g. in python
class Graph(object):
def __init__(self, vertices):
self.vertices = frozenset(vertices)
# add edge logic here and to methods, etc. etc.
def subgraphs(self):
cache = set()
def helper(graph):
yield graph
for element in graph:
if {{REMOVING ELEMENT WOULD DISCONNECT GRAPH}}:
# you fill in above function; easy if
# there is 0 or 1 ring in molecule
# (keep track if molecule has ring, e.g.
# self.numRings, maybe even more data)
# if you know there are 0 rings the operation
# takes O(1) time
continue
subgraph = Graph(graph.vertices-{element})
if not subgraph in cache:
cache.add(subgraph)
for s in helper(subgraph):
yield s
for graph in helper(self):
yield graph
def __eq__(self, other):
return self.vertices == other.vertices
def __hash__(self):
return hash(self.vertices)
def __iter__(self):
return iter(self.vertices)
def __repr__(self):
return 'Graph(%s)' % repr(set(self.vertices))
Demonstration:
G = Graph({1,2,3,4,5})
for subgraph in G.subgraphs():
print(subgraph)
Result:
Graph({1, 2, 3, 4, 5})
Graph({2, 3, 4, 5})
Graph({3, 4, 5})
Graph({4, 5})
Graph({5})
Graph(set())
Graph({4})
Graph({3, 5})
Graph({3})
Graph({3, 4})
Graph({2, 4, 5})
Graph({2, 5})
Graph({2})
Graph({2, 4})
Graph({2, 3, 5})
Graph({2, 3})
Graph({2, 3, 4})
Graph({1, 3, 4, 5})
Graph({1, 4, 5})
Graph({1, 5})
Graph({1})
Graph({1, 4})
Graph({1, 3, 5})
Graph({1, 3})
Graph({1, 3, 4})
Graph({1, 2, 4, 5})
Graph({1, 2, 5})
Graph({1, 2})
Graph({1, 2, 4})
Graph({1, 2, 3, 5})
Graph({1, 2, 3})
Graph({1, 2, 3, 4})

There is this algorithm called gspan [1] that has been used to count frequent subgraphs it can also be used to enumerate all subgraphs. You can find an implementation of it here [2].
The idea is the following: Graphs are represented by so called DFS codes. A DFS code corresponds to a depth first search on a graph G and has an entry of the form
(i, j, l(v_i), l(v_i, v_j), l(v_j)), for each edge (v_i, v_j) of the graph, where the vertex subscripts correspond to the order in which the vertices are discovered by the DFS. It is possible to define a total order on the set of all DFS codes (as is done in [1]) and as a consequence to obtain a canonical graph label for a given graph by computing the minimum over all DFS codes representing this graph. Meaning that if two graphs have the same minimum DFS code they are isomorphic. Now, starting from all possible DFS codes of length 1 (one per edge), all subgraphs of a graph can be enumerated by subsequently adding one edge at a time to the codes which gives rise to an enumeration tree where each node corresponds to a graph. If the enumeration is done carefully (i.e., compatible with the order on the DFS codes) minimal DFS codes are encountered first. Therefore, whenever a DFS code is encountered that is not minimal its corresponding subtree can be pruned. Please consult [1] for further details.
[1] https://sites.cs.ucsb.edu/~xyan/papers/gSpan.pdf
[2] http://www.nowozin.net/sebastian/gboost/

Related

Find guaranteed ancestors in directed graph

I'm trying to implement an algorithm to find what I call 'guaranteed ancestors' in a directed graph. I have a list of nodes which each can point to zero, one or multiple child nodes.
Below you see an example of a simple graph. I've marked all circles with a unique number.
Let's imagine we're trying to determine which nodes I'm guaranteed to have visited before reaching node 13 starting at node 0.
My thoughts when solving this simple example by hand is starting in node 13 and working my way back, which nodes am I guaranteed to visit no matter which direction I go. The first node I notice obeying this property is node 10, since no matter if I choose to visit node 11 or node 12, then I'm guaranteed to eventually reach node 13. Similarly I can conclude I have to visit node 9 if I want to reach node 13. Working all the way up the graph I conclude that node 13 has node 0, 1, 9, 10 as it's guaranteed anchestors.
I'm not sure what such an algorithm is called, but I'm sure there is a name for this specific search.
Here is the constraints you can assume about my graph.
There is a single defined "head/root" node, which is the only node without any other nodes pointing to it.
The graph is acyclic (Ideally the algorithm would be able to handle cycles too, but I have a different check, verifying that the graph is acyclic, so this is not a must.)
There is no "dead" nodes, eg. nodes which can't be reached from the head/root node.
This has to run on more complicated graphs with up to 500 nodes and many nodes with multiple "parents", which could be connected back and forth. Runtime is a priority as well - I assume we should be able to solve this problem in linear time complexity.
I've tried simplifying the problem to the point where I tried making an algorithm which could determine if a single node was a guaranteed anchestor of another node, which I believe is pretty simple to determine in O(n), however if I want a complete list of all guaranteed anchestors I assume I'd have to run this algorithm for every node, leaving me with O(n^2).
Does anyone know the correct name of the algorithm I'm describing?
Assign a weight of 1 to every edge
Run Dijkstra to find shortest path between head and root.
Assign weight of 2 * ( edge count of graph ) to every edge in path
Run Dijkstra to find cheapest path
Identify edges that are present in both paths. ( they could not be avoided although very expensive )
The nodes at both ends of every edge identified in 5 will be critical - i.e they must ALL be visted by any route between head and root.
Consider an example:
The first Dijkstra run would return a path containing node 1 or 2 ( they both belong on 5 hop paths. The second run would return a path containing the other of those two nodes
This is almost what the definition of an articulation or cut vertex is in an undirected graph. See Biconnected component:
a cut vertex is any vertex whose removal increases the number of connected components.
The difference is that your graph is directed, and that you consider the root also as such a vertex.
So my suggestion is to temporarily consider the graph to be undirected, and to apply a depth-first algorithm to identify such cut vertices, and include the root.
The algorithm is given as pseudo code in the same Wikipedia article. I have rewritten it in JavaScript, so it can be run here for the graph that you have given as example:
function buildAdjacencyList(n, edges) {
// Indexes in adj represent node identifiers.
// Values in adj are lists of neighbors: start out with empty lists
let adj = [];
for (let i = 0; i < n; i++) adj.push([]);
for (let [start, end] of edges) {
adj[start].push(end );
adj[end ].push(start); // make edge bidirectional
}
return adj;
}
function markArticulationPoints(nodes, node, depth) {
node.visited = true;
node.depth = depth;
node.low = depth;
for (let neighborId of node.neighbors) {
let neighbor = nodes[neighborId];
if (!neighbor.visited) {
neighbor.parent = node;
markArticulationPoints(nodes, neighbor, depth + 1);
if (neighbor.low >= node.depth) node.isArticulation = true;
if (neighbor.low < node.low) node.low = neighbor.low;
} else if (neighbor != node.parent && neighbor.depth < node.low) {
node.low = neighbor.depth;
}
}
}
function getArticulationPoints(adj, root) {
// Create object for each node, having meta data for algorithm
let nodes = [];
for (let i = 0; i < adj.length; i++) {
nodes.push({
neighbors: adj[i],
visited: false,
depth: Infinity,
low: Infinity,
parent: -1,
isArticulation: i == root // root is considered articulation point
});
}
markArticulationPoints(nodes, nodes[root], 0); // start DFS algorithm
// Collect articulation points from meta data
let result = [];
for (let i = 0; i < adj.length; i++) {
if (nodes[i].isArticulation) result.push(i);
}
return result;
}
// Build adjacency list for example graph, but with undirected edges
let adj = buildAdjacencyList(14, [
[0, 1],
[1, 2],
[1, 3],
[2, 4],
[2, 5],
[4, 5],
[4, 6],
[3, 7],
[7, 8],
[6, 9],
[8, 9],
[9, 10],
[10, 11],
[10, 12],
[11, 13],
[12, 13]
]);
let result = getArticulationPoints(adj, 0);
console.log("Articluation points:", ...result);

Mathematica - Reflect/invert unidirect graph edges

Given is a graph with data in mathematica-11. The graph includes undirected edges between nodes and standalone nodes (not connected).
Graph[{1 <-> 2, 2<-> 3, 3<-> 1, 4<-> 5, 5<-> 6, 6<-> 2, 2<-> 4}, VertexLabels -> "Name", VertexShapeFunction -> "Diamond", VertexSize -> Small]
Question:
How do I reflect (invert) the edges between the nodes?
Since they are unidirected, by reflecting/inverting I mean that those nodes are supposed to be linked where no edge was before (and the former edges are gone).
Mathematica-11 provides the ReverseGraph function, however this one does only reflect directed edges by their direction. Any ideas?
Idea:
Convert the graph into a AdjacencyMatrix and inverse it. Then, the inversed matrix could be used to create the reflected/inverted graph.
However, I am stuck with inverting the AdjacencyMatrix since the result will behavior strange:
So simple the true values are replaced with the term inverse when using Inverse (AdjacencyMatrix[data]) // MatrixForm
Related:
This article covers how to reflect edge weights, but not the edges itself.
I'm not sure I completely understand your question so if this isn't the answer tell us why not:
opts = {VertexLabels -> "Name", VertexShapeFunction -> "Diamond", VertexSize -> Small}
g1 = Graph[{1 <-> 2, 2 <-> 3, 3 <-> 1, 4 <-> 5, 5 <-> 6, 6 <-> 2, 2 <-> 4}];
Then you might like
GraphComplement[g1, opts]
or, if you want edges from each node to itself
AdjacencyGraph[Table[1, {VertexCount[g1]}, {VertexCount[g1]}] - AdjacencyMatrix[g1], opts]

Function of this graph pseudocode

procedure explore(G; v)
Input: G = (V;E) is a graph; v 2 V
Output: visited(u) is set to true for all nodes u reachable from v
visited(v) = true
previsit(v)
for each edge (v; u) 2 E:
if not visited(u): explore(u)
postvisit(v)
All this pseudocode does is find one path right? It does nothing while backtracking if I'm not wrong?
It just explores the graph (it doesn't return a path) - everything that's reachable from the starting vertex will be explored and have the corresponding value in visited set (not just the vertices corresponding to one of the paths).
It moves on to the next edge while backtracking ... and it does postvisit.
So if we're at a, which has edges to b, c and d, we'll start by going to b, then, when we eventually return to a, we'll then go to c (if it hasn't been visited already), and then we will similarly go to d after return to a for the 2nd time.
It's called depth-first search, in case you were wondering. Wikipedia also gives an example of the order in which vertices will get explored in a tree: (the numbers correspond to the visit order, we start at 1)
In the above, you're not just exploring the vertices going down the left (1-4), but after 4 you go back to 3 to visit 5, then back to 2 to visit 6, and so on, until all 12 are visited.
With regard to previsit and postvisit - previsit will happen when we first get to a vertex, postvisit will happen after we've explored all of it's children (and their descendants in the corresponding DFS tree). So, in the above example, for 1, previsit will happen right at the start, but post-visit will happen only at the very end, because all the vertices are children of 1 or descendants of those children. The order will go something like:
pre 1, pre 2, pre 3, pre 4, post 4, pre 5, post 5, post 3, pre 6, post 6, post 2, ...

Finding size of max independent set in binary tree - why faulty "solution" doesn't work?

Here is a link to a similar question with a good answer: Java Algorithm for finding the largest set of independent nodes in a binary tree.
I came up with a different answer, but my professor says it won't work and I'd like to know why (he doesn't answer email).
The question:
Given an array A with n integers, its indexes start with 0 (i.e, A[0],
A[1], …, A[n-1]). We can interpret A as a binary tree in which the two
children of A[i] are A[2i+1] and A[2i+2], and the value of each
element is the node weight of the tree. In this tree, we say that a
set of vertices is "independent" if it does not contain any
parent-child pair. The weight of an independent set is just the
summation of all weights of its elements. Develop an algorithm to
calculate the maximum weight of any independent set.
The answer I came up with used the following two assumptions about independent sets in a binary tree:
All nodes on the same level are independent from each other.
All nodes on alternating levels are independent from each other (there are no parent/child relations)
Warning: I came up with this during my exam, and it isn't pretty, but I just want to see if I can argue for at least partial credit.
So, why can't you just build two independent sets (one for odd levels, one for even levels)?
If any of the weights in each set are non-negative, sum them (discarding the negative elements because that won't contribute to a largest weight set) to find the independent set with the largest weight.
If the weights in the set are all negative (or equal to 0), sort it and return the negative number closest to 0 for the weight.
Compare the weights of the largest independent set from each of the two sets and return it as the final solution.
My professor claims it won't work, but I don't see why. Why won't it work?
Interjay has noted why your answer is incorrect. The problem can be solved with a recursive algorithm find-max-independent which, given a binary tree, considers two cases:
What is the max-independent set given that the root node is
included?
What is the max-independent set given that the root node
is not included?
In case 1, since the root node is included, neither of its children can. Thus we sum the value of find-max-independent of the grandchildren of root, plus the value of root (which must be included), and return that.
In case 2, we return the max value of find-max-independent of the children nodes, if any (we can pick only one)
The algorithm may look something like this (in python):
def find_max_independent ( A ):
N=len(A)
def children ( i ):
for n in (2*i+1, 2*i+2):
if n<N: yield n
def gchildren ( i ):
for child in children(i):
for gchild in children(child):
yield gchild
memo=[None]*N
def rec ( root ):
"finds max independent set in subtree tree rooted at root. memoizes results"
assert(root<N)
if memo[root] != None:
return memo[root]
# option 'root not included': find the child with the max independent subset value
without_root = sum(rec(child) for child in children(root))
# option 'root included': possibly pick the root
# and the sum of the max value for the grandchildren
with_root = max(0, A[root]) + sum(rec(gchild) for gchild in gchildren(root))
val=max(with_root, without_root)
assert(val>=0)
memo[root]=val
return val
return rec(0) if N>0 else 0
Some test cases illustrated:
tests=[
[[1,2,3,4,5,6], 16], #1
[[-100,2,3,4,5,6], 6], #2
[[1,200,3,4,5,6], 200], #3
[[1,2,3,-4,5,-6], 6], #4
[[], 0],
[[-1], 0],
]
for A, expected in tests:
actual=find_max_independent(A)
print("test: {}, expected: {}, actual: {} ({})".format(A, expected, actual, expected==actual))
Sample output:
test: [1, 2, 3, 4, 5, 6], expected: 16, actual: 16 (True)
test: [-100, 2, 3, 4, 5, 6], expected: 15, actual: 15 (True)
test: [1, 200, 3, 4, 5, 6], expected: 206, actual: 206 (True)
test: [1, 2, 3, -4, 5, -6], expected: 8, actual: 8 (True)
test: [], expected: 0, actual: 0 (True)
test: [-1], expected: 0, actual: 0 (True)
Test case 1
Test case 2
Test case 3
Test case 4
The complexity of the memoized algorithm is O(n), since rec(n) is called once for each node. This is a top-down dynamic programming solution using depth-first-search.
(Test case illustrations courtesy of leetcode's interactive binary tree editor)
Your algorithm doesn't work because the set of nodes it returns will be either all from odd levels, or all from even levels. But the optimal solution can have nodes from both.
For example, consider a tree where all weights are 0 except for two nodes which have weight 1. One of these nodes is at level 1 and the other is at level 4. The optimal solution will contain both these nodes and have weight 2. But your algorithm will only give one of these nodes and have weight 1.

For every vertex in a graph, find all vertices within a distance d

In my particular case, the graph is represented as an adjacency list and is undirected and sparse, n can be in the millions, and d is 3. Calculating A^d (where A is the adjacency matrix) and picking out the non-zero entries works, but I'd like something that doesn't involve matrix multiplication. A breadth-first search on every vertex is also an option, but it is slow.
def find_d(graph, start, st, d=0):
if d == 0:
st.add(start)
else:
st.add(start)
for edge in graph[start]:
find_d(graph, edge, st, d-1)
return st
graph = { 1 : [2, 3],
2 : [1, 4, 5, 6],
3 : [1, 4],
4 : [2, 3, 5],
5 : [2, 4, 6],
6 : [2, 5]
}
print find_d(graph, 1, set(), 2)
Let's say that we have a function verticesWithin(d,x) that finds all vertices within distance d of vertex x.
One good strategy for a problem such as this, to expose caching/memoisation opportunities, is to ask the question: How are the subproblems of this problem related to each other?
In this case, we can see that verticesWithin(d,x) if d >= 1 is the union of vertices(d-1,y[i]) for all i within range, where y=verticesWithin(1,x). If d == 0 then it's simply {x}. (I'm assuming that a vertex is deemed to be of distance 0 from itself.)
In practice you'll want to look at the adjacency list for the case d == 1, rather than using that relation, to avoid an infinite loop. You'll also want to avoid the redundancy of considering x itself as a member of y.
Also, if the return type of verticesWithin(d,x) is changed from a simple list or set, to a list of d sets representing increasing distance from x, then
verticesWithin(d,x) = init(verticesWithin(d+1,x))
where init is the function that yields all elements of a list except the last one. Obviously this would be a non-terminating recursive relation if transcribed literally into code, so you have to be a little bit clever about how you implement it.
Equipped with these relations between the subproblems, we can now cache the results of verticesWithin, and use these cached results to avoid performing redundant traversals (albeit at the cost of performing some set operations - I'm not entirely sure that this is a win). I'll leave it as an exercise to fill in the implementation details.
You already mention the option of calculating A^d, but this is much, much more than you need (as you already remark).
There is, however, a much cheaper way of using this idea. Suppose you have a (column) vector v of zeros and ones, representing a set of vertices. The vector w := A v now has a one at every node that can be reached from the starting node in exactly one step. Iterating, u := A w has a one for every node you can reach from the starting node in exactly two steps, etc.
For d=3, you could do the following (MATLAB pseudo-code):
v = j'th unit vector
w = v
for i = (1:d)
v = A*v
w = w + v
end
the vector w now has a positive entry for each node that can be accessed from the jth node in at most d steps.
Breadth first search starting with the given vertex is an optimal solution in this case. You will find all the vertices that within the distance d, and you will never even visit any vertices with distance >= d + 2.
Here is recursive code, although recursion can be easily done away with if so desired by using a queue.
// Returns a Set
Set<Node> getNodesWithinDist(Node x, int d)
{
Set<Node> s = new HashSet<Node>(); // our return value
if (d == 0) {
s.add(x);
} else {
for (Node y: adjList(x)) {
s.addAll(getNodesWithinDist(y,d-1);
}
}
return s;
}

Resources