In a recent algorithms course we had to form a condensation graph and compute its reflexive-transitive closure to get a partial order. But it was never really explained why we would want to do that in a graph. I understand the gist of a condensation graph in that it highlights the strongly connected components, but what does the partial order give us that the original graph did not?
The algorithm implemented went like this:
Find strongly connected components (I used Tarjan's algorithm)
Create condensation graph for the SCCs
Form reflexive-transitive closure of adjacency matrix (I used Warshall's algorithm)
Doing that forms the partial order, but.... what advantage does finding the partial order give us?
Like any other data structure or algorithm, advantages are there only if it's properties are needed :-)
Result of procedure you described is structure that can be used to (easily) answer questions like:
For two nodes x, y. Is it x<=y and/or y<=x, or neither?
For a node x, find all nodes a that are a<=x, or x<=a?
These properties can be used to answer other questions about initial graph (DAG). Like, if adding edge x->y will produce a cycle. That can be checked by intersecting set A, of a<=x, and set B of y<=b. If A intersection B is not empty than edge x->y creates a cycle.
Structure also can be used to simpler implement algorithms that use graph to describes other dependencies. E.g. x->y means that result of calculation x is used for calculation y. If calculation x is changed than all calculations a where x<=a should be re-evaluated or flagged 'dirty' or result of x removed from a cache.
Related
I have an application that uses a directed acyclic graph (DAG) to represent events ordered by time. My goal is to create or find an algorithm to simplify the graph by removing certain edges with specific properties. I'll try to define what I mean:
In the example below, a is the first node and f is the last. In the first picture, there are four unique paths to use to go from a to f. If we isolate the paths between b and e, we have two alternative paths. The path that is a single edge, namely the edge between b and e is the type of path that I want to remove, leaving the graph in the second picture as a result.
Therefore, all the edges I want to remove are defined as: single edges between two nodes that have at least one other path with >1 edges.
I realize this might be a very specific kind of graph operation, but hoping this algorithm already exists out there, my question to Stack Overflow is: Is this a known graph operation, or should I get my hiney to the algorithm drawing board?
Like Matt Timmermans said in the comment: that operation is called a transitive reduction.
Thanks Matt!
I want to find the connected components in an undirected graph. However, I don't have an adjacency matrix. Instead I have a set of vertices as well as a function telling me whether two vertices are adjacent. What is the most efficient way to find all connected components?
I know I could just calculate the entire adjacency matrix and use depth-first search to find all components. But that would not be very efficient as I'd need to check every pair of vertices.
What I'm currently doing is the following procedure:
Pick any unassigned vertex which is now its own component
Find all neighbors of that vertex and add them to the component
Find all neighbors of the just added vertices (amongst those vertices not yet assigned to any component) and add them too
Repeat previous step until no new neighbors can be found
The component is now complete, repeat from the first step to find other components until all vertices are assigned
This is the pseudocode:
connected_components(vertices):
// vertices belonging to the current component and whose neighbors have not yet been added
vertices_to_check= [vertices.pop()]
// vertices belonging to the current component
current_component = []
components = []
while (vertices.notEmpty() or vertices_to_check.notEmpty()):
if (vertices_to_check.isEmpty()): // All vertices in the current component have been found
components.add(current_component)
current_component = []
vertices_to_check= [vertices.pop()]
next_vertex = vertices_to_check.pop()
current_component.add(next_vertex)
for vertex in vertices: // Find all neighbors of next_vertex
if (vertices_adjacent(vertex, next_vertex)):
vertices.remove(vertex)
vertices_to_check.add(vertex)
components.add(current_component)
return components
I understand that this method is faster than calculating the adjacency matrix in most cases, as I don't need to check whether two vertices are adjacent, if it is already known they belong to the same component. But is there a way to improve this algorithm?
Ultimately, any algorithm will have to call vertices_adjacent for every single pair of vertices that turn out to belong to separate components, because otherwise it will never be able to verify that there's no link between those components.
Now, if a majority of vertices all belong to a single component, then there may not be too many such pairs; but unless you expect a majority of vertices all belong to a single component, there's little point optimizing specifically for that case. So, discarding that case, the very best-case scenario is:
There turn out to be exactly two components, each with the same number of vertices (½|V| each). So there are ¼|V|2 pairs of vertices that belong to separate components, and you need to call vertices_adjacent for each of those pairs.
These two components turn out to be complete, or you turn out to be exceptionally lucky in your choice of edges to check for first, such that you can detect the connected parts by checking just |V| − 2 pairs.
. . . which still involves making ¼|V|2 + |V| − 2 calls to vertices_adjacent. By comparison, the build-an-adjacency-list approach makes ½|V|2 − ½|V| calls — which is more than the best-case scenario, but by a factor of less than 2. (And the worst-case scenario is simply equivalent to the build-an-adjacency-list approach. That would happen if no component contains more than two vertices, or if the graph is acyclic and you get unlikely in your choice of edges to check for first. Most graphs will be somewhere in between.)
So it's probably not worth trying to optimize too closely for the exact minimum number of calls to vertices_adjacent.
That said, your approach seems pretty reasonable to me; it doesn't make any calls to vertices_adjacent that are clearly unnecessary, so the only improvement would be a probabilistic one, if it could do a better job guessing which calls will turn out to be useful for eliminating later calls.
One possibility: in many graphs, there are some vertices that have a lot of neighbors and some vertices that have relatively few, according to a power-law distribution. So if you prioritize vertices based on how many neighbors they're already known to have, you may be able to take advantage of that pattern. (I think this is especially likely to be useful if the majority of vertices really do all belong to a single component, which is the only case where a better-than-factor-of-2 improvement is even conceivable.) But, you'll have to test to see if it actually makes a difference for the graphs you're interested in.
I'm trying to represent a transitive relation (in a database) and having a hard time working out the best data structure.
Basically, the data structure is a series of pairs A → B such that if A → B and B → C, then implicitly A → C. It's important to me to be able to identify which entries are original input and which entries exist implicitly. Asking if A → C is equivalent to me having a digraph and asking if there exists a path from A to C in that digraph.
I could just represent the original entries, but if I do than then it takes a lot of time to determine if two items are related, since I need to search for all possible paths and this is rather slow.
Alternatively, I can store the original edges, as well as a listing of all paths. This makes adding a new edge easy, because when I add A → B I can just take the Cartesian product of paths ending in A and the paths ending in B and put them together. This has some significant space overhead of O(n2) in the worst case, but has the nice property that lookups, by far the most common operation, will be constant time. The issue is deleting, where I cannot think of anything really other than recalculating all paths that may or may not run through the edge deleted, and this can be really nasty.
Does anyone have any better ideas?
Technical notes: the digraph may be cyclic, but the relation is reflexive so I don't need to represent the reflexivity or store anything about it.
This is called the Reachability problem.
It would seem that you want an efficient online algorithm, which is an open problem, and an area of much research.
See my similar question on cs.SE: An incrementally-condensed transitive-reduction of a DAG, with efficient reachability queries, where I reference several related querstions across stackexchange:
Related:
What is the fastest deterministic algorithm for dynamic digraph reachability with no edge deletion?
What is the fastest deterministic algorithm for incremental DAG reachability?
Does an algorithm exist to efficiently maintain connectedness information for a DAG in presence of inserts/deletes?
Is there an online-algorithm to keep track of components in a changing undirected graph?
Dynamic shortest path data structure for DAG
Note that even though some algorithm might be for a DAG only, if it supports condensation (that is, collapsing strongly connected components into one node, since they are considered equal, ie. they relate back and forth), it is equivalent; after condensation, you can query the graph for the representative node in place of any of the condensed nodes (because they were both reachable from each-other, and thusly related to the rest of the graph in exactly the same way).
My conclusion is that as-of-yet there does not seem to be an efficient way to do this (on the order of O(log n) queries for a dynamic graph, with output-sensitive update times on the condensed graph). For less efficient ways, see the related links above.
The closest practical algorithm I found was here (source), which is an interesting read. I am not sure how easy/practical this data-structure or any data structure in any paper you will find, would be to adapt it to a database.
PS. Consider asking CS-related questions on cs.stackexchange.com in the future.
Here is the problem:
assuming two persons are registered in a social networking website, how to decide whether they are connected or not?
my analysis (after reading more): actually, the question is looking for - the shortest path from A to B in a graph. I think both BFS and Dijkstra's Algorithms works here and time complexity is exactly the same (O(V+E)) because it is an unweighted graph, so we can't take advantage of the priority queue. So, a simple queue could resolve the problem. But, both of them doesnt resolve the problem that: find the path between them.
Bidrectrol should be a better solution at this point.
To find a path between the two, you should begin with a breadth first search. First find all neighbors of A, then find all neighbors of all neighbors of A, etc. Once B is hit, not only do you have a path from A to B, but you also have a shortest such path.
Dijkstra's algorithm rocks, and you may be able to speed this up by working from both end, i.e. find neighbors of A and neighbors of B, and compare.
If you do a depth first search, then you're following one path at a time. This will be much much slower.
If you do dfs for finding whether two people are connected on a social network, then it will take too long!
You already know the two persons, so you should use Bidirectional Search.. But, simple bidirectional search won't be enough for a graph as big as a social networking site. You will have to use some heuristics. Wikipedia page has some links to it.
You may also be able to use A* search. From wikipedia : "A* uses a best-first search and finds the least-cost path from a given initial node to one goal node (out of one or more possible goals)."
Edit: I suggest A* because "The additional complexity of performing a bidirectional search means that the A* search algorithm is often a better choice if we have a reasonable heuristic." So, if you can't form a reasonable heuristic, then use Bidirectional search. (Forming a good heuristic is never easy ;).)
One way is to use Union Find, add all links union(from,to), and if find(A) is find(B) is True then A and B are connected. This avoids the recursive search but it actually computes the connectivity of all pairs and doesn't give you the paths that connects A and B.
I think that the true criteria is: there are at least N paths between A and B shorter then K, or A and B are connected diectly. I would go with K = 3 and N near 5, i.e. have 5 common friends.
Note: answer edited.
Any method might end up being very slow. If you need to do this repeatedly, it's best to find the connected components of the graph, after which the task becomes a trivial O(1) operation: if two people are in the same component, they are connected.
Note that finding connected components for the first time might be slow, but keeping them updated as new edges/nodes are added to the graph is fast.
There are several methods for finding connected components.
One method is to construct the Laplacian of the graph, and look at its eigenvalues / eigenvectors. The number of zero eigenvalues gives you the number of connected components. The non-zero elements of the corresponding eigenvectors gives the nodes belonging to the respective components.
Another way is along the following lines:
Create a transformation table of nodes. Element n of the array contains the index of the node that node n transforms to.
Loop through all edges (i,j) in the graph (denoting a connection between i and j):
Compute recursively which node do i and j transform to based on the current table. Let us denote the results by k and l. Update entry k to make it transform to l. Update entries i and j to point to l as well.
Loop through the table again, and update each entry to point directly to the node it recursively transforms to.
Now nodes in the same connected component will have the same entry in the transformation table. So to check if two nodes are connected, just check if they transform to the same value.
Every time a new node or edge is added to the graph, the transformation table needs to be updated, but this update will be much faster than the original calculation of the table.
Is there an algorithm or heuristics for graph isomorphism?
Corollary: A graph can be represented in different different drawings.
What s the best approach to find different drawing of a graph?
It is a hell of a problem.
In general, the basic idea is to simplify the graph into a canonical form, and then perform comparison of canonical forms. Spanning trees are generated with this objective, but spanning trees are not unique, so you need to have a canonical way to create them.
After you have canonical forms, you can perform isomorphism comparison (relatively) easy, but that's just the start, since non-isomorphic graphs can have the same spanning tree. (e.g. think about a spanning tree T and a single addition of an edge to it to create T'. These two graphs are not isomorph, but they have the same spanning tree).
Other techniques involve comparing descriptors (e.g. number of nodes, number of edges), which can produce false positive in general.
I suggest you to start with the wiki page about the graph isomorphism problem. I also have a book to suggest: "Graph Theory and its applications". It's a tome, but worth every page.
As from you corollary, every possible spatial distribution of a given graph's vertexes is an isomorph. So two isomorph graphs have the same topology and they are, in the end, the same graph, from the topological point of view. Another matter is, for example, to find those isomorph structures enjoying particular properties (e.g. with non crossing edges, if exists), and that depends on the properties you want.
One of the best algorithms out there for finding graph isomorphisms is VF2.
I've written a high-level overview of VF2 as applied to chemistry - where it is used extensively. The post touches on the differences between VF2 and Ullmann. There is also a from-scratch implementation of VF2 written in Java that might be helpful.
A very similar problem - graph automorphism - can be solved by saucy, which is available in source code. This finds all symmetries of a graph. If you have two graphs, join them into one and any isomorphism can be discovered as an automorphism of the join.
Disclaimer: I am one of co-authors of saucy.
There are algorithms to do this -- however, I have not had cause to seriously investigate them as of yet. I believe Donald Knuth is either writing or has written on this subject in his Art of Computing series during his second pass at (re)writing them.
As for a simple way to do something that might work in practice on small graphs, I would recommend counting degrees, then for each vertex, also note the set of degrees for those vertexes that are adjacent. This will then give you a set of potential vertex isomorphisms for each point. Then just try all those (via brute force, but choosing the vertexes in increasing order of potential vertex isomorphism sets) from this restricted set. Intuitively, most graph isomorphism can be practically computed this way, though clearly there would be degenerate cases that might take a long time.
I recently came across the following paper : http://arxiv.org/abs/0711.2010
This paper proposes "A Polynomial Time Algorithm for Graph Isomorphism"
My project - Griso - at sf.net: http://sourceforge.net/projects/griso/ with this description:
Griso is a graph isomorphism testing utility written in C++. It is based on my own POLYNOMIAL-TIME (in this point the salt of the project) algorithm. See Griso's sample input/output on http://funkybee.narod.ru/graphs.htm page.
nauty and Traces
nauty and Traces are programs for computing automorphism groups of graphs and digraphs [*]. They can also produce a canonical label. They are written in a portable subset of C, and run on a considerable number of different systems.
AutGroupGraph command in GRAPE's package of GAP.
bliss: another symmetry and canonical labeling program.
conauto: a graph ismorphism package.
As for heuristics: i've been fantasising about a modified Ullmann's algorithm, where you don't only use breadth first search but mix it with depth first search the way, that first you use breadth first search intensively, than you set a limit for breadth analysis and go deeper after checking a few neighbours, and you lower the breadh every step at some amount. This is practically how i find my way on a map: first locate myself with breadth first search, then search the route with depth first search - largely, and this is the best evolution of my brain has ever invented. :) On the long term some intelligence may be added for increasing breadth first search neighbour count at critical vertexes - for example where there are a large number of neighbouring vertexes with the same edge count. Like checking your actual route sometimes with the car (without a gps).
I've found out that the algorithm belongs in the category of k-dimension Weisfeiler-Lehman algorithms, and it fails with regular graphs. For more here:
http://dabacon.org/pontiff/?p=4148
Original post follows:
I've worked on the problem to find isomorphic graphs in a database of graphs (containing chemical compositions).
In brief, the algorithm creates a hash of a graph using the power iteration method. There might be false positive hash collisions but the probability of that is exceedingly small (i didn't had any such collisions with tens of thousands of graphs).
The way the algorithm works is this:
Do N (where N is the radius of the graph) iterations. On each iteration and for each node:
Sort the hashes (from the previous step) of the node's neighbors
Hash the concatenated sorted hashes
Replace node's hash with newly computed hash
On the first step, a node's hash is affected by the direct neighbors of it. On the second step, a node's hash is affected by the neighborhood 2-hops away from it. On the Nth step a node's hash will be affected by the neighborhood N-hops around it. So you only need to continue running the Powerhash for N = graph_radius steps. In the end, the graph center node's hash will have been affected by the whole graph.
To produce the final hash, sort the final step's node hashes and concatenate them together. After that, you can compare the final hashes to find if two graphs are isomorphic. If you have labels, then add them (on the first step) in the internal hashes that you calculate for each node.
There is more background here:
https://plus.google.com/114866592715069940152/posts/fmBFhjhQcZF
You can find the source code of it here:
https://github.com/madgik/madis/blob/master/src/functions/aggregate/graph.py