What algorithm can I apply to this DAG? - algorithm

I have a DAG representing a list of properties. These properties are such that if a>b, then a has a directed edge to b. It is transitive as well, so that if a>b and b>c, then a has a directed edge to c.
However, the directed edge from a to c is superfluous because a has a directed edge to b and b has a directed edge to c. How can I prune all these superfluous edges? I was thinking of using a minimum spanning tree algorithm, but I'm not really sure what is the appropriate algorithm to apply in this situation
I suppose I could do a depth first search from each node and all its outgoing edges and compare if it can reach certain nodes without using certain edges, but this seems horribly inefficient and slow.
After the algorithm is complete, the output would be a linear list of all the nodes in an order that is consistent with the graph. So if a has three directed edges to b,c, and d. b and c also each of which has a directed edge to d, the output could be either abcd or acbd.

This is called the transitive reduction problem. Formally speaking, you are looking for a minimal (fewest edges) directed graph, the transitive closure of which is equal to the transitive closure of the input graph. (The diagram on the above Wikipedia link makes it clear.)
Apparently there exists an efficient algorithm for solving this problem that takes the same time as for producing a transitive closure (i.e. the more common inverse problem of adding transitive links instead of removing them), however the link to the 1972 paper by Aho, Garey, and Ullman costs $25 to download, and some quick googling didn't turn up any nice descriptions.
EDIT: Scott Cotton's graphlib contains a Java implementation! This Java library looks to be very well organised.

Actually, after looking around a little more, I think a Topologicalsort is what I'm really after here.

If these are already n nodes with directed edges:
Starting from any point M, loop all its child edge, select the biggest child (like N), remove other edges, the complexity should be o(n) . If no N exists (no child edge, goto step 3).
start from N, repeat step 1.
start from point M, select the smallest parent node ( like T), remove others' edges.
start from T, repeat step 3.....
Actually it's just a ordering algorithm, and the totally complexity should be o(0.5n^2).
One problem is that if we want loop one node's parent nodes, then we need more memory to log edge so we can trace back from child to parent. This can be improved in the step 3 where we choose one node from the left nodes bigger than M, this means we need to keep a list of nodes to know what nodes are left..

Related

Could Kruskal’s algorithm be implemented in this way instead of using a disjoint-set forest?

I am studying Kruskal's MST from this geeksforgeeks article. The steps given are:
Sort all the edges in non-decreasing order of their weight.
Pick the smallest edge. Check if it forms a cycle with the spanning tree formed so far. If cycle is not formed, include this edge. Else, discard it.
Repeat step (2) until there are (V-1) edges in the spanning tree.
I really don't feel any need to use disjoint set. Instead for checking a cycle we can just store vertices in a visited array and mark them as true whenever an edge is selected. Looping through the program if we find an edge whose both vertices are in the visited array we ignore that edge.
In other words, instead of storing a disjoint-set forest, can’t we just store an array of bits indicating which vertices have been linked to another edge in some previous step?
The approach you’re describing will not work properly in all cases. As an example, consider this line graph:
A - - B - - C - - D
Let’s assume A-B has weight 1, C-D has weight 2, and B - C has weight 3. What will Kruskal’s algorithm do here? First, it’ll add in A - B, then C - D, and then B - C.
Now imagine what your implementation will do. When we add A - B, you’ll mark A and B as having been visited. When we then add C - D, you’ll mark C and D as having been visited. But then when we try to add B - C, since both B and C are visited, you’ll decide not to add the edge, leaving a result that isn’t connected.
The issue here is that when building up an MST you may add edges linking nodes that have already been linked to other nodes in the past. The criterion for adding an edge is therefore less “have these nodes been linked before?” and more “is there already a path between these nodes?” That’s where the disjoint-set forest comes in.
It’s great that you’re poking and prodding conventional implementations and trying to find ways to improve them. You’ll learn a lot about those algorithms if you do! In this case, it just so happens that what you’re proposing doesn’t quite work, and seeing why it doesn’t work helps shed light on why the existing approach is what it is.
I really don't feel any need to use disjoint set. Instead for checking a cycle we can just store vertices in a visited array and mark them as true whenever an edge is selected. Looping through the program if we find an edge whose both vertices are in the visited array we ignore that edge.
Yes, of course you can do that. The point of using a disjoint set in this algorithm is performance. Use of a suitable disjoint set implementation yields better asymptotic performance than using a List can do.

Cycle detection that handles a series of directed edges? [duplicate]

I came upon wait-for graphs and I wonder, are there any efficient algorithms for detecting if adding an edge to a directed graph results in a cycle?
The graphs in question are mutable (they can have nodes and edges added or removed). And we're not interested in actually knowing an offending cycle, just knowing there is one is enough (to prevent adding an offending edge).
Of course it'd be possible to use an algorithm for computing strongly connected components (such as Tarjan's) to check if the new graph is acyclic or not, but running it again every time an edge is added seems quite inefficient.
If I understood your question correctly, then a new edge (u,v) is only inserted if there was no path from v to u before (i.e., if (u,v) does not create a cycle). Thus, your graph is always a DAG (directed acyclic graph). Using Tarjan's Algorithm to detect strongly connected components (http://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm) sounds like an overkill in this case. Before inserting (u,v), all you have to check is whether there is a directed path from v to u, which can be done with a simple BFS/DFS.
So the simplest way of doing it is the following (n = |V|, m = |E|):
Inserting (u,v): Check whether there is a path from v to u (BFS/DFS). Time complexity: O(m)
Deleting edges: Simply remove them from the graph. Time complexity: O(1)
Although inserting (u,v) takes O(m) time in the worst case, it is probably pretty fast in your situation. When doing the BFS/DFS starting from v to check whether u is reachable, you only visit vertices that are reachable from v. I would guess that in your setting the graph is pretty sparse and that the number of vertices reachable by another is not that high.
However, if you want to improve the theoretical running time, here are some hints (mostly showing that this will not be very easy). Assume we aim for testing in O(1) time whether there exists a directed path from v to u. The keyword in this context is the transitive closure of a DAG (i.e., a graph that contains an edge (u, v) if and only if there is a directed path from u to v in the DAG). Unfortunately, maintaining the transitive closure in a dynamic setting seems to be not that simple. There are several papers considering this problem and all papers I found were STOC or FOCS papers, which indicates that they are very involved. The newest (and fastest) result I found is in the paper Dynamic Transitive Closure via Dynamic Matrix Inverse by Sankowski (http://dl.acm.org/citation.cfm?id=1033207).
Even if you are willing to understand one of those dynamic transitive closure algorithms (or even want to implement it), they will not give you any speed up for the following reason. These algorithms are designed for the situation, where you have a lot of connectivity queries (which then can be performed in O(1) time) and only few changes in the graph. The goal then is to make these changes cheaper than recomputing the transitive closure. However, this update is still slower that a single check for connectivity. Thus, if you need to do an update on every connectivity query, it is better to use the simple approach mentioned above.
So why do I mention this approach of maintaining the transitive closure if it does not fit your needs? Well, it shows that searching an algorithm consuming only O(1) query time does probably not lead you to a solution faster than the simple one using BFS/DFS. What you could try is to get a query time that is faster than O(m) but worse than O(1), while updates are also faster than O(m). This is a very interesting problem, but it sounds to me like a very ambitious goal (so maybe do not spend too much time on trying to achieve it..).
As Mark suggested it is possible to use data structure that stores connected nodes. It is the best to use boolean matrix |V|x|V|. Values can be initialized with Floyd–Warshall algorithm. That is done in O(|V|^3).
Let T(i) be set of vertices that have path to vertex i, and F(j) set of vertices where exists path from vertex j. First are true's in i'th row and second true's in j'th column.
Adding an edge (i,j) is simple operation. If i and j wasn't connected before, than for each a from T(i) and each b from F(j) set matrix element (a,b) to true. But operation isn't cheap. In worst case it is O(|V|^2). That is in case of directed line, and adding edge from end to start vertex makes all vertices connected to all other vertices.
Removing an edge (i,j) is not so simple, but not more expensive operation in the worst case :-) If there is a path from i to j after removing edge, than nothing changes. That is checked with Dijkstra, less than O(|V|^2). Vertices that are not connected any more are (a,b):
a in T(i) - i - T(j),
b in F(j) + j
Only T(j) is changed with removing edge (i,j), so it has to be recalculated. That is done by any kind of graph traversing (BFS, DFS), by going in opposite edge direction from vertex j. That is done in less then O(|V|^2). Since setting of matrix element is in worst case is again O(|V|^2), this operation has same worst case complexity as adding edge.
This is a problem which I recently faced in a slightly different situation (optimal ordering of interdependent compiler instructions).
While I can't improve on O(n*n) theoretical bounds, after a fair amount of experimentation and assuming heuristics for my case (for example, assuming that the initial ordering wasn't created maliciously) the following was the best compromise algorithm in terms of performance.
(In my case I had an acceptable "right side failure": after the initial nodes and arcs were added (which was guaranteed to be possible), it was acceptable for the optimiser to occasionally reject the addition of further arcs where one could actually be added. This approximation isn't necessary for this algorithm when carried to completion, but it does admit such an approximation if you wish to do so, and so limiting its runtime further).
While a graph is topologically sorted, it is guaranteed to be cycle-free. In the first phase when I had a static bulk of nodes and arcs to add, I added the nodes and then topologically sorted them.
During the second phase, adding additional arcs, there are two situations when considering an arc from A to B. If A already lies to the left of B in the sort, an arc can simply be added and no cycle can be generated, as the list is still topologically sorted.
If B is to the left of A, we consider the sub-sequence between B and A and partition it into two disjoint sequences X, Y, where X is those nodes which can reach A (and Y the others). If A is not reachable from B, ie there are no direct arcs from B into X or to A, then the sequence can be reordered XABY before adding the A to B arc, showing it is still cycle-free and maintaining the topological sort. The efficiency over the naive algorithm here is that we only need consider the subsequence between B and A as our list is topologically sorted: A is not reachable from any node to the right of A. For my situation, where localised reorderings are the most frequent and important, this an important gain.
As we don't reorder within the sequences X,A,B,Y, clearly any arcs which start or end within the same sequence are still ordered correctly, and the same in each flank, and any "fly-over" arcs from the left to the right flanks. Any arcs between the flanks and X,A,B,Y are also still ordered correctly as our reordering is restricted to this local region. So we only need to consider arcs between our four sequences. Consider each possible "problematic" arc for our final ordering XABY in turn: YB YA YX BA BX AX. Our initial order was B[XY]A, so AX and YB cannot occur. X reaches A, but Y does not, therefore YX and YA do not occur or A could be reached from the source of the arc in Y (potentially via X) a contradiction. Our criterion for acceptability was that there are no links BX or BA. So there are no problematic arcs, and we are still topologically sorted.
Our only acceptability criterion (that A is not reachable from B) is clearly sufficient to create a cycle on adding the arc A->B: B -(X)-> A -> B, so the converse is also shown.
This can be implemented reasonably efficiently if we can add a flag to each node. Consider the nodes [BXY] going right-to-left from the node immediately to the left of A. If that node has a direct arc to A then set the flag. At an arbitrary such node, we need only consider direct outgoing arcs: the nodes to its right are either after A (and so irrelevant), or else have already been flagged if reachable from A, so the flag on such an arbitrary node is set when any flagged nodes are encountered by direct link. If B is not flagged at the end of the process, the reordering is acceptable and the flagged nodes comprise X.
Though this always yields a correct ordering if carried to completion (as far as I can tell), as I mentioned in the introduction it is particularly efficient if your initial build is approximately correct (in the sense of accommodating of likely additional arcs without reordering).
There also exists an effective approximation, if your context is such that "outrageous" arcs can be rejected (those which would massively reorder) by limiting the A to B distance you are prepared to scan. If you have an initial list of the additional arcs you wish to add, they can be ordered by increasing distance in the initial ordering until you run out of some scanning "credit", and call your optimisation a day at that point.
If the graph is directed, you would only have to check the parent nodes (navigate up until you reach the root) of the node where the new edge should start. If one of the parent nodes is equal to the end of the edge, adding the edge would create a cycle.
If all previous jobs are in Topologically sorted order. Then if you add an edge that appears to brake the sort, and can not be fixed, then you have a cycle.
https://stackoverflow.com/a/261621/831850
So if we have a sorted list of nodes:
1, 2, 3, ..., x, ..., z, ...
Such that each node is waiting for nodes to its left.
Say we want to add an edge from x->z. Well that appears to brake the sort. So we can move the node at x to position z+1 which will fix the sort iif none of the nodes (x, z] have an edge to the node at x.

Identify Redundant Dependence in Graph

I have a DIRECTED ACYCLIC GRAPH where each node stands for a task and each directed edge A -> B means task A should be done before task B starts
A simple illustration could be like this:
So this is actually a workflow. In this graph, edge A -> B is considered redundant because task B need task C done first, and task C needs task A done first. (not to mention another path A -> D -> E -> B which make A -> B unnecessary)
The problem is: I want to identify (say, just output) all the redundant dependence (edges) on the graph. My friend and I have got an idea like this: iterate through all edges on the graph, and for each edge say X -> Y, remove it and check the connectivity from X to Y (for example, run DFS/BFS), if there still exists a path (other than the removed one), then edge X -> Y is redundant and can be physically removed, otherwise just put it back. In this case, the complexity in the worst condition could be O(n^2) (DFS/BFS pass through approximately all edges every time), where n stands for the number of edges on the graph.
I wonder if there is any optimization about this?
Have you heard Transitive reduction? From Wikipedia
A transitive reduction of a directed graph is a graph with as few edges as possible that has the same reachability relation as the given graph. Equivalently, the given graph and its transitive reduction should have the same transitive closure as each other, and its transitive reduction should have as few edges as possible among all graphs with this property. Transitive reductions were introduced by Aho, Garey & Ullman (1972), who provided tight bounds on the computational complexity of constructing them.
You can get details from Transitive Reduction. If the number n of vertices and the number m of edges in a directed acyclic graph, then transitive reductions can be found in time O(nm).
Topological sorting using DFS with an stack can yield result in linear time, it can be done by starting with a vertex, marking it visited, then recursively performing topologic sorting to all its non visited adjacent edged, once all of them are explored push the vertex to stack.
then simply print from stack, it will generate result in linear time, for more you can refer to an algo explained in following link.
Topological sorting

Sum of Vertices in Induced Graph - Dynamic Programming

This is a homework question so I'll be glad to get a hint.
I have a graph G, where each vertex v has a weight w(v).
S(G) is the sum of weights of the all the vertexes in the graph.
I need to find an algorithm that determines if there is a group of vertexes A, when G[A] (G's graph induced by A) is a tree, that conducts S(G[A])=S(G[V\A]).
I know that i should go over all vertexes, sum their weights, and then try to find a tree that reaches half of that sum, but i'm not sure how exactly. I'm pretty sure it involves dynamic programming.
Thank you very much,
Yaron.
This is not really a dynamic programming problem, it is a search problem, the key being that you are trying to find a tree.
If you think about it, you already know an algorithm or two that will will tell you the minimum spanning tree. By the same logic, you can make a maximum spanning tree. For example, if you find the maximum spanning tree and the sum of its weights is less than 50% (or whatever the target value is), then you know the problem is impossible.
So, following this logic, you can go along as though you were making a spanning tree and reject any path that goes over the target amount. This strategy is known as "branch and bound". Let's imagine how we could do this with Kruskal's algorithm:
(1) you will have a set of trees; start with each vertex as a separate "tree"
(2) maintain a queue of edges you have not used yet, sorted from least to greatest
(3) maintain a stack of edges that you have used
(4) look for an edge that (a) connects two different trees, and (b) the sum of the two trees is less than (or equal to the target value, ie a solution)
(4a) if no such edge exists, then pop a value from the stack (remove the edge and seperate the trees) and try the next value in the queue
(4b) if such an edge does exist, then add the edge (combine two of the trees), push it onto the stack and go back to step 4
Obviously there are different ways to do the same process. For example, you could use a variant of Prim's algorithm as well.

Is this minimum spanning tree algorithm correct?

The minimum spanning tree problem is to take a connected weighted graph and find the subset of its edges with the lowest total weight while keeping the graph connected (and as a consequence resulting in an acyclic graph).
The algorithm I am considering is:
Find all cycles.
remove the largest edge from each cycle.
The impetus for this version is an environment that is restricted to "rule satisfaction" without any iterative constructs. It might also be applicable to insanely parallel hardware (i.e. a system where you expect to have several times more degrees of parallelism then cycles).
Edits:
The above is done in a stateless manner (all edges that are not the largest edge in any cycle are selected/kept/ignored, all others are removed).
What happens if two cycles overlap? Which one has its longest edge removed first? Does it matter if the longest edge of each is shared between the two cycles or not?
For example:
V = { a, b, c, d }
E = { (a,b,1), (b,c,2), (c,a,4), (b,d,9), (d,a,3) }
There's an a -> b -> c -> a cycle, and an a -> b -> d -> a
#shrughes.blogspot.com:
I don't know about removing all but two - I've been sketching out various runs of the algorithm and assuming that parallel runs may remove an edge more than once I can't find a situation where I'm left without a spanning tree. Whether or not it's minimal I don't know.
For this to work, you'd have to detail how you would want to find all cycles, apparently without any iterative constructs, because that is a non-trivial task. I'm not sure that's possible. If you really want to find a MST algorithm that doesn't use iterative constructs, take a look at Prim's or Kruskal's algorithm and see if you could modify those to suit your needs.
Also, is recursion barred in this theoretical architecture? If so, it might actually be impossible to find a MST on a graph, because you'd have no means whatsoever of inspecting every vertex/edge on the graph.
I dunno if it works, but no matter what your algorithm is not even worth implementing. Finding all cycles will be the freaking huge bottleneck that will kill it. Also doing that without iterations is impossible. Why don't you implement some standard algorithm, let's say Prim's.
Your algorithm isn't quite clearly defined. If you have a complete graph, your algorithm would seem to entail, in the first step, removing all but the two minimum elements. Also, listing all the cycles in a graph can take exponential time.
Elaboration:
In a graph with n nodes and an edge between every pair of nodes, there are, if I have my math right, n!/(2k(n-k)!) cycles of size k, if you're counting a cycle as some subgraph of k nodes and k edges with each node having degree 2.
#Tynan The system can be described (somewhat over simplified) as a systems of rules describing categorizations. "Things are in category A if they are in B but not in C", "Nodes connected to nodes in Z are also in Z", "Every category in M is connected to a node N and has 'child' categories, also in M for every node connected to N". It's slightly more complicated than this. (I have shown that by creating unstable rules you can model a turning machine but that's beside the point.) It can't explicitly define iteration or recursion but can operate on recursive data with rules like the 2nd and 3rd ones.
#Marcin, Assume that there are an unlimited number of processors. It is trivial to show that the program can be run in O(n^2) for n being the longest cycle. With better data structures, this can be reduced to O(n*O(set lookup function)), I can envision hardware (quantum computers?) that can evaluate all cycles in constant time. giving a O(1) solution to the MST problem.
The Reverse-delete algorithm seems to provide a partial proof of correctness (that the proposed algorithm will not produce a non-minimal spanning tree) this is derived by arguing that mt algorithm will remove every edge that the Reverse-delete algorithm will. However I'm not sure how to show that my algorithm won't delete more than that algorithm.
Hhmm....
OK this is an attempt to finish the proof of correctness. By analogy to the Reverse-delete algorithm, we know that enough edges will be removed. What remains is to show that there will not be to many edges removed.
Removing to many edges can be described as removing all the edges between the side of a binary partition of the graph nodes. However only edges in a cycle are ever removed, therefor, for all edge between partitions to be removed, there needs to be a return path to complete the cycle. If we only consider edges between the partitions then the algorithm can at most remove the larger of each pair of edges, this can never remove the smallest bridging edge. Therefor for any arbitrary binary partitioning, the algorithm can't sever all links between the side.
What remains is to show that this extends to >2 way partitions.

Resources