Topological sort variant algorithm - algorithm

I have a set of data on which I need to perform a topological sort, with a few assumptions and constraints, and I was wondering if anyone knew an existing, efficient algorithm that would be appropriate for this.
The data relationships are known to form a DAG (so no cycles to worry about).
An edge from A to B indicates that A depends on B, so B must appear before A in the topological ordering.
The graph is not necessarily connected; that is, for any two nodes N and M there may be no way to get from N to M by following edges (even if you ignore edge direction).
The data relationships are singly linked. This means that when there is an edge directed from A to B, only the A node contains information about the existence of the edge.
The problem can be formulated as follows:
Given a set of nodes S in graph G which may or may not have incoming edges, find a topological ordering of the subgraph G' consisting of all of the nodes in G that are reachable from any node in set S (obeying edge direction).
This confounds the usual approaches to topological sorting because they require that the nodes in set S do not have any incoming edges, which is something that is not true in my case. The pathological case is:
A --> B --> D
| ^ ^
| | |
\---> C ----/
Where S = {B, C}. An appropriate ordering would be D, B, C, but if a normal topological sort algorithm happened to consider B before C, it would end up with C, D, B, which is completely wrong. (Note that A does not appear in the resulting ordering since it is not reachable from S; it's there to give an example where all of the nodes in S might have incoming edges)
Now, I have to imagine that this is a long-solved problem, since this is essentially what programs like apt and yum have to do when you specify multiple packages in one install command. However, when I search for keyphrases like "dependency resolution algorithm", I get results describing normal topological sorting, which does not handle this particular case.
I can think of a couple of ways to do this, but none of them seem particularly elegant. I was wondering if anyone had some pointers to an appropriate algorithm, preferably one that can operate in a single pass over the data.

I don't think you'll find an algorithm that can do this with a single pass over the data. I would perform a breadth-first search, starting with the nodes in S, and then do a topological sort on the resulting subgraph.

I think you can do a topological sorting of the entire graph and then select only the nodes which are reachable from the set of nodes (you can do some depth first searches from the nodes in the set, in the order resulted after the sorting, and you'll get in the subtree of a node if it wasn't visited before).

Related

Efficient Graph Traversal for Node Editor Evaluation

I have a directed acyclic graph created by users, where each node (vertex) of the graph represents an operation to perform on some data. The outputs of a node depend on its inputs (obviously), and that input is provided by its parents. The outputs are then passed on to its children. Cycles are guaranteed to not be present, so can be ignored.
This graph works on the same principle as the Shader Editor in Blender. Each node performs some operation on its input, and this operation can be arbitrarily expensive. For this reason, I only want to evaluate these operations when strictly required.
When a node is updated, via user input or otherwise, I need to reevaluate every node which depends on the output of the updated node. However, given that I can't justify evaluating the same node multiple times, I need a way to determine the correct order to update the nodes. A basic breadth-first traversal doesn't solve the problem. To see why, consider this graph:
A traditional breadth-first traversal would result in D being evaluated prior to B, despite D depending on B.
I've tried doing a breadth-first traversal in reverse (that is, starting with the O1 and O2 nodes, and traversing up the graph), but I seem to run into the same problem. A reversed breadth-first traversal will visit D before B, thus I2 before A, resulting in I2 being ordered after A, despite A depending on I2.
I'm sure I'm missing something relatively simple here, and I feel as though the reverse traversal is key, but I can't seem to wrap my head around it and get all the pieces to fit. I suppose one potential solution is to use the reverse traversal as intended, but rather than avoiding visiting each node more than once, just visiting each node each time it comes up, ensuring that it has a definitely correct ordering. But visiting each node multiple times and the exponential scaling that comes with that is a very unattractive solution.
Is there a well-known efficient algorithm for this type of problem?
Yes, there is a well known efficient algorithm. It's topological sorting.
Create a dictionary with all nodes and their corresponding in-degree, let's call it indegree_dic. in-degree is the number of parents/or incoming edges to that node. Have a set S of the nodes with in-degree equal to zero.
Taken from the Wikipedia page with some modification:
L ← Empty list that will contain the nodes sorted topologically
S ← Set of all nodes with no incoming edge that haven't been added to L yet
while S is not empty do
remove a node n from S
add n to L
for each child node m of n do
decrement m's indegree
if indegree_dic[m] equals zero then
delete m from indegree_dic
insert m into S
if indegree_dic has length > 0 then
return error (graph is not a DAG)
else
return L (a topologically sorted order)
This sort is not unique. I mention that because it has some impact on your algorithm.
Now, whenever a change happens to any of the nodes, you can safely avoid recalculation of any nodes that come before the changed node in your topologically sorted list, but need to nodes that come after it. You can be sure that all the parents are processed before their children if you follow the sorted list in your calculation.
This algorithm is not optimal, as there could be nodes after the changed node, that are not children of that node. Like in the following scenario:
A
/ \
B C
One correct topological sort would be [A, B, C]. Now, suppose B changes. You skip A because nothing has changed for it, but recalculate C because it comes after B. But you actually don't need to, because B has no effect on C whatsoever.
If the impact of this isn't big, you could use this algorithm and keep the implementation easier and less prone to bugs. But if efficiency is key, here are some ideas that may help:
You can do a topological sort each time and include the which node has change as a factor. When choosing nodes from S in the above algorithm, choose every other node that you can before you choose the changed node. In other words, you choose the changed node from S only when S has length 1. This guarantees that you process every node that isn't below the hierarchy of the changed node before it. This approach helps when the sorting is much cheaper then processing the nodes.
Another approach, which I'm not entirely sure is correct, is to look after the changed node in the topological sorted list and start processing only when you reach the first child of the changed node.
Another way relies on idea 1 but is helpful if you can do some pre-processing. You can create topological sorts for each case of one node being changed. When a node is changed, you try to put it in the ordering as late as possible. You save all these ordering in a node to ordering dictionary and based on which node has changed you choose that ordering.

Edge direction in a dependency graph for topological sort?

Wikipedia explains a dependency graph in a very intuitive way (IMO), citing that an edge goes from a => b when a depends on b. In other words, we can find any given node's direct dependencies (if any) immediately by looking at its neighbors, available in its adjacency list.
This seems to be a sensible way to realize dependencies; it allows us to perform topological sort basically as easily as doing a Depth-First-Traversal (DFS from every node in the graph). If the nodes represent "tasks", then we can execute/visit a task only when all of its transitive dependencies have been executed/visited. Leaf nodes are the first to be visited, and so on.
The Wikipedia page for topological sorting explains the definition as:
In computer science, a topological sort or topological ordering of a directed graph is a linear ordering of its vertices such that for every directed edge uv from vertex u to vertex v, u comes before v in the ordering.
This is the opposite of what I'd expect given a "dependency graph". We just explained that if a depends on b, there is a directed edge a => b, and we must visit/execute b before a. However, with the graph explained above, since we execute/visit task u before v, it stands to reason that v depends on u. So if I'm not mistaken, it seems that the input graph that Wiki's topo sorting page expects is a "dependency graph" with its edges reversed. The algorithms on the page corroborate this; for example, their DFS approach would start at a node n, recurse to the nodes that depend on n (not n's dependencies), after which prepending n to the head of some list so it appears earlier than its dependents. The result is the same as the DFT I explained, and to be clear, I'm not saying anything on either page is wrong, it just demonstrates several ways of doing something.
It does feel weird though that Wiki has this definition of a dependency graph, yet seems to use its inverse on the topological sort page, by recursing through reverse dependencies, and essentially reversing the output list.
Question
My only question is: is there some glaringly obvious reason that I'm missing, that the expected graph on the topological sorting page is basically the opposite of the "dependency graph" dfn? It feels unintuitive that we traverse from n to n's dependents, and effectively reverse the output by recording to something like a stack.
More generally, the graph that the topological sorting page seems to expect doesn't seem to be a good dependency graph anyways. If we considered this graph to be the canonical "dependency graph", then in order to find n's dependencies, we'd have to iterate through the entire graph asking "Does this node point to n?", which seems odd.
A topological sort produces a total ordering that is consistent with a partial ordering.
A partial ordering is the same thing as a DAG.
Very often we topologically sort items according to a dependency graph...
But the partial ordering we usually use is the "must come before" graph, not the "depends on" graph. This is the same graph, but with the edges reversed.
The two things I think you're missing are:
1) The graph is an interpretation of the data structure. The data structure is not the graph. In most real-life situations graph algorithms are applied over data structures that do not literally or explicitly represent the graph itself. In this case where there's a pointer from a to b, the DAG we're sorting has an edge from b to a.
2) Reversing the edges in the DAG just means reversing the final topological ordering, or starting from the other end. It hardly matters, so in colloquial speech, it's natural to talk about topologically sorting the dependency graph instead of topologically sorting the edge-reversed dependency graph. Sorting in descending order is still sorting, and reverse topological sorting is still topological sorting.

Reason why all DAG have more than one topological sort order

I am wondering as to why all Directed Acyclic Graph have more than one topological sort order.
I have searched up google and saying most of it just breeze through the fact that they have at least one topo sort. But i am thinking along the lines of how a singly linked list is implemented :
A -> B -> C -> D
This might mean that there is only one way the toposort can technically go through - D, C, B, A...
However, it may be the case that that is not a directed acyclic graph but i am not sure how to refute the case since it is directed (A to B, etc) , Acyclic (There are no cycles back to any start) Graph (it is technically a tree)..
Thank you so much for any clarifications provided !
It's not true that all DAGs have more than one topological sort. Remember that we can construct a topological sort by removing vertices with no incoming edges in order.
Consider a DAG that contains a continuous path that connects all its vertices (Note that this path does not form a cycle, otherwise it won't be a DAG). We can start by removing a vertex with no incoming edge and repeat. We'll find that the topological sort has an edge between each consecutive pair of vertices. If we wanted to form another topological sort, we could have started by removing some other vertex with no incoming edge, but this would mean that there are at least 2 edges with no incoming edges and in that case, it would be impossible to start a path from one vertex and connect all others.
Since we started with a DAG having a path connecting all the vertices, we are met with a contradiction. Hence, it is proven that a DAG with a path connecting all the vertices will have a unique topological sort.

Could Kruskal’s algorithm be implemented in this way instead of using a disjoint-set forest?

I am studying Kruskal's MST from this geeksforgeeks article. The steps given are:
Sort all the edges in non-decreasing order of their weight.
Pick the smallest edge. Check if it forms a cycle with the spanning tree formed so far. If cycle is not formed, include this edge. Else, discard it.
Repeat step (2) until there are (V-1) edges in the spanning tree.
I really don't feel any need to use disjoint set. Instead for checking a cycle we can just store vertices in a visited array and mark them as true whenever an edge is selected. Looping through the program if we find an edge whose both vertices are in the visited array we ignore that edge.
In other words, instead of storing a disjoint-set forest, can’t we just store an array of bits indicating which vertices have been linked to another edge in some previous step?
The approach you’re describing will not work properly in all cases. As an example, consider this line graph:
A - - B - - C - - D
Let’s assume A-B has weight 1, C-D has weight 2, and B - C has weight 3. What will Kruskal’s algorithm do here? First, it’ll add in A - B, then C - D, and then B - C.
Now imagine what your implementation will do. When we add A - B, you’ll mark A and B as having been visited. When we then add C - D, you’ll mark C and D as having been visited. But then when we try to add B - C, since both B and C are visited, you’ll decide not to add the edge, leaving a result that isn’t connected.
The issue here is that when building up an MST you may add edges linking nodes that have already been linked to other nodes in the past. The criterion for adding an edge is therefore less “have these nodes been linked before?” and more “is there already a path between these nodes?” That’s where the disjoint-set forest comes in.
It’s great that you’re poking and prodding conventional implementations and trying to find ways to improve them. You’ll learn a lot about those algorithms if you do! In this case, it just so happens that what you’re proposing doesn’t quite work, and seeing why it doesn’t work helps shed light on why the existing approach is what it is.
I really don't feel any need to use disjoint set. Instead for checking a cycle we can just store vertices in a visited array and mark them as true whenever an edge is selected. Looping through the program if we find an edge whose both vertices are in the visited array we ignore that edge.
Yes, of course you can do that. The point of using a disjoint set in this algorithm is performance. Use of a suitable disjoint set implementation yields better asymptotic performance than using a List can do.

What algorithm can I apply to this DAG?

I have a DAG representing a list of properties. These properties are such that if a>b, then a has a directed edge to b. It is transitive as well, so that if a>b and b>c, then a has a directed edge to c.
However, the directed edge from a to c is superfluous because a has a directed edge to b and b has a directed edge to c. How can I prune all these superfluous edges? I was thinking of using a minimum spanning tree algorithm, but I'm not really sure what is the appropriate algorithm to apply in this situation
I suppose I could do a depth first search from each node and all its outgoing edges and compare if it can reach certain nodes without using certain edges, but this seems horribly inefficient and slow.
After the algorithm is complete, the output would be a linear list of all the nodes in an order that is consistent with the graph. So if a has three directed edges to b,c, and d. b and c also each of which has a directed edge to d, the output could be either abcd or acbd.
This is called the transitive reduction problem. Formally speaking, you are looking for a minimal (fewest edges) directed graph, the transitive closure of which is equal to the transitive closure of the input graph. (The diagram on the above Wikipedia link makes it clear.)
Apparently there exists an efficient algorithm for solving this problem that takes the same time as for producing a transitive closure (i.e. the more common inverse problem of adding transitive links instead of removing them), however the link to the 1972 paper by Aho, Garey, and Ullman costs $25 to download, and some quick googling didn't turn up any nice descriptions.
EDIT: Scott Cotton's graphlib contains a Java implementation! This Java library looks to be very well organised.
Actually, after looking around a little more, I think a Topologicalsort is what I'm really after here.
If these are already n nodes with directed edges:
Starting from any point M, loop all its child edge, select the biggest child (like N), remove other edges, the complexity should be o(n) . If no N exists (no child edge, goto step 3).
start from N, repeat step 1.
start from point M, select the smallest parent node ( like T), remove others' edges.
start from T, repeat step 3.....
Actually it's just a ordering algorithm, and the totally complexity should be o(0.5n^2).
One problem is that if we want loop one node's parent nodes, then we need more memory to log edge so we can trace back from child to parent. This can be improved in the step 3 where we choose one node from the left nodes bigger than M, this means we need to keep a list of nodes to know what nodes are left..

Resources