I have a large (100,000+ nodes) Directed Acyclic Graph (DAG) and would like to run a "visitor" type function on each node in order, where order is defined by the arrows in the graph. i.e. all parents of a node are guaranteed to be visited before the node itself.
If two nodes do not refer to each other directly or indirectly, then I don't care which order they are visited in.
What's the most efficient algorithm to do this?
You would have to perform a topological sort on the nodes, and visit the nodes in the resulting order.
The complexity of such algorithm is O(|V|+|E|) which is quite good. You want to traverse all nodes, so if you would want a faster algorithm than that, you would have to solve it without even looking at all edges, which would be dangerous, because one single edge could havoc the order completely.
There are some answers here:
Good graph traversal algorithm
and here:
http://en.wikipedia.org/wiki/Topological_sorting
In general, after visiting a node, you should visit its related nodes, but only the nodes that are not already visited. In order to keep track of the visited nodes, you need to keep the IDs of the nodes in a set (or map), or you can mark the node as visited (somehow).
If you care about the topological order, you must first get hold of a collection of all the un-traversed links ("remaining links") to a node, sorted by the id of the referenced node (typically: map(node-ID -> link-count)). If you haven't got that, you might need to build it using an approach similar to the one above. Then, start by visiting a node whose remaining incoming link count is zero. For each link from that node, reduce the remaining link count for each related node, adding the related node to the set of nodes-to-visit (or just visiting the node) if the count reaches zero.
As mentioned in the other answers, this problem can be solved by Topological Sorting.
A very simple algorithm for that (not the most efficient):
Keep an array (or map) indegree[] where indegree[node]=number of incoming edges of node
while there is at least one node n with indegree[n]=0:
for each node n in nodes where indegree[n]>0:
visit(n)
indegree[n]=-1 # mark n as visited
for each node x adjacent to n:
indegree[x]=indegree[x]-1 # its parent has been visited, so one less edge coming into it
You can traverse a DAG in O(N) (without any topsort) by just running your dfs from every node with zero indegree, because those will be the valid "starting point". This will work because graph has no cycles, those zero indegree nodes must exist, and must traverse the whole graph.
Related
I have a directed acyclic graph created by users, where each node (vertex) of the graph represents an operation to perform on some data. The outputs of a node depend on its inputs (obviously), and that input is provided by its parents. The outputs are then passed on to its children. Cycles are guaranteed to not be present, so can be ignored.
This graph works on the same principle as the Shader Editor in Blender. Each node performs some operation on its input, and this operation can be arbitrarily expensive. For this reason, I only want to evaluate these operations when strictly required.
When a node is updated, via user input or otherwise, I need to reevaluate every node which depends on the output of the updated node. However, given that I can't justify evaluating the same node multiple times, I need a way to determine the correct order to update the nodes. A basic breadth-first traversal doesn't solve the problem. To see why, consider this graph:
A traditional breadth-first traversal would result in D being evaluated prior to B, despite D depending on B.
I've tried doing a breadth-first traversal in reverse (that is, starting with the O1 and O2 nodes, and traversing up the graph), but I seem to run into the same problem. A reversed breadth-first traversal will visit D before B, thus I2 before A, resulting in I2 being ordered after A, despite A depending on I2.
I'm sure I'm missing something relatively simple here, and I feel as though the reverse traversal is key, but I can't seem to wrap my head around it and get all the pieces to fit. I suppose one potential solution is to use the reverse traversal as intended, but rather than avoiding visiting each node more than once, just visiting each node each time it comes up, ensuring that it has a definitely correct ordering. But visiting each node multiple times and the exponential scaling that comes with that is a very unattractive solution.
Is there a well-known efficient algorithm for this type of problem?
Yes, there is a well known efficient algorithm. It's topological sorting.
Create a dictionary with all nodes and their corresponding in-degree, let's call it indegree_dic. in-degree is the number of parents/or incoming edges to that node. Have a set S of the nodes with in-degree equal to zero.
Taken from the Wikipedia page with some modification:
L ← Empty list that will contain the nodes sorted topologically
S ← Set of all nodes with no incoming edge that haven't been added to L yet
while S is not empty do
remove a node n from S
add n to L
for each child node m of n do
decrement m's indegree
if indegree_dic[m] equals zero then
delete m from indegree_dic
insert m into S
if indegree_dic has length > 0 then
return error (graph is not a DAG)
else
return L (a topologically sorted order)
This sort is not unique. I mention that because it has some impact on your algorithm.
Now, whenever a change happens to any of the nodes, you can safely avoid recalculation of any nodes that come before the changed node in your topologically sorted list, but need to nodes that come after it. You can be sure that all the parents are processed before their children if you follow the sorted list in your calculation.
This algorithm is not optimal, as there could be nodes after the changed node, that are not children of that node. Like in the following scenario:
A
/ \
B C
One correct topological sort would be [A, B, C]. Now, suppose B changes. You skip A because nothing has changed for it, but recalculate C because it comes after B. But you actually don't need to, because B has no effect on C whatsoever.
If the impact of this isn't big, you could use this algorithm and keep the implementation easier and less prone to bugs. But if efficiency is key, here are some ideas that may help:
You can do a topological sort each time and include the which node has change as a factor. When choosing nodes from S in the above algorithm, choose every other node that you can before you choose the changed node. In other words, you choose the changed node from S only when S has length 1. This guarantees that you process every node that isn't below the hierarchy of the changed node before it. This approach helps when the sorting is much cheaper then processing the nodes.
Another approach, which I'm not entirely sure is correct, is to look after the changed node in the topological sorted list and start processing only when you reach the first child of the changed node.
Another way relies on idea 1 but is helpful if you can do some pre-processing. You can create topological sorts for each case of one node being changed. When a node is changed, you try to put it in the ordering as late as possible. You save all these ordering in a node to ordering dictionary and based on which node has changed you choose that ordering.
In a graph with a bunch of normal nodes and a few special marked nodes, is there a common algorithm to find the closest marked node from a given starting position in the graph?
Or is the best way to do a BFS search to find the marked nodes and then doing Dijkstra's on each of the discovered marked nodes to see which one is the closest?
This depends on the graph, and your definition of "closest".
If you compute "closest" ignoring edge weights, or your graph has no edge weights, a simple breadth-first search (BFS) will suffice. The first node reached vía BFS is, by definition of BFS, the closest (or, if there are several closest nodes, tied for closeness). If you keep track of the number of expanded BFS levels, you can locate all closest nodes by reaching the end of the level instead of stopping as soon as you find the first marked node.
If you have edge weights, and need to use them in your computation, use Dijkstra instead. If the edges can have negative weights, and there happen to be any negative cycles, then you will need to instead use Bellman-Ford.
As mentioned by SaiBot, if the start node is always the same, and you will perform several queries with changing "marked" nodes, there are faster ways to do things. In particular, you can store in each node the "parent" found in a first full traversal, and the node's distance to the start node. When adding a new batch of k marked nodes, you would immediately know the closest to the start by looking at this distance for each marked node.
The fastest way would be to perform Dijkstra right away from your starting position (starting node). When "closeness" is defined as the number of edges that have to be traversed, you can just assign a weight of 1 to each edge. In case precomputation is allowed there will be faster ways to do it.
I'd like to construct a binary tree from a quite unusual input. The input contains:
Total number of nodes.
The integer label of the root.
A list of all edges (vertices/nodes that are connected to each
other). The edges in the list are UNSORTED, there is only one rule for
determining left/right children - the child in the edge that appears
first in list is always on the left. The order of child/parent in the vertices pair is also random.
I've come up with some straighforward solutions but they require multiple searches through the list of all edges (I'd basically find the 2 edges that have the labeled root in them and repeat this process for all the subtrees.)
I imagine this straightforward approach would be VERY inefficient for trees with a big amount of nodes, but I can't come up with anything else.
Any ideas for more efficient algorithms to solve this?
Here's an example for better visualization:
INPUT: 5 NODES, ROOT LABELED 2, LIST OF EDGES: [(1,0),(1,2),(2,3),(1,4)]
The tree would look like this:
2
1 3
0 4
It is important to clarify whether the given edge list is stated to be directed or not.
If edges are given in a directed fashion (i.e. it is stated that any given edge A-B also includes the information that A is a parent of B) storing the edges in an adjacency list while recording number of incoming edges for each vertex in an array should be sufficient. Once you go through the array for the incoming edges, the vertex with 0 incoming edges(i.e. parents) should be the root. Then you can run a DFS in linear time complexity to traverse the graph and put it in any data structure that is best for your needs.
If the edges given are stated to be undirected, the scheme changes a bit. In that case, you don't have the concept of incoming and outgoing edge. In that case, as no structure for the array is specified(e.g. BST, etc.) you can basically consider any node with less than 3 edges as root and run DFS as mentioned above. (all the leaves and intermediary nodes with single child nodes)
A simple solution is: "Link all the edges in the tree that it!"
Start preparing a dictionary. If nodes don't exist by the start and end point, create them nodes. As it is random in nature, you can set their left and right pointers to NULL initially.
You have rule - " the child in the edge that appears first in list is always on the left.". So create child accordingly.
Also, you already know the root of the tree so you can iterate across the nodes you have constructed so far.
Through this you can generate tree in one shot.
Hope this helps!
Can some one provide me the information about how check if the edges of the graph form a loop or not?
Any information would be highly helpful.
Many thanks in advance.
The Kruskal algorithm (which you tagged the question with) uses disjoint set data structure initialized with disjoint sets for each vertex. Then, for each edge, two sets that the edge's vertices belong to are merged. If the two vertices are already in the same set, you've found a loop. If you remove the edge every time it happens, you will get a spanning tree. If you sort the edges in order of ascending weight, that would be a minimum spanning tree.
If you need only to know whether a graph contains loops or not, use something simpler as DFS - if any node has an adjacent node (other than parent) which was visited already - you've found a cycle.
Do a complete DFS on the graph. Maintain two boolean variables, 'visited' and 'completed' for each node in the graph. 'visited' to indicate whether the vertex has been visited or not and 'completed' to indicate whether the DFS starting from that particular node has completed or not. If while doing DFS you hit a node which has already been visited but its DFS has not yet completed, then there exists a cycle in the graph.
It is the use of fast union-find datastructure which check if the edge to be connected is not between vertices that of same cluster.
Union-Find Datastructures
I have a Directed Cyclic graph consisting of node a, b, c, d, e,f g, where ever node is connected to every other node. The edges may be unidirectional or bidirectional. I need to printout a valid order like this for eg. f->a->c->b->e->d->g such that I can reach the end node from the start node. Note that all the nodes must be present in the output list.
Also note that there may be cycles in the graph.
What I came up with:
Basically first we can try to find a start node. If there is a node such that there is no incoming edge to it (there could be atmost one such node). I may find a start node or may not. Also I will do some preprocessing to find the total number of nodes(lets call it n). Now I will start a DFS from the start node marking nodes as visited when I reach them and counting how many nodes I visited. If I can reach n nodes by this method. I am done. If I hit a node, from which there are no outgoing edges to any unvisited node, I have hit a dead end, and I will just mark that node as unvisited again, reduce the pointer and go to its previous node to try a different route.
This was the case when I find a start node. If I dont find a start node, I will just have to try this with various nodes.
I have no idea if I am even close to the solution. Can anyone help me in this regard?
In my opinion, if there is no incoming edge to a node, it means that node is a start node. You can traverse the graph using this start node. And if this start node can not visit all the n nodes, then there is no solution (as you said that all the nodes must be present in the output list.). This is because if you start with some other nodes, you won't be able to reach this start node.
The problem with your solution is that if you enter a loop you don't know if and when to exit.
A DFS search in these conditions can easily became a non polynomial task!
Let me introduce a polynomial algorithm for your problem.
It looks complicated I hope there's room for simplifications.
Here my suggested solution
1) For each node construct the table of the nodes it can reach (if a can reach b and c; b can reach d; c can reach e; a can reach b,c,d,e even tough there is not a single pathfrom a passing through all of them).
If no node can reach all the other ones you're done: there is no the path you're looking for.
2) Find loops. That's easy: if a node can reach itself, there is a loop. This should be part of the construction of the table at the previous point.
Once you have find one loop you can shrink it (and its nodes) to the representative node whose ingoing (outgoing) connections are the union of the ingoing (outgoing) connections of the nodes in the loop.
You keep reducing loops until you cannot do any more.
3) At this point you are left with an acyclic graph, If there is a path connecting all nodes, there is a single node connected to all and starting from it you can perform depth first search.
4)
Write down the path by replacing the traversal of representative nodes with a loop from the entry point of the loop to the exit point.