Is there an effient way of determining whether a leaf node is reachable from another arbitrary node in a Directed Acyclic Graph?

Not sure if leaf node is still proper terminology since it's not really a tree (each node can have multiple children and also multiple parents) and also I'm actually trying to find all the root nodes (which is really just a matter of semantics, if you reverse the direction of all the edges it'd they'd be leaf nodes).
Right now we're just traversing the entire graph (that's reachable from the specified node), but that's turning out to be somewhat expensive, so I'm wondering if there's a better algorithm for doing this. One thing I'm thinking is that we keep track of nodes that have been visited already (while traversing a different path) and don't recheck those.
Are there any other algorithmic optimizations?
We also thought about keeping a list of root nodes that this node is a descendant of, but it seems like maintaining such a list would be fairly expensive as well if we need to check if it changes every time a node is added, moved, or removed.
This is more than just finding a single node, but rather finding ALL nodes that are endpoints.
Also there is no master list of nodes. Each node has a list of it's children and it's parents. (Well, that's not completely true, but pulling millions of nodes from the DB ahead of time is prohibitively expensive and would likely cause an OutOfMemory exception)
May or may not change possible solutions, but the graph is bottom-heavy in that there's at most a few dozen root nodes (what I'm trying to find) and some millions (possibly tens or hundreds of millions) leaf nodes (where I'm starting from).

There are a few methods that each may be faster depending on your structure, but in general what youre going to want is a traversal.
A depth first search, goes through each possible route, keeping track of nodes that have already been visited. It's a recursive function, because at each node you have to branch and try each child node of it. There's no faster method if you dont know which way to look for the object you just have to try each way! You definitely need to keep track of where you have already been because it would be wasteful otherwise. It should require on the order of the number of nodes to do a full traversal.
A breadth first search is similar but visits each child of the node before "moving on" and as such builds up layers of distance from the chosen root. This can be faster if the destination is expected to be close to the root node. It would be slower if it is expected to be all the way down a path, because it forces you to traverse every possible edge.
Youre right about maybe keeping a list of known root nodes, the tradeoff there is that you basically have to do the search whenever you alter the graph. If you are altering the graph rarely this is acceptable, but if you alter the graph more frequently than you need to generate this information, then of course it is too costly.
It sounds like we are actually looking for a path between two arbitrary nodes, the root/leaf semantic keeps getting switched. The DepthFirstSearch (DFS) starts at one node, and then for each unvisited child, recurse. Break if you find the target node. Due to the way recursion evaluates, this will traverse all the way down the 'left' path, then enumerate nodes at this distance before ever getting to the 'right' path. This is time costly and inefficient if the target node is potentially the first child on the right. BreadthFirst walks in steps, covering all children before moving forward. Because your graph is bottom heavy like a tree, both will be approximately the same execution time.
When the graph is bottom heavy you might be interested in a reverse traversal. Start at the target node and walk upwards, because there are relatively fewer nodes in this direction. So long as the nodes in general have more parents than children, this direction will be much faster. You can also combine the approaches, stepping one up and one down , then comparing lists of nodes, and meeting somewhere in the middle. (this combination might seem the fastest if you ignore that twice as much work is done at each step).
However, since you said that your graph is stored as a list of lists of children, you have no real way of traversing the graph backwards. A node does not know what its parents are. This is a problem. To fix it you have to get a node to know what its parents are by adding that data on graph update, or by creating a duplicate of the whole structure (which you have said is too large). It will need the whole structure to be rewritten, which sounds probably out of the question due to it being a large database at this point.
Just color (keep track of) visited nodes.
Sample in Python:
def reachable(nodes, edges, start, end):
color = {}
for n in nodes:
color[n] = False
q = [start]
while q:
n = q.pop()
if color[n]:
color[n] = True
for adj in edges[n]:
return color[end]

For a vertex x you want to compute a bit array f(x), each bit corresponds to a root vertex Ri, and 1 (resp 0) means "x can (resp can't) be reached from root vertex Ri.
You could partition the graph into one "upper" set U containing all your target roots R and such that if x in U then all parents of x are in U. For example the set of all vertices at distance <=D from the closest Ri.
Keep U not too big, and precompute f for each vertex x of U.
Then, for a query vertex y: if y is in U, you already have the result. Otherwise recursively perform the query for all parents of y, caching the value f(x) for each visited vertex x (in a map for example), so you won't compute a value twice. The value of f(y) is the bitwise OR of the value of its parents.


Efficient Graph Traversal for Node Editor Evaluation

I have a directed acyclic graph created by users, where each node (vertex) of the graph represents an operation to perform on some data. The outputs of a node depend on its inputs (obviously), and that input is provided by its parents. The outputs are then passed on to its children. Cycles are guaranteed to not be present, so can be ignored.
This graph works on the same principle as the Shader Editor in Blender. Each node performs some operation on its input, and this operation can be arbitrarily expensive. For this reason, I only want to evaluate these operations when strictly required.
When a node is updated, via user input or otherwise, I need to reevaluate every node which depends on the output of the updated node. However, given that I can't justify evaluating the same node multiple times, I need a way to determine the correct order to update the nodes. A basic breadth-first traversal doesn't solve the problem. To see why, consider this graph:
A traditional breadth-first traversal would result in D being evaluated prior to B, despite D depending on B.
I've tried doing a breadth-first traversal in reverse (that is, starting with the O1 and O2 nodes, and traversing up the graph), but I seem to run into the same problem. A reversed breadth-first traversal will visit D before B, thus I2 before A, resulting in I2 being ordered after A, despite A depending on I2.
I'm sure I'm missing something relatively simple here, and I feel as though the reverse traversal is key, but I can't seem to wrap my head around it and get all the pieces to fit. I suppose one potential solution is to use the reverse traversal as intended, but rather than avoiding visiting each node more than once, just visiting each node each time it comes up, ensuring that it has a definitely correct ordering. But visiting each node multiple times and the exponential scaling that comes with that is a very unattractive solution.
Is there a well-known efficient algorithm for this type of problem?
Yes, there is a well known efficient algorithm. It's topological sorting.
Create a dictionary with all nodes and their corresponding in-degree, let's call it indegree_dic. in-degree is the number of parents/or incoming edges to that node. Have a set S of the nodes with in-degree equal to zero.
Taken from the Wikipedia page with some modification:
L ← Empty list that will contain the nodes sorted topologically
S ← Set of all nodes with no incoming edge that haven't been added to L yet
while S is not empty do
remove a node n from S
add n to L
for each child node m of n do
decrement m's indegree
if indegree_dic[m] equals zero then
delete m from indegree_dic
insert m into S
if indegree_dic has length > 0 then
return error (graph is not a DAG)
return L (a topologically sorted order)
This sort is not unique. I mention that because it has some impact on your algorithm.
Now, whenever a change happens to any of the nodes, you can safely avoid recalculation of any nodes that come before the changed node in your topologically sorted list, but need to nodes that come after it. You can be sure that all the parents are processed before their children if you follow the sorted list in your calculation.
This algorithm is not optimal, as there could be nodes after the changed node, that are not children of that node. Like in the following scenario:
/ \
One correct topological sort would be [A, B, C]. Now, suppose B changes. You skip A because nothing has changed for it, but recalculate C because it comes after B. But you actually don't need to, because B has no effect on C whatsoever.
If the impact of this isn't big, you could use this algorithm and keep the implementation easier and less prone to bugs. But if efficiency is key, here are some ideas that may help:
You can do a topological sort each time and include the which node has change as a factor. When choosing nodes from S in the above algorithm, choose every other node that you can before you choose the changed node. In other words, you choose the changed node from S only when S has length 1. This guarantees that you process every node that isn't below the hierarchy of the changed node before it. This approach helps when the sorting is much cheaper then processing the nodes.
Another approach, which I'm not entirely sure is correct, is to look after the changed node in the topological sorted list and start processing only when you reach the first child of the changed node.
Another way relies on idea 1 but is helpful if you can do some pre-processing. You can create topological sorts for each case of one node being changed. When a node is changed, you try to put it in the ordering as late as possible. You save all these ordering in a node to ordering dictionary and based on which node has changed you choose that ordering.

counting leaf nodes in a tree

Assume we have a tree where every node has pre-decided set of outgoing nodes. Is it possible to come up with a fast way/optimizations to count the number of leaf nodes given a level value? Would be great if someone could suggest any ideas/links/resources to do the same.
No. you'd still have to traverse the entire tree. There's no way of predicting the precise structure - or approximating it - from only the number of childnodes of each node of the tree.
Apart from that: just keep a counter and update it on each insertion. Far simpler and wouldn't change time-complexity of any operation, except for counting leaves, which would be reduced to O(1).
This can actually get pretty tough thing. As it varies of what is the programming language, what is the input data structure, is the tree binary or general tree (arbitrary number of children), size of the tree.
The most general idea is to run a DFS or BFS, starting from the root, to get every node level and then make a list of sets where each set contains the nodes of a single level. The set can be any structure, standard list is fine.
Let's say you are working in C++ which is good, if not the best practical choice if you need performance (even better than C).
Let's say we have a general tree and the input structure is adjacency list as you mentioned.
//nodes are numbered from zero to N-1
vector<vector<int>> adjList;
Then you run either a BFS or DFS, either will do for a tree, keeping a level for each node. The level for a next node is the level of it's parent plus one.
Once you discover the level, you put the node in like this.
vector<vector<int>> nodesPartitionedByLevels(nodeCount);
//run bfs here
//inside it you call
That's about it.
Then when you have the levels, you iterate all the nodes on that level and you check the adjaceny list if it contans any nodes.
basically you call adjList[node].empty(). If true than that is a leaf node.

Constructing a binary tree from a list of its edges (node pairs)

I'd like to construct a binary tree from a quite unusual input. The input contains:
Total number of nodes.
The integer label of the root.
A list of all edges (vertices/nodes that are connected to each
other). The edges in the list are UNSORTED, there is only one rule for
determining left/right children - the child in the edge that appears
first in list is always on the left. The order of child/parent in the vertices pair is also random.
I've come up with some straighforward solutions but they require multiple searches through the list of all edges (I'd basically find the 2 edges that have the labeled root in them and repeat this process for all the subtrees.)
I imagine this straightforward approach would be VERY inefficient for trees with a big amount of nodes, but I can't come up with anything else.
Any ideas for more efficient algorithms to solve this?
Here's an example for better visualization:
INPUT: 5 NODES, ROOT LABELED 2, LIST OF EDGES: [(1,0),(1,2),(2,3),(1,4)]
The tree would look like this:
1 3
0 4
It is important to clarify whether the given edge list is stated to be directed or not.
If edges are given in a directed fashion (i.e. it is stated that any given edge A-B also includes the information that A is a parent of B) storing the edges in an adjacency list while recording number of incoming edges for each vertex in an array should be sufficient. Once you go through the array for the incoming edges, the vertex with 0 incoming edges(i.e. parents) should be the root. Then you can run a DFS in linear time complexity to traverse the graph and put it in any data structure that is best for your needs.
If the edges given are stated to be undirected, the scheme changes a bit. In that case, you don't have the concept of incoming and outgoing edge. In that case, as no structure for the array is specified(e.g. BST, etc.) you can basically consider any node with less than 3 edges as root and run DFS as mentioned above. (all the leaves and intermediary nodes with single child nodes)
A simple solution is: "Link all the edges in the tree that it!"
Start preparing a dictionary. If nodes don't exist by the start and end point, create them nodes. As it is random in nature, you can set their left and right pointers to NULL initially.
You have rule - " the child in the edge that appears first in list is always on the left.". So create child accordingly.
Also, you already know the root of the tree so you can iterate across the nodes you have constructed so far.
Through this you can generate tree in one shot.
Hope this helps!

Tree root finding

How could I get from set of nodes and edges get tree with a root?
(I'm working with connectivity-matrix, each edge has weight: graph[i][j], without any negative edges). Later I need to do DFS and find LCA's in that tree, so it would be good for optimize.
I suppose that your matrix represents the child relationship (i.e. M[i][j] tells that j is the child of i), for a directed graph G(V,E).
You have 2 different strategies:
use a bit vector, go through each cell of your matrix, and mark the child index in the vector if the cell's weight is not null): the root is the vertex not set in the vector,
look for the columns (or rows, if your matrix is column first) whose cells are all null (no ancestors),
The second solution is better for dense matrices. Its worst running time would be when the root is the last entry (O(V²)). In this case you can stop at the first hit, or run til the end to get all the roots, if your graph has many.
The first one is better suited for sparse matrices, since you have to go through all the cells. It's running time is in O(E). You also get all the roots with this algorithm.
If you are certain that your graph has only one root, you can use the walk the edges up technique, as described in other answers.
Here is a computationally MUCH SLOWER version that is also much easier to code. For small graphs, it is just fine.
Find the node with in-degree zero!
You have to compute all node degrees, O(n), but depending on the setting, this is often much easier to code and thus less prone to error.
Pick one node in the tree and walk up, that is, against the orientation of the edges. When you find a node without an ancestor you have the root.
If you need to do something like this often, just remember the parent node for each node.
a DFS search from any graph gives you a tree (assuming the graph is connected, of course).
you can iterate it, and start from each node as a possible root, you will get a spanning tree eventually this way, if there is one. complexity will be O(V^2+VE)
EDIT: it works because for any finite graph, if there is a root form node a to node b, there will be a path from a to b in the tree DFS creates. so, assuming there is a possible spanning tree, there is a root r, which you can get from to each v in V. when iterating when r chosen as root, there is a path from r to each v in V, so there will be a path from r to it in the spanning tree.

An algorithm to check if a vertex is reachable

Is there an algorithm that can check, in a directed graph, if a vertex, let's say V2, is reachable from a vertex V1, without traversing all the vertices?
You might find a route to that node without traversing all the edges, and if so you can give a yes answer as soon as you do. Nothing short of traversing all the edges can confirm that the node isn't reachable (unless there's some other constraint you haven't stated that could be used to eliminate the possibility earlier).
Edit: I should add that it depends on how often you need to do queries versus how large (and dense) your graph is. If you need to do a huge number of queries on a relatively small graph, it may make sense to pre-process the data in the graph to produce a matrix with a bit at the intersection of any V1 and V2 to indicate whether there's a connection from V1 to V2. This doesn't avoid traversing the graph, but it can avoid traversing the graph at the time of the query. I.e., it's basically a greedy algorithm that assumes you're going to eventually use enough of the combinations that it's easiest to just traverse them all and store the result. Depending on the size of the graph, the pre-processing step may be slow, but once it's done executing a query becomes quite fast (constant time, and usually a pretty small constant at that).
Depth first search or breadth first search. Stop when you find one. But there's no way to tell there's none without going through every one, no. You can improve the performance sometimes with some heuristics, like if you have additional information about the graph. For example, if the graph represents a coordinate space like a real map, and most of the time you know that there's going to be a mostly direct path, then you can attempt to have the depth-first search look along lines that "aim towards the target". However, imagine the case where the start and end points are right next to each other, but with no vector inbetween, and to find it, you have to go way out of the way. You have to check every case in order to be exhaustive.
I doubt it has a name, but a breadth-first search might go like this:
Add V1 to a queue of nodes to be visited
While there are nodes in the queue:
If the node is V2, return true
Mark the node as visited
For every node at the end of an outgoing edge which is not yet visited:
Add this node to the queue
End for
End while
Return false
Create an adjacency matrix when the graph is created. At the same time you do this, create matrices consisting of the powers of the adjacency matrix up to the number of nodes in the graph. To find if there is a path from node u to node v, check the matrices (starting from M^1 and going to M^n) and examine the value at (u, v) in each matrix. If, for any of the matrices checked, that value is greater than zero, you can stop the check because there is indeed a connection. (This gives you even more information as well: the power tells you the number of steps between nodes, and the value tells you how many paths there are between nodes for that step number.)
(Note that if you know the number of steps in the longest path in your graph, for whatever reason, you only need to create a number of matrices up to that power. As well, if you want to save memory, you could just store the base adjacency matrix and create the others as you go along, but for large matrices that may take a fair amount of time if you aren't using an efficient method of doing the multiplications, whether from a library or written on your own.)
It would probably be easiest to just do a depth- or breadth-first search, though, as others have suggested, not only because they're comparatively easy to implement but also because you can generate the path between nodes as you go along. (Technically you'd be generating multiple paths and discarding loops/dead-end ones along the way, but whatever.)
In principle, you can't determine that a path exists without traversing some part of the graph, because the failure case (a path does not exist) cannot be determined without traversing the entire graph.
You MAY be able to improve your performance by searching backwards (search from destination to starting point), or by alternating between forward and backward search steps.
Any good AI textbook will talk at length about search techniques. Elaine Rich's book was good in this area. Amazon is your FRIEND.
You mentioned here that the graph represents a road network. If the graph is planar, you could use Thorup's Algorithm which creates an O(nlogn) space data structure that takes O(nlogn) time to build and answers queries in O(1) time.
Another approach to this problem would allow you to ignore all of the vertices. If you were to only look at the edges, you can produce a transitive closure array that will show you each vertex that is reachable from any other vertex.
Start with your list of edges:
Va -> Vc
Va -> Vd
Create an array with start location as the rows and end location as the columns. Fill the arrays with 0. For each edge in the list of edges, place a one in the start,end coordinate of the edge.
Now you iterate a few times until either V1,V2 is 1 or there are no changes.
For each row:
NextRowN = RowN
For each column that is true for RowN
Use boolean OR to OR in the results of that row of that number with the current NextRowN.
Set RowN to NextRowN
If you run this algorithm until the end, you will quickly have a complete list of all reachable vertices without looking at any of them. The runtime is proportional to the number of edges. This would work well with a reasonable implementation and a reasonable number of edges.
A slightly more complex version of this algorithm would be to only calculate the vertices reachable by V1. To do this, you would focus your scope on the ones that are currently reachable at any given time. You can also limit adding rows to only one time, since the other rows are never changing.
In order to be sure, you either have to find a path, or traverse all vertices that are reachable from V1 once.
I would recommend an implementation of depth first or breadth first search that stops when it encounters a vertex that it has already seen. The vertex will be processed on the first occurrence only. You need to make sure that the search starts at V1 and stops when it runs out of vertices or encounters V2.
