Data structure for directed graphs, allowing fast node deletion?

Data structure for directed graphs, allowing fast node deletion? - algorithm

I need to store a directed graph (not necessarily acyclic), so that node deletion is as fast as possible. I wouldn't mind storing additional data in order to know exactly which edges have to go when a node is deleted.
If I store a list of edges (as pairs of node indexes), then when killing some node n I have to search the whole list for edges whose source or target is n. This is too costly for my application. Can this search be avoided by storing some additional data in the nodes?
One idea would be to have each node store its own sources and targets, as two separate lists. When node n is killed, its lists are killed too. But then, how would all the targets/sources linked to node n know to update their own lists (i.e., to eliminate the defunct node from their lists)? This would require some costly searching...
Can it be avoided?
Thx.

You have two choices without getting too fancy are Adjacency List and Adjacency Matrix. The former is probably best for what you're doing. To remove a node, simply eliminate the list for that node for all of its out edges. For the in-edges, you might consider keeping a hash-table for each list for O(1) lookups.
This is a good overview
http://www.algorithmist.com/index.php/Graph_data_structures

I solved it! This is the solution for undirected graphs, adding direction is easy afterwards.
In each vertex I keep a special adjacency list. It is a list (double linked, for easy insertion/deletion) whose elements are "slots":
class Slot {
Slot prev, next; // pointers to the other slots in the list
Slot other_end; // the other end of the edge: not a vertex, but a Slot!
Vertex other_vertex; // the actual vertex at the other end
void kill() {
if (next!=null) next.kill(); // recursion
other_end.pop_out();
}
void pop_out() {
if (next!=null) next.prev = prev;
if (prev!=null) prev.next = next;
else other_end.other_vertex.slot_list = next; // in case this slot is the
// first in its list, I need
// to adjust the vertex's
// slot_list pointer.
// other_end.other_vertex is actually the vertex to which this slot belongs;
// but this slot doesn't know it, so I have to go around like this.
}
}
So basically each edge is represented by two slots, cross-pointing one to each other. And each vertex has a list of such slots.
When a vertex is killed, it sends recursively a "kill" signal up its slot list. Each slot responds by destroying its other_end (which graciously pops out from the neighbor's list, mending the prev/next pointers behind).
This way a vertex plus all its edges are deleted without any searching. The price I have to pay is memory: instead of 3 pointers (prev, next and vertex for a regular double linked adjacency list), I have to keep 4 pointers (prev, next, vertex and other_end).
This is the basic idea. For directed graphs, I only have to distinguish somehow between IN slots and OUT slots. Probably by dividing each vertex's adjacency list in two separate lists: IN_slot_list and OUT_slot_list.

Related

Efficient Graph Traversal for Node Editor Evaluation

I have a directed acyclic graph created by users, where each node (vertex) of the graph represents an operation to perform on some data. The outputs of a node depend on its inputs (obviously), and that input is provided by its parents. The outputs are then passed on to its children. Cycles are guaranteed to not be present, so can be ignored.
This graph works on the same principle as the Shader Editor in Blender. Each node performs some operation on its input, and this operation can be arbitrarily expensive. For this reason, I only want to evaluate these operations when strictly required.
When a node is updated, via user input or otherwise, I need to reevaluate every node which depends on the output of the updated node. However, given that I can't justify evaluating the same node multiple times, I need a way to determine the correct order to update the nodes. A basic breadth-first traversal doesn't solve the problem. To see why, consider this graph:
A traditional breadth-first traversal would result in D being evaluated prior to B, despite D depending on B.
I've tried doing a breadth-first traversal in reverse (that is, starting with the O1 and O2 nodes, and traversing up the graph), but I seem to run into the same problem. A reversed breadth-first traversal will visit D before B, thus I2 before A, resulting in I2 being ordered after A, despite A depending on I2.
I'm sure I'm missing something relatively simple here, and I feel as though the reverse traversal is key, but I can't seem to wrap my head around it and get all the pieces to fit. I suppose one potential solution is to use the reverse traversal as intended, but rather than avoiding visiting each node more than once, just visiting each node each time it comes up, ensuring that it has a definitely correct ordering. But visiting each node multiple times and the exponential scaling that comes with that is a very unattractive solution.
Is there a well-known efficient algorithm for this type of problem?

Yes, there is a well known efficient algorithm. It's topological sorting.
Create a dictionary with all nodes and their corresponding in-degree, let's call it indegree_dic. in-degree is the number of parents/or incoming edges to that node. Have a set S of the nodes with in-degree equal to zero.
Taken from the Wikipedia page with some modification:
L ← Empty list that will contain the nodes sorted topologically
S ← Set of all nodes with no incoming edge that haven't been added to L yet
while S is not empty do
remove a node n from S
add n to L
for each child node m of n do
decrement m's indegree
if indegree_dic[m] equals zero then
delete m from indegree_dic
insert m into S
if indegree_dic has length > 0 then
return error (graph is not a DAG)
else
return L (a topologically sorted order)
This sort is not unique. I mention that because it has some impact on your algorithm.
Now, whenever a change happens to any of the nodes, you can safely avoid recalculation of any nodes that come before the changed node in your topologically sorted list, but need to nodes that come after it. You can be sure that all the parents are processed before their children if you follow the sorted list in your calculation.
This algorithm is not optimal, as there could be nodes after the changed node, that are not children of that node. Like in the following scenario:
A
/ \
B C
One correct topological sort would be [A, B, C]. Now, suppose B changes. You skip A because nothing has changed for it, but recalculate C because it comes after B. But you actually don't need to, because B has no effect on C whatsoever.
If the impact of this isn't big, you could use this algorithm and keep the implementation easier and less prone to bugs. But if efficiency is key, here are some ideas that may help:
You can do a topological sort each time and include the which node has change as a factor. When choosing nodes from S in the above algorithm, choose every other node that you can before you choose the changed node. In other words, you choose the changed node from S only when S has length 1. This guarantees that you process every node that isn't below the hierarchy of the changed node before it. This approach helps when the sorting is much cheaper then processing the nodes.
Another approach, which I'm not entirely sure is correct, is to look after the changed node in the topological sorted list and start processing only when you reach the first child of the changed node.
Another way relies on idea 1 but is helpful if you can do some pre-processing. You can create topological sorts for each case of one node being changed. When a node is changed, you try to put it in the ordering as late as possible. You save all these ordering in a node to ordering dictionary and based on which node has changed you choose that ordering.

Constructing a binary tree from a list of its edges (node pairs)

I'd like to construct a binary tree from a quite unusual input. The input contains:
Total number of nodes.
The integer label of the root.
A list of all edges (vertices/nodes that are connected to each
other). The edges in the list are UNSORTED, there is only one rule for
determining left/right children - the child in the edge that appears
first in list is always on the left. The order of child/parent in the vertices pair is also random.
I've come up with some straighforward solutions but they require multiple searches through the list of all edges (I'd basically find the 2 edges that have the labeled root in them and repeat this process for all the subtrees.)
I imagine this straightforward approach would be VERY inefficient for trees with a big amount of nodes, but I can't come up with anything else.
Any ideas for more efficient algorithms to solve this?
Here's an example for better visualization:
INPUT: 5 NODES, ROOT LABELED 2, LIST OF EDGES: [(1,0),(1,2),(2,3),(1,4)]
The tree would look like this:
2
1 3
0 4

It is important to clarify whether the given edge list is stated to be directed or not.
If edges are given in a directed fashion (i.e. it is stated that any given edge A-B also includes the information that A is a parent of B) storing the edges in an adjacency list while recording number of incoming edges for each vertex in an array should be sufficient. Once you go through the array for the incoming edges, the vertex with 0 incoming edges(i.e. parents) should be the root. Then you can run a DFS in linear time complexity to traverse the graph and put it in any data structure that is best for your needs.
If the edges given are stated to be undirected, the scheme changes a bit. In that case, you don't have the concept of incoming and outgoing edge. In that case, as no structure for the array is specified(e.g. BST, etc.) you can basically consider any node with less than 3 edges as root and run DFS as mentioned above. (all the leaves and intermediary nodes with single child nodes)

A simple solution is: "Link all the edges in the tree that it!"
Start preparing a dictionary. If nodes don't exist by the start and end point, create them nodes. As it is random in nature, you can set their left and right pointers to NULL initially.
You have rule - " the child in the edge that appears first in list is always on the left.". So create child accordingly.
Also, you already know the root of the tree so you can iterate across the nodes you have constructed so far.
Through this you can generate tree in one shot.
Hope this helps!

Object and Pointer Graph representations

I keep seeing everywhere that there are 3 ways to represent graphs:
Objects and pointers
Adjacency matrix
Adjacency lists
However, I just plain don't understand what these Object and pointer representations are - yet every recruiter, and many blogs cite Steve Yegge's blog that they are indeed a separate representation.
This widely accepted answer to a very similar question seems to suggest that the vertex structures themselves have no internal pointers to other vertices, and instead all edges are represented by edge structures which contain pointers to the adjacent vertices.
How does this representation offer any discernible analytical advantage in any scenario?

From the top of my head, I hope I have the facts correct.
Conceptually, graph tries to represent how a set of nodes (or vertices) are related (connected) to each other (via edges).
However, in actual physical device (memory), we have a continuous array of memory cell.
So, in order to represent the graph, we can choose to use a matrix.
In this case, we use the vertex index as the row and column and the entry has value 1 if the vertices are adjacent to each other, 0 otherwise.
Alternatively, you can also represent a graph by allocating an object to represent the node/vertex which points to a list of all the nodes that are adjacent to it.
The matrix representation gives the advantage when the graph is dense, meaning when most of the nodes/vertices are connected to each other. This is because in such cases, by using the entry of matrix, it saves us from having to allocate an extra pointer (which need a word size memory) for each connection.
For sparse graph, the list approach is better because you don't need to account for the 0 entries when there is no connection between the vertices.
Hope it helps.

For now I have a hard time finding a pro w.r.t typical "graph algorithms". But it sure is possible to represent a graph with objects and pointers and a very natural thing to do if you think of it as a representation of something you just drew on a whiteboard.
Think of a scenario where you want to combine nodes of a graph in a certain order.
Nodes have payloads that contain domain data, the graph structure itself is not a core aspect of your program.
Sure, you can update your lists / matrix for every operation, but given an "objects and pointers" structure, you can do the merging locally. Further, if nodes have payloads, it means that lists/matrix will feature node id's that identify the actual node objects. A combination would mean you update your graph representation, follow the node identifiers and do the actual processing. It may feel more intuitively to work on your actual node objects and simply remove pointerswhen collapsing a neighbor (and delete that node) .
Besides, there are more ways to represent a graph:
E.g. just as triples, like Turle does
Or as offset
representation (offsets per node into an edge array), e.g. this
Boost data structure (disclaimer: I have not tested the linked
implementation myself)
etc

Here a way i have been using to create Graph with this concept :
#include <vector>
class Node
{
public:
Node();
void setLink(Node *n); // *n as argument to pass the address of the node
virtual ~Node(void);
private:
vector<Node*> m_links;
};
And the function responsible for creating the link between vertices is :
void Node::setLink(Node *n)
{
m_links.push_back(n);
}

Objects and pointers representation reduces space complexity to exactly V+E, where V is the number of vertices, E - the number of edges (down from V+2E in Adjacency List or even 2V+2E if you store index->Vertex mapping in a separate hash map), sacrificing time complexity: particular edge lookup will take O(E), which equals O(V^2) in a Dense graph (up from O(V) in Adjacency List). The space saving is achieved by removing duplicated edges that appear in the Adjacency List.

Finding the list of common children (descendants) for any two nodes in a cyclic graph

I have a cyclic directed graph and I was wondering if there is any algorithm (preferably an optimum one) to make a list of common descendants between any two nodes? Something almost opposite of what Lowest Common Ancestor (LCA) does.

As user1990169 suggests, you can compute the set of vertices reachable from each of the starting vertices using DFS and then return the intersection.
If you're planning to do this repeatedly on the same graph, then it might be worthwhile first to compute and contract the strong components to supervertices representing a set of vertices. As a side effect, you can get a topological order on supervertices. This allows a data-parallel algorithm to compute reachability from multiple starting vertices at the same time. Initialize all vertex labels to {}. For each start vertex v, set the label to {v}. Now, sweep all vertices w in topological order, updating the label of w's out-neighbors x by setting it to the union of x's label and w's label. Use bitsets for a compact, efficient representation of the sets. The downside is that we cannot prune as with single reachability computations.

I would recommend using a DFS (depth first search).
For each input node
Create a collection to store reachable nodes
Perform a DFS to find reachable nodes
When a node is reached
If it's already stored stop searching that path // Prevent cycles
Else store it and continue
Find the intersection between all collections of nodes
Note: You could easily use BFS (breadth first search) instead with the same logic if you wanted.
When you implement this keep in mind there will be a few special cases you can look for to further optimize your search such as:
If an input node doesn't have any vertices then there are no common nodes
If one input node (A) reaches another input node (B), then A can reach everything B can. This means the algorithm wouldn't have to be ran on B.
etc.

Why not just reverse the direction of the edge and use LCA?

Is there an effient way of determining whether a leaf node is reachable from another arbitrary node in a Directed Acyclic Graph?

Wikipedia: Directed Acyclic Graph
Not sure if leaf node is still proper terminology since it's not really a tree (each node can have multiple children and also multiple parents) and also I'm actually trying to find all the root nodes (which is really just a matter of semantics, if you reverse the direction of all the edges it'd they'd be leaf nodes).
Right now we're just traversing the entire graph (that's reachable from the specified node), but that's turning out to be somewhat expensive, so I'm wondering if there's a better algorithm for doing this. One thing I'm thinking is that we keep track of nodes that have been visited already (while traversing a different path) and don't recheck those.
Are there any other algorithmic optimizations?
We also thought about keeping a list of root nodes that this node is a descendant of, but it seems like maintaining such a list would be fairly expensive as well if we need to check if it changes every time a node is added, moved, or removed.
Edit:
This is more than just finding a single node, but rather finding ALL nodes that are endpoints.
Also there is no master list of nodes. Each node has a list of it's children and it's parents. (Well, that's not completely true, but pulling millions of nodes from the DB ahead of time is prohibitively expensive and would likely cause an OutOfMemory exception)
Edit2:
May or may not change possible solutions, but the graph is bottom-heavy in that there's at most a few dozen root nodes (what I'm trying to find) and some millions (possibly tens or hundreds of millions) leaf nodes (where I'm starting from).

There are a few methods that each may be faster depending on your structure, but in general what youre going to want is a traversal.
A depth first search, goes through each possible route, keeping track of nodes that have already been visited. It's a recursive function, because at each node you have to branch and try each child node of it. There's no faster method if you dont know which way to look for the object you just have to try each way! You definitely need to keep track of where you have already been because it would be wasteful otherwise. It should require on the order of the number of nodes to do a full traversal.
A breadth first search is similar but visits each child of the node before "moving on" and as such builds up layers of distance from the chosen root. This can be faster if the destination is expected to be close to the root node. It would be slower if it is expected to be all the way down a path, because it forces you to traverse every possible edge.
Youre right about maybe keeping a list of known root nodes, the tradeoff there is that you basically have to do the search whenever you alter the graph. If you are altering the graph rarely this is acceptable, but if you alter the graph more frequently than you need to generate this information, then of course it is too costly.
EDIT: Info Update.
It sounds like we are actually looking for a path between two arbitrary nodes, the root/leaf semantic keeps getting switched. The DepthFirstSearch (DFS) starts at one node, and then for each unvisited child, recurse. Break if you find the target node. Due to the way recursion evaluates, this will traverse all the way down the 'left' path, then enumerate nodes at this distance before ever getting to the 'right' path. This is time costly and inefficient if the target node is potentially the first child on the right. BreadthFirst walks in steps, covering all children before moving forward. Because your graph is bottom heavy like a tree, both will be approximately the same execution time.
When the graph is bottom heavy you might be interested in a reverse traversal. Start at the target node and walk upwards, because there are relatively fewer nodes in this direction. So long as the nodes in general have more parents than children, this direction will be much faster. You can also combine the approaches, stepping one up and one down , then comparing lists of nodes, and meeting somewhere in the middle. (this combination might seem the fastest if you ignore that twice as much work is done at each step).
However, since you said that your graph is stored as a list of lists of children, you have no real way of traversing the graph backwards. A node does not know what its parents are. This is a problem. To fix it you have to get a node to know what its parents are by adding that data on graph update, or by creating a duplicate of the whole structure (which you have said is too large). It will need the whole structure to be rewritten, which sounds probably out of the question due to it being a large database at this point.
There's a lot of work to do.
http://en.wikipedia.org/wiki/Graph_(data_structure)

Just color (keep track of) visited nodes.
Sample in Python:
def reachable(nodes, edges, start, end):
color = {}
for n in nodes:
color[n] = False
q = [start]
while q:
n = q.pop()
if color[n]:
continue
color[n] = True
for adj in edges[n]:
q.append(adj)
return color[end]

For a vertex x you want to compute a bit array f(x), each bit corresponds to a root vertex Ri, and 1 (resp 0) means "x can (resp can't) be reached from root vertex Ri.
You could partition the graph into one "upper" set U containing all your target roots R and such that if x in U then all parents of x are in U. For example the set of all vertices at distance <=D from the closest Ri.
Keep U not too big, and precompute f for each vertex x of U.
Then, for a query vertex y: if y is in U, you already have the result. Otherwise recursively perform the query for all parents of y, caching the value f(x) for each visited vertex x (in a map for example), so you won't compute a value twice. The value of f(y) is the bitwise OR of the value of its parents.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio