Defining Invalid Paths on a Graph - data-structures

We are working on a project that a regular directed graph is suitable for most cases. However on our graph we want to invalidate some paths. For example if our graph is:
A->B
A->D
B->C
D->C
Then A->B->C is a valid path but A->D->C is not. We could define invalid paths somewhere and do a validation check every time but this cause an important performance issue since our application highly depends on the graph.
So, is there a special data structure or algorithm for this type situation?
Thanks

You could store a mapping at each node of from:list(to) so you know that a path coming from A can not go to C because it isn't in the list of nodes it can go to from C. If you ahve a depth greater than 1, the tuple of nodes leading up to it can be a key instead of that one node. This is a lot like how eBGP works for internet routing.
On a different note, you will need to do a check no matter what if you decide to use a data structure like you're describing. Either that, or store multiple graphs for each situation.

Related

Graph implementation adjacency list vs set

After reading about how to implement a graph it seems I have basically two options:
Matrix
Adjacency list
In order to decide which implementation to use this post can be useful.
When an adjacency list is used to implement a graph the cost to know if there is an edge between two nodes may take linear time (for those nodes connected to all nodes).
That make me wonder: Why not to use a HashSet instead of a linked list in order to keep the neighbors of a node?
This will give us constant time to know if there is an edge between two nodes.
I'm sure must be a disadvantage using a Set instead of Linked list but I can't see it.
I think "list" is just a generic name. I've used a Set and it works perfectly well.
There is no specific reason to use a list instead of a set, here go through this link - Graph using set
Hope this helps!

Implementing shortcuts (reach) pruning while using A*

I am working on a project for shortest path finding. I have looked at alot of resources online to come up with a good algorithm.
I am working with openstreetmap data and it's clear to me that I have to use A* algorithm.
While looking for different solutions, I have found that because a way is made of different nodes, one can prune away the intermediate nodes that are not junctions.
How can I do this in a programming language? If anyone has an idea or a further article that can help me, that would be really grateful.
The exact information I found about this pruning that's relevant to osm is this
parse all ways a second time; a way will normally become one edge,
but if any nodes apart from the first and the last have a link counter
greater than one, then split the way into two edges at that point.
Nodes with a link counter of one and which are neither first nor last
can be thrown away unless you need to compute the length of the edge.
Have a look into the GraphHopper project (where I'm the author of) or other routing projects for OSM already doing this. The idea is to count the number of ways one node is member of and mark nodes as junctions if they have a count of three or more (or just one if an endstanding 'junction').
Still the nodes in-between should be accessible as you need to plot the route for the end results after calculating the route. In GraphHopper we call them pillar nodes (nodes between junctions) and tower nodes (junctions). Here is more detailed information.
Another problem is that you have to calculate GPS precise routes and not just routes from junction to junction. Look into this change how we fixed this via virtual nodes and edges.

What data structure to use for digraph paths?

I'm trying to represent a transitive relation (in a database) and having a hard time working out the best data structure.
Basically, the data structure is a series of pairs A → B such that if A → B and B → C, then implicitly A → C. It's important to me to be able to identify which entries are original input and which entries exist implicitly. Asking if A → C is equivalent to me having a digraph and asking if there exists a path from A to C in that digraph.
I could just represent the original entries, but if I do than then it takes a lot of time to determine if two items are related, since I need to search for all possible paths and this is rather slow.
Alternatively, I can store the original edges, as well as a listing of all paths. This makes adding a new edge easy, because when I add A → B I can just take the Cartesian product of paths ending in A and the paths ending in B and put them together. This has some significant space overhead of O(n2) in the worst case, but has the nice property that lookups, by far the most common operation, will be constant time. The issue is deleting, where I cannot think of anything really other than recalculating all paths that may or may not run through the edge deleted, and this can be really nasty.
Does anyone have any better ideas?
Technical notes: the digraph may be cyclic, but the relation is reflexive so I don't need to represent the reflexivity or store anything about it.
This is called the Reachability problem.
It would seem that you want an efficient online algorithm, which is an open problem, and an area of much research.
See my similar question on cs.SE: An incrementally-condensed transitive-reduction of a DAG, with efficient reachability queries, where I reference several related querstions across stackexchange:
Related:
What is the fastest deterministic algorithm for dynamic digraph reachability with no edge deletion?
What is the fastest deterministic algorithm for incremental DAG reachability?
Does an algorithm exist to efficiently maintain connectedness information for a DAG in presence of inserts/deletes?
Is there an online-algorithm to keep track of components in a changing undirected graph?
Dynamic shortest path data structure for DAG
Note that even though some algorithm might be for a DAG only, if it supports condensation (that is, collapsing strongly connected components into one node, since they are considered equal, ie. they relate back and forth), it is equivalent; after condensation, you can query the graph for the representative node in place of any of the condensed nodes (because they were both reachable from each-other, and thusly related to the rest of the graph in exactly the same way).
My conclusion is that as-of-yet there does not seem to be an efficient way to do this (on the order of O(log n) queries for a dynamic graph, with output-sensitive update times on the condensed graph). For less efficient ways, see the related links above.
The closest practical algorithm I found was here (source), which is an interesting read. I am not sure how easy/practical this data-structure or any data structure in any paper you will find, would be to adapt it to a database.
PS. Consider asking CS-related questions on cs.stackexchange.com in the future.

What are good ways of organizing directed graph data?

Here's my situation. I have a graph that has different sets of data being added at different times. For example, set1 might have a few thousand nodes and then set2 comes in later and we apply business logic to create edges from set1 to set2(and disgard any Vertices from set1 that do not have edges to set2). Then at a later point, we get set3, set4, and so on and the same process applies between each set and its previous set.
Question, what's the best way to organize this? What I did before was name the nodes set1-xx, set2-xx,etc.. The problem I faced was when I was trying to run analytics between the current set and the previous set I would have to run a loop through the entire graph and look for all the nodes that started with 'setx'. It took a long time as the graph grew, so I thought of another solution which was to create a node called 'set1' and have it connected to all nodes for that particular set. I am testing it but I was wondering if there way a more efficient way or a build in way of handling data structures like this? Is there a way to somehow segment data like this?
I think a general solution would be application but if it helps I'm using neo4j(so any specific solution to that database would be good as well).
You have a very special type of a directed graph, called a layered graph.
The choice of the data structure depends primarily on the expected graph density (how many nodes from a previous set/layer are typically connected to a node in the current set/layer) and on the operations that you need to perform on it most of the time. It is definitely a good idea to have each layer directly represented by a numeric index (that is, the outermost structure will be an array of sets/layers), and presumably you can also use one array of vertices per layer. However, the list of edges per vertex (out only, or in and out sets of edges depending on whether you ever traverse the layers backward) may be any of the following:
Linked list of vertex identifiers; this is good if the graph is very sparse and edges are often added/removed.
Sorted array of vertex identifiers; this is good if the graph is quite sparse and immutable.
Array of booleans, indexed by vertex identifiers, determining whether a given vertex is or is not linked by an edge from the current vertex; this is good if the graph is dense.
The "vertex identifier" can take many forms. For example, it can be an index into the array of vertices on the next layer.
Your second solution is what I would do- create a setX node and connect all nodes belonging to that set to setX. That way your data is partitioned and it is easier to query.

How do I find all paths through a set of given nodes in a DAG?

I have a list of items (blue nodes below) which are categorized by the users of my application. The categories themselves can be grouped and categorized themselves.
The resulting structure can be represented as a Directed Acyclic Graph (DAG) where the items are sinks at the bottom of the graph's topology and the top categories are sources. Note that while some of the categories might be well defined, a lot is going to be user defined and might be very messy.
Example:
(source: theuprightape.net)
On that structure, I want to perform the following operations:
find all items (sinks) below a particular node (all items in Europe)
find all paths (if any) that pass through all of a set of n nodes (all items sent via SMTP from example.com)
find all nodes that lie below all of a set of nodes (intersection: goyish brown foods)
The first seems quite straightforward: start at the node, follow all possible paths to the bottom and collect the items there. However, is there a faster approach? Remembering the nodes I already passed through probably helps avoiding unnecessary repetition, but are there more optimizations?
How do I go about the second one? It seems that the first step would be to determine the height of each node in the set, as to determine at which one(s) to start and then find all paths below that which include the rest of the set. But is this the best (or even a good) approach?
The graph traversal algorithms listed at Wikipedia all seem to be concerned with either finding a particular node or the shortest or otherwise most effective route between two nodes. I think both is not what I want, or did I just fail to see how this applies to my problem? Where else should I read?
It seems to me that its essentially the same operation for all 3 questions. You're always asking "Find all X below node(s) Y, where X is of type Z". All you need is a generic mechanism for 'locate all nodes below node', (solves Q3) and then you can filter the results for 'nodetype=sink' (solves Q1). For Q2, you have the starting-point (your node set) and your ending point (any sink below the starting point) so your solution set is all paths from starting node specified to the sink. So I would suggest that what you basically have a is a tree, and basic tree-traversal algorithms would be the way to go.
Despite the fact that your graph is acyclic, the operations you cite remind me of similar aspects of control flow graph analysis. There is a rich set of algorithms based on dominance that may be applicable. For example, your third operation reminds me od computing dominance frontiers; I believe that algorithm would work directly if you temporarily introduce "entry" and "exit" nodes. The entry node connects the "given set of nodes" and the exit nodes connects the sinks.
Also see Robert Tarjan's basic algorithms.

Resources