Detecting unusual path patterns in a gigantic directed graph - algorithm

I have a gigantic directed graph (100M+ nodes) of nodes, with multiple path instance records between sets of nodes. the path taken between any two nodes may vary, but what I'd like to find are paths that share multiple intermediary nodes except for a major deviation.
For example, I have 10 instances of a path between node A and node H. Nine of those ten path instances travel through nodes c,d,e,f - but one of the instances travels through c,d,z,e,f - I want to find that "odd" instance.
Any ideas how I would even begin to approach such a problem? Existing analytical frameworks that might be suited to the task?
Details based on comments:
A PIR (path instance record) is a list of nodes traveled through with associated edge traversal times per edge.
Currently, raw PIR records are in a plain string format - obviously, I would want to store it differently based on how I eventually choose to analyze it.
This is not a route solving problem - I never need to find all possible paths; I only need to analyze taken paths (each of which is a PIR).
The list of subpaths needs to be generated from the PIRs.
An example of a PIR would be something like:
nodeA;300;nodeB;600;nodeC;100;nodeD;100;nodeF
This translates to the path of A->B-C->D->F; the cost/time of each vertice is the number - for instance, it cost 300 to go from A->B, 600 to go from B->C, and 100 to go from D->F. The cost/time of each traversal will differ each time the traversal is made. So, for instance, in one PIR, it may cost 100 to go from A->B, but in the next it may cost 150 to go from A->B.

Go through the list of paths and break them up into sets based on the start and end node. So that for example all paths that start with the node A and end with the node B are in the same set. Then you can do the same thing with subsequences of those paths. So that for example every path with the subsequence a,b,c,d and the start node y and the end node k are in the same set. Also reversing paths as required so that for example, you don't have a set for paths k to y and a set for paths y to k. You can then check if a subsequence is common enough followed by checking if the path(s) that don't have that subsequence if there is a subsequence within that path that is sufficiently close to the original sequence based on edit distance. If you are just interested in the path, then you can simply calculate the edit distance of the path and the subsequence, subtract the difference in length, and check if result is low enough. It's probably best to use a subsequence of the path such that it starts and ends with the same node as the desired subsequence.
For your example, the algorithm would eventually reach the set of paths containing the subsequence c,d,e,f, and find that there are 9 of them. This exceeds the amount required for the subsequence to be common enough (and long enough, probably want sequences of at least length k), it would then check the paths that are not included. In this case, there are only one. It would then note, either directly or indirectly, that only only the removal of z, would make the sequence c,d,z,e,f into c,d,e,f. This passes the (currently vague) requirements for "odd", and thus the path containing c,d,z,e,f is added to the list of paths to be returned.

Related

Good algorithm for finding shortest path for specific vertices

I'm solving the problem described below and can't think of a better algorithm than trying every permutation of every vertex of every group with every.
I'm given a graph of vertices, along with a list of groups of specific vertices, the goal is to find the shortest path from a specific starting vertex to a specific ending vertex, and the path must pass through at least one vertex from each specified group of vertices.
There are also vertices in the graph that are not part of any given group.
Re-visiting vertices and edges is possible.
The graph data is specified as follows:
Vertex list - each vertex is identified by a sequence number (0 to the number of vertices -1 )
Edge list - list of vertex pairs (by vertex number)
Vertex group list - list of lists of vector numbers
A specific starting and ending vertex.
I would be grateful for any ideas for a better solution, thank you.
Summary:
We can use bitmasks to efficiently check which groups we have visited so far, and combine this with a traditional BFS/ Dijkstra's shortest-path algorithm.
If we assume E edges, V vertices, and K vertex-groups that have to be included, the below algorithm has a time complexity of O((V + E) * 2^K) and a space complexity of O(V * 2^K). The exponential 2^K term means it will only work for a relatively small K, say up to 10 or 20.
Details:
First, are the edges weighted?
If yes then a "shortest path" algorithm will usually be a variation of Dijkstra's algorithm, in which we keep a (min) priority queue of the shortest paths. We only visit a node once it's at the top of the queue, meaning that this must be the shortest path to this node. Any other shorter path to this node would already have been added to the priority queue and would come before the current iteration. (Note: this doesn't work for negative paths).
If no, meaning all edges have the same weight, then there is no need to maintain a priority queue with the shortest edges. We can instead just run a regular Breadth-first search (BFS), in which we maintain a deque with all nodes at the current depth. At each step we iterate over all nodes at the current depth (popping them from the left of the deque), and for each node we add all it's not-yet-visited neighbors to the right side of the deque, forming the next level.
The below algorithm works for both BFS and Dijkstra's, but for simplicity's sake for the rest of the answer I'll pretend that the edges have positive weights and we will use Dijkstra's. What is important to take away though is that for either algorithm we will only "visit" or "explore" a node for a path that must be the shortest path to that node. This property is essential for the algorithm to be efficient, since we know that we will at most visit each of the V nodes and E edges only one time, giving us a time complexity of O(V + E). If we use Dijkstra's we have to multiply this with log(V) for the priority queue usage (this also applies to the time complexity mentioned in the summary).
Our Problem
In our case we have the additional complexity that we have K vertex-groups, for each of which our shortest path has to contain at least one the nodes in it. This is a big problem, since it destroys our ability to simple go along with the "shortest current path".
See for example this simple graph. Notation: -- means an edge, start is that start node, and end is the end node. A vertex with value 0 does not have a vertex-group, and a vertex with value >= 1 belongs to the vertex-group of that index.
end -- 0 -- 2 -- start -- 1 -- 2
It is clear that the optimal path will first move right to the node in group 1, and then move left until the end. But this is impossible to do for the BFS and Dijkstra's algorithm we introduced above! After we move from the start to the right to capture the node in group 1, we would never ever move back left to the start, since we have already been there with a shorter path.
The Trick
In the above example, if the right-hand side would have looked like start -- 0 -- 0, where 0 means the vertex does not not belonging to a group, then it would be of no use to go there and back to the start.
The decisive reason of why it makes sense to go there and come back, although the path will get longer, is that it includes a group that we have not seen before.
How can we keep track of whether or not at a current position a group is included or not? The most efficient solution is a bit mask. So if we for example have already visited a node of group 2 and 4, then the bitmask would have a bit set at the position 2 and 4, and it would have the value of 2 ^ 2 + 2 ^ 4 == 4 + 16 == 20
In the regular Dijkstra's we would just keep a one-dimensional array of size V to keep track of what the shortest path to each vertex is, initialized to a very high MAX value. array[start] begins with value 0.
We can modify this method to instead have a two-dimensional array of dimensions [2 ^ K][V], where K is the number of groups. Every value is initialized to MAX, only array[mask_value_of_start][start] begins with 0.
The value we store at array[mask][node] means Given the already visited groups with bit-mask value of mask, what is the length of the shortest path to reach this node?
Suddenly, Dijkstra's resurrected
Once we have this structure, we can suddenly use Dijkstra's again (it's the same for BFS). We simply change the rules a bit:
In regular Dijkstra's we never re-visit a node
--> in our modification we differentiate by mask and never re-visit a node if it's already been visited for that particular mask.
In regular Dijkstra's, when exploring a node, we look at all neighbors and only add them to the priority queue if we managed to decrease the shortest path to them.
--> in our modification we look at all neighbors, and update the mask we use to check for this neighbor like: neighbor_mask = mask | (1 << neighbor_group_id). We only add a {neighbor_mask, neighbor} pair to the priority queue, if for that particular array[neighbor_mask][neighbor] we managed to decrease the minimal path length.
In regular Dijkstra's we only visit unexplored nodes with the current shortest path to it, guaranteeing it to be the shortest path to this node
--> In our modification we only visit nodes that for their respective mask values are not explored yet. We also only visit the current shortest path among all masks, meaning that for any given mask it must be the shortest path.
In regular Dijkstra's we can return once we visit the end node, since we are sure we got the shortest path to it.
--> In our modification we can return once we visit the end node for the full mask, meaning the mask containing all groups, since it must be the shortest path for the full mask. This is the answer to our problem.
If this is too slow...
That's it! Because time and space complexity are exponentially dependent on the number of groups K, this will only work for very small K (of course depending on the number of nodes and edges).
If this is too slow for your requirements then there might be a more sophisticated algorithm for this that someone smarter can come up with, it will probably involve dynamic programming.
It is very possible that this is still too slow, in which case you will probably want to switch to some heuristic, that sacrifices accuracy in order to gain more speed.

What algorithm should I use to get all possible paths in a directed weighted graph, with positive weights?

I have a directed weighted graph, with positive weights, which looks something like this :-
What I am trying to do is:-
Find all possible paths between two nodes.
Arrange the paths in ascending order, based on their path length (as given by the edge weights), say top 5 atleast.
Use an optimal way to do so, so that even in cases of larger number of nodes, the program won't take much time computing.
E.g.:- Say my initial node is d, and final node is c.
So the output should be something like
d to c = 11
d to e to c = 17
d to b to c = 25
d to b to a to c = 31
d to b to a to f to c = 38
How can I achieve this?
The best approach would be to take the Dijkstra’s shortest path algorithm, we can get a shortest path in O(E + VLogV) time.
Take this basic approach to help you find the shortest path possible:
Look at all nodes directly adjacent to the starting node. The values carried by the edges connecting the start and these adjacent nodes are the shortest distances to each respective node.
Record these distances on the node - overwriting infinity - and also cross off the nodes, meaning that their shortest path has been found.
Select one of the nodes which has had its shortest path calculated, we’ll call this our pivot. Look at the nodes adjacent to it (we’ll call these our destination nodes) and the distances separating them.
For every ending (destination node):
If the value in the pivot plus the edge value connecting it totals less than the destination node’s value, then update its value, as a new shorter path has been found.
If all routes to this destination node have been explored, it can be crossed off.
Repeat step 2 until all nodes have been crossed off. We now have a graph where the values held in any node will be the shortest distance to it from the start node.
Find all possible paths between two nodes
You could use bruteforce here, but it is possible, that you get a lot of paths, and it will really take years for bigger graphs (>100 nodes, depending on a lot of facotrs).
Arrange the paths in ascending order, based on their path length (as given by the edge weights), say top 5 atleast.
Simply sort them, and take the 5 first. (You could use a combination of a list of edges and an integer/double for the length of the path).
Use an optimal way to do so, so that even in cases of larger number of nodes, the program won't take much time computing.
Even finding all possible paths between two nodes is NP-Hard (Source, it's for undirected graphs, but is still valid). You will have to use heuristics.
What do you mean with a larger number of nodes? Do you mean 100 or 100 million? It depends on your context.

How to find widest paths collection on a directed weighted graph

Consider the following graph:
nodes 1 to 6 are connected with a transition edge that have a direction and a volume property (red numbers). I'm looking for the right algorithm to find paths with a high volume. In the above example the output should be:
Path: [4,5,6] with a minimal volume of 17
Path: [1,2,3] with a
minimal volume of 15
I've looked at Floyd–Warshall algorithm but I'm not sure it's the right approach.
Any resources, comments or ideas would be appreciated.
Finding a beaten graph:
In the comments, you clarify that you are looking for "beaten" paths. I am assume this means that you are trying to contrast the paths with the average; for instance, looking for paths which can support weight at least e*w, where 0<e and w is the average edge weight. (You could have any number of contrast functions here, but the function you choose does not affect the algorithm.)
Then the algorithm to find all paths that meet this condition is incredibly simple and only takes O(m) time:
Loop over all edges to find the average weight. (Takes O(m) time.)
Calculate the threshold based on the average. (Takes O(1) time.)
Remove all edges which do not support the threshold weight. (Takes O(m) time.)
Any path in the resulting graph will be a member of the "widest path collection."
Example:
Consider that e=1.5. That is, you require that a beaten path support at least 1.5x the average edge weight. Then in graph you provided, you will loop over all the edges to find their average weight, and multiply this by e:
((20+4)+15+3+(2+20)+(1+1+17))/9 = 9.2
9.2*1.5 = 13.8
Then you loop over all edges, removing any that have weight less than 13.8. Any remaining paths in the graph are "beaten" paths.
Enumerating all beaten paths:
If you then want to find the set of beaten paths with maximal length (that is, they are not "parts" of paths), the modified graph is must be a DAG (because a cycle can be repeated infinite times). If it is a DAG, you can find the set of all maximal paths by:
In your modified graph, select the set of all source nodes (no incoming edges).
From each of these source nodes, perform a DFS (allowing repeated visits to the same node).
Every time you get to a sink node (no outgoing edges), write down the path that you took to get here.
This will take up to O(IncompleteGamma[n,1]) time (super exponential), depending on your graph. That is, it is not very feasible.
Finding the widest paths:
An actually much simpler task is to find the widest paths between every pair of nodes. To do this:
Start from the modified graph.
Run Floyd-Warshall's, using pathWeight(i,j,k+1) = max[pathWeight(i,j,k), min[pathWeight(i,k+1,k), pathWeight(k+1,j,k)]] (that is, instead of adding the weights of two paths, you take the minimum volume they can support).

Counting the number of shortest paths through a node in a DAG

I'm looking for an algorithm to count the number of paths crossing a specific node in a DAG (similar to the concept of 'betweenness'), with the following conditions and constraints:
I need to do the counting for a set of source/destination nodes in the graph, and not all nodes, i.e. for a middle node n, I want to know how many distinct shortest paths from set of nodes S to set of nodes D pass through n (and by distinct, I mean every two paths that have at least one non-common node)
What are the algorithms you may suggest to do this, considering that the DAG may be very large but sparse in edges, and hence preference is not given to deep nested loops on nodes.
You could use a breadth first search for each pair of Src/Dest nodes and see which of those have your given node in the path. You would have to modify the search slightly such that once you've found your shortest path, you continue to empty the queue until you reach a path that causes you to increase the size. In this way you're not bound by random chance if there are multiple shortest paths. This is only an option with non-weighted graphs, of course.

Algorithm for finding all critical paths

I need to create an algorithm that could find all the critical paths in a graph.
I have found the topological order of the nodes, calculated earliest end times and latest start times for each node.
Also I have found all critical nodes (i.e. the ones that are on a critical path).
The problem is putting it all together and actually printing out all these paths. If there is only 1 critical path in a graph, then I can deal with it but the issues start if there are multiple paths.
For example one node being part of several critical paths, multiple start nodes, multiple end nodes and so on. I have not managed to come up with an algorithm that could take all of these factors into account.
The output I am looking for is something like this (if a,b,c etc are all nodes):
a->e
a->c->f->i->j->k
a->c->g
l->e
It would be nice if someone could write a description of an algorithm that could find the paths from knowing critical nodes, topological order, etc. Or maybe also in C or java code.
EDIT:
Here is an example, which should provide the output I posted earlier. Critical paths are red, the values of each node are marked above or near it.
The computation of latest start times nearly provides the critical path as well. You need to construct the results from the terminal nodes, and go backwards:
find all nodes that have the maximum earliest end time (t=11, nodes = {e,g,k})
for each of them, find all predecessors that have the locally maximum earliest end time.
For e, both l and k have t=2, so you get l->e, a->e. For g, t=6, which is for node c,
so you get c->g. For k, t=10, but only j has this as its end time, so you get j->k.
repeat until you reach the start node. l->e and a->e are already start nodes.
c->g has the only predecessor a, so you get a->c->g. j->k has t=9, which is i's end time,
so you get i->j->k.
I don't understand your question entirely: what is a critical path? You'd might be helped by:
algorithm to calculate Minimum Spanning Tree or Dijkstra's shortest path algorithm - although you might probably already knew these algorithms.

Resources