we have a system where the customer comes and interacts, triggers jobs and does many actions. We have 1000s of such users. Each job has a name and our backend database has all the data about the customer interactions.
These jobs fail often. We know why a particular job failed based on its inputs, but now we want to find what was the path taken by user (journey) before he reached the failure job. We want to see if we can improve the experience much before so that the failure is avoided.
Example (hypothetical), login->create file-> save file -> download file. Download file is failing with some error. Say this usually happens when a save has just completed. If you have done some operation between save file and download, then down load does not fail. That is a hidden bug possibly.
The question is - Given a history of 3000 users graph traversal (take paths of size 5 [as a moving window]) build a system that when asked **
"what are the most probable paths to reach node X"
gives the top 5 most probably paths to reach X.
I have created the nodes as [jobName][State], example, loginSuccess->createFileSuccess->SaveFileSuccess->DownloadFailed. X will be typically a [Job Name]Failed node that we will query.
We have about 50 jobs and 3 states, success, failed cancelled.
Any idea how to build this model, which algorithm to use, and how to reverse generate the probabilities when a node is asked?
Adding some more clarity -
Given a target node, can I list what were the most probable paths to
reach it with length 5. I dont know the starting points to start the
dijkstra's. Also a direct path of low probability might exit from a
given starting node, directly to the target node, but I need to find
paths of length 5.
The first step I would take would be to construct a list of records of length 5, where each such record contains the 5 steps taken by a particular customer leading up to node X. Then you could simply sort this list and count the number of times each possible record occurs in it, to work out the most popular records.
Another approach would be to assign each edge exiting a node a score which was the fraction of paths that exited that node to exit it via that edge. Then compute the overall score for a path by multiplying together the scores for its edges, and again take the observed paths with highest scores.
From what I have understood you need to find path most likely followed by users and you can make nodes for each process and two process are connected to each other if a customer goes from one process to other.
STEP 1. Construct a graph for all 3000 users which will be a weighted graph
(as such weight of an edge will be number of times a user goes from
one process to another, so each time you find an already built edge
increment its weight by 1 or else make a new edge with weight =1)
Now, to find most probable path from source node to another
STEP 2. Apply Dijkstra's algorithm but with small change.
Dijkstra's algo find smallest path from one node to every other
node,so you need to find maximum path from one node to another.
I think, it should work as all the edges have positive weight and it will give you the most probable path taken from one node to another by all users and you could easily get data of all nodes from source to destination node very easily.
But it will only give you the most probable path and not top 5 of them.
Related
What I want to solve:
I want to detect newly created unreachable subgraph(marking nodes that unreachable) in given directed graph if I cut the specific edge.
The restriction of this problem:
The given graph is directed graph (see useful information below.)
The number of the nodes are more than 100,000.
The number of the edges are around 1.5x of nodes.
Running time of the solution should be less than a second.
The information that might be useful:
The given graph was made by connecting numerous cycles. And, there are at least 1 route exists from any node to other node.
A few (~ 10%) of the nodes have the branch. No more than 3 edges on the node exists in the graph.
The meaning of "unreachable area" is including "not connected", but you can ignore this if you think this is mixing two different problems into one.
My trials
When I met this problem, I tried in 4-ways but no luck above of them.
Find the another path can replace the cut node.
This method is rejected because of running time of method. Currently we use Dijkstra Algorithm for path-find and when I tried this method by putting into job queue, the job queue was flooded in less than an hour.
Check level of edges (like packets' Time-To-Live from network.)
Search from edge node with given threshold level.
If I met the branch, keep previous level. Otherwise, decrease level.
If level is 0, do nothing.
Current temporal solution is this one, but obviously this solution ignores a lot of corner cases.
Simulate flow network to the graph.
It's simple:
Give a threshold(like 100) to every nodes and simulate its flow.
If I met the branch, split number into each branch.
Check the values that is lower than 1.
But this method is also rejected because of Time complexity.
SCC and Topological sorting.
Lastly, I check the Strongly-Connected-Components with Topological orders. (Of course I know I used wrong word, see below)
The idea is, topological sorting is used for DAG(Directed Acyclic Graph), but If I add some rules(like "If I detect cycle, treat that cycle as a virtual node, recursively", using SCC), I can check the "topological orders" for general directed graph. If I found the topological orders, this means that there is an area that unreachable. (It's hard to say, think about it with method 3: simulate flow network)
I think this approach is the best one, and might be solve the problem, but I have no ideas about keywords that should I search and learn about it. Same as implementation.
EDIT
I forgot the explanation of unreachable means. If there is no route from a node(node 'A') to any other node, node 'A' is "unreachable". Initially, at given digraph, there are no unreachable node exists.
In this problem, let's assume that node 1 is the master node. if there is no route from node 1 to node 2, then node 2 is unreachable.
I got a question which is to find the min cost from the least number node (1) to the largest number node (7).
The cost is the edge between nodes. I labeled them.
This problem got me to think of the Dijkstra which leads the time complexity for O((v+e) log v)
Any other better approach to solving this question efficiently?
The other requirement is to keep the path information, any thought to keep the path?
As others pointed out, the complexity is as you say and cannot be better. As #nico-schertler commented, searching from both sides in parallel (or taking turns) and stopping as soon as something touches is faster than doing just a search from one side, but it will have the same complexity. It is possible in this case (with fixed costs for the bidirectional edges) but it needs not be in the general case (e. g. cost depending on the already taken path) where Dijkstra is still applicable.
Concerning the keeping of the path: Of course, the whole thing often only makes sense if you get the path to be taken as an answer. There are two main approaches to get the path as a result.
One is to store the path already taken to a certain node along with the node in the lists (white/grey in the classical implementation). Each time you add a new node, you extend the path of its former node by one step. If you find the target node, you can directly return the path as a result (along with the cost sum). Of course this way means uses a lot of memory.
The other is to not store the origin node along with each new found node, so each node points to the node it was visited from first. Think of it as putting up signposts in each node how to go back. This way, if you find the target node, you will have to go backwards from each node to the one it was first visited from and build the path in reverse order in the process. Then you can return this path.
I'm wondering if there is a more elegant solution to this problem. The brute-force approach (depth-first search) is too computationally intensive.
You are given a network of nodes interconnected with paths. Each path has a distance and zero or more elements along the path that can only be collected once every five minutes. Collecting those elements increases your score.
The goal is to plan out the next five minutes of path traversal, keeping in mind the paths that have been traversed already in the last five minutes, so as to maximize the score increase.
The brute force algorithm is to try every possible route from the current location, avoiding places we have already been, stopping when we have traveled our max planning distance or time, and keep a virtual tally of rewards collected. Then all we have to do is choose the route with the highest score.
Unfortunately, the number of nodes and paths in the graph is high enough that planning out even just five minutes worth of travel requires too much computation.
Is there a known algorithm that solves this problem more efficiently than the brute-force method? Even if it only finds an approximate solution, and not an optimal one.
EDIT
Thank you #SaiBot, here is my final solution, in case anyone should ever find themselves asking this same question:
I assigned every path, going from node A to node B, a unique ID. The path from B to A had its own ID. Outside the DFS search function but accessible to it, I kept a hash keyed by the ID, and the value consists of both the distance traveled prior to taking this path, and the size of the reward received so far. To minimize extra work, I made sure that at each node, the outgoing paths were sorted shortest to longest. Then, when the DFS algorithm was asked to evaluate a path it has evaluated before, the first thing it inspects is that cached result. If the cached result arrived with:
( reward <= previous_reward && distance >= previous_distance )
|| reward / distance <= previous_score
Then it is reasoned that there will be no benefit to recursing this path again, so it returns immediately with a score of 0 to immediately disqualify it from consideration. Otherwise, it records the new incoming reward, distance, and score in the cache, and proceeds normally.
In addition, I did one other thing. I reasoned that I wanted a certain amount of novelty in the path, meaning I didn't want it to just find one tiny little path that gets maximum reward, I wanted it to explore the map. So I added a filter to outgoing nodes, saying that if the node has been visited in the past X minutes, remove it from consideration. This had the side-effect of allowing the algorithm to route itself into a corner, so I added a fall-back, where if there were no available options, it would sort the outgoing paths by last visited, oldest first, and try in that order.
The result was decent, but I'm going to do some more experiments to see if I can get even better results.
You problem is closely related to pareto optimal path computation in multi-criteria networks, e.g., as described in this paper.
If you would just have one criteria (like distance) associated with each edge, then Dijkstra lets you quickly find all possible paths (optimizing distance). This is possible since you can "discard" a path that arrives at a node if another path reaching that node already has a lower distance.
The problem arises when you have two or more criteria (e.g., distance and reward) associated with each edge. Now, if two paths (starting form your start node) lead to the same node and path_1 has a lower distance than path_2, but path_2 has higher reward than path_1 you cannot discard either. However, if both criteria of a path are worse than in another path you are able to discard it.
One possible algorithm to do the complete search is described in the above paper.
Edit
My answer above will not consider elements reappearing during the route. If you want to include this, you would have to know when and where elements reappear during route planning. This however, will make things a lot more complicated since you could achieve a higher reward by "waiting" for elements to respawn.
There is one thing about a star path finding algorithm that I do not understand. In the pseudocode; if the current node's (the node being analysed) g cost is less than the adjacent nodes g cost then recalculate the adjacent nodes g,h a f cost and reassign the parent node.
Why do you do this?
Why do you need to reevaluate the adjacent nodes costs and parent if it's gCost is greater than the current nodes gCost? I'm what instance would you need to do this?
Edit; I am watcing this video
https://www.youtube.com/watch?v=C0qCR18gXdU\
At at 8.19 he says: When you come across blocks (nodes) that have already been analysed, the question is should we change the properties of the block?
First a tip. You can actually add the time you want as a bookmark to get a video that starts right where you want. In this case https://www.youtube.com/watch?v=C0qCR18gXdU#t=08m19s is the bookmarked time link.
Now the quick answer to your question. We fill in a node the first time we find a path to it. But the first path we found to it might not be the cheapest one. We want the cheapest one, and if we find a cheaper one second, we want it.
Here is a visual metaphor. Imagine a running path with a fence next to it. The spot we want is on the other side of the fence. Actually draw this out, it will help.
The first path that our algorithm finds to it is run down the path, jump over the fence. The second path that we find is run part way down the path, go through the gate, then get to that spot. We don't want to throw away the idea of using the gate just because we already figured out that we could get there by jumping the fence!
Now in your picture put costs of moving from one spot to another that are reasonable for moving along a running path, an open field, through a gate, and jumping a fence. Run the algorithm by hand and you'll see that you figure out first that you can jump the fence, and then later that you really wanted to use the gate.
This guy is totally wrong because he says change the parent node however your succesors are based on your parent node and if you change parent Node then you you can't have a valid path because the path is simply by moving from parent to child.
Instead of changing parent, Pathmax function. It says that if a parent Node A have a child node whose cost (heuristic(A) <= heuristic(A) + accumulated(cost)) then set the cost of child equal to the cost of parent.
PathMax to ensure monotonicty:
Monotonicity: Every parent Node has cost greater or equal then the cost of it's child node.
A* has a property: It says that if your cost is monotonically increasing then the first (sub)path that A* finds is always the part of the final path. More precisely: Under monotonicity each node is reach first through the best path.
Do you see why?
Suppose you have a graph :(A,B) (B,C) (A,E) ,(E,D) here every tuple means they are connected. Suppose cost is monotonically increasing and your algortihm chooses (A,B),(B,C) and at this point you know your algorithm have chosen best path till now and everyother path which can reach this node,must have cost higher but if the cost is not monotonically increasing then it can be the case that (A,E) is cost greater than your current cost and from (E,D) it's zero. so you have better path there.
This algorithm relies on it's heuristic function , if it's underustimated then it gets corrected by accumulated cost but if it's over-estimated then it can explore extra node and i leave it for you why this happends.
Why do you need to re-evaluate an adjacent node that's already in the open list if it has a lower g cost to the current node?
Don't do this because it's just extra work.
Corollary: if you later come the same node from a node p with same cost then simply remove that node from queue. Do not extend it.
I have a gigantic directed graph (100M+ nodes) of nodes, with multiple path instance records between sets of nodes. the path taken between any two nodes may vary, but what I'd like to find are paths that share multiple intermediary nodes except for a major deviation.
For example, I have 10 instances of a path between node A and node H. Nine of those ten path instances travel through nodes c,d,e,f - but one of the instances travels through c,d,z,e,f - I want to find that "odd" instance.
Any ideas how I would even begin to approach such a problem? Existing analytical frameworks that might be suited to the task?
Details based on comments:
A PIR (path instance record) is a list of nodes traveled through with associated edge traversal times per edge.
Currently, raw PIR records are in a plain string format - obviously, I would want to store it differently based on how I eventually choose to analyze it.
This is not a route solving problem - I never need to find all possible paths; I only need to analyze taken paths (each of which is a PIR).
The list of subpaths needs to be generated from the PIRs.
An example of a PIR would be something like:
nodeA;300;nodeB;600;nodeC;100;nodeD;100;nodeF
This translates to the path of A->B-C->D->F; the cost/time of each vertice is the number - for instance, it cost 300 to go from A->B, 600 to go from B->C, and 100 to go from D->F. The cost/time of each traversal will differ each time the traversal is made. So, for instance, in one PIR, it may cost 100 to go from A->B, but in the next it may cost 150 to go from A->B.
Go through the list of paths and break them up into sets based on the start and end node. So that for example all paths that start with the node A and end with the node B are in the same set. Then you can do the same thing with subsequences of those paths. So that for example every path with the subsequence a,b,c,d and the start node y and the end node k are in the same set. Also reversing paths as required so that for example, you don't have a set for paths k to y and a set for paths y to k. You can then check if a subsequence is common enough followed by checking if the path(s) that don't have that subsequence if there is a subsequence within that path that is sufficiently close to the original sequence based on edit distance. If you are just interested in the path, then you can simply calculate the edit distance of the path and the subsequence, subtract the difference in length, and check if result is low enough. It's probably best to use a subsequence of the path such that it starts and ends with the same node as the desired subsequence.
For your example, the algorithm would eventually reach the set of paths containing the subsequence c,d,e,f, and find that there are 9 of them. This exceeds the amount required for the subsequence to be common enough (and long enough, probably want sequences of at least length k), it would then check the paths that are not included. In this case, there are only one. It would then note, either directly or indirectly, that only only the removal of z, would make the sequence c,d,z,e,f into c,d,e,f. This passes the (currently vague) requirements for "odd", and thus the path containing c,d,z,e,f is added to the list of paths to be returned.