Searching for shortest path - algorithm

Isn't it always better when searching for shortest path to use for connected nodes lists instead of grid?
When using grid, you have to iterate over the grid every time, whereas using lists saves lots of time.

With adjacency matrix usually each check costs you O(n) time. It may be a bit slower than a list of connected nodes. However, you can do some fancy stuff with it. For example, if you want to delete a lot of edges, you can do it in O(1) using adjacency matrix (it may take a lot longer using a list of nodes depending on what data structure you use for it). Adjacency matrix is also a matrix. What do I mean by that? If you want to check in how many ways you can get from node A to node B in k steps, you can raise this matrix to the power of k, which is impossible to do with a list.

Related

What's the best pathfinding algorithm in complexity?

I need to implement a pathfinding algorithm in one of my programs. The goal is to know whether a path exists or not. As a consequence, knowing the path itself isn't important.
I already did some researches and I am not sure which one to pick. This post have been telling that a DFS or a BFS would be more suitable for this kind of programs but I'd rather have confirmation knowing the exact situation. I also would be interested in knowing the complexity itself of the program, but I guess I can find this. It's fine if it's not shared.
Here's the graph I am using: let's say I have a x*y grid with zones the path can and cannot take.
I want to know if there is an existing path that starts from the top of the graph and ends on the bottom of the graph. Here's an example with the path in red:
I believe DFS is the best in complexity but I also am not sure exactly how to implement it knowing the different start points the path can take. I am not sure if it's better to launch the DFS on each of the different points the path can start or if I add a layer of zones the path can take to let one test work.
Thank you for your help!
There are a number of different approaches that you can take here. Assuming that the grids you're working with are of roughly the size that you're showing above, and assuming you aren't, say, processing millions of grids at once, chances are that both breadth-first search and depth-first search would work equally well. The advantage of breadth-first search is that it will find the shortest path from anywhere in the top to anywhere in the bottom; the disadvantage is that it typically requires more memory than depth-first search. But again, if you're working with grids on the order of, say, hundreds or thousands of cells each, chances are that this memory overhead isn't going to be too much of a problem. I'd say to pick whichever algorithm you feel most comfortable working with and go with it.
As for how to implement a search from "anywhere in the top" to "anywhere in the bottom," you can achieve this in a few different ways.
If you're using a depth-first search, you can run one depth-first search from each of the cells in the top row and search for a path down to the bottom row. DFS requires you to maintain some information about which cells have and have not been visited. If you recycle this same information across all the calls to DFS, you'll ensure that no two calls do any duplicated work, and so the resulting solution should be very efficient, running in time O(mn) for an m × n grid.
If you're using a breadth-first search, the modification is pretty straightforward: instead of just enqueuing a single start point in the queue at the beginning of the search, enqueue every cell in the top row at the beginning of the search. The BFS will then naturally explore all possible paths starting anywhere in the top row.
Both of these ideas can be thought of in a different way. Imagine your grid is a graph where each cell is a node and edges correspond to pairs of adjacent cells. You can then add in a new node that sits above the top row of the grid and is connected to each of the nodes in the top row. You then add in a new node that sits just below the bottom row and is connected to each of the nodes in the bottom row. Now, if there's a path from the new top node to the new bottom node, it means that there's a path from some node in the top row to some node in the bottom row, so doing a single search in this graph will be sufficient to check if a path exists. (Fun fact: the two above modifications to DFS and BFS can each be thought of as implicitly doing a search in this new graph.)
There's another option you might want to consider that's fairly easy to implement and imperceptibly less efficient than DFS or BFS, and that's to use a disjoint-set forest data structure to determine what's connected. This data structure supports two kinds of queries:
Given two cells, mark that there's a way to get from the first cell to the second. ("Union")
Given two cells, determine whether there's a path between them, which can be a direct path or could be formed by chaining together multiple other paths. ("Find")
You could implement your connectivity query by building a disjoint-set forest, unioning together all pairs of adjacent cells, and then unioning together all nodes in the top row and unioning all nodes in the bottom row. Doing a "find" query to see if any one of the top nodes is connected to any of the bottom nodes will then solve your problem. This will take time O(mn α(mn)) for a function α(mn) that grows so slowly that it's essentially three or four, so it's effectively as efficient as BFS or DFS.

What is the Time complexity for finding universal sink given the adjecency list representation

There are many variants of this question asking the solution in O(|V|) time.
But what is the worst case bound if I wanna compute if there is a universal sink in the graph and I have graph represented in adjacency lists. This is important because all other algorithms seem to be better for adjacency lists, so if finding universal sink is not too frequent operation that I need, I will definitely go ahead for lists rather than matrix.
In my opinion, the time complexity would be the size of the graph, that is O(|V| + |E|). the algorithm for finding universal sink of a graph is as follows. Assuming in-neighbor list, Start from the index 1 of a graph. Check the length of adjacency list at index 1, if it is |V| - 1, then traverse the list to check if there is a self loop. If list does not have a self loop and all other vertices are part of a list, store the list index. Then, we must go through other lists to check if this vertex is part of their list. If it is, then the stored vertex cannot be a universal sink. Continue the search from the next index. Even if list is out-neighbor list, we will have to search the vertices which have list with length = 0, then search all other lists to check if this vertex exists in their respective lists.
As it can be concluded from above explanation, no matter what form of adjacency list is considered, in worst case, finding the universal sink must traverse through all the vertices and edges once, hence the complexity is the size of the graph, i.e. O(|V|+|E|)
But my friend who has recently joined as a assistant professor at a university, mentioned it has to be O(|V|*|V|). I am reviewing his notes before he starts teaching the course in the spring, but before correcting it I wanna be one hundred percent sure.
You're quite correct. We can build the structures we need to track all of the intermediate results, but the basic complexity is still straightforward: we go through all of our edges once, marking and counting references. We can even build a full transition matrix in O(E) time.
Depending on the data structures, we may find an improvement by a second pass over all edges, but 2 * O(E) is still O(E).
Then we traverse each node once, looking for in/out counts and a self-loop.

Neighbor Join Algorithm Output

I'm trying to implement a Neighbor Joining algorithm, at the moment I have got it working as it should, calculating the correct lengths at each step and outputting the correct values.
However, I am struggling with obtaining the final output of the algorithm, I need it to output the overall calculated matrix representation because I would like to represent it visually as a graph. Through each iteration of the main loop of the algorithm I get a sub group of nodes that goes back into the start of the algorithm, but I don't believe that this subgroup could be used because it contains redundant information that I can't really specify whether or not will be needed in the final representation.
I am using this algorithm here: http://en.wikipedia.org/wiki/Neighbor_joining#The_algorithm
Any help would be fantastic and I can provide more information if needed, thanks.
I have read the link you provided and it seems to me that you do need the information.
Each step of the algorithm merges 2 nodes into 1, making your distance matrix smaller until everything is merged. You need to remember the distances of the nodes you merge to their resulting node. If you merge A and B, then the columns/rows of your distance matrix are replaced by a column/row belonging to a new node, u. You need to remember the distances of A and B to u.
After everything is merged, you should have all distances of all nodes that have to be connected and you can start visualizing.

how to decide whether two persons are connected

Here is the problem:
assuming two persons are registered in a social networking website, how to decide whether they are connected or not?
my analysis (after reading more): actually, the question is looking for - the shortest path from A to B in a graph. I think both BFS and Dijkstra's Algorithms works here and time complexity is exactly the same (O(V+E)) because it is an unweighted graph, so we can't take advantage of the priority queue. So, a simple queue could resolve the problem. But, both of them doesnt resolve the problem that: find the path between them.
Bidrectrol should be a better solution at this point.
To find a path between the two, you should begin with a breadth first search. First find all neighbors of A, then find all neighbors of all neighbors of A, etc. Once B is hit, not only do you have a path from A to B, but you also have a shortest such path.
Dijkstra's algorithm rocks, and you may be able to speed this up by working from both end, i.e. find neighbors of A and neighbors of B, and compare.
If you do a depth first search, then you're following one path at a time. This will be much much slower.
If you do dfs for finding whether two people are connected on a social network, then it will take too long!
You already know the two persons, so you should use Bidirectional Search.. But, simple bidirectional search won't be enough for a graph as big as a social networking site. You will have to use some heuristics. Wikipedia page has some links to it.
You may also be able to use A* search. From wikipedia : "A* uses a best-first search and finds the least-cost path from a given initial node to one goal node (out of one or more possible goals)."
Edit: I suggest A* because "The additional complexity of performing a bidirectional search means that the A* search algorithm is often a better choice if we have a reasonable heuristic." So, if you can't form a reasonable heuristic, then use Bidirectional search. (Forming a good heuristic is never easy ;).)
One way is to use Union Find, add all links union(from,to), and if find(A) is find(B) is True then A and B are connected. This avoids the recursive search but it actually computes the connectivity of all pairs and doesn't give you the paths that connects A and B.
I think that the true criteria is: there are at least N paths between A and B shorter then K, or A and B are connected diectly. I would go with K = 3 and N near 5, i.e. have 5 common friends.
Note: answer edited.
Any method might end up being very slow. If you need to do this repeatedly, it's best to find the connected components of the graph, after which the task becomes a trivial O(1) operation: if two people are in the same component, they are connected.
Note that finding connected components for the first time might be slow, but keeping them updated as new edges/nodes are added to the graph is fast.
There are several methods for finding connected components.
One method is to construct the Laplacian of the graph, and look at its eigenvalues / eigenvectors. The number of zero eigenvalues gives you the number of connected components. The non-zero elements of the corresponding eigenvectors gives the nodes belonging to the respective components.
Another way is along the following lines:
Create a transformation table of nodes. Element n of the array contains the index of the node that node n transforms to.
Loop through all edges (i,j) in the graph (denoting a connection between i and j):
Compute recursively which node do i and j transform to based on the current table. Let us denote the results by k and l. Update entry k to make it transform to l. Update entries i and j to point to l as well.
Loop through the table again, and update each entry to point directly to the node it recursively transforms to.
Now nodes in the same connected component will have the same entry in the transformation table. So to check if two nodes are connected, just check if they transform to the same value.
Every time a new node or edge is added to the graph, the transformation table needs to be updated, but this update will be much faster than the original calculation of the table.

An algorithm to check if a vertex is reachable

Is there an algorithm that can check, in a directed graph, if a vertex, let's say V2, is reachable from a vertex V1, without traversing all the vertices?
You might find a route to that node without traversing all the edges, and if so you can give a yes answer as soon as you do. Nothing short of traversing all the edges can confirm that the node isn't reachable (unless there's some other constraint you haven't stated that could be used to eliminate the possibility earlier).
Edit: I should add that it depends on how often you need to do queries versus how large (and dense) your graph is. If you need to do a huge number of queries on a relatively small graph, it may make sense to pre-process the data in the graph to produce a matrix with a bit at the intersection of any V1 and V2 to indicate whether there's a connection from V1 to V2. This doesn't avoid traversing the graph, but it can avoid traversing the graph at the time of the query. I.e., it's basically a greedy algorithm that assumes you're going to eventually use enough of the combinations that it's easiest to just traverse them all and store the result. Depending on the size of the graph, the pre-processing step may be slow, but once it's done executing a query becomes quite fast (constant time, and usually a pretty small constant at that).
Depth first search or breadth first search. Stop when you find one. But there's no way to tell there's none without going through every one, no. You can improve the performance sometimes with some heuristics, like if you have additional information about the graph. For example, if the graph represents a coordinate space like a real map, and most of the time you know that there's going to be a mostly direct path, then you can attempt to have the depth-first search look along lines that "aim towards the target". However, imagine the case where the start and end points are right next to each other, but with no vector inbetween, and to find it, you have to go way out of the way. You have to check every case in order to be exhaustive.
I doubt it has a name, but a breadth-first search might go like this:
Add V1 to a queue of nodes to be visited
While there are nodes in the queue:
If the node is V2, return true
Mark the node as visited
For every node at the end of an outgoing edge which is not yet visited:
Add this node to the queue
End for
End while
Return false
Create an adjacency matrix when the graph is created. At the same time you do this, create matrices consisting of the powers of the adjacency matrix up to the number of nodes in the graph. To find if there is a path from node u to node v, check the matrices (starting from M^1 and going to M^n) and examine the value at (u, v) in each matrix. If, for any of the matrices checked, that value is greater than zero, you can stop the check because there is indeed a connection. (This gives you even more information as well: the power tells you the number of steps between nodes, and the value tells you how many paths there are between nodes for that step number.)
(Note that if you know the number of steps in the longest path in your graph, for whatever reason, you only need to create a number of matrices up to that power. As well, if you want to save memory, you could just store the base adjacency matrix and create the others as you go along, but for large matrices that may take a fair amount of time if you aren't using an efficient method of doing the multiplications, whether from a library or written on your own.)
It would probably be easiest to just do a depth- or breadth-first search, though, as others have suggested, not only because they're comparatively easy to implement but also because you can generate the path between nodes as you go along. (Technically you'd be generating multiple paths and discarding loops/dead-end ones along the way, but whatever.)
In principle, you can't determine that a path exists without traversing some part of the graph, because the failure case (a path does not exist) cannot be determined without traversing the entire graph.
You MAY be able to improve your performance by searching backwards (search from destination to starting point), or by alternating between forward and backward search steps.
Any good AI textbook will talk at length about search techniques. Elaine Rich's book was good in this area. Amazon is your FRIEND.
You mentioned here that the graph represents a road network. If the graph is planar, you could use Thorup's Algorithm which creates an O(nlogn) space data structure that takes O(nlogn) time to build and answers queries in O(1) time.
Another approach to this problem would allow you to ignore all of the vertices. If you were to only look at the edges, you can produce a transitive closure array that will show you each vertex that is reachable from any other vertex.
Start with your list of edges:
Va -> Vc
Va -> Vd
....
Create an array with start location as the rows and end location as the columns. Fill the arrays with 0. For each edge in the list of edges, place a one in the start,end coordinate of the edge.
Now you iterate a few times until either V1,V2 is 1 or there are no changes.
For each row:
NextRowN = RowN
For each column that is true for RowN
Use boolean OR to OR in the results of that row of that number with the current NextRowN.
Set RowN to NextRowN
If you run this algorithm until the end, you will quickly have a complete list of all reachable vertices without looking at any of them. The runtime is proportional to the number of edges. This would work well with a reasonable implementation and a reasonable number of edges.
A slightly more complex version of this algorithm would be to only calculate the vertices reachable by V1. To do this, you would focus your scope on the ones that are currently reachable at any given time. You can also limit adding rows to only one time, since the other rows are never changing.
In order to be sure, you either have to find a path, or traverse all vertices that are reachable from V1 once.
I would recommend an implementation of depth first or breadth first search that stops when it encounters a vertex that it has already seen. The vertex will be processed on the first occurrence only. You need to make sure that the search starts at V1 and stops when it runs out of vertices or encounters V2.

Resources