I need to implement a pathfinding algorithm in one of my programs. The goal is to know whether a path exists or not. As a consequence, knowing the path itself isn't important.
I already did some researches and I am not sure which one to pick. This post have been telling that a DFS or a BFS would be more suitable for this kind of programs but I'd rather have confirmation knowing the exact situation. I also would be interested in knowing the complexity itself of the program, but I guess I can find this. It's fine if it's not shared.
Here's the graph I am using: let's say I have a x*y grid with zones the path can and cannot take.
I want to know if there is an existing path that starts from the top of the graph and ends on the bottom of the graph. Here's an example with the path in red:
I believe DFS is the best in complexity but I also am not sure exactly how to implement it knowing the different start points the path can take. I am not sure if it's better to launch the DFS on each of the different points the path can start or if I add a layer of zones the path can take to let one test work.
Thank you for your help!
There are a number of different approaches that you can take here. Assuming that the grids you're working with are of roughly the size that you're showing above, and assuming you aren't, say, processing millions of grids at once, chances are that both breadth-first search and depth-first search would work equally well. The advantage of breadth-first search is that it will find the shortest path from anywhere in the top to anywhere in the bottom; the disadvantage is that it typically requires more memory than depth-first search. But again, if you're working with grids on the order of, say, hundreds or thousands of cells each, chances are that this memory overhead isn't going to be too much of a problem. I'd say to pick whichever algorithm you feel most comfortable working with and go with it.
As for how to implement a search from "anywhere in the top" to "anywhere in the bottom," you can achieve this in a few different ways.
If you're using a depth-first search, you can run one depth-first search from each of the cells in the top row and search for a path down to the bottom row. DFS requires you to maintain some information about which cells have and have not been visited. If you recycle this same information across all the calls to DFS, you'll ensure that no two calls do any duplicated work, and so the resulting solution should be very efficient, running in time O(mn) for an m × n grid.
If you're using a breadth-first search, the modification is pretty straightforward: instead of just enqueuing a single start point in the queue at the beginning of the search, enqueue every cell in the top row at the beginning of the search. The BFS will then naturally explore all possible paths starting anywhere in the top row.
Both of these ideas can be thought of in a different way. Imagine your grid is a graph where each cell is a node and edges correspond to pairs of adjacent cells. You can then add in a new node that sits above the top row of the grid and is connected to each of the nodes in the top row. You then add in a new node that sits just below the bottom row and is connected to each of the nodes in the bottom row. Now, if there's a path from the new top node to the new bottom node, it means that there's a path from some node in the top row to some node in the bottom row, so doing a single search in this graph will be sufficient to check if a path exists. (Fun fact: the two above modifications to DFS and BFS can each be thought of as implicitly doing a search in this new graph.)
There's another option you might want to consider that's fairly easy to implement and imperceptibly less efficient than DFS or BFS, and that's to use a disjoint-set forest data structure to determine what's connected. This data structure supports two kinds of queries:
Given two cells, mark that there's a way to get from the first cell to the second. ("Union")
Given two cells, determine whether there's a path between them, which can be a direct path or could be formed by chaining together multiple other paths. ("Find")
You could implement your connectivity query by building a disjoint-set forest, unioning together all pairs of adjacent cells, and then unioning together all nodes in the top row and unioning all nodes in the bottom row. Doing a "find" query to see if any one of the top nodes is connected to any of the bottom nodes will then solve your problem. This will take time O(mn α(mn)) for a function α(mn) that grows so slowly that it's essentially three or four, so it's effectively as efficient as BFS or DFS.
Related
Isn't it always better when searching for shortest path to use for connected nodes lists instead of grid?
When using grid, you have to iterate over the grid every time, whereas using lists saves lots of time.
With adjacency matrix usually each check costs you O(n) time. It may be a bit slower than a list of connected nodes. However, you can do some fancy stuff with it. For example, if you want to delete a lot of edges, you can do it in O(1) using adjacency matrix (it may take a lot longer using a list of nodes depending on what data structure you use for it). Adjacency matrix is also a matrix. What do I mean by that? If you want to check in how many ways you can get from node A to node B in k steps, you can raise this matrix to the power of k, which is impossible to do with a list.
I'm having trouble finding an algorithm for my problem.
I have a grid of 8x8 blocks, each block has a value ranging from 0 to 9. And I want to find collections of connected blocks that match a total value of for example 15. My first approach was to start of at the border, that worked fine. But when starting in the middle of the grid my algorithm gets lost.
Would anyone know a simple algorithm to use or can you point me in the right direction?
Thanks!
As far as I know, no simple algorithm exists for this. As for pointing you in the right direction, an 8x8 grid is really just a special case of a graph, so I'd start with graph traversal algorithms. I find that in cases like this, it sometimes helps to think how you would solve the problem for a smaller grid (say, 3x3 or 4x4) and then see if your algorithm scales up to "full size."
EDIT :
My proposed algorithm is a modified depth-first traversal. To use it, you'll have to convert your grid into a graph. The graph should be undirected, since connected blocks are connected equally in both directions.
Each graph node represents a single block, containing the block's value and a visited variable. Edge weights represent their edges' resistance to being followed. Set them by summing the values of the nodes they connect. Depending on the sum you're looking for, you may be able to optimize this by removing edges that are guaranteed to fail. For example, if you're looking for 15, you can delete all edges with weight of 16 or greater.
The rest of the algorithm will be performed as many times as there are blocks, with each block serving as the starting block once. Traverse the graph by following the lowest-weighted edge from the current node, unless that takes you to a visited node. Push each visited node onto a stack and set its visited variable to true. Keep a running sum for every path followed.
Whenever the desired sum is reached, save the current path as one of your answers. Do not stop traversal, because the current node could be connected to a zero.
Whenever the total exceeds the desired sum, backtrack by setting visited to false and popping the current node off the stack.
Whenever all edges for a given node have been explored, backtrack.
After every possible path from a given starting node is analyzed, every answer that includes that node has been found. So, remove all edges touching the starting node and choose a new starting node.
I haven't fully analyzed the efficiency/running time of this algorithm yet, but... it's not good. (Consider the number of paths to be searched in a graph containing all zeroes.) That said, it's far better than pure brute force.
I have a pyramid of numbers. Each number represents the number of points associated. I need to use a greedy algorithm to find the path with the lowest cost to get from the top of the pyramid to the bottom. I've read about uninformed & informed search algorithms, but still I don't know what to choose. What do you thing is best suited for this type of problem? Greedy best-first search / A* search or other? It's such a simple issue, but I'm not used with all these algorithms to know what's the best option. And as I said, it has to be a greedy algorithm.
If I am understanding you correctly, in your pyramid you always have the option of descending to the left or to the right, and the cost of getting to the bottom is the sum of all the nodes you pass through.
In this case, simply work your way up from the bottom. Start at the 2nd row from the bottom. For each node in the row, look at its left and right children in the row below. Add the cost of the cheaper child node to the node you are on. Move up a row and repeat, until you are at the root/peak. Each node will now contain the cost of the cheapest path from there to the bottom. Just greedily descend by choosing the child node with the cheaper cost.
If you don't have a must of using greedy algorithm which isn't correct here.
For this kind of problem you naturally use a technique called "dynamic programming".
You initialize all squares of your pyramid (you make a backup) with infinity - except the initial point which has value of its own.
And you proccess pyramid from top to bottom, row by row.
You try to go wherever you can from the first row (so the only one is top) and you update nodes at the second row, giving them the value of the top + their value. And then you move to second row, and update nodes in the next row.
It is possible that earlier you've found a better route to that node (leading from the node placed one place left) so you only update if the newly created route is "faster". (You made therefore an infinity initialization, meaning that at the beggining you don't know if any route actually exists) .After you finish processing a level of pyradim that way you know that you have best possible routes to nodes that are placed in the level just below.
Even if it sounds a bit complicated it's quite easy to implement, i hope it won't make you a problem.
What you want is the Dijkstra-Algorithm it is simpler then A* search but I guess a DFS would do that to. I'm not sure what you really want.
Is there an algorithm that can check, in a directed graph, if a vertex, let's say V2, is reachable from a vertex V1, without traversing all the vertices?
You might find a route to that node without traversing all the edges, and if so you can give a yes answer as soon as you do. Nothing short of traversing all the edges can confirm that the node isn't reachable (unless there's some other constraint you haven't stated that could be used to eliminate the possibility earlier).
Edit: I should add that it depends on how often you need to do queries versus how large (and dense) your graph is. If you need to do a huge number of queries on a relatively small graph, it may make sense to pre-process the data in the graph to produce a matrix with a bit at the intersection of any V1 and V2 to indicate whether there's a connection from V1 to V2. This doesn't avoid traversing the graph, but it can avoid traversing the graph at the time of the query. I.e., it's basically a greedy algorithm that assumes you're going to eventually use enough of the combinations that it's easiest to just traverse them all and store the result. Depending on the size of the graph, the pre-processing step may be slow, but once it's done executing a query becomes quite fast (constant time, and usually a pretty small constant at that).
Depth first search or breadth first search. Stop when you find one. But there's no way to tell there's none without going through every one, no. You can improve the performance sometimes with some heuristics, like if you have additional information about the graph. For example, if the graph represents a coordinate space like a real map, and most of the time you know that there's going to be a mostly direct path, then you can attempt to have the depth-first search look along lines that "aim towards the target". However, imagine the case where the start and end points are right next to each other, but with no vector inbetween, and to find it, you have to go way out of the way. You have to check every case in order to be exhaustive.
I doubt it has a name, but a breadth-first search might go like this:
Add V1 to a queue of nodes to be visited
While there are nodes in the queue:
If the node is V2, return true
Mark the node as visited
For every node at the end of an outgoing edge which is not yet visited:
Add this node to the queue
End for
End while
Return false
Create an adjacency matrix when the graph is created. At the same time you do this, create matrices consisting of the powers of the adjacency matrix up to the number of nodes in the graph. To find if there is a path from node u to node v, check the matrices (starting from M^1 and going to M^n) and examine the value at (u, v) in each matrix. If, for any of the matrices checked, that value is greater than zero, you can stop the check because there is indeed a connection. (This gives you even more information as well: the power tells you the number of steps between nodes, and the value tells you how many paths there are between nodes for that step number.)
(Note that if you know the number of steps in the longest path in your graph, for whatever reason, you only need to create a number of matrices up to that power. As well, if you want to save memory, you could just store the base adjacency matrix and create the others as you go along, but for large matrices that may take a fair amount of time if you aren't using an efficient method of doing the multiplications, whether from a library or written on your own.)
It would probably be easiest to just do a depth- or breadth-first search, though, as others have suggested, not only because they're comparatively easy to implement but also because you can generate the path between nodes as you go along. (Technically you'd be generating multiple paths and discarding loops/dead-end ones along the way, but whatever.)
In principle, you can't determine that a path exists without traversing some part of the graph, because the failure case (a path does not exist) cannot be determined without traversing the entire graph.
You MAY be able to improve your performance by searching backwards (search from destination to starting point), or by alternating between forward and backward search steps.
Any good AI textbook will talk at length about search techniques. Elaine Rich's book was good in this area. Amazon is your FRIEND.
You mentioned here that the graph represents a road network. If the graph is planar, you could use Thorup's Algorithm which creates an O(nlogn) space data structure that takes O(nlogn) time to build and answers queries in O(1) time.
Another approach to this problem would allow you to ignore all of the vertices. If you were to only look at the edges, you can produce a transitive closure array that will show you each vertex that is reachable from any other vertex.
Start with your list of edges:
Va -> Vc
Va -> Vd
....
Create an array with start location as the rows and end location as the columns. Fill the arrays with 0. For each edge in the list of edges, place a one in the start,end coordinate of the edge.
Now you iterate a few times until either V1,V2 is 1 or there are no changes.
For each row:
NextRowN = RowN
For each column that is true for RowN
Use boolean OR to OR in the results of that row of that number with the current NextRowN.
Set RowN to NextRowN
If you run this algorithm until the end, you will quickly have a complete list of all reachable vertices without looking at any of them. The runtime is proportional to the number of edges. This would work well with a reasonable implementation and a reasonable number of edges.
A slightly more complex version of this algorithm would be to only calculate the vertices reachable by V1. To do this, you would focus your scope on the ones that are currently reachable at any given time. You can also limit adding rows to only one time, since the other rows are never changing.
In order to be sure, you either have to find a path, or traverse all vertices that are reachable from V1 once.
I would recommend an implementation of depth first or breadth first search that stops when it encounters a vertex that it has already seen. The vertex will be processed on the first occurrence only. You need to make sure that the search starts at V1 and stops when it runs out of vertices or encounters V2.
Suppose that I have a very large undirected, unweighted graph (starting at hundreds of millions of vertices, ~10 edges per vertex), non-distributed and processed by single thread only and that I want to do breadth-first searches on it. I expect them to be I/O-bound, thus I need a good-for-BFS disk page layout, disk space is not an issue. The searches can start on every vertex with equal probability. Intuitively that means minimizing the number of edges between vertices on different disk pages, which is a graph partitioning problem.
The graph itself looks like a spaghetti, think of random set of points randomly interconnected, with some bias towards shorter edges.
The problem is, how does one partition graph this large? The available graph partitioners I have found work with graphs that fit into memory only. I could not find any descriptions nor implementations of any streaming graph partitioning algorithms.
OR, maybe there is an alternative to partitioning graph for getting a disk layout that works well with BFS?
Right now as an approximation I use the fact that the vertices have spatial coordinates attached to them and put the vertices on disk in Hilbert sort order. This way spatially close vertices land on the same page, but the presence or absence of edge between them is completely ignored. Can I do better?
As an alternative, I can split graph into pieces using the Hilbert sort order for vertices, partition the subgraphs, stitch them back and accept poor partitioning on the seams.
Some things I have looked into already:
How to store a large directed unweighted graph with billions of nodes and vertices
http://neo4j.org/ - I found zero information on how does it do graph layout on disk
Partitioning implementations (unless I'm mistaken, all of them need to fit graph into memory):
http://glaros.dtc.umn.edu/gkhome/views/metis
http://www.sandia.gov/~bahendr/chaco.html
http://staffweb.cms.gre.ac.uk/~c.walshaw/jostle/
http://www.cerfacs.fr/algor/Softs/MESHPART/
EDIT: info on how the graphs looks like and that BFS can start everywhere.
EDIT: idea on partitioning subgraphs
No algorithm truly needs to "fit into memory"--you can always page things in and out as needed. But you do want to avoid having the computation take unreasonably long--and global graph partitioning in the generic case is a NP-complete problem, which is "unreasonably long" for most problems that do not even fit in memory.
Fortunately, you want to do breadth-first searches, which means that you want a format where breadth-first is the easy computation. I don't know of any algorithms offhand that do this, but you can construct your own breadth-first layout if you're willing to allow a bit of extra disk space.
If the edges are not biased towards local interactions, then disentangling the graph will be difficult. If they are biased towards local interactions, then I suggest an algorithm like the following:
Pick a random set of vertices as starting points from throughout the entire data set.
For each vertex, collect all neighboring vertices (takes one sweep through the data set).
For each set of neighboring vertices collect the set of neighbors-of-neighbors and rank them according to how many edges connect to them. If you don't have space in a page to store them all, keep the most-connected vertices. If you do have space to save them all, you may wish to throw away the least useful ones (e.g. if the fraction of edges kept within a page / fraction of vertices needing storage ratio drops "too low"--where "too low" will depend on how much breadth your searches really need, and whether you can do any pruning and so on--then don't include those in the neighborhood.
Repeat the process of collecting and ranking neighbors until your neighborhood is full (e.g. fills some page size that suits you). Then check for repeats among the randomly chosen starts. If you have a small number of vertices appearing in both, remove them from one or the other, whichever breaks fewer edges. If you have a large number of vertices appearing in both, keep the neighborhood with the best (vertices in neighborhood/broken edge) ratio and throw the other away.
Now you have some local neighborhoods that are approximately locally optimal in that breadth-first searches tend to fall inside. If your breadth-first search prunes off unproductive branches pretty effectively, then this is probably good enough. If not, you probably want adjacent neighborhoods to cluster.
If you don't need adjacent neighborhoods to cluster too much, you set aside the vertices you've grouped into neighborhoods, and repeat the process on the remaining data until all vertices are accounted for. You change each vertex identifier to (vertex,neighborhood), and you're done: when following edges, you know exactly which page to grab, and most of them will be close by given the construction.
If you do need adjacent neighborhoods, then you'll need to keep track of your growing neighborhoods. You repeat the previous process (pick at random, grow neighborhoods), but now rank neighbors by both how many edges they satisfy within the neighborhood and what fraction of their edges that leave the neighborhood are in an existing group. You might need weighting factors, but something like
score = (# edges within) - (# neighborhoods outside) - (# neighborhoodless edges outside)
would probably do the trick.
Now, this is not globally or even locally optimal, but this or something very much like it should give a nicely locally-connected structure, and should let you produce a covering set of neighborhoods that have relatively high interconnectivity.
Again, it depends whether your breadth-first search prunes branches or not. If it does, the inexpensive thing to do is to maximize local interconnectivity. If it doesn't the thing to do is to minimize external connectivity--and in that case, I'd suggest just collecting breadth-first sets up to some size and saving those (with duplication at the edges of the sets--you're not badly limited by hard drive space, are you?).
You might want to look at HDF5. Despite the H standing for Hierarchical it can store graphs, check the documentation under the keyword 'Groups', and it is designed for very large datasets. If I understand correctly HDF5 'files' can be spread across multiple o/s 'files'. Now, HDF5 is only a data structure, plus a set of libraries for low- and high-level manipulations of the data structure. Off the top of my head I haven't a clue about streaming graph-partitioning algorithms, but I stick to the notion that if you get the data structure right algorithms become easier to implement.
What do you already know about the mega-graph ? Does it naturally partition into dense subgraphs which themselves are sparsely connected ? Would a topological sort of the graph be a better basis for storage on disk than the existing spatial sort?
Failing crisp answers to such questions, maybe you just have to bite the bullet and read the graph multiple times to build the partitions, in which case you just want the fastest I/O you can manage, and sophisticated layout of partitions on nodes is nice but not as important. If you can partition the graph into sub-graphs which themselves have single edges to the other sub-graphs you maybe able to make the problem more tractable.
You want a good-for-BFS layout, but BFS is usually applied to trees. Does your graph have a unique root from which to start all BFSes? If not, then layout for BFS from one vertex will be suboptimal for BFS from another vertex.