A* or Bidirectional Breadth First Search? - performance

Not sure if this is the right place to ask but,
I have programmed a bidirectional breadth-first search algorithm in Java which searches from both the start node in a graph and the goal node in a graph at the same time. In a graph with 3000000 (3 million) nodes of which all nodes are connected with an average of 4 other nodes (connected with two-way/bidirectional edges) it takes on average only 0.5 seconds on a mediocre CPU to find the shortest path between any two random nodes, with about 10 seconds being the worst-case time that I found over 30 test runs.
Assuming that only a search for one path is required (for example, when plotting a route between a starting point and a destination), what would be the advantages of using an A* algorithm with a decent heuristic in this case? Yes, it might be slightly faster with finding a path, but A* most likely won't find the shortest path.
Can someone elaborate on why A* is often chosen over bidirectional breadth-first search and what advantages it yields in terms of performance (computation time, memory usage, the chance of finding the optimal solution etc.)?
Also,
Happy easter everyone!

Bidirectional search is great when it’s feasible, but it relies on being able to produce goal nodes as easily as start nodes (how would you select a goal node in a chess game, for example?), and the branching factor being similar in both directions. It relies on being able to go backwards at all, come to that. Even if you have a good heuristic, you’re unlikely to be able to take much advantage from it in the reverse part of the search.

Related

Is there an efficient way to find shortest paths in a functional graph?

My task is to process Q shortest path queries in a functional graph with V nodes. Q and V are integers that can be up to 100000.
My first idea was to use the Floyd-Warshall algorithm to answer queries efficiently, but this algorithm takes O(V^3) time to calculate the shortest paths, which is way too slow.
My second idea runs in O(QV) time, because for every query I start at the starting node and traverse through the graph until I discover a cycle or reach the destination node.
However, this solution is still too slow; it has no chance of quickly processing queries when V and Q become large. I think that there is some pre-processing or another technique that I could use to solve this, but I haven't been able to find any online resources to help guide me. Can somebody please help me out?
A functional graph means that each node has only a single out-edge, so the maximum number of steps between A and B couldn't be more than the number of vertices without encountering a cycle. You should be O(V).
Since there are no choices, you could readily build a CostMap[V][V] which recorded the distance between two nodes, and lazily fill it as you encounter queries; thus successive queries would approach constant time.
There are a a lot of algorithm designed for this purpose, you can look up the Depth First Search (DFS) or Breadth First Search (BFS) algorithm. As well as the Djikstra's algorithm and the A* (A star) algorithm, this last one is often used in pathfinding for video games. They all have their pros and cons and it depends on the architecture of your network but it should suit your needs.

Using Dijsktra's algorithm with unique distance between nodes

For a school project, my friends and I are learning what is pathfinding and how to exploit it via a simple exercise:
For a group of ants going from A to B, they need to travel through multiple nodes, one at a time.
I've read some explanation about Dijsktra's algoritm, and I wanted to know if I could use it in a graph where every distance between every node is 1 unit. Is it optimal ? Or is A* more appropriated to my case ?
EDIT:
Since I had knowledge on the graph, BFS was prefered because distance between node was pre-computed, where Djisktra is prefered when you have absolutely nothing about the graph itself. See this post for reference
If every edge has the same cost, then Dijkstra's algorithm is the same as a breadth-first search. In this case you might as well just implement BFS. It's easier.
If you have a way to estimate how far a point is from the target, then you could use A* instead. This will find the best path faster in most cases.

Travelling by bus

If you have the full bus schedule for a country, how can you find the
furthest anyone can travel in one day without visiting the same stop twice?
I assume a bus schedule gives you the full list of leaving and arriving times for every bus stop.
A slow and naive method would be as follows.
You can of course make a graph from the bus schedule with multiple directed edges between bus stops. You could then do a depth first search remembering the arrival time of the edge you took to get to each node and only taking edges from that stop that leave after the one that you took to get there. If you go to a node you have been to before you would only carry on from there if the current time in your traversal is before the earliest time you had ever visited that node before. You could record the furthest you can get from each node and then you could check each node to find the furthest you can travel overall.
This seems very inefficient however and it really isn't a normal graph problem. The problem is that in a normal directed graph if you can get from A to B and from B to C then you can get from A to C. This isn't true here.
What is the fastest you can solve this problem?
I think your original algorithm is pretty good.
You can think of your approach as being a version of Dijkstra's algorithm, in attempting to find the shortest path to each node.
Note that it is best at this stage to weight edges in the graph in terms of time. The idea is to use your Dijkstra-like algorithm to compute all nodes reachable within 1 days worth of time, and then pick whichever of these nodes is furthest in space from the start point.
Implementations of Dijkstra can use a heap to retrieve the next node to explore in O(logn), and I think this would be a good enhancement to your approach as well. If you always choose the node that you can reach earliest, you never need to repeat the calculation for that node.
Overall the approach is:
For each starting point
Use a modified Dijkstra to compute all nodes reachable in 1 day
Find the furthest in space of all these nodes.
So for n starting points and e bus routes, the complexity is about O(n(n+e)log(n)) to get the optimal answer.
You should be able to get improved performance by using an appropriate heuristic in an A* search. The heuristic needs to underestimate the max distance possible from a point, so you could use the maximum speed of a bus multiplied by the remaining time.
Instead of making multiple edges for each departure from a location, you can make multiple nodes per location / time.
Create one node per location per departure time.
Create one node per location per arrival time.
Create edges to connect departures to arrivals.
Create edges to connect a given node to the node belonging to the same location at the nearest future time.
By doing this, any path you can traverse through the graph is "valid" (meaning a traveler would be able to achieve this by a combination of bus trips or choosing to sit at a location and wait for a future bus).
Sorry to say, but as described this problem has a pretty high complexity. Misread the problem originally and thought it was np-hard, but it is not. It does however have a pretty high complexity that I personally would not want to deal with. This algorithm is a pretty good approximation that give a considerable complexity savings that I personally think it worth it.
However, if all you want is an answer that is "pretty good" there are are lot of fairly efficient algorithms out there that will get close very quickly.
Personally I would suggest using a simple greedy algorithm here.
I've done this on a few (granted, small and contrived) examples and it's worked pretty well and has an nlog(n) efficiency.
Associate a velocity with each node, velocity being the fastest you can move away from a given node. In my examples this velocity was distance_travelled/(wait_time + travel_time). I used the maximum velocity of all trips leaving a node as the velocity score for that node.
From your node/time calculate the velocities of all neighboring nodes and travel to the "fastest" node.
This algorithm is pretty good for the complexity as it basically transforms the problem into a static search, but there are a couple potential pitfalls that could be adjusted for depending on your data set.
The biggest issue with this algorithm is the possibility of a really fast bus going into the middle of nowhere. You could get around that by adding a "popularity" term to the velocity calculation (make more popular stops effectively faster) but depending on your data set that could easily make things either better or worse.
The simplistic graph representation will not work. I. e. each city is a node and the edges represent time. That's because the "edge" is not always active -- it is only active at certain times of the day.
The second thing that comes to mind is Edward Tufte's Paris Train Schedule which is a different kind of graph. But that does not quite fit the problem either. With the train schedule, the stations have a sequential relationship between stations, but that's not the case in general with cities and bus schedules.
But Tufte motivates the following way to model it as a graph. You could write code only to construct the graph and use a standard graph library that includes the shortest path algorithm.
Each bus trip is an edge with weight = distance covered
Each (city, departure) and (city, arrival) is a node
All nodes for a given city are connected by zero-weight edges in a time-ordered sequence, ignoring whether it is an arrival or a departure. This subgraph will look like a chain.
(it is a directed graph)
Linear Time Solution: Note that the graph will be a directed, acyclic graph. Finding the longest path in such a graph is linear. "A longest path between two given vertices s and t in a weighted graph G is the same thing as a shortest path in a graph −G derived from G by changing every weight to its negation. Therefore, if shortest paths can be found in −G, then longest paths can also be found in G."
Hope this helps! If somebody can post a visualization of the graph, it would be nice. If I can do so myself, I will do 1 more edit.
Naive is the best you'll get -- http://en.wikipedia.org/wiki/Longest_path_problem
EDIT:
So the problem is two fold.
Create a list of graphs where its possible to travel from pointA to pointB. Possible is in terms of times available for busA to travel from pointA to pointB.
Find longest path from all the possible generated path above.
Another approach would be to reevaluate the graph upon each node traversal and find the longest path.
It still reduces to finding longest possible path, which is NP-Hard.

When is backward search better than forward?

I'm studying graph search algorithms (for this question sake, lets limit algorithms only on DFS, BreadthFS, ID).
All these algorithms can be implemented as either forward search (from start node to end node) or backward search (from end node to start node).
My question is, when will backward search perform better than forward? Is there a general rule for that?
With a breadth-first search or iterative deepening, I think the mathematical answer to your question involves the notion of a "ball" around a vertex. Define Ball(v, n) to be the set of nodes at distance at most n from node v, and let the distance from the start node s to the destination node t be d. Then in the worst case a forward search will perform better than a backward search if |Ball(s, d)| < |Ball(t, d)|. This is true because breadth-first search always (and ID in the worst case) expands out all nodes of some distance k from the start node before ever visiting any nodes of depth k + 1. Consequently, if there's a smaller number of nodes around the start than the target a forward search should be faster, whereas if there's a smaller number of nodes around the target than the start and backward search should be faster. Unfortunately, it's hard to know this number a priori; you usually either have to run the search to determine which is the case. You could potentially use the branching factor around the two nodes as a heuristic for this value, but it wouldn't necessarily guarantee one search would be faster.
One interesting algorithm you might want to consider exploring is bidirectional breadth-first search, which does a search simultaneously from the source and target nodes. It tends to be much faster than the standard breadth-first search (in particular, with a branching factor b and distance d between the nodes, BFS takes roughly O(bd) time while bidirectional BFS takes O(bd/2)). It's also not that hard to code up once you have a good BFS implementation.
As for depth-first search, I actually don't know of a good way to determine which will be faster because in the worst-case both searches could explore the entire graph before finding a path. If someone has a good explanation about how to determine which will be better, it would be great if they could post it.

Calculating "Kevin Bacon" Numbers

I've been playing around with some things and thought up the idea of trying to figure out Kevin Bacon numbers. I have data for a site that for this purpose we can consider a social network. Let's pretend that it's Facebook (for simplification of discussion). I have people and I have a list of their friends, so I have the connections between them. How can I calculate the distance from one person to another (basically, a Kevin Bacon number)?
My best idea is a Bidirectional search, with a depth limit (to limit computational complexity and avoid the problem of people who simply can't be connected in the graph), but I realize this is rather brute force.
Could it be better to make little sub-graphs (say something equivalent to groups on Facebook), calculate the shortest distances between them (ahead of time, perhaps) and then try to use THOSE to find a link? While this requires pre-calculation, it could make it possible to search many fewer nodes (nodes could be groups instead of individuals, making the graph much smaller). This would still be a bidirectional search though.
I could also pre-calculate the number of people an individual is connected to, searching the nodes for "popular" people first since they could have the best chance of connecting to the given destination individual. I realize this would be a trade-off of speed for possible shortest path. I'd think I'd also want to use a depth-first search instead of the breadth-first search I'd plan to use in the other cases.
Can someone think of a simpler/faster way of doing this? I'd like to be able to find the shortest length between two people, so it's not as easy as always having the same end point (such as in the Kevin Bacon problem).
I realize that there are problems like I could get chains of 200 people and such, but that can be solved my having a limit to the depth I'm willing to search.
This is a standard shortest path problem. There are lots of solutions, including Dijkstra's algorithm and Bellman-Ford. You may be particularly interested in looking at the A* algorithm and seeing how it would perform with the cost function relative to the inverse of any particular node's degree. The idea would be to visit more popular nodes (those with higher degree) first.
Sounds like a job for
Dijkstra's algorithm.
ED: Eh, I shouldn't have pulled the trigger so fast. Dijkstra's (and Bellman-Ford) reduces to a breadth-first search when the weights are 1, so this isn't too useful. Oh well.
The A* algorithm, mentioned by tvanfosson, may be ideal for this. The idea is that instead of searching and recursing in whatever order the elements are in each level of the tree (rooted on your start- or end-point), you use some heuristic to determine which element you are going to try first. In your case a good bet would probably be the degree of a node (number of "friends"), but you could possibly want to use the number of people within some arbitrary number of degrees of a given person (i.e., the guy who has has three friends who each have 100 friends is likely to be a better node than the guy who has 20 friends in a clique that shuns outsiders). There's all sorts of other things you could use as a heuristic (friends get 2 points, friends-of-friends get 1 point; whatever, experiment).
Combine this with a depth limit (cut off after 6 degrees of separation, or whatever), and you can vastly improve your average case (worst case is still the same as basic BFS).
run a breadth-first search in both directions (from each endpoint) and stop when you have a connection or reach your depth limit
This one might be better overall Floyd-Warshall the all pairs shortest distance.

Resources