A* vs trees in longest path - algorithm

Let T be a tree in which each node represents a state. The root represents the initial state. An edge going from a parent to a child specifies an action that can be performed on the parent in order to change state (the new state will be the child). Each edge is associated with a gain, i.e., I gain something by transitioning from the parent state to the child state.
Moreover, suppose that each path from the root to a leaf node has length Q.
My objective is the one of finding the most promising path of length Q, i.e., the path that guarantees the largest gain (where the path gain is defined as the summation of the gains attached to the edges in the path).
Obviously, I would like to do this without exploring the entire tree, since T could be very huge.
Thus, I thought about applying A*. I know that A* can be used to find the shortest path in a graph, but:
I don't have costs, I have gains
I want to find the longest path (actually not the longest in distance from the start node, but the one whose weights, if summed up, give the highest value)
I have a tree and not a graph (no cycles!)
Eventually, I came up with a set of question that I would like to pose to you:
Is A* suitable for this type of problem? Will I find the optimal solution by applying it?
Since A* requires to use an (under)estimate of the cost from the current node to the goal in case of shortest path, am I required to look for an (over)estimate of the gain from the current node to the goal and use it as a heuristic?
Given a node n in T, my idea was to compute the heuristic h(n) as the summation of the gains achieved by the children of n, which might not be so tight. Do you think there could be a better solution?
EDIT: given a node n in the tree, the gain attached to an edge outgoing from n cannot be greater than a quantity U(n). Moreover, U(n) becomes smaller and smaller as the depth of n increases.

Analysis
The reason is as follows. Suppose you assert that a path P is optimal, and have not examined edge e. I can, without loss of generality, set the gain for e to a value greater than the sum of all other gains in the tree. Then your path P is not optimal.
So any assertion of optimality made before examining all edges' gains is false.
Conclusion
If no additional information is given about the gains on edges, you cannot find the optimal path without exploring the entire tree.
If you had, for example, an upper bound on gain values, you could use A* to more-efficiently find the optimal path and not examine every edge.
Responses to the changes you made to the question after this answer was written are in the comments below. Make sure to review them.

To answer the question, A* is normally not the right approach to exploring trees. It is for weighted graphs, not trees. If you are exploring a tree you use backtracking. You can make backtracking more intelligent by using heuristics or pruning.

Related

How can the shortest path between two nodes in a balanced binary tree be affected by path 'weight'?

I am following an online introductory algorithms course with Udacity.
In the final assessment there is a question as follows:
In the shortest-path oracle described in Andrew Goldberg's interview,
each node has a label, which is a list of some other nodes in the
network and their distance to these nodes. These lists have the
property that:
(1) for any pair of nodes (x,y) in the network, their lists will have
at least one node z in common
(2) the shortest path from x to y will go through z. Given a graph G
that is a balanced binary tree, preprocess the graph to create such
labels for each node. Note that the size of the list in each label
should not be larger than log n for a graph of size n.
The full question can be found here.
Given the constraint of a balanced binary tree and the hint that the size should not be larger than log n, intuitively it seems that the label for a particular node would consist of all its parents (and optionally itself, if it isn't a leaf).
However some additional instructor notes in the question adds:
Write your solution to work on weighted graphs. Note that the test
given, all the edges have a weight of one - which isn't particularly
interesting.
So my question is:
How can the shortest path between two nodes in a binary tree be affected by whether the paths have weights or not?
Surely in a binary tree, the shortest path between two nodes is the unique simple path, and is unaffected by any weighting?
(unless weights can be negative and the path doesn't have to be simple in which case there is no shortest path?)
My basic solution works with the simple test provided in the question, but fails to pass the automatic grader which gives no feedback.
I'm obviously misunderstanding something, but what...
Ok, so I think my initial reaction and the obvious answer is correct:
Positive weights cannot affect the shortest path between two nodes in a binary tree.
On the other hand, weights obviously do affect the shortest 'distance' between two nodes in a binary tree as compared to simply calculating 'distance' between nodes as the number of hops.
This is what the udacity instruction was getting at.
It seems that this instruction to work with weighted binary trees was simply to enable correct automatic grading of the code which relied on using the labels to calculate the exact shortest 'distance' (which is affected by weight) as opposed to the shortest path (list of nodes) which is not.
Once I modified my algorithm to take this into account and output the correct distance, it passed the grader.

Pathfinding - A path of less than or equal to n turns

Most of the time when implementing a pathfinding algorithm such as A*, we seek to minimize the travel cost along the path. We could also seek to find the optimal path with the fewest number of turns. This could be done by, instead of having a grid of location states, having a grid of location-direction states. For any given location in the old grid, we would have 4 states in that spot representing that location moving left, right, up, or down. That is, if you were expanding to a node above you, you would actually be adding the 'up' state of that node to the priority queue, since we've found the quickest route to this node when going UP. If you were going that direction anyway, we wouldnt add anything to the weight. However, if we had to turn from the current node to get to the expanded node, we would add a small epsilon to the weight such that two shortest paths in distance would not be equal in cost if their number of turns differed. As long as epsilon is << cost of moving between nodes, its still the shortest path.
I now pose a similar problem, but with relaxed constraints. I no longer wish to find the shortest path, not even a path with the fewest turns. My only goal is to find a path of ANY length with numTurns <= n. To clarify, the goal of this algorithm would be to answer the question:
"Does there exist a path P from locations A to B such that there are fewer than or equal to n turns?"
I'm asking whether using some sort of greedy algorithm here would be helpful, since I do not require minimum distance nor turns. The problem is, if I'm NOT finding the minimum, the algorithm may search through more squares on the board. That is, normally a shortest path algorithm searches the least number of squares it has to, which is key for performance.
Are there any techniques that come to mind that would provide an efficient way (better or same as A*) to find such a path? Again, A* with fewest turns provides the "optimal" solution for distance and #turns. But for my problem, "optimal" is the fastest way the function can return whether there is a path of <=n turns between A and B. Note that there can be obstacles in the path, but other than that, moving from one square to another is the same cost (unless turning, as mentioned above).
I've been brainstorming, but I can not think of anything other than A* with the turn states . It might not be possible to do better than this, but I thought there may be a clever exploitation of my relaxed conditions. I've even considered using just numTurns as the cost of moving on the board, but that could waste a lot of time searching dead paths. Thanks very much!
Edit: Final clarification - Path does not have to have least number of turns, just <= n. Path does not have to be a shortest path, it can be a huge path if it only has n turns. The goal is for this function to execute quickly, I don't even need to record the path. I just need to know whether there exists one. Thanks :)

Using A* to solve Travelling Salesman

I've been tasked to write an implementation of the A* algorithm (heuristics provided) that will solve the travelling salesman problem. I understand the algorithm, it's simple enough, but I just can't see the code that implements it. I mean, I get it. Priority queue for the nodes, sorted by distance + heuristic(node), add the closest node on to the path. The question is, like, what happens if the closest node can't be reached from the previous closest node? How does one actually take a "graph" as a function argument? I just can't see how the algorithm actually functions, as code.
I read the Wikipedia page before posting the question. Repeatedly. It doesn't really answer the question- searching the graph is way, way different to solving the TSP. For example, you could construct a graph where the shortest node at any given time always results in a backtrack, since two paths of the same length aren't equal, whereas if you're just trying to go from A to B then two paths of the same length are equal.
You could derive a graph by which some nodes are never reached by always going closest first.
I don't really see how A* applies to the TSP. I mean, finding a route from A to B, sure, I get that. But the TSP? I don't see the connection.
I found a solution here
Use minimum spanning tree as a heuristic.
Set
Initial State: Agent in the start city and has not visited any other city
Goal State: Agent has visited all the cities and reached the start city again
Successor Function: Generates all cities that have not yet visited
Edge-cost: distance between the cities represented by the nodes, use this cost to calculate g(n).
h(n): distance to the nearest unvisited city from the current city + estimated distance to travel all the unvisited cities (MST heuristic used here) + nearest distance from an unvisited city to the start city. Note that this is an admissible heuristic function.
 
You may consider maintaining a list of visited cities and a list of unvisited cities to facilitate computations.
The confusion here is that the graph on which you are trying to solve the TSP is not the graph you are performing an A* search on.
See related: Sudoku solving algorithm C++
To solve this problem you need to:
Define your:
TSP states
TSP initial state
TSP goal state(s)
TSP state successor function
TSP state heuristic
Apply a generic A* solver to this TSP state graph
A quick example I can think up:
TSP states: list of nodes (cities) currently in the TSP cycle
TSP initial state: the list containing a single node, the travelling salesman's home town
TSP goal state(s): a state is a goal if it contains every node in the graph of cities
TSP successor function: can add any node (city) that isn't in the current cycle to the end of the list of nodes in the cycle to get a new state
The cost of the transition is equal to the cost of the edge you're adding to the cycle
TSP state heuristic: you decide
If it's just a problem of understanding the algorithm and how it works you might want to consider drawing a graph on paper, assigning weights to it and drawing it out. Also you can probably find some animations that show Dijkstra's shortest path, Wikipedia has a good one. The only difference between Dijkstra and A* is the addition of the heuristic, and you stop the search as soon as you reach the target node. As far as using it to solve the TSP, good luck with that!
Think about this a little more abstractly. Forget about A* for a moment, it's just dijkstra's with a heuristic anyway. Before, you wanted to get from A to B. What was your goal? To get to B. The goal was to get to B with the least cost. At any given point, what was your current "state"? Probably just your location on the graph.
Now, you want to start at A, then go to both B and C. What is your goal now? To pass over both B and C, maintaining least cost. You can generalize this with more nodes: D, E, F, ... or just N nodes. Now, at any given point, what is your current "state"? This is critical: it ISN'T just your location in the graph--it's also which of B or C or whatever nodes you have visited so far in the search.
Implement your original algorithm so that it calls some function asking if it has reached "the goal state" after making X move. Before, the function would have just said "yes, you're at state B, therefore you are at the goal". But now, let that function return "yes, you're at the goal state" if the search's path has passed over each of the points of interest. It'll know whether or not the search has passed over all points of interest because that's included in the current state.
After you get that, improve the search with some heuristic, and A* it up.
To answer one of your questions...
To pass a graph as a function argument, you have several options. You could pass a pointer to an array containing all the nodes. You could pass just the one starting node and work from there, if it's a fully connected graph. And finally, you could write a graph class with whatever data structures you need inside it, and pass a reference to an instance of that class.
As for your other question about closest nodes, isn't part of A* search that it will backtrack as needed? Or you could implement your own sort of backtracking to handle that kind of situation.
The question is, like, what happens if the closest node can't be reached from the previous closest node?
This step isn't necessary. As in, you aren't computing a path from the previous closest to the current closest, you are trying to get to your goal node, and the current closest is the only thing that matters (e.g. the algorithm doesn't care that last iteration you were 100km away, because this iteration you are only 96km away).
As a broad introduction, A* doesn't directly construct a path: it explores until it definitely knows that the path is contained within the region it has explored, and then constructs the path based on the information recorded during the exploration.
(I'm going to use the code in the Wikipedia article as a reference implementation to aid my explanation.)
You have a two sets of nodes: closedset and openset
closedset holds nodes that have been fully evaluated, that is, you know exactly how far they are from start and all their neighbours are in one of the two sets. This there is no more computation you can do with them and so we can (sort of) ignore them. (Basically these are completely contained within the border.)
openset holds "border" nodes, you know how far these are from start, but you haven't touched their neighbours yet, so they are on the edge of your search so far.
(Implicitly, there is a third set: completely untouched nodes. But you don't really touch them until they are in openset so they don't matter.)
At a given iteration, if you've got nodes to explore (that is, nodes in openset), you need to work out which one to explore. This is the job of the heuristic, it basically gives you a hint about which point on the border will be the best to explore next by telling you which node it thinks will have the shortest path to goal.
The previous closest node is irrelevant, it just expanded the border a bit, adding new nodes to openset. These new nodes are now candidates for the closest node in this iteration.
At first, openset only contains start, but then you iterate and at each step the border is expanded a little (in the most promising direction), until you eventually reach goal.
When A* is actually doing the exploration, it doesn't worry about which nodes came from where. It doesn't need to, because it knows their distance from start and the heuristic function and that's all it needs.
However to reconstruct the path later, you need to have some record of the path, this is what camefrom is. For a given node, camefrom links it to the node that is closest to start, so you can reconstruct the shortest path by following the links backwards from goal.
How does one actually take a "graph" as a function argument?
By passing one of the representations of a graph.
I don't really see how A* applies to the TSP. I mean, finding a route from A to B, sure, I get that. But the TSP? I don't see the connection.
You need a different heuristic and a different end condition: goal is no longer a single node any more, but the state of having everything connected; and your heuristic is some estimate of the length of the shortest path connecting the remaining nodes.

Find the shortest path in a graph which visits certain nodes

I have a undirected graph with about 100 nodes and about 200 edges. One node is labelled 'start', one is 'end', and there's about a dozen labelled 'mustpass'.
I need to find the shortest path through this graph that starts at 'start', ends at 'end', and passes through all of the 'mustpass' nodes (in any order).
( http://3e.org/local/maize-graph.png / http://3e.org/local/maize-graph.dot.txt is the graph in question - it represents a corn maze in Lancaster, PA)
Everyone else comparing this to the Travelling Salesman Problem probably hasn't read your question carefully. In TSP, the objective is to find the shortest cycle that visits all the vertices (a Hamiltonian cycle) -- it corresponds to having every node labelled 'mustpass'.
In your case, given that you have only about a dozen labelled 'mustpass', and given that 12! is rather small (479001600), you can simply try all permutations of only the 'mustpass' nodes, and look at the shortest path from 'start' to 'end' that visits the 'mustpass' nodes in that order -- it will simply be the concatenation of the shortest paths between every two consecutive nodes in that list.
In other words, first find the shortest distance between each pair of vertices (you can use Dijkstra's algorithm or others, but with those small numbers (100 nodes), even the simplest-to-code Floyd-Warshall algorithm will run in time). Then, once you have this in a table, try all permutations of your 'mustpass' nodes, and the rest.
Something like this:
//Precomputation: Find all pairs shortest paths, e.g. using Floyd-Warshall
n = number of nodes
for i=1 to n: for j=1 to n: d[i][j]=INF
for k=1 to n:
for i=1 to n:
for j=1 to n:
d[i][j] = min(d[i][j], d[i][k] + d[k][j])
//That *really* gives the shortest distance between every pair of nodes! :-)
//Now try all permutations
shortest = INF
for each permutation a[1],a[2],...a[k] of the 'mustpass' nodes:
shortest = min(shortest, d['start'][a[1]]+d[a[1]][a[2]]+...+d[a[k]]['end'])
print shortest
(Of course that's not real code, and if you want the actual path you'll have to keep track of which permutation gives the shortest distance, and also what the all-pairs shortest paths are, but you get the idea.)
It will run in at most a few seconds on any reasonable language :)
[If you have n nodes and k 'mustpass' nodes, its running time is O(n3) for the Floyd-Warshall part, and O(k!n) for the all permutations part, and 100^3+(12!)(100) is practically peanuts unless you have some really restrictive constraints.]
run Djikstra's Algorithm to find the shortest paths between all of the critical nodes (start, end, and must-pass), then a depth-first traversal should tell you the shortest path through the resulting subgraph that touches all of the nodes start ... mustpasses ... end
This is two problems... Steven Lowe pointed this out, but didn't give enough respect to the second half of the problem.
You should first discover the shortest paths between all of your critical nodes (start, end, mustpass). Once these paths are discovered, you can construct a simplified graph, where each edge in the new graph is a path from one critical node to another in the original graph. There are many pathfinding algorithms that you can use to find the shortest path here.
Once you have this new graph, though, you have exactly the Traveling Salesperson problem (well, almost... No need to return to your starting point). Any of the posts concerning this, mentioned above, will apply.
Actually, the problem you posted is similar to the traveling salesman, but I think closer to a simple pathfinding problem. Rather than needing to visit each and every node, you simply need to visit a particular set of nodes in the shortest time (distance) possible.
The reason for this is that, unlike the traveling salesman problem, a corn maze will not allow you to travel directly from any one point to any other point on the map without needing to pass through other nodes to get there.
I would actually recommend A* pathfinding as a technique to consider. You set this up by deciding which nodes have access to which other nodes directly, and what the "cost" of each hop from a particular node is. In this case, it looks like each "hop" could be of equal cost, since your nodes seem relatively closely spaced. A* can use this information to find the lowest cost path between any two points. Since you need to get from point A to point B and visit about 12 inbetween, even a brute force approach using pathfinding wouldn't hurt at all.
Just an alternative to consider. It does look remarkably like the traveling salesman problem, and those are good papers to read up on, but look closer and you'll see that its only overcomplicating things. ^_^ This coming from the mind of a video game programmer who's dealt with these kinds of things before.
This is not a TSP problem and not NP-hard because the original question does not require that must-pass nodes are visited only once. This makes the answer much, much simpler to just brute-force after compiling a list of shortest paths between all must-pass nodes via Dijkstra's algorithm. There may be a better way to go but a simple one would be to simply work a binary tree backwards. Imagine a list of nodes [start,a,b,c,end]. Sum the simple distances [start->a->b->c->end] this is your new target distance to beat. Now try [start->a->c->b->end] and if that's better set that as the target (and remember that it came from that pattern of nodes). Work backwards over the permutations:
[start->a->b->c->end]
[start->a->c->b->end]
[start->b->a->c->end]
[start->b->c->a->end]
[start->c->a->b->end]
[start->c->b->a->end]
One of those will be shortest.
(where are the 'visited multiple times' nodes, if any? They're just hidden in the shortest-path initialization step. The shortest path between a and b may contain c or even the end point. You don't need to care)
Andrew Top has the right idea:
1) Djikstra's Algorithm
2) Some TSP heuristic.
I recommend the Lin-Kernighan heuristic: it's one of the best known for any NP Complete problem. The only other thing to remember is that after you expanded out the graph again after step 2, you may have loops in your expanded path, so you should go around short-circuiting those (look at the degree of vertices along your path).
I'm actually not sure how good this solution will be relative to the optimum. There are probably some pathological cases to do with short circuiting. After all, this problem looks a LOT like Steiner Tree: http://en.wikipedia.org/wiki/Steiner_tree and you definitely can't approximate Steiner Tree by just contracting your graph and running Kruskal's for example.
Considering the amount of nodes and edges is relatively finite, you can probably calculate every possible path and take the shortest one.
Generally this known as the travelling salesman problem, and has a non-deterministic polynomial runtime, no matter what the algorithm you use.
http://en.wikipedia.org/wiki/Traveling_salesman_problem
The question talks about must-pass in ANY order. I have been trying to search for a solution about the defined order of must-pass nodes. I found my answer but since no question on StackOverflow had a similar question I'm posting here to let maximum people benefit from it.
If the order or must-pass is defined then you could run dijkstra's algorithm multiple times. For instance let's assume you have to start from s pass through k1, k2 and k3 (in respective order) and stop at e. Then what you could do is run dijkstra's algorithm between each consecutive pair of nodes. The cost and path would be given by:
dijkstras(s, k1) + dijkstras(k1, k2) + dijkstras(k2, k3) + dijkstras(k3, 3)
How about using brute force on the dozen 'must visit' nodes. You can cover all the possible combinations of 12 nodes easily enough, and this leaves you with an optimal circuit you can follow to cover them.
Now your problem is simplified to one of finding optimal routes from the start node to the circuit, which you then follow around until you've covered them, and then find the route from that to the end.
Final path is composed of :
start -> path to circuit* -> circuit of must visit nodes -> path to end* -> end
You find the paths I marked with * like this
Do an A* search from the start node to every point on the circuit
for each of these do an A* search from the next and previous node on the circuit to the end (because you can follow the circuit round in either direction)
What you end up with is a lot of search paths, and you can choose the one with the lowest cost.
There's lots of room for optimization by caching the searches, but I think this will generate good solutions.
It doesn't go anywhere near looking for an optimal solution though, because that could involve leaving the must visit circuit within the search.
One thing that is not mentioned anywhere, is whether it is ok for the same vertex to be visited more than once in the path. Most of the answers here assume that it's ok to visit the same edge multiple times, but my take given the question (a path should not visit the same vertex more than once!) is that it is not ok to visit the same vertex twice.
So a brute force approach would still apply, but you'd have to remove vertices already used when you attempt to calculate each subset of the path.

Suggestions for KSPA on undirected graph

There is a custom implementation of KSPA which needs to be re-written. The current implementation uses a modified Dijkstra's algorithm whose pseudocode is roughly explained below. It is commonly known as KSPA using edge-deletion strategy i think so. (i am a novice in graph-theory).
Step:-1. Calculate the shortest path between any given pair of nodes using the Dijkstra algorithm. k = 0 here.
Step:-2. Set k = 1
Step:-3. Extract all the edges from all the ‘k-1’ shortest path trees. Add the same to a linked list Edge_List.
Step:-4. Create a combination of ‘k’ edges from Edge_List to be deleted at once such that each edge belongs to a different SPT (Shortest Path Tree). This can be done by inspecting the ‘k’ value for each edge of the combination considered. The ‘k’ value has to be different for each of the edge of the chosen combination.
Step:-5. Delete the combination of edges chosen in the above step temporarily from the graph in memory.
Step:-6. Re-run Dijkstra for the same pair of nodes as in Step:-1.
Step:-7. Add the resulting path into a temporary list of paths. Paths_List.
Step:-8. Restore the deleted edges back into the graph.
Step:-9. Go to Step:-4 to get another combination of edges for deletion until all unique combinations are exhausted. This is nothing but choosing ‘r’ edges at a time among ‘n’ edges => nCr.
Step:-10. The ‘k+1’ th shortest path is = Minimum(Paths_List).
Step:-11. k = k + 1 Go to Step:-3, until k < N.
Step:-12. STOP
As i understand the algorithm, to get kth shortest path, ‘k-1’ SPTs are to be found between each source-destination pair and ‘k-1’ edges each from one SPT are to be deleted simultaneously for every combination.
Clearly this algorithm has combinatorial complexity and clogs the server on large graphs. People suggested me Eppstein's algorithm (http://www.ics.uci.edu/~eppstein/pubs/Epp-SJC-98.pdf). But this white paper cites a 'digraph' and I did not see a mention that it works only for digraphs. I just wanted to ask folks here if anyone has used this algorithm on an undirected graph?
If not, are there good algorithms (in terms of time-complexity) to implement KSPA on an undirected graph?
Thanks in advance,
Time complexity: O(K*(E*log(K)+V*log(V)))
Memory complexity of O(K*V) (+O(E) for storing the input).
We perform a modified Djikstra as follows:
For each node, instead of keeping the best currently-known cost of route from start-node. We keep the best K routes from start node
When updating a nodes' neighbours, we don't check if it improves the best currently known path (like Djikstra does), we check if it improves the worst of the K' best currently known path.
After we already processed the first of a nodes' K best routes, we don't need to find K best routes, but only have K-1 remaining, and after another one K-2. That's what I called K'.
For each node we will keep two priority queues for the K' best currently known path-lengths.
In one priority queue the shortest path is on top. We use this priority queue to determine which of the K' is best and will be used in the regular Djikstra's priority queues as the node's representative.
In the other priority queue the longest path is on top. We use this one to compare candidate paths to the worst of the K' paths.

Resources