Proof of A* algorithm's optimality when heuristics always underestimates - algorithm

I understand why A* algorithm always gives the most optimal path to a goal state when the heuristic always underestimates, but I can't create a formal proof for it.
As far as I understand, for each path considered as it goes deeper and deeper the accuracy of f(n) increases until the goal state, where it is 100% accurate. Also, no incorrect paths are ignored, as estimation is less than the actual cost; thus leading to the optimal path. But how should I create a proof for it?

The main idea of the proof is that when A* finds a path, it has a found a path that has an estimate lower than the estimate of any other possible paths. Since the estimates are optimistic, the other paths can be safely ignored.
Also, A* is only optimal if two conditions are met:
The heuristic is admissible, as it will never overestimate the cost.
The heuristic is monotonic, that is, if h(ni) < h(ni + 1), then real-cost(ni) < real-cost(ni + 1).
You can prove the optimality to be correct by assuming the opposite, and expanding the implications.
Assume that the path give by A* is not optimal with an admissible and monotonic heuristic, and think about what that means in terms of implications (you'll soon find yourself reaching a contradiction), and thus, your original assumption is reduced to absurd.
From that you can conclude that your original assumption was false, that is, A* is optimal with the above conditions. Q.E.D.

Consider that last step, the one that completes the optimal path.
Why must A* choose that path? Or, put another way, why must A* avoid choosing a sub-optimal path that reaches the goal?
Hint: this is the reason the heuristic needs to be admissible. Note that it is ok to choose a sub-optimal path, so long as it doesn't complete the path (why?).
That should give you some idea of how to fashion a proof.

Related

Dynamic Programming : Why the need for optimal sub structure

I was revisiting my notes on Dynamic Programming. Its basically a memoized recursion technique, which stores away the solutions to smaller subproblems for later reuse in computing solutions to relatively larger sub problems.
The question I have is that in order to apply DP to a recursive problem, it must have an optimal substructure. This basically necessitates that an optimal solution to a problem contains optimal solution to subproblems.
Is it possible otherwise ? I mean have you ever seen a case where optimal solution to a problem does not contain an optimal solution to subproblems.
Please share some examples, if you know to deepen my understanding.
In dynamic programming a given problems has Optimal Substructure Property if optimal solution of the given problem can be obtained by using optimal solutions of its sub problems.
For example the shortest path problem has following optimal substructure property: If a node X lies in the shortest path from a source node U to destination node V then the shortest path from U to V is combination of shortest path from U to X and shortest path from X to V.
But Longest path problem doesn’t have the Optimal Substructure property.
i.e the longest path between two nodes doesn't have to be the longest path between the in between nodes.
For example, the longest path q->r->t is not a combination of longest path from q to r and longest path from r to t, because the longest path from q to r is q->s->t->r.
So here: optimal solution to a problem does not contain an optimal solution to the sub problems.
For more details you can read
Longest path problem from wikipedia
Optimal substructure from wikipedia
You're perfectly right that the definitions are imprecise. DP is a technique for getting algorithmic speedups rather than an algorithm in itself. The term "optimal substructure" is a vague concept. (You're right again here!) To wit, every loop can be expressed as a recursive function: each iteration solves a subproblem for the successive one. Is every algorithm with a loop a DP? Clearly not.
What people actually mean by "optimal substructure" and "overlapping subproblems" is that subproblem results are used often enough to decrease the asymptotic complexity of solutions. In other words, the memoization is useful! In most cases the subtle implication is a decrease from exponential to polynomial time, O(n^k) to O(n^p), p<k or similar.
Ex: There is an exponential number of paths between two nodes in a dense graph. DP finds the shortest path looking at only a polynomial number of them because the memos are extremely useful in this case.
On the other hand, Traveling salesman can be expressed as a memoized function (e.g. see this discussion), where the memos cause a factor of O( (1/2)^n ) time to be saved. But, the number of TS paths through n cities, is O(n!). This is so much bigger that the asymptotic run time is still super-exponential: O(n!)/O(2^n) = O(n!). Such an algorithm is generally not called a Dynamic Program even though it's following very much the same pattern as the DP for shortest paths. Apparently it's only a DP if it gives a nice result!
To my understanding, this 'optimal substructure' property is necessary not only for Dynamic Programming, but to obtain a recursive formulation of the solution in the first place. Note that in addition to the Wikipedia article on Dynamic Programming, there is a separate article on the optimal substructure property. To make things even more involved, there is also an article about the Bellman equation.
You could solve the Traveling Salesman Problem, choosing the nearest city at each step, but it's wrong method.
The whole idea is to narrow down the problem into the relatively small set of the candidates for optimal solution and use "brute force" to solve it.
So it better be that solutions of the smaller sub-problem should be sufficient to solve the bigger problem.
This is expressed via a recursion as function of the optimal solution of smaller sub-problems.
answering this question:
Is it possible otherwise ? I mean have you ever seen a case where
optimal solution to a problem does not contain an optimal solution to
subproblems.
no it is not possible, and can even be proven.
You can try to implement dynamic programming on any recursive problem but you will not get any better result if it doesn't have optimal substructure property. In other words dynamic programming methodology is not useful to implement on a problem which doesn't have optimal substructure property.

Is there a case that a heuristic approach to be guaranteed to provide optimal solution?

As it is said (e.g., Wikipedia) heuristics provide solution which are not guarantee to be optimal. I think this is true in many cases, but what if we use for example a heuristic cost estimation (like the one in A* algorithm) to achieve a solution which could be proven to be optimal. In that case shouldn't we refer to that algorithm as heuristics?
Given a heuristic cost estimation function that obeys certain laws, A* is an algorithm in the strict sense of a method of computation that always gives the right answer to a prespecified set of problems.(*) The fact that it uses a heuristic does not make A* itself a heuristic.
( * ) There are cases where the optimal path between A and B might not exist and A* will run forever; for such problems, A* is a semi-algorithm.

Usage of admissible and consistent heuristics in A*

Does someone have an easy and/or intuitive explanation why you have to use an admissible heuristic in A* and why you "should" use a consistent heuristic?
Admissible
How much we think it will cost to get to the goal isn't more than it will actually cost.
Why do we need admissibility?
If any expected cost is less than the actual cost, it means the optimal path will always have an expected cost less than or equal to its true cost, which will be less than the true cost of some non-optimal path. Since we always explore the node with the lowest expected total cost first and when we reach the goal we'd only look at the true cost, we can never reach the goal through a non-optimal path.
If we think it will cost more than it will actually cost, we could actually end up taking a more expensive path. The expected cost of path A could be more than the expected cost of path B, yet path A can have a lower actual cost. This would mean we'd explore the non-optimal path B first.
If a heuristic is not admissible, we might theoretically never even get to the goal (or at least we'd explore the entire possible space before getting there). While this is unlikely with a reasonable heuristic, it is possible to create a heuristic where we think it will cost less to get to the goal the further away we are, and the expected remaining cost decreases faster than the actual cost when moving away from the goal. As a simple (finite) example: heuristic = 100000000 - 2 * actual.
Consistent
No move we make should decrease the expected total cost. Put in another way: any move made should decrease the heuristic by no more than the cost of that move.
The expected total cost (f(n)) is the expected remaining cost (h(n)) plus the cost so far (g(n)). As an example, we may think the total time to get to the goal would be 10 minutes. After travelling 5 minutes, it's fine if we think the total time (including the 5 minutes travelled) will be 11 minutes, but we shouldn't think the total time is 9 minutes.
Note: for the remaining cost, we only consider how long we think it will take. How long it actually takes may be different.
In addition to the above, consistent heuristics also need to have an expected remaining cost of 0 when we're already at the goal.
A consistent heuristic is also admissible (but not necessarily the other way around). This follows from the above.
Why do we need consistency?
If we keep making moves (towards or away from the goal), we want the cost to increase. Otherwise we could end up moving away from the goal and exploring a whole bunch of unpromising paths before we finally find the optimal one.
Note: if a heuristic is admissible but not consistent, we won't find a non-optimal path to the goal, but finding the optimal path may take a while.
Examples
h(n) = heuristic, i.e. expected (remaining) cost from n to goal
g(n) = cost (so far) from the start to n
t(n) = true (remaining) cost from n to goal
h(n) = 10 (except h(goal) = 0): Not admissible if moves cost less than 10, since there would be some n where the t(n) < 10. Not consistent since moving to the goal would involve decreasing the heuristic from 10 to 0, yet the move to do so would cost less than 10. However, if every move costs at least 10, this would be both admissible and consistent.
h(n) = 0: Admissible since (for positive costs) it can't cost less than 0 to get to the goal. Consistent since the heuristic will never decrease. Not particularly useful though. This would be equivalent to Dijkstra's algorithm.
h(n) = t(n) / 2: Admissible since the expected cost is lower than the true cost. Consistent since the cost of a move will always be at least double the expected cost of that move (it will also increase h(n) if moving away from the goal), thus any move will increase the total expected cost.
This is simply to allow you to say that the found result is "optimal" you can use whatever heuristic you want it will just be harder to proof that the found result is optimal.
For example when you overestimate the distance to the target node, it is possible that the actual distance is smaller then the estimated one. Therefor the found result might be marked as "optimal" while there is still a solution that is better.

Weighted A* on wikipedia

I guess I find a problem on this wiki page:
I think the `
have a cost of at most ε times
in the Weighted A* algorithm part should be
have a cost less than ε times
instead.
Because here it assumes ε > 1. But I am not sure about it, just want to listen anybody's opinion on this..
Thank you for your help in advance :)
I believe the paragraph starting "Weighted A*. If ha(n) is" is correct, and a guarantee that the cost of the path found is at most eta times the cost of the best path is the sort of guarantee you want - since you are looking for the least cost path and trying to reduce cpu time you are settling for a sub-optimal (higher cost) solution but obtaining a guarantee that the cost is not too bad - at most eta times the cost of the best path.
I do think that there is an inconsistency between the use of eta in this paragraph and that in the paragraph above - I don't know whether that is a mistake or whether it derives from an unfortunate difference of conventions between weighted A* and a more general definition of approximate solutions.
The paragraph is consistent with the notes at http://inst.eecs.berkeley.edu/~cs188/sp11/slides/SP11%20cs188%20lecture%204%20--%20CSPs%206PP.pdf - bottom of page 5 on the pdf and with a rough proof. When weighted A* thinks it has a solution with cost g(x) all nodes still in play must have a predicted cost g(y) + eh(y) at least this. To get the largest possible error assume that g(y) is zero and that eh(y) = g(x) for correct solution y and we see that the solution A* thinks it has found is e times as expensive as y - since we presume that the original h() is admissable and therefore an upper bound on cost.

Correct formulation of the A* algorithm

I'm looking at definitions of the A* path-finding algorithm, and it seems to be defined somewhat differently in different places.
The difference is in the action performed when going through the successors of a node, and finding that a successor is on the closed list.
One approach (suggested by Wikipedia, and this article) says: if the successor is on the closed list, just ignore it
Another approach (suggested here and here, for example) says: if the successor is on the closed list, examine its cost. If it's higher than the currently computed score, remove the item from the closed list for future examination.
I'm confused - which method is correct ? Intuitively, the first makes more sense to me, but I wonder about the difference in definition. Is one of the definitions wrong, or are they somehow isomorphic ?
The first approach is optimal only if the optimal path to any repeated state is always the first to be followed. This property holds if the heuristic function has the property of consistency (also called monoticity). A heuristic function is consistent if, for every node n and every successor n' of n, the estimated cost of reaching the goal from n is no greater than the step cost of getting to n' from n plus the estimated cost of reaching the goal from n.
The second approach is optimal if the heuristic function is merely admissible, that is, it never overestimates the cost to reach the goal.
Every consistent heuristic function is also admissible. Although consistency is a stricter requirement than admissibility, one has to work quite hard to concoct heuristic functions that are admissible but not consistent.
Thus, even though the second approach is more general as it works with a strictly larger subset of heuristic functions, the first approach is usually sufficient in practice.
Reference: the subsection A* search: Minimizing the total estimated solution cost in section 4.1 Informed (Heuristic) Search Strategies of the book Artificial Intelligence: A Modern Approach.

Resources