Why would my minimum search (steepest decent) climb a hill?

Why would my minimum search (steepest decent) climb a hill? - wolfram-mathematica

I'm trying to minimize a discretized function using a method of steepest decent. This should be fairly straightforward, but I'm having trouble with the search 'climbing' out of any local minimum. Here's my code in Mathematica, but its syntax is easy to follow.
x = {some ordered pair as a beginning search point};
h = 0.0000001; (* something rather small *)
lambda = 1;
While[infiniteloop == True,
x1 = x[[1]];
x2 = x[[2]];
x1Gradient = (f[x1-h,x2]-f[x1+h,x2])/(2h);
x2Gradient = (f[x1,x2-h]-f[x1,x2+h])/(2h);
gradient = {x1Gradient,x2Gradient};
(* test if minimum is found by normalizing the gradient*)
If[Sqrt[x1Gradient^2 + x2Gradient^2] > 0.000001,
xNew = x + lambda*g,
Break[];
];
(* either pass xNew or reduce lambda *)
If[f[xNew[[1]],xNew[[2]]] < f[x1,x],
x = xNew,
lambda = lambda/2;
];
];
Why would this ever climb a hill? I'm puzzled because I even test if the new value is less than the old. And I don't pass it when it is! Thoughts?

From the Unconstrained Optimization Tutorial, p.4: (available at: http://www.wolfram.com/learningcenter/tutorialcollection/)
"Steepest descent is indeed a possible strategy for local minimization, but it often does not converge quickly. In subsequent steps in this example, you may notice that the search direction is not exactly perpendicular to the contours. The search is using information from past steps to try to get information about the curvature of the function, which typically gives it a better direction to go. Another strategy, which usually converges faster, but can be more expensive, is to use the second derivative of the function. This is usually referred to as Newton's" method."
To me, the idea seems to be that 'going the wrong way' helps the algorithm learn the 'right way to go'---and provides useful information on the curvature of your function to guide subsequent steps.
HTH... If not, have a look at the Constrained and Unconstrained Tutorials. Lot's of interesting info.

Your gradient is negative. Use
x1Gradient = (f[x1+h,x2]-f[x1-h,x2])/(2h);
x2Gradient = (f[x1,x2+h]-f[x1,x2-h])/(2h);

Steepest descent gets stuck in local optima, enable a tabu search aspect on it to not get stuck in local optima.
See this book for example algorithms of steepest ascent (= steepest descent) and tabu search.

Related

Algorithm to find desired direction with minimum amount of iterations

There's three components to this problem:
A three dimensional vector A.
A "smooth" function F.
A desired vector B (also three dimensional).
We want to find a vector A that when put through F will produce the vector B.
F(A) = B
F can be anything that somehow transforms or distorts A in some manner. The point is that we want to iteratively call F(A) until B is produced.
The question is:
How can we do this, but with the least amount of calls to F before finding a vector that equals B (within a reasonable threshold)?

I am assuming that what you call "smooth" is tantamount to being differentiable.
Since the concept of smoothness only makes sense in the rational / real numbers, I will also assume that you are solving a floating point-based problem.
In this case, I would formulate the problem as a nonlinear programming problem. i.e. minimizing the squared norm of the difference between f(A) and B, given by
(F(A)_1 -B_1)² + (F(A)_2 - B_2)² + (F(A)_3 - B_3)²
It should be clear that this expression is zero if and only if f(A) = B and positive otherwise. Therefore you would want to minimize it.
As an example, you could use the solvers built into the scipy optimization suite (available for python):
from scipy.optimize import minimize
# Example function
f = lambda x : [x[0] + 1, x[2], 2*x[1]]
# Optimization objective
fsq = lambda x : sum(v*v for v in f(x))
# Initial guess
x0 = [0,0,0]
res = minimize(fsq, x0, tol=1e-6)
# res.x is the solution, in this case
# array([-1.00000000e+00, 2.49999999e+00, -5.84117172e-09])
A binary search (as pointed out above) only works if the function is 1-d, which is not the case here. You can try out different optimization methods by adding the method="name" to the call to minimize, see the API. It is not always clear which method works best for your problem without knowing more about the nature of your function. As a rule of thumb, the more information you give to the solver, the better. If you can compute the derivative of F explicitly, passing it to the solver will help reduce the number of required evaluations. If F has a Hessian (i.e., if it is twice differentiable), providing the Hessian will help as well.
As an alternative, you can use the least_squares function on F directly via res = least_squares(f, x0). This could be faster since the solver can take care of the fact that you are solving a least squares problem rather than a generic optimization problem.
From a more general standpoint, the problem of restoring the function arguments producing a given value is called an Inverse Problem. These problems have been extensively studied.

Provided that F(A)=B, F,B are known and A remains unknown, you can start with a simple gradient search:
F(A)~= F(C) + F'(C)*(A-C)~=B
where F'(C) is the gradient of F() evaluated in point C. I'm assuming you can calculate this gradient analytically for now.
So, you can choose a point C that you estimate it is not very far from the solution and iterate by:
C= Co;
While(true)
{
Ai = inverse(F'(C))*(B-F(C)) + C;
convergence = Abs(Ai-C);
C=Ai;
if(convergence<someThreshold)
break;
}
if the gradient of F() cannot be calculated analytically, you can estimate it. Let Ei, i=1:3 be the ortonormal vectors, then
Fi'(C) = (F(C+Ei*d) - F(C-Ei*d))/(2*d);
F'(C) = [F1'(C) | F2'(C) | F3'(C)];
and d can be chosen as fixed or as a function of the convergence value.
These algorithms suffer from the problem of local maxima, null gradient areas, etc., so in order for it to work, the start point (Co) must be not very far from the solution where the function F() behaves monotonically

it seems like you can try a metaheuristic approach for this.
Genetic algorithm (GA) might be the best suite for this.
you can initiate a number of A vector to init a population, and use GA to make evolution on your population, so you will have better generation in which they have better vectors that F(Ax) closer to B.
Your fitness function can be a simple function that compare F(Ai) to B
You can choose how to mutate your population by each generation.
A simple example about GA can be found here link

algorithm for the inverse of a 2d bijective function

I want to write a function f_1(a,b) = (x,y) that approximates the inverse of f, where f(x,y) = (a,b) is a bijective function (over a specific range)
Any suggestions on how to get an efficient numerical approximation?
The programming language used is not important.

Solving f(x,y)=(a,b) for x,y is equivalent to finding the root or minimum of f(x,y)-(a,b) ( = 0) so you can use any of the standard root finding or optimization algorithms. If you are implementing this yourself, I recommend Coordinate descent because it is probably the most simple algorithm. You could also try Adaptive coordinate descent although that may be a bit harder to analyze.
If you want to find the inverse over a range, you can either compute the inverse at various points and interpolate with something like a Cubic Spline or solve the above equation whenever you want to evaluate the inverse function. Even if you solve the equation for each evaluation, it may still be helpful to precompute some values so they can be used as initial values for a solver such as Coordinate descent.
Also see Newton's method and the Bisection method

There is no 'automatic' solution that wil work for any general function. Even in the simpler case of y = f(x) it can be hard to find a suitable starting point. As an example:
y = x^2
has a nice algebraic inverse
x = sqrt(y)
but trying to approximate the sqrt function in the range [0..1] with a polynomial (for instance) sucks badly.
If your range is small enough, and your function well behaved enough, then you might get a fit using 2D splines. If this is going to work, then you should try using independant functions for x and y, i.e. use
y = Y_1(a,b) and x = X_1(a,b)
rather than the more complicated
(x,y) = F_1(a,b)

Newton module in stdlib - what does it do?

BigDecimal has some modules which are hardly documented, like Newton.
"Solves the nonlinear algebraic equation system f = 0 by Newton’s
method. This program is not dependent on BigDecimal.
To call:
n = nlsolve(f,x) where n is the number of iterations required,
x is the initial value vector
f is an Object which is used to compute the values of the equations to be solved. "
And that's it. Google did not result in something I could understand. I'd like to see some sample code with a bit of not-too-math-heavy explanation; to get a better idea of what that weird thing at the bottom of the toolbox is.

Newton's Method is a way of approximating the root of an equation. It's pretty good, provided your function meets some continuity requirements.
The method is:
Take a starting point
At that point, find a tangent line
Figure out where that tangent line has a root. Take the root as a point.
If you've reached tolerance, return this point as the solution. If not, go back to #1 using this as your new point.

Doing probabilistic calculations on a higher abstraction level

To the downvoters: this isn't a question about mathematics, it's a
question about the programming language Mathematica.
One of the prime characteristics of Mathematica is that it can deal with many things symbolically. But if you come to think about it, many of the symbolic features are actually only halfway symbolic.
Take vectors for instance. We can have a symbolic vector like {x,y,z}, do a matrix multiplication with a matrix full of symbols and end up with a symbolic result and so we might consider that symbolic vector algebra. But we all know that, right out of the box, Mathematica does not allow you to say that a symbol x is a vector and that given a matrix A, A . x is a vector too. That's a higher level of abstraction, one that Mathematica (currently) does not very well deal with.
Similarly, Mathematica knows how to find the 5th derivative of a function that's defined in terms of nothing than symbols, but it's not well geared towards finding the r th derivative (see the "How to find a function's rth derivative when r is symbolic in Mathematica?" question).
Furthermore, Mathematica has extensive Boolean algebra capabilities, some stone age old, but many recently obtained in version 7. In version 8 we got Probability and friends (such as Conditioned) which allows us to reason with probabilities of random variables with given distributions. It's a really magnificent addition which helps me a lot in familiarizing myself with this domain, and I enjoy working with it tremendously. However,...
I was discussing with a colleague certain rules of probabilistic logic like the familiar
i.e., the conditional probability of event/state/outcome C given event/state/outcome A is true.
Specifically, we were looking at this one:
and although I had spoken highly about Mathematica's Probability just before I realized that I wouldn't know how to solve this right away with Mathematica. Again, just as with abstract vectors and matrices, and symbolic derivatives, this seems to be an abstraction level too high. Or is it? My question is:
Could you find a way to find the truth or falsehood in the above and similar equations using a Mathematica program?

>> Mathematica does not allow you to say that a symbol x is a vector
Sure it does... Close enough anyway... that it's a collection of Reals. It's called assumptions or conditioning, depending on what you want to do.
Refine[Sqrt[x]*Sqrt[y]]
The above doesn't refine because it assumes X and Y can be any symbol, but if you narrow their scope, you get results:
Assuming[ x > 0 && y > 0, Refine[Sqrt[x]*Sqrt[y]]]
It would be very nice to have the ability to say: Element[x,Reals^2] (2-dimensional real vector), maybe in Mathematica 9. :-)
As for this problem:
>> Could you find a way to find the truth or falsehood in the above and similar equations using a Mathematica program?
Please refer to my answer (first one) on this question to see a symbolic approach to Bayes theorem:
https://stackoverflow.com/questions/8378336/how-do-you-work-out-conditional-probabilities-in-mathematica-is-it-possible

Just glanced at this and found an example from the documentation on Condition:
In[1]:= c = x^2 < 30; a = x > 1;
(Sorry for the formatting here...)
In[2]:= Probability[c \[Conditioned] a, x \[Distributed] PoissonDistribution[2]] ==
Probability[c && a, x \[Distributed] PoissonDistribution[2]] / Probability[a, x \[Distributed] PoissonDistribution[2]]
Which evaluates to True and corresponds to a less general version of the first example you gave.
I'll revisit this later tonight if I have time.

What's a good weighting function?

I'm trying to perform some calculations on a non-directed, cyclic, weighted graph, and I'm looking for a good function to calculate an aggregate weight.
Each edge has a distance value in the range [1,∞). The algorithm should give greater importance to lower distances (it should be monotonically decreasing), and it should assign the value 0 for the distance ∞.
My first instinct was simply 1/d, which meets both of those requirements. (Well, technically 1/∞ is undefined, but programmers tend to let that one slide more easily than do mathematicians.) The problem with 1/d is that the function cares a lot more about the difference between 1/1 and 1/2 than the difference between 1/34 and 1/35. I'd like to even that out a bit more. I could use √(1/d) or ∛(1/d) or even ∜(1/d), but I feel like I'm missing out on a whole class of possibilities. Any suggestions?
(I thought of ln(1/d), but that goes to -∞ as d goes to ∞, and I can't think of a good way to push that up to 0.)
Later:
I forgot a requirement: w(1) must be 1. (This doesn't invalidate the existing answers; a multiplicative constant is fine.)

perhaps:
exp(-d)
edit: something along the lines of
exp(k(1-d)), k real
will fit your extra requirement (I'm sure you knew that but what the hey).

How about 1/ln (d + k)?

Some of the above answers are versions of a Gaussian distribution which I agree is a good choice. The Gaussian or normal distribution can be found often in nature. It is a B-Spline basis function of order-infinity.
One drawback to using it as a blending function is its infinite support requires more calculations than a finite blending function. A blend is found as a summation of product series. In practice the summation may stop when the next term is less than a tolerance.
If possible form a static table to hold discrete Gaussian function values since calculating the values is computationally expensive. Interpolate table values if needed.

How about this?
w(d) = (1 + k)/(d + k) for some large k
d = 2 + k would be the place where w(d) = 1/2

It seems you are in effect looking for a linear decrease, something along the lines of infinity - d. Obviously this solution is garbage, but since you are probably not using a arbitrary precision data type for the distance, you could use yourDatatype.MaxValue - d to get a linear decreasing function for this.
In fact you might consider using (yourDatatype.MaxValue - d) + 1 you are using doubles, because you could then assign the weight of 0 if your distance is "infinity" (since doubles actually have a value for that.)
Of course you still have to consider implementation details like w(d) = double.infinity or w(d) = integer.MaxValue, but these should be easy to spot if you know the actual data types you are using ;)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio