I have a function which takes as inputs n-dimensional (say n=10) vectors whose components are real numbers varying from 0 to a large positive number A say 50,000, ends included. For any such vector the function outputs an integer from 1 to say B=100. I have this function and want to find its global minima.
Broadly speaking there are algorithmic, iterative and heuristics based approaches to tackle such optimization problem. Which are the best techniques suggested to solve this problem? I am looking for suggestions to algorithms or active research papers that i can implement from scratch to solve such problems. I have already given up hope on existing optimization functions that ship with Matlab/python. I am hoping to read experience of others working with approximation/heuristic algorithms to optimize such ill-defined functions.
I ran fmincon, fminsearch, fminunc in Matlab but they fail to optimize the function. The function is ill-defined according to their definitions. Matlab says this for fmincon:
Initial point is a local minimum that satisfies the constraints.
Optimization completed because at the initial point, the objective function is non-decreasing
in feasible directions to within the selected value of the optimality tolerance, and
constraints are satisfied to within the selected value of the constraint tolerance.
Problem arises because this function has piecewise-constant behavior. If a vector V is assigned to a number say 65, changing its components very slightly may not have any change. Such ill-defined behavior is to be well-expected because of pigeon-hole principle. The domain of function is unlimited whereas range is just a bunch of numbers.
I also wish to clarify one issue that may arise. Suppose i do gradient descent on a starting point x0 and my next x that i get from GD-iteration has some components lie outside the domain [0,50000], then what happens? So actually the domain is circular. So a vector of size 3 like [30;5432;50432] becomes [30;5432;432]. This is automatically taken care of so that there is no worry about iterations finding a vector outside the domain.
Related
I recently got this question in an interview, and it's kind of making me mad thinking about it.
Suppose you have a family of functions that each take a fixed number of parameters (different functions can take different numbers of parameters), each with the following properties:
Each input is between 0-1
Each output is between 0-1
The function is continuous
The function is a blackbox (i.e you cannot look at the equation for it)
He then asked me to create an algorithm to find the global minima of this function.
To me, looking at this question was like trying to answer the basis of machine learning. Obviously if there was some way to guarantee to find the global minima of a function, then we'd have perfect machine learning algorithms. Obviously we don't, so this question seems kind of impossible.
Anyways, the answer I gave was a mixture of divide and conquer with stochastic gradient descent. Since all functions are continuous, you'll always be able to calculate the partial gradient with respect to a certain dimension. You split each dimension in half and once you've reached a certain granularity, you apply stochastic gradient descent. In gradient descent, you initialize a certain start point, and evaluate the left and right side of that point based on a small delta with respect to every dimension to get the slope at that point. Then you update your point based on a certain learning rate and recalculate your partial derivatives until you've reached a point where the distance between old and new point is below a certain threshold. Then you re-merge and return the minimum of the two sections until you return the minimum value from all your divisions. My hope was to get around the fact that SGD can get stuck in local minima, so I thought dividing the dimension space would reduce the chance of that happening.
He seemed pretty unimpressed with my algorithm in the end. Does anybody have a faster/more accurate way of solving this problem?
The range is [0, 1], therefore f(x) = 0, where x on R^n, is the global minima. Moreover, it's not guaranteed that the function will be a convex, by knowing the domain, range, and continuity holds.
ex. f(x) = sqrt(x), it's a concave function (i.e. has no minimum), and x - [0, 1] belongs to its domain.
So I have this issue where I have to find the best distribution that, when passed through a function, matches a known surface. I have written a script that creates the distribution given some parameters and spits out a metric that compares the given surface to the known, but this script takes a non-negligible time, so I can't just run through a very large set of parameters to find the optimal set of parameters. I looked into the simplex method, and it seems to be the right path, but its not quite what I need, because I dont exactly have a set of linear equations, and dont know the constraints for the parameters, but rather one method that gives a single output (an thats all). Can anyone point me in the right direction to how to solve this problem? Thanks!
To quickly go over my process / problem again, I have a set of parameters (at this point 2 but will be expanded to more later) that defines a distribution. This distribution is used to create a surface, which is compared to a known surface, and an error metric is produced. I want to find the optimal set of parameters, but cannot run through an arbitrarily large number of parameters due to the time constraint.
One situation consistent with what you have asked is a model in which you have a reasonably tractable probability distribution which generates an unknown value. This unknown value goes through a complex and not mathematically nice process and generates an observation. Your surface corresponds to the observed probability distribution on the observations. You would be happy finding the parameters that give a good least squares fit between the theoretical and real life surface distribution.
One approximation for the fitting process is that you compute a grid of values in the space output by the probability distribution. Each set of parameters gives you a probability for each point on this grid. The not nice process maps each grid point here to a nearest grid point in the space of the surface. The least squares fit is a quadratic in the probabilities calculated for the first grid, because the probabilities calculated for a grid point in the surface are the sums of the probabilities calculated for values in the first grid that map to something nearer to that point in the surface than any other point in the surface. This means that it has first (and even second) derivatives that you can calculate. If your probability distribution is nice enough you can use the chain rule to calculate derivatives for the least squares fit in the initial parameters. This means that you can use optimization methods to calculate the best fit parameters which require not just a means to calculate the function to be optimized but also its derivatives, and these are generally more efficient than optimization methods which require only function values, such as Nelder-Mead or Torczon Simplex. See e.g. http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math4/optim/package-summary.html.
Another possible approach is via something called the EM Algorithm. Here EM stands for Expectation-Maximization. It can be used for finding maximum likelihood fits in cases where the problem would be easy if you could see some hidden state that you cannot actually see. In this case the output produced by the initial distribution might be such a hidden state. One starting point is http://www-prima.imag.fr/jlc/Courses/2002/ENSI2.RNRF/EM-tutorial.pdf.
Problem
I have a formula for calculation of 1D polynomial, joint function. I want to find all local maximums of that function within a given range.
My approach
My current solution is that i evaluate my function in a certain number of points from the range and then I go through these points and remember points where function changed from rising to decline. Of cause I can change number of samples within the interval, but I want to find all maximums with as lowest number of samples as possible.
Question
Can you suggest any effetive algorithm to me?
Finding all the maxima of an unknown function is hard. You can never be sure that a maximum you found is really just one maximum or that you have not overlooked a maximum somewhere.
However, if something is known about the function, you can try to exploit that. The simplest one is, of course, is if the function is known to be rational and bounded in grade. Up to a rational function of grade five it is possible to derive all four extrema from a closed formula, see http://en.wikipedia.org/wiki/Quartic_equation#General_formula_for_roots for details. Most likely, you don't want to implement that, but for linear, square, and cubic roots, the closed formula is feasible and can be used to find maxima of a quartic function.
That is only the most simple information that might be known, other interesting information is whether you can give a bound to the second derivative. This would allow you to reduce the sampling density when you find a strong slope.
You may also be able to exploit information from how you intend to use the maxima you found. It can give you clues about how much precision you need. Is it sufficient to know that a point is near a maximum? Or that a point is flat? Is it really a problem if a saddle point is classified as a maximum? Or if a maximum right next to a turning point is overlooked? And how much is the allowable error margin?
If you cannot exploit information like this, you are thrown back to sampling your function in small steps and hoping you don't make too much of an error.
Edit:
You mention in the comments that your function is in fact a kernel density estimation. This gives you at least the following information:
Unless the kernel is not limited in extend, your estimated function will be a piecewise function: Any point on it will only be influenced by a precisely calculable number of measurement points.
If the kernel is based on a rational function, the resulting estimated function will be piecewise rational. And it will be of the same grade as the kernel!
If the kernel is the uniform kernel, your estimated function will be a step function.
This case needs special handling because there won't be any maxima in the mathematical sense. However, it also makes your job really easy.
If the kernel is the triangular kernel, your estimated function will be a piecewise linear function.
If the kernel is the Epanechnikov kernel, your estimated function will be a piecewise quadratic function.
In all these cases it is next to trivial to produce the piecewise functions and to find their maxima.
If the kernel is of too high grade or transcendental, you still know the measurements that your estimation is based on, and you know the kernel properties. This allows you to derive a heuristic on how dense your maxima can get.
At the very least, you know the first and second derivative of the kernel.
In principle, this allows you to calculate the first and second derivative of the estimated function at any point.
In the case of a local kernel, it might be more prudent to calculate the first derivative and an upper bound to the second derivative of the estimated function at any point.
With this information, it should be possible to constrain the search to the regions where there are maxima and avoid oversampling of the slopes.
As you see, there is a lot of useful information that you can derive from the knowledge of your function, and which you can use to your advantage.
The local maxima are among the roots of the first derivative. To isolate those roots in your working interval you can use the Sturm theorem, and proceed by dichotomy. In theory (using exact arithmetic) it gives you all real roots.
An equivalent approach is to express your polynomial in the Bezier/Bernstein basis and look for changes of signs of the coefficients (hull property). Dichotomic search can be efficiently implemented by recursive subdivision of the Bezier.
There are several classical algorithms available for polynomials, such as Laguerre, that usually look for the complex roots as well.
In most cases, the Baum-Welch algorithm is used to train a Hidden Markov model.
In many papers however, it is argued that the BW algorithm will optimize until it got stuck in a local optimum.
Does there exist an exact algorithm that actually succeeds in finding the global optimum (except from enumerating nearly all possible models and evaluating them)?
Of course for most applications, BW will work fine. We are however interested in finding lower bounds of the amount of information loss when reducing the number of states. Therefore we always need to generate the best model possible.
We are thus looking for an efficient NP-hard algorithm (that only enumerates over a (potentially) exponential number of extreme points) and not over a discretized number of floating points for each probability in the model.
A quick search finds in http://www.cs.tau.ac.il/~rshamir/algmb/98/scribe/html/lec06/node6.html "In this case, the problem of finding the optimal set of parameters $\Theta^{\ast}$ is known to be NP-complete. The Baum-Welch algorithm [2], which is a special case of the EM technique (Expectation and Maximization), can be used for heuristically finding a solution to the problem. " Therefore I suggest that an EM variant that was guaranteed to find a global optimum in polynomial time would prove P=NP and is unknown and in fact probably does not exist.
This problem almost certainly is not convex, because there will in fact be multiple globally optimal solutions with the same scores - given any proposed solution, which typically gives a probability distribution for observations given the underlying state, you can, for instance, rename hidden state 0 as hidden state 1, and vice versa, adjusting the probability distributions so that the observed behaviour generated by the two different solutions is identical. Thus if there are N hidden states there are at least N! local optimums produced by permuting the hidden states amongst themselves.
On another application of EM, https://www.math.ias.edu/csdm/files/11-12/amoitra_disentangling_gaussians.pdf provides an algorithm for finding a globally optimum gaussian mixture model. It observes that the EM algorithm is often used for this problem, but points out that it is not guaranteed to find the global optimum, and does not reference as related work any version of EM which might (it also says the EM algorithm is slow).
If you are trying to do some sort of likelihood ratio test between e.g. a 4-state model and a 5-state model, it would clearly be embarrassing if, due to local optima, the 5-state model fitted had a lower likelihood than the 4-state model. One way to avoid this or to recover from it would be to start a 5-state EM from a starting point very close to that of the best 4-state models found. For instance, you could create a 5th state with probability epsilon and with an output distribution reflecting an average of the 4-state output distributions, keeping the 4-state distributions as the other 4 distributions in the new 5-state model, multiplying in a factor of (1-epsilon) somewhere so that everything still added up to one.
I think if you really want this, you can, given a local optimum, define a domain of convergence. If you can have some reasonably weak conditions, then you can quickly show that either the whole field is in the domain of convergence, or that there is a second local mininium.
E.g., suppose in an example I have two independent variables (x,y), and one dependent variable (z), and suppose that given a local minimim z_1, and a pair of start points which converge to z_1=(x_1,y_1), P_1 = (x_2, y_1) and p_2 = (x_1, y_3), then i might be able to prove that then all of the triangle z_1, p_1, p_2 is in the domain of convergence.
Of course, this is not an approach which works generally, but you can solve a sub class of problems efficiently.E.g., some problems have no no domain of convergence in a sense, e.g. its possible to ahve a problem where a point converges to a different solution than all the points in its neighbourhood, but lots of problems have some reasonable smoothness to their convergence to a solution, so then you can do ok.
I was wondering if anyone knows which kind of algorithm could be use in my case. I already have run the optimizer on my multivariate function and found a solution to my problem, assuming that my function is regular enough. I slightly perturbate the problem and would like to find the optimum solution which is close to my last solution. Is there any very fast algorithm in this case or should I just fallback to a regular one.
We probably need a bit more information about your problem; but since you know you're near the right solution, and if derivatives are easy to calculate, Newton-Raphson is a sensible choice, and if not, Conjugate-Gradient may make sense.
If you already have an iterative optimizer (for example, based on Powell's direction set method, or CG), why don't you use your initial solution as a starting point for the next run of your optimizer?
EDIT: due to your comment: if calculating the Jacobian or the Hessian matrix gives you performance problems, try BFGS (http://en.wikipedia.org/wiki/BFGS_method), it avoids calculation of the Hessian completely; here
http://www.alglib.net/optimization/lbfgs.php you find a (free-for-non-commercial) implementation of BFGS. A good description of the details you will here.
And don't expect to get anything from finding your initial solution with a less sophisticated algorithm.
So this is all about unconstrained optimization. If you need information about constrained optimization, I suggest you google for "SQP".
there are a bunch of algorithms for finding the roots of equations. If you know approximately where the root is, there are algorithms that will get you arbitrarily close very quickly, in ln n time or better.
One is Newton's method
another is the Bisection Method
Note that these algorithms are for single variable functions, but can be expanded to multivariate functions.
Every minimization algorithm performs better (read: perform at all) if you have a good initial guess. The initial guess for the perturbed problem will be in your case the minimum point of the non perturbed problem.
Then, you have to specify your requirements: you want speed. What accuracy do you want ? Does space efficiency matters ? Most importantly: what information do you have: only the value of the function, or do you also have the derivatives (possibly second derivatives) ?
Some background on the problem would help too. Looking for a smooth function which has been discretized will be very different than looking for hundreds of unrelated parameters.
Global information (ie. is the function convex, is there a guaranteed global minimum or many local ones, etc) can be left aside for now. If you have trouble finding the minimum point of the perturbed problem, this is something you will have to investigate though.
Answering these questions will allow us to select a particular algorithm. There are many choices (and trade-offs) for multivariate optimization.
Also, which is quicker will very much depend on the problem (rather than on the algorithm), and should be determined by experimentation.
Thought I don't know much about using computers in this capacity, I remember an article that used neuroevolutionary techniques to find "best-fit" equations relatively efficiently, given a known function complexity (linear, Nth-polynomial, exponential, logarithmic, etc) and a set of point plots. As I recall it was one of the earliest uses of what we now know as computational neuroevolution; because the functional complexity (and thus the number of terms) of the equation is known and fixed, a static neural net can be used and seeded with your closest values, then "mutated" and tested for fitness, with heuristics to make new nets closer to existing nets with high fitness. Using multithreading, many nets can be created, tested and evaluated in parallel.