Optimization of multivariate function with a initial solution close to the optimum - algorithm

I was wondering if anyone knows which kind of algorithm could be use in my case. I already have run the optimizer on my multivariate function and found a solution to my problem, assuming that my function is regular enough. I slightly perturbate the problem and would like to find the optimum solution which is close to my last solution. Is there any very fast algorithm in this case or should I just fallback to a regular one.

We probably need a bit more information about your problem; but since you know you're near the right solution, and if derivatives are easy to calculate, Newton-Raphson is a sensible choice, and if not, Conjugate-Gradient may make sense.

If you already have an iterative optimizer (for example, based on Powell's direction set method, or CG), why don't you use your initial solution as a starting point for the next run of your optimizer?
EDIT: due to your comment: if calculating the Jacobian or the Hessian matrix gives you performance problems, try BFGS (http://en.wikipedia.org/wiki/BFGS_method), it avoids calculation of the Hessian completely; here
http://www.alglib.net/optimization/lbfgs.php you find a (free-for-non-commercial) implementation of BFGS. A good description of the details you will here.
And don't expect to get anything from finding your initial solution with a less sophisticated algorithm.
So this is all about unconstrained optimization. If you need information about constrained optimization, I suggest you google for "SQP".

there are a bunch of algorithms for finding the roots of equations. If you know approximately where the root is, there are algorithms that will get you arbitrarily close very quickly, in ln n time or better.
One is Newton's method
another is the Bisection Method
Note that these algorithms are for single variable functions, but can be expanded to multivariate functions.

Every minimization algorithm performs better (read: perform at all) if you have a good initial guess. The initial guess for the perturbed problem will be in your case the minimum point of the non perturbed problem.
Then, you have to specify your requirements: you want speed. What accuracy do you want ? Does space efficiency matters ? Most importantly: what information do you have: only the value of the function, or do you also have the derivatives (possibly second derivatives) ?
Some background on the problem would help too. Looking for a smooth function which has been discretized will be very different than looking for hundreds of unrelated parameters.
Global information (ie. is the function convex, is there a guaranteed global minimum or many local ones, etc) can be left aside for now. If you have trouble finding the minimum point of the perturbed problem, this is something you will have to investigate though.
Answering these questions will allow us to select a particular algorithm. There are many choices (and trade-offs) for multivariate optimization.
Also, which is quicker will very much depend on the problem (rather than on the algorithm), and should be determined by experimentation.

Thought I don't know much about using computers in this capacity, I remember an article that used neuroevolutionary techniques to find "best-fit" equations relatively efficiently, given a known function complexity (linear, Nth-polynomial, exponential, logarithmic, etc) and a set of point plots. As I recall it was one of the earliest uses of what we now know as computational neuroevolution; because the functional complexity (and thus the number of terms) of the equation is known and fixed, a static neural net can be used and seeded with your closest values, then "mutated" and tested for fitness, with heuristics to make new nets closer to existing nets with high fitness. Using multithreading, many nets can be created, tested and evaluated in parallel.

Related

How to find neighboring solutions in simulated annealing?

I'm working on an optimization problem and attempting to use simulated annealing as a heuristic. My goal is to optimize placement of k objects given some cost function. Solutions take the form of a set of k ordered pairs representing points in an M*N grid. I'm not sure how to best find a neighboring solution given a current solution. I've considered shifting each point by 1 or 0 units in a random direction. What might be a good approach to finding a neighboring solution given a current set of points?
Since I'm also trying to learn more about SA, what makes a good neighbor-finding algorithm and how close to the current solution should the neighbor be? Also, if randomness is involved, why is choosing a "neighbor" better than generating a random solution?
I would split your question into several smaller:
Also, if randomness is involved, why is choosing a "neighbor" better than generating a random solution?
Usually, you pick multiple points from a neighborhood, and you can explore all of them. For example, you generate 10 points randomly and choose the best one. By doing so you can efficiently explore more possible solutions.
Why is it better than a random guess? Good solutions tend to have a lot in common (e.g. they are close to each other in a search space). So by introducing small incremental changes, you would be able to find a good solution, while random guess could send you to completely different part of a search space and you'll never find an appropriate solution. And because of the curse of dimensionality random jumps are not better than brute force - there will be too many places to jump.
What might be a good approach to finding a neighboring solution given a current set of points?
I regret to tell you, that this question seems to be unsolvable in general. :( It's a mix between art and science. Choosing a right way to explore a search space is too problem specific. Even for solving a placement problem under varying constraints different heuristics may lead to completely different results.
You can try following:
Random shifts by fixed amount of steps (1,2...). That's your approach
Swapping two points
You can memorize bad moves for some time (something similar to tabu search), so you will use only 'good' ones next 100 steps
Use a greedy approach to generate a suboptimal placement, then improve it with methods above.
Try random restarts. At some stage, drop all of your progress so far (except for the best solution so far), raise a temperature and start again from a random initial point. You can do this each 10000 steps or something similar
Fix some points. Put an object at point (x,y) and do not move it at all, try searching for the best possible solution under this constraint.
Prohibit some combinations of objects, e.g. "distance between p1 and p2 must be larger than D".
Mix all steps above in different ways
Try to understand your problem in all tiniest details. You can derive some useful information/constraints/insights from your problem description. Assume that you can't solve placement problem in general, so try to reduce it to a more specific (== simpler, == with smaller search space) problem.
I would say that the last bullet is the most important. Look closely to your problem, consider its practical aspects only. For example, a size of your problems might allow you to enumerate something, or, maybe, some placements are not possible for you and so on and so forth. THere is no way for SA to derive such domain-specific knowledge by itself, so help it!
How to understand that your heuristic is a good one? Only by practical evaluation. Prepare a decent set of tests with obvious/well-known answers and try different approaches. Use well-known benchmarks if there are any of them.
I hope that this is helpful. :)

Simplex Algorithm - why so complicated?

Recently I was introduced to the simplex algorithm for optimizing linear problems. I think I understand how it works, but I don't know, why it is that complicated.
Why do we have to use the pivot- row/column/element. If I see this correctly it is just a Gauss elimination which eliminates Elements in a specific order. Why does the order matter? Is it due to numerical noise? If everything is calculated precisely the ordering shouldn't be important, right?
In all examples I encountered so far (not too many to be fair) all slack variables are set to zero in the end. Why did we introduce them in the first place and didn't just rewrite all inequations directly as equations?
Pivoting is NOT Gaussian elimination. The tableau is likely to have about as many zeros after the pivot as before it. The order of pivoting is chosen to maintain the simplex constraint (always moving from one vertex of the feasible space to another), and to do so efficiently (by making maximum progress along the objective with each move).
Slack variables are zero for constraints which bind the solution. It sounds like the problems you looked at didn't have any non-binding constraints.

concrete examples of heuristics

What are concrete examples (e.g. Alpha-beta pruning, example:tic-tac-toe and how is it applicable there) of heuristics. I already saw an answered question about what heuristics is but I still don't get the thing where it uses estimation. Can you give me a concrete example of a heuristic and how it works?
Warnsdorff's rule is an heuristic, but the A* search algorithm isn't. It is, as its name implies, a search algorithm, which is not problem-dependent. The heuristic is. An example: you can use the A* (if correctly implemented) to solve the Fifteen puzzle and to find the shortest way out of a maze, but the heuristics used will be different. With the Fifteen puzzle your heuristic could be how many tiles are out of place: the number of moves needed to solve the puzzle will always be greater or equal to the heuristic.
To get out of the maze you could use the Manhattan Distance to a point you know is outside of the maze as your heuristic. Manhattan Distance is widely used in game-like problems as it is the number of "steps" in horizontal and in vertical needed to get to the goal.
Manhattan distance = abs(x2-x1) + abs(y2-y1)
It's easy to see that in the best case (there are no walls) that will be the exact distance to the goal, in the rest you will need more. This is important: your heuristic must be optimistic (admissible heuristic) so that your search algorithm is optimal. It must also be consistent. However, in some applications (such as games with very big maps) you use non-admissible heuristics because a suboptimal solution suffices.
A heuristic is just an approximation to the real cost, (always lower than the real cost if admissible). The better the approximation, the fewer states the search algorithm will have to explore. But better approximations usually mean more computing time, so you have to find a compromise solution.
Most demonstrative is the usage of heuristics in informed search algorithms, such as A-Star. For realistic problems you usually have large search space, making it infeasible to check every single part of it. To avoid this, i.e. to try the most promising parts of the search space first, you use a heuristic. A heuristic gives you an estimate of how good the available subsequent search steps are. You will choose the most promising next step, i.e. best-first. For example if you'd like to search the path between two cities (i.e. vertices, connected by a set of roads, i.e. edges, that form a graph) you may want to choose the straight-line distance to the goal as a heuristic to determine which city to visit first (and see if it's the target city).
Heuristics should have similar properties as metrics for the search space and they usually should be optimistic, but that's another story. The problem of providing a heuristic that works out to be effective and that is side-effect free is yet another problem...
For an application of different heuristics being used to find the path through a given maze also have a look at this answer.
Your question interests me as I've heard about heuristics too during my studies but never saw an application for it, I googled a bit and found this : http://www.predictia.es/blog/aco-search
This code simulate an "ant colony optimization" algorithm to search trough a website.
The "ants" are workers which will search through the site, some will search randomly, some others will follow the "best path" determined by the previous ones.
A concrete example: I've been doing a solver for the game JT's Block, which is roughly equivalent to the Same Game. The algorithm performs a breadth-first search on all possible hits, store the values, and performs to the next ply. Problem is the number of possible hits quickly grows out of control (10e30 estimated positions per game), so I need to prune the list of positions at each turn and only take the "best" of them.
Now, the definition of the "best" positions is quite fuzzy: they are the positions that are expected to lead to the best final scores, but nothing is sure. And here comes the heuristics. I've tried a few of them:
sort positions by score obtained so far
increase score by best score obtained with a x-depth search
increase score based on a complex formula using the number of tiles, their color and their proximity
improve the last heuristic by tweaking its parameters and seeing how they perform
etc...
The last of these heuristic could have lead to an ant-march optimization: there's half a dozen parameters that can be tweaked from 0 to 1, and an optimizer could find the optimal combination of these. For the moment I've just manually improved some of them.
The second of this heuristics is interesting: it could lead to the optimal score through a full depth-first search, but such a goal is impossible of course because it would take too much time. In general, increasing X leads to a better heuristic, but increases the computing time a lot.
So here it is, some examples of heuristics. Anything can be an heuristic as long as it helps your algorithm perform better, and it's what makes them so hard to grasp: they're not deterministic. Another point with heuristics: they're supposed to lead to quick and dirty results of the real stuff, so there's a trade-of between their execution time and their accuracy.
A couple of concrete examples: for solving the Knight's Tour problem, one can use Warnsdorff's rule - an heuristic. Or for solving the Fifteen puzzle, a possible heuristic is the A* search algorithm.
The original question asked for concrete examples for heuristics.
Some of these concrete examples were already given. Another one would be the number of misplaced tiles in the 15-puzzle or its improvement, the Manhattan distance, based on the misplaced tiles.
One of the previous answers also claimed that heuristics are always problem-dependent, whereas algorithms are problem-independent. While there are, of course, also problem-dependent algorithms (for instance, for every problem you can just give an algorithm that immediately solves that very problem, e.g. the optimal strategy for any tower-of-hanoi problem is known) there are also problem-independent heuristics!
Consequently, there are also different kinds of problem-independent heuristics. Thus, in a certain way, every such heuristic can be regarded a concrete heuristic example while not being tailored to a specific problem like 15-puzzle. (Examples for problem-independent heuristics taken from planning are the FF heuristic or the Add heuristic.)
These problem-independent heuristics base on a general description language and then they perform a problem relaxation. That is, the problem relaxation only bases on the syntax (and, of course, its underlying semantics) of the problem description without "knowing" what it represents. If you are interested in this, you should get familiar with "planning" and, more specifically, with "planning as heuristic search". I also want to mention that these heuristics, while being problem-independent, are dependent on the problem description language, of course. (E.g., my before-mentioned heuristics are specific to "planning problems" and even for planning there are various different sub problem classes with differing kinds of heuristics.)

Gradient descent implementation

I've implemented both the batch and stochastic gradient descent. I'm experiencing some issues though. This is the stochastic rule:
1 to m {
theta(j):=theta(j)-step*derivative (for all j)
}
The issue I have is that, even though the cost function is becoming smaller and smaller the testing says it's not good. If I change the step a bit and change the number of iterations, the cost function is a bit bigger in value but the results are ok. Is this an overfitting "symptom"? How do I know which one is the right one? :)
As I said, even though the cost function is more minimized the testing says it's not good.
Gradient descent is a local search method for minimizing a function. When it reaches a local minimum in the parameter space, it won't be able to go any further. This makes gradient descent (and other local methods) prone to getting stuck in local minima, rather than reaching the global minimum. The local minima may or may not be good solutions for what you're trying to achieve. What to expect will depend on the function that you're trying to minimize.
In particular, high-dimensional NP-complete problems can be tricky. They often have exponentially many local optima, with many of them nearly as good as the global optimum in terms of cost, but with parameter values orthogonal to those for the global optimum. These are hard problems: you don't generally expect to be able to find the global optimum, instead just looking for a local minimum that is good enough. These are also relevant problems: many interesting problems have just these properties.
I'd suggest first testing your gradient descent implementation with an easy problem. You might try finding the minimum in a polynomial. Since it's a one-parameter problem, you can plot the progress of the parameter values along the curve of the polynomial. You should be able to see if something is drastically wrong, and can also observe how the search gets stuck in local minima. You should also be able to see that the initial parameter choice can matter quite a lot.
For dealing with harder problems, you might modify your algorithm to help it escape the local minima. A few common approaches:
Add noise. This reduces the precision of the parameters you've found, which can "blur" out local minima. The search can then jump out of local minima that are small compared to the noise, while still being trapped in deeper minima. A well-known approach for adding noise is simulated annealing.
Add momentum. Along with using the current gradient to define the step, also continue in the same direction as the previous step. If you take a fraction of the previous step as the momentum term, there is a tendency to keep going, which can take the search past the local minimum. By using a fraction, the steps decay exponentially, so poor steps aren't a big problem. This was always a popular modification to gradient descent when used to train neural networks, where gradient descent is known as backpropagation.
Use a hybrid search. First use a global search (e.g., genetic algorithms, various Monte Carlo methods) to find some good starting points, then apply gradient descent to take advantage of the gradient information in the function.
I won't make a recommendation on which to use. Instead, I'll suggest doing a little research to see what others have done with problems related to what you're working on. If it's purely a learning experience, momentum is probably the easiest to get working.
There are lots of things that could be going on:
your step could be a bad choice
your derivative might be off
your "expected value" might be mistaken
your gradient descent could simply be slow to converge
I would try increasing the run length, and plot runs with a variety of step values. A smaller step will have a better chance of avoiding the problems of, er, steps that are too big.

Algorithm to optimize parameters based on imprecise fitness function

I am looking for a general algorithm to help in situations with similar constraints as this example :
I am thinking of a system where images are constructed based on a set of operations. Each operation has a set of parameters. The total "gene" of the image is then the sequential application of the operations with the corresponding parameters. The finished image is then given a vote by one or more real humans according to how "beautiful" it is.
The question is what kind of algorithm would be able to do better than simply random search if you want to find the most beautiful image? (and hopefully improve the confidence over time as votes tick in and improve the fitness function)
Given that the operations will probably be correlated, it should be possible to do better than random search. So for example operation A with parameters a1 and a2 followed by B with parameters b1 could generally be vastly superior to B followed by A. The order of operations will matter.
I have tried googling for research papers on random walk and markov chains as that is my best guesses about where to look, but so far have found no scenarios similar enough. I would really appreciate even just a hint of where to look for such an algorithm.
I think what you are looking for fall in a broad research area called metaheuristics (which include many non-linear optimization algorithms such as genetic algorithms, simulated annealing or tabu search).
Then if your raw fitness function is just giving a statistical value somehow approximating a real (but unknown) fitness function, you can probably still use most metaheuristics by (somehow) smoothing your fitness function (averaging results would do that).
Do you mean the Metropolis algorithm?
This approach uses a random walk, weighted by the fitness function. It is useful for locating local extrema in complicated fitness landscapes, but is generally slower than deterministic approaches where those will work.
You're pretty much describing a genetic algorithm in which the sequence of operations represents the "gene" ("chromosome" would be a better term for this, where the parameter[s] passed to each operation represents a single "gene", and multiple genes make up a chromosome), the image produced represents the phenotypic expression of the gene, and the votes from the real humans represent the fitness function.
If I understand your question, you're looking for an alternative algorithm of some sort that will evaluate the operations and produce a "beauty" score similar to what the real humans produce. Good luck with that - I don't think there really is any such thing, and I'm not surprised that you didn't find anything. Human brains, and correspondingly human evaluations of aesthetics, are much too staggeringly complex to be reducible to a simplistic algorithm.
Interestingly, your question seems to encapsulate the bias against using real human responses as the fitness function in genetic-algorithm-based software. This is a subject of relevance to me, since my namesake software is specifically designed to use human responses (or "votes") to evaluate music produced via a genetic process.
Simple Markov Chain
Markov chains, which you mention, aren't a bad way to go. A Markov chain is just a state machine, represented as a graph with edge weights which are transition probabilities. In your case, each of your operations is a node in the graph, and the edges between the nodes represent allowable sequences of operations. Since order matters, your edges are directed. You then need three components:
A generator function to construct the graph of allowed transitions (which operations are allowed to follow one another). If any operation is allowed to follow any other, then this is easy to write: all nodes are connected, and your graph is said to be complete. You can initially set all the edge weights to 1.
A function to traverse the graph, crossing N nodes, where N is your 'gene-length'. At each node, your choice is made randomly, but proportionally weighted by the values of the edges (so better edges have a higher chance of being selected).
A weighting update function which can be used to adjust the weightings of the edges when you get feedback about an image. For example, a simple update function might be to give each edge involved in a 'pleasing' image a positive vote each time that image is nominated by a human. The weighting of each edge is then normalised, with the currently highest voted edge set to 1, and all the others correspondingly reduced.
This graph is then a simple learning network which will be refined by subsequent voting. Over time as votes accumulate, successive traversals will tend to favour the more highly rated sequences of operations, but will still occasionally explore other possibilities.
Advantages
The main advantage of this approach is that it's easy to understand and code, and makes very few assumptions about the problem space. This is good news if you don't know much about the search space (e.g. which sequences of operations are likely to be favourable).
It's also easy to analyse and debug - you can inspect the weightings at any time and very easily calculate things like the top 10 best sequences known so far, etc. This is a big advantage - other approaches are typically much harder to investigate ("why did it do that?") because of their increased abstraction. Although very efficient, you can easily melt your brain trying to follow and debug the convergence steps of a simplex crawler!
Even if you implement a more sophisticated production algorithm, having a simple baseline algorithm is crucial for sanity checking and efficiency comparisons. It's also easy to tinker with, by messing with the update function. For example, an even more baseline approach is pure random walk, which is just a null weighting function (no weighting updates) - whatever algorithm you produce should perform significantly better than this if its existence is to be justified.
This idea of baselining is very important if you want to evaluate the quality of your algorithm's output empirically. In climate modelling, for example, a simple test is "does my fancy simulation do any better at predicting the weather than one where I simply predict today's weather will be the same as yesterday's?" Since weather is often correlated on a timescale of several days, this baseline can give surprisingly good predictions!
Limitations
One disadvantage of the approach is that it is slow to converge. A more agressive choice of update function will push promising results faster (for example, weighting new results according to a power law, rather than the simple linear normalisation), at the cost of giving alternatives less credence.
This is equivalent to fiddling with the mutation rate and gene pool size in a genetic algorithm, or the cooling rate of a simulated annealing approach. The tradeoff between 'climbing hills or exploring the landscape' is an inescapable "twiddly knob" (free parameter) which all search algorithms must deal with, either directly or indirectly. You are trying to find the highest point in some fitness search space. Your algorithm is trying to do that in less tries than random inspection, by looking at the shape of the space and trying to infer something about it. If you think you're going up a hill, you can take a guess and jump further. But if it turns out to be a small hill in a bumpy landscape, then you've just missed the peak entirely.
Also note that since your fitness function is based on human responses, you are limited to a relatively small number of iterations regardless of your choice of algorithmic approach. For example, you would see the same issue with a genetic algorithm approach (fitness function limits the number of individuals and generations) or a neural network (limited training set).
A final potential limitation is that if your "gene-lengths" are long, there are many nodes, and many transitions are allowed, then the size of the graph will become prohibitive, and the algorithm impractical.

Resources