I'm looking for algorithms to find a "best" set of parameter values. The function in question has a lot of local minima and changes very quickly. To make matters even worse, testing a set of parameters is very slow - on the order of 1 minute - and I can't compute the gradient directly.
Are there any well-known algorithms for this kind of optimization?
I've had moderate success with just trying random values. I'm wondering if I can improve the performance by making the random parameter chooser have a lower chance of picking parameters close to ones that had produced bad results in the past. Is there a name for this approach so that I can search for specific advice?
More info:
Parameters are continuous
There are on the order of 5-10 parameters. Certainly not more than 10.
How many parameters are there -- eg, how many dimensions in the search space? Are they continuous or discrete - eg, real numbers, or integers, or just a few possible values?
Approaches that I've seen used for these kind of problems have a similar overall structure - take a large number of sample points, and adjust them all towards regions that have "good" answers somehow. Since you have a lot of points, their relative differences serve as a makeshift gradient.
Simulated
Annealing: The classic approach. Take a bunch of points, probabalistically move some to a neighbouring point chosen at at random depending on how much better it is.
Particle
Swarm Optimization: Take a "swarm" of particles with velocities in the search space, probabalistically randomly move a particle; if it's an improvement, let the whole swarm know.
Genetic Algorithms: This is a little different. Rather than using the neighbours information like above, you take the best results each time and "cross-breed" them hoping to get the best characteristics of each.
The wikipedia links have pseudocode for the first two; GA methods have so much variety that it's hard to list just one algorithm, but you can follow links from there. Note that there are implementations for all of the above out there that you can use or take as a starting point.
Note that all of these -- and really any approach to this large-dimensional search algorithm - are heuristics, which mean they have parameters which have to be tuned to your particular problem. Which can be tedious.
By the way, the fact that the function evaluation is so expensive can be made to work for you a bit; since all the above methods involve lots of independant function evaluations, that piece of the algorithm can be trivially parallelized with OpenMP or something similar to make use of as many cores as you have on your machine.
Your situation seems to be similar to that of the poster of Software to Tune/Calibrate Properties for Heuristic Algorithms, and I would give you the same advice I gave there: consider a Metropolis-Hastings like approach with multiple walkers and a simulated annealing of the step sizes.
The difficulty in using a Monte Carlo methods in your case is the expensive evaluation of each candidate. How expensive, compared to the time you have at hand? If you need a good answer in a few minutes this isn't going to be fast enough. If you can leave it running over night, it'll work reasonably well.
Given a complicated search space, I'd recommend a random initial distributed. You final answer may simply be the best individual result recorded during the whole run, or the mean position of the walker with the best result.
Don't be put off that I was discussing maximizing there and you want to minimize: the figure of merit can be negated or inverted.
I've tried Simulated Annealing and Particle Swarm Optimization. (As a reminder, I couldn't use gradient descent because the gradient cannot be computed).
I've also tried an algorithm that does the following:
Pick a random point and a random direction
Evaluate the function
Keep moving along the random direction for as long as the result keeps improving, speeding up on every successful iteration.
When the result stops improving, step back and instead attempt to move into an orthogonal direction by the same distance.
This "orthogonal direction" was generated by creating a random orthogonal matrix (adapted this code) with the necessary number of dimensions.
If moving in the orthogonal direction improved the result, the algorithm just continued with that direction. If none of the directions improved the result, the jump distance was halved and a new set of orthogonal directions would be attempted. Eventually the algorithm concluded it must be in a local minimum, remembered it and restarted the whole lot at a new random point.
This approach performed considerably better than Simulated Annealing and Particle Swarm: it required fewer evaluations of the (very slow) function to achieve a result of the same quality.
Of course my implementations of S.A. and P.S.O. could well be flawed - these are tricky algorithms with a lot of room for tweaking parameters. But I just thought I'd mention what ended up working best for me.
I can't really help you with finding an algorithm for your specific problem.
However in regards to the random choosing of parameters I think what you are looking for are genetic algorithms. Genetic algorithms are generally based on choosing some random input, selecting those, which are the best fit (so far) for the problem, and randomly mutating/combining them to generate a next generation for which again the best are selected.
If the function is more or less continous (that is small mutations of good inputs generally won't generate bad inputs (small being a somewhat generic)), this would work reasonably well for your problem.
There is no generalized way to answer your question. There are lots of books/papers on the subject matter, but you'll have to choose your path according to your needs, which are not clearly spoken here.
Some things to know, however - 1min/test is way too much for any algorithm to handle. I guess that in your case, you must really do one of the following:
get 100 computers to cut your parameter testing time to some reasonable time
really try to work out your parameters by hand and mind. There must be some redundancy and at least some sanity check so you can test your case in <1min
for possible result sets, try to figure out some 'operations' that modify it slightly instead of just randomizing it. For example, in TSP some basic operator is lambda, that swaps two nodes and thus creates new route. Your can be shifting some number up/down for some value.
then, find yourself some nice algorithm, your starting point can be somewhere here. The book is invaluable resource for anyone who starts with problem-solving.
Related
When is the right time to use the Elite\Elitist mode in a Genetic Algorithm? I have no idea when to use it. What kind problems can be solved using this?
All I know is an elitist model is where you choose the elite (the solution with highest fitness function) and they have a reserve slot for the next generation, and they are the one up for crossover.
You pretty much always use some form of elitism. What varies is the percentage (p) of best performers that you allow to survive to the next generation. So no elitism is basically saying p=0.
The higher p, the more your algorithm will have a tendency to find local peaks of fitness. i.e. once it finds a chromosome with a good fitness, it'll tend to focus more on optimizing it than trying to find new completely different solutions. On the contrary, if it's smaller, your GA will look for possible solutions all over the place and won't zero in as fast once it finds something close to the optimum solution.
So setting p correctly is going to have a direct impact on your algorithm's performance. But it depends on what you're after and your problem space. Play around with it a bit to adjust properly. I typically use 20% for the problems I work with, to give enough room for innovation. It works ok for me.
I understand that each particle is a solution to a specific function, and each particle and the swarm is constantly searching for the best solution. If the global best is found after the first iteration, and no new particles are being added to the mix, shouldn't the loop just quit and the first global best found be the most fitting solution? If this is the case what makes PSO better than just iterating through a list.
Your terminology is a bit off. Simple PSO is a search for a vector x that minimizes some scalar objective function E(x). It does this by creating many candidate vectors. Call them x_i. These are the "particles". They are initialized randomly in both position and rate of change, also called velocity, which is consistent with the idea of a moving particle, even though that particle may have many more than 3 dimensions.
Simple rules describe how the position and velocity change over time. The rules are chosen so that each particle x_i tends randomly to move in directions that reduce E(x_i).
The rules usually involve tracking the "single best x_i value seen so far" and are tuned so that all particles tend to head generally toward that best value with random variations. So the particles swarm like buzzing bees, heading as a group toward a common goal, but with many deviations by individual bees that, over time, cause the common goal to change.
It's unfortunate that some of the literature calls this goal or best particle value seen so far "the global minimum." In optimization, global minimum has a different meaning. A global minimum (there can be more than one when there are "ties" for best) is a value of x that - out of the entire domain of possible x values - produces the unique minimum possible value of E(x).
In no way is PSO guaranteed to find a global minimum. In fact, your question is a bit nonsensical in that one generally never knows when a global minimum has been found. How would you? In most problems you don't even know the gradient of E (which gives the direction taking E to smaller values, i.e. downhill). This is why you are using PSO in the first place. If you know the gradient, you can almost certainly use numerical techniques that will find an answer more quickly than PSO. Without a gradient, you can't even be sure you've found a local minimum, let alone a global one.
Rather, the best you can usually do is "guess" when a local minimum has been found. You do this by letting the system run while watching how often and by how much the "best particle seen so far" is being updated. When the changes become infrequent and/or small, you declare victory.
Another way of putting this is that PSO is used on problems where reducing E(x) is always good and "you'll take anything you can get" regardless of whether you have any confidence that what you got is the best possible. E.g. you're Walmart and any way of locating your stores that saves/makes more dollars is interesting.
With all this as background, let's recap your specific questions:
If the global best is found after the first iteration, and no new particles are being added to the mix, shouldn't the loop just quit and the first global best found be the most fitting solution?
There's no answer because there's no way to determine a global best has been found. The swarm of buzzing particles might find a new best in the next iteration or ten trillion iterations from now. You seldom know.
If this is the case what makes PSO better than just iterating through a list?
I don't exactly grok what you mean by this. The PSO is emulating the way swarms of biological entities like bugs and herd animals behave. In this manner it resembles genetic algorithms, simulated annealing, neural networks, and other families of solution finders that use the following logic: Nature, both physical and biological, has known-good optimization processes. Let's take advantage of them and do our best to emulate them in software. We are using nature to do better than any simple iteration we might devise ourselves.
Given a function, a particle swarm attempts to find the solution (a vector) that will minimize (or sometimes maximize, depending on the problem) the value to that function.
If you happen to know the minimum of the solution (suppose for argument sake, it is 0) AND
if you are lucky enough to generate the solution that gives you 0 on the first step, then you can exit the loop and stop the algorithm.
That said; the probability of you randomly generating that solution on initialization is infinitely small.
In most practical terms, when you would want to use a PSO to solve, it is most likely that you will not know the minimum value, so you wont be able to use that as a stopping condition.
The particle swarm optimization, the optimization process is not in the way the random initial step occurs, but rather the modification that occurs by adapting the initial solution with the velocity determined by social and cognitive component.
The social component consists of the current evaluated global best solution of the swarm
The cognitive component consists of a the best location seen by the current solution.
This adjustment will move the particle along a line between the global best and the current best - in hope there is a better solution between them.
I hope that answers the question in some way
Just to add some piece in answering, your problem seems to be linked to the common issue of "when should I stop my PSO?" A question everyone is faced when launching a swarm since (as clearly explained above) you never know if you reached the global best solution (except in very specific objective functions).
Usual tricks already present in most PSO implementation:
1- just limit a number of iterations since there is always a limit in processing time (and you could implement different ways to convert the iterations number into a time limit by self assessment of time spent to evaluate the objective).
2- stop the algorithm when the progress in optimization starts to be insignificant.
I am working on edge detection in images and would like to evaluate the performance of algorithm, if any any one could give me a reference or method on how to proceed it will be really helpful. :)
I do not have ground truth and data set includes color as well as gray images.
Thank you.
Create a synthetic data set with known edges, for example by 3D rendering, by compositing 2D images with precise masks (as may be obtained in royalty free photosets), or by introducing edges directly (thin/faint lines). Remember to add some confounding non-edges that look like edges, of a type appropriate for what you're tuning for.
Use your (non-synthetic) data set. Run the reference algorithms that you want to compare against. Also produce combinations of the reference algorithms, for example by voting (majority, at least K out of N, etc). Calculate stats on your algo vs reference algo performance, in terms of (a) number of points your algo classifies as edge which each reference algo, or the combination, does not classify as edge (false positive), or (b) number of points which the reference algo classifies as edge that your algo does not (false negative). You can also calculate a rank correlation-type number for algos by looking at each point and looking at which algos do (or don't) classify that as an edge.
Create ground truth manually. Use reference edge-finding algos as a starting point, then fix up by hand. Probably valuable to do for a small number of images in any case.
Good luck!
For comparisons, quantitative measures like what #Alex I explained is best. To do so, you need to define what is "correct" with a ground truth set and a way to consistently determine if a given image is correct or on a more granular level, how correct (some number like a percentage) it is. #Alex I gave a way to do that.
Another option that is often used in graphics research where there is no ground truth is user studies. Usually less desirable as they are time consuming and often more costly. However, if it is a qualitative improvement that you are after or if a quantitative measurement is just too hard to do, a user study is an appropriate solution.
When I mean user study I mean to poll people on how well a result is given the input image. You could give them a scale to rate things on and randomly give them samples from both your results and the results of another algorithm
And of course, if you still want more ideas, be sure to check out edge detection papers to see how they measured their results (I'd actually look here first as they've already gone through this same process and determined what was best for them: google scholar).
I did a little GP (note:very little) work in college and have been playing around with it recently. My question is in regards to the intial run settings (population size, number of generations, min/max depth of trees, min/max depth of initial trees, percentages to use for different reproduction operations, etc.). What is the normal practice for setting these parameters? What papers/sites do people use as a good guide?
You'll find that this depends very much on your problem domain - in particular the nature of the fitness function, your implementation DSL etc.
Some personal experience:
Large population sizes seem to work
better when you have a noisy fitness
function, I think this is because the growth
of sub-groups in the population over successive generations acts
to give more sampling of
the fitness function. I typically use
100 for less noisy/deterministic functions, 1000+
for noisy.
For number of generations it is best to measure improvements in the
fitness function and stop when it
meets your target criteria. I normally run a few hundred generations and see what kind of answers are coming out, if it is showing no improvement then you probably have an issue elsewhere.
Tree depth requirements are really dependent on your DSL. I sometimes try to do an
implementation without explicit
limits but penalise or eliminate
programs that run too long (which is probably
what you really care about....). I've also found total node counts of ~1000 to be quite useful hard limits.
Percentages for different mutation / recombination operators don't seem
to matter all that much. As long as
you have a comprehensive set of mutations, any reasonably balanced
distribution will usually work. I think the reason for this is that you are basically doing a search for favourable improvements so the main objective is just to make sure the trial improvements are reasonably well distributed across all the possibilities.
Why don't you try using a genetic algorithm to optimise these parameters for you? :)
Any problem in computer science can be
solved with another layer of
indirection (except for too many
layers of indirection.)
-David J. Wheeler
When I started looking into Genetic Algorithms I had the same question.
I wanted to collect data variating parameters on a very simple problem and link given operators and parameters values (such as mutation rates, etc) to given results in function of population size etc.
Once I started getting into GA a bit more I then realized that given the enormous number of variables this is a huge task, and generalization is extremely difficult.
talking from my (limited) experience, if you decide to simplify the problem and use a fixed way to implement crossover, selection, and just play with population size and mutation rate (implemented in a given way) trying to come up with general results you'll soon realize that too many variables are still into play because at the end of the day the number of generations after which statistically you will get a decent result (whatever way you wanna define decent) still obviously depend primarily on the problem you're solving and consequently on the genome size (representing the same problem in different ways will obviously lead to different results in terms of effect of given GA parameters!).
It is certainly possible to draft a set of guidelines - as the (rare but good) literature proves - but you will be able to generalize the results effectively in statistical terms only when the problem at hand can be encoded in the exact same way and the fitness is evaluated in a somehow an equivalent way (which more often than not means you're ealing with a very similar problem).
Take a look at Koza's voluminous tomes on these matters.
There are very different schools of thought even within the GP community -
Some regard populations in the (low) thousands as sufficient whereas Koza and others often don't deem if worthy to start a GP run with less than a million individuals in the GP population ;-)
As mentioned before it depends on your personal taste and experiences, resources and probably the GP system used!
Cheers,
Jan
Let's imagine I have an unknown function that I want to approximate via Genetic Algorithms. For this case, I'll assume it is y = 2x.
I'd have a DNA composed of 5 elements, one y for each x, from x = 0 to x = 4, in which, after a lot of trials and computation and I'd arrive near something of the form:
best_adn = [ 0, 2, 4, 6, 8 ]
Keep in mind I don't know beforehand if it is a linear function, a polynomial or something way more ugly, Also, my goal is not to infer from the best_adn what is the type of function, I just want those points, so I can use them later.
This was just an example problem. In my case, instead of having only 5 points in the DNA, I have something like 50 or 100. What is the best approach with GA to find the best set of points?
Generating a population of 100,
discard the worse 20%
Recombine the remaining 80%? How?
Cutting them at a random point and
then putting together the first
part of ADN of the father with the
second part of ADN of the mother?
Mutation, how should I define in
this kind of problem mutation?
Is it worth using Elitism?
Any other simple idea worth using
around?
Thanks
Usually you only find these out by experimentation... perhaps writing a GA to tune your GA.
But that aside, I don't understand what you're asking. If you don't know what the function is, and you also don't know the points to being with, how do you determine fitness?
From my current understanding of the problem, this is better fitted by a neural network.
edit:
2.Recombine the remaining 80%? How? Cutting them at a random point and then putting together the first part of ADN of the father with the second part of ADN of the mother?
This is called crossover. If you want to be saucey, do something like pick a random starting point and swapping a random length. For instance, you have 10 elements in an object. randomly choose a spot X between 1 and 10 and swap x..10-rand%10+1.. you get the picture... spice it up a little.
3.Mutation, how should I define in this kind of problem mutation?
usually that depends more on what is defined as a legal solution than anything else. you can do mutation the same way you do crossover, except you fill it with random data (that is legal) rather than swapping with another specimen... and you do it at a MUCH lower rate.
4.Is it worth using Elitism?
experiment and find out.
Gaussian adaptation usually outperforms standard genetic algorithms. If you don't want to write your own package from scratch, the Mathematica Global Optimization package is EXCELLENT -- I used it to fit a really nasty nonlinear function where standard fitters failed miserably.
Edit:
Wikipedia Article
If you hunt down prints of the listed papers on the article, you can find whitepapers and implementations. In general though, you should have some idea what the solution space for your maximizing the fitness function look like. If the number of variables is small, or the number of local maxima is small or they are connected/slope down to a global maxima, simple least squares works fine. If the area around each local maxima is small (IE you have to get a damned good solution to hit the best one, otherwise you hit a bad one), then fancier algorithms are needed.
Choosing variables for a genetic algorithm depends on what the solution space will look like.