I've implemented a number of genetic algorithms to solve a variety of a problems. However I'm still skeptical of the usefulness of crossover/recombination.
I usually first implement mutation before implementing crossover. And after I implement crossover, I don't typically see a significant speed-up in the rate at which a good candidate solution is generated compared to simply using mutation and introducing a few random individuals in each generation to ensure genetic .
Of course, this may be attributed to poor choices of the crossover function and/or the probabilities, but I'd like to get some concrete explanation/evidence as to why/whether or not crossover improves GAs. Have there been any studies regarding this?
I understand the reasoning behind it: crossover allows the strengths of two individuals to be combined into one individual. But to me that's like saying we can mate a scientist and a jaguar to get a smart and fast hybrid.
EDIT: In mcdowella's answer, he mentioned how finding a case where cross-over can improve upon hill-climbing from multiple start points is non-trivial. Could someone elaborate upon this point?
It strongly depends on the smoothness of your search space. Perverse example if every "geneome" was hashed before being used to generate "phenomes" then you would just be doing random search.
Less extreme case, this is why we often gray-code integers in GAs.
You need to tailor your crossover and mutation functions to the encoding. GAs decay quite easily if you throw unsympathetic calculations at them. If the crossover of A and B doesn't yield something that's both A-like and B-like then it's useless.
Example:
The genome is 3 bits long, bit 0 determines whether it's land-dwelling or sea-dwelling. Bits 1-2 describe digestive functions for land-dwelling creatures and visual capabilities for sea-dwelling creatures.
Consider two land-dwelling creatures.
| bit 0 | bit 1 | bit 2
----+-------+-------+-------
Mum | 0 | 0 | 1
Dad | 0 | 1 | 0
They might crossover between bits 1 and 2 yielding a child whose digestive function is some compromise between Mum's and Dad's. Great.
This crossover seems sensible provided that bit 0 hasn't changed. If is does then your crossover function has turned some kind of guts into some kind of eyes. Er... Wut? It might as well have been a random mutations.
Begs the question how DNA gets around this problem. Well, it's both modal and hierarchial. There are large sections which can change a lot without much effect, in others a single mutation can have drastic effects (like bit 0 above). Sometimes the value of X affects the behaviour tiggered by Y, and all values of X are legal and can be explored whereas a modification to Y makes the animal segfault.
Theoretical analyses of GAs often use extremely crude encodings and they suffer more from numerical issues than semantic ones.
You are correct in being skeptical about the cross-over operation. There is a paper called
"On the effectiveness of crossover in simulated evolutionary optimization" (Fogel and Stayton, Biosystems 1994). It is available for free at 1.
By the way, if you haven't already I recommend looking into a technique called "Differential Evolution". It can be very good at solving many optimization problems.
My impression is that hill-climbing from multiple random starts is very effective, but that trying to find a case where cross-over can improve on this is non-trivial. One reference is "Crossover: The Divine Afflatus in Search" by David Icl˘anzan, which states
The traditional GA theory is pillared on the Building Block Hypothesis
(BBH) which states that Genetic Algorithms (GAs) work by discovering,
emphasizing and recombining low order schemata in high-quality
strings, in a strongly parallel manner. Historically, attempts to
capture the topological fitness landscape features which exemplify
this intuitively straight-forward process, have been mostly
unsuccessful. Population-based recombinative methods had been
repeatedly outperformed on the special designed abstract test suites,
by different variants of mutation-based algorithms.
A related paper is "Overcoming Hierarchical Difficulty by Hill-Climbing the
Building Block Structure" by David Iclănzan and Dan Dumitrescu, which states
The Building Block Hypothesis suggests that Genetic Algorithms (GAs)
are well-suited for hierarchical problems, where efficient solving
requires proper problem decomposition and assembly of solution from
sub-solution with strong non-linear interdependencies. The paper
proposes a hill-climber operating over the building block (BB) space
that can efficiently address hierarchical problems.
John Holland's two seminal works "Adaptation in Natural and Artificial Systems" and "Hidden Order" (less formal) discuss the theory of crossover in depth. IMO, Goldberg's "Genetic Algorithms in Search, Optimization, and Machine Learning" has a very approachable chapter on mathematical foundations which includes such conclusions as:
With both crossover and reproduction....those schemata with both above-average performance and short defining lengths are going to be sampled at exponentially increasing rates.
Another good reference might be Ankenbrandt's "An Extension to the Theory of Convergence and a Proof of the Time Complexity of Genetic Algorithms" (in "Foundations of Genetic Algorithms" by Rawlins).
I'm surprised that the power of crossover has not been apparent to you in your work; when I began using genetic algorithms and saw how powerfully "directed" crossover seemed, I felt I gained an insight into evolution that overturned what I had been taught in school. All the questions about "how could mutation lead to this and that?" and "Well, over the course of so many generations..." came to seem fundamentally misguided.
The crossover and mutation!! Actually both of them are necessary.
Crossover is an explorative operator, but the mutation is an exploitive one. Considering the structure of solutions, problem, and the likelihood of optimization rate, its very important to select a correct value for Pc and Pm (probability of crossover and mutation).
Check this GA-TSP-Solver, it uses many crossover and mutation methods. You can test any crossover alongside mutations with given probabilities.
it mainly depends on the search space and the type of crossover you are using. For some problems I found that using crossover at the beginning and then mutation, it will speed up the process for finding a solution, however this is not very good approach since I will end up on finding similar solutions. If we use both crossover and mutation I usually get better optimized solutions. However for some problems crossover can be very destructive.
Also genetic operators alone are not enough to solve large/complex problems. When your operators don't improve your solution (so when they don't increase the value of fitness), you should start considering other solutions such as incremental evolution, etc..
Related
Please correct me if I'm wrong, but it is my understanding that crossovers tend to lead towards local optima, while mutation increases the random walk of the search thus tend to help in escaping local optima tendencies. This insight I got from reading the following: Introduction to Genetic Algorithms and Wikipedia's article on Genetic Operators.
My question is, what is the best or most ideal way to pick which individuals go through crossover and which go through mutation? Is there a rule of thumb for this? What are the implications?
Thanks in advance. This is a pretty specific question that is a bit hard to Google with (for me at least).
The selection of individuals to participate in crossover operation must consider the fitness, that is "better individuals are more likely to have more child programs than inferior individuals.":
http://cswww.essex.ac.uk/staff/rpoli/gp-field-guide/23Selection.html#7_3
The most common way to perform this is using Tournament Selection (see wikipedia).
Selection of the individuals to mutate should not consider fitness, in fact, should be random. And the number of elements mutated per generation (mutation rate) should be very low, around 1% (or it may fall into random search):
http://cswww.essex.ac.uk/staff/rpoli/gp-field-guide/24RecombinationandMutation.html#7_4
In my experience, tweaking the tournament parameters just a bit could lead to substantial changes in the final results (for better or for worse), so it is usually a good idea to play with these parameters until you find a "sweet spot".
I am implementing my M.Sc dissertation and in theory aspect of my thesis, i have a big problem.
suppose we want to use genetic algorithms.
we have 2 kind of functions :
a) some functions that have relations like this : ||x1 - x2||>>||f(x1) - f(x2)||
for example : y=(1/10)x^2
b) some functions that have relations like this : ||x1 - x2||<<||f(x1) - f(x2)||
for example : y=x^2
my question is that which of the above kind of functions have more difficulties than other when we want to use genetic algorithms to find optimum ( never mind MINIMUM or MAXIMUM ).
Thank you a lot,
Armin
I don't believe you can answer this question in general without imposing additional constraints.
It's going to depend on the particular type of genetic algorithm you're dealing with. If you use fitness proportional (roulette-wheel) selection, then altering the range of fitness values can matter a great deal. With tournament selection or rank-biased selection, as long as the ordering relations hold between individuals, there will be no effects.
Even if you can say that it does matter, it's still going to be difficult to say which version is harder for the GA. The main effect will be on selection pressure, which causes the algorithm to converge more or less quickly. Is that good or bad? It depends. For a function like f(x)=x^2, converging as fast as possible is probably great, because there's only one optimum, so find it as soon as possible. For a more complex function, slower convergence can be required to find good solutions. So for any given function, scaling and/or translating the fitness values may or may not make a difference, and if it does, the difference may or may not be helpful.
There's probably also a No Free Lunch argument that no single best choice exists over all problems and optimization algorithms.
I'd be happy to be corrected, but I don't believe you can say one way or the other without specifying much more precisely exactly what class of algorithms and problems you're focusing on.
I'd like to pose a few abstract questions about computer vision research. I haven't quite been able to answer these questions by searching the web and reading papers.
How does someone know whether a computer vision algorithm is correct?
How do we define "correct" in the context of computer vision?
Do formal proofs play a role in understanding the correctness of computer vision algorithms?
A bit of background: I'm about to start my PhD in Computer Science. I enjoy designing fast parallel algorithms and proving the correctness of these algorithms. I've also used OpenCV from some class projects, though I don't have much formal training in computer vision.
I've been approached by a potential thesis advisor who works on designing faster and more scalable algorithms for computer vision (e.g. fast image segmentation). I'm trying to understand the common practices in solving computer vision problems.
You just don't prove them.
Instead of a formal proof, which is often impossible to do, you can test your algorithm on a set of testcases and compare the output with previously known algorithms or correct answers (for example when you recognize the text, you can generate a set of images where you know what the text says).
In practice, computer vision is more like an empirical science: You gather data, think of simple hypotheses that could explain some aspect of your data, then test those hypotheses. You usually don't have a clear definition of "correct" for high-level CV tasks like face recognition, so you can't prove correctness.
Low-level algorithms are a different matter, though: You usually have a clear, mathematical definition of "correct" here. For example if you'd invent an algorithm that can calculate a median filter or a morphological operation more efficiently than known algorithms or that can be parallelized better, you would of course have to prove it's correctness, just like any other algorithm.
It's also common to have certain requirements to a computer vision algorithm that can be formalized: For example, you might want your algorithm to be invariant to rotation and translation - these are properties that can be proven formally. It's also sometimes possible to create mathematical models of signal and noise, and design a filter that has the best possible signal to noise-ratio (IIRC the Wiener filter or the Canny edge detector were designed that way).
Many image processing/computer vision algorithms have some kind of "repeat until convergence" loop (e.g. snakes or Navier-Stokes inpainting and other PDE-based methods). You would at least try to prove that the algorithm converges for any input.
This is my personal opinion, so take it for what it's worth.
You can't prove the correctness of most of the Computer Vision methods right now. I consider most of the current methods some kind of "recipe" where ingredients are thrown down until the "result" is good enough. Can you prove that a brownie cake is correct?
It is a bit similar in a way to how machine learning evolved. At first, people did neural networks, but it was just a big "soup" that happened to work more or less. It worked sometimes, didn't on other cases, and no one really knew why. Then statistical learning (through Vapnik among others) kicked in, with some real mathematical backup. You could prove that you had the unique hyperplane that minimized a particular loss function, PCA gives you the closest matrix of fixed rank to a given matrix (considering the Frobenius norm I believe), etc...
Now, there are still a few things that are "correct" in computer vision, but they are pretty limited. What comes to my mind is the wavelet : they are the sparsest representation in an orthogonal basis of function. (i.e : the most compressed way to represent an approximation of an image with minimal error)
Computer Vision algorithms are not like theorems which you can prove, they usually try to interpret the image data into the terms which are more understandable to us humans. Like face recognition, motion detection, video surveillance etc. The exact correctness is not calculable, like in the case of image compression algorithms where you can easily find the result by the size of the images.
The most common methods used to show the results in Computer Vision methods(especially classification problems) are the graphs of precision Vs recall, accuracy Vs false positives. These are measured on standard databases available on various sites. Usually the harsher you set the parameters for correct detection, the more false positives you generate. The typical practice is to choose the point from the graph according to your requirement of 'how many false positives are tolerable for the application'.
I did a little GP (note:very little) work in college and have been playing around with it recently. My question is in regards to the intial run settings (population size, number of generations, min/max depth of trees, min/max depth of initial trees, percentages to use for different reproduction operations, etc.). What is the normal practice for setting these parameters? What papers/sites do people use as a good guide?
You'll find that this depends very much on your problem domain - in particular the nature of the fitness function, your implementation DSL etc.
Some personal experience:
Large population sizes seem to work
better when you have a noisy fitness
function, I think this is because the growth
of sub-groups in the population over successive generations acts
to give more sampling of
the fitness function. I typically use
100 for less noisy/deterministic functions, 1000+
for noisy.
For number of generations it is best to measure improvements in the
fitness function and stop when it
meets your target criteria. I normally run a few hundred generations and see what kind of answers are coming out, if it is showing no improvement then you probably have an issue elsewhere.
Tree depth requirements are really dependent on your DSL. I sometimes try to do an
implementation without explicit
limits but penalise or eliminate
programs that run too long (which is probably
what you really care about....). I've also found total node counts of ~1000 to be quite useful hard limits.
Percentages for different mutation / recombination operators don't seem
to matter all that much. As long as
you have a comprehensive set of mutations, any reasonably balanced
distribution will usually work. I think the reason for this is that you are basically doing a search for favourable improvements so the main objective is just to make sure the trial improvements are reasonably well distributed across all the possibilities.
Why don't you try using a genetic algorithm to optimise these parameters for you? :)
Any problem in computer science can be
solved with another layer of
indirection (except for too many
layers of indirection.)
-David J. Wheeler
When I started looking into Genetic Algorithms I had the same question.
I wanted to collect data variating parameters on a very simple problem and link given operators and parameters values (such as mutation rates, etc) to given results in function of population size etc.
Once I started getting into GA a bit more I then realized that given the enormous number of variables this is a huge task, and generalization is extremely difficult.
talking from my (limited) experience, if you decide to simplify the problem and use a fixed way to implement crossover, selection, and just play with population size and mutation rate (implemented in a given way) trying to come up with general results you'll soon realize that too many variables are still into play because at the end of the day the number of generations after which statistically you will get a decent result (whatever way you wanna define decent) still obviously depend primarily on the problem you're solving and consequently on the genome size (representing the same problem in different ways will obviously lead to different results in terms of effect of given GA parameters!).
It is certainly possible to draft a set of guidelines - as the (rare but good) literature proves - but you will be able to generalize the results effectively in statistical terms only when the problem at hand can be encoded in the exact same way and the fitness is evaluated in a somehow an equivalent way (which more often than not means you're ealing with a very similar problem).
Take a look at Koza's voluminous tomes on these matters.
There are very different schools of thought even within the GP community -
Some regard populations in the (low) thousands as sufficient whereas Koza and others often don't deem if worthy to start a GP run with less than a million individuals in the GP population ;-)
As mentioned before it depends on your personal taste and experiences, resources and probably the GP system used!
Cheers,
Jan
I'm not interested in tiny optimizations giving few percents of the speed.
I'm interested in the most important heuristics for alpha-beta search. And most important components for evaluation function.
I'm particularly interested in algorithms that have greatest (improvement/code_size) ratio.
(NOT (improvement/complexity)).
Thanks.
PS
Killer move heuristic is a perfect example - easy to implement and powerful.
Database of heuristics is too complicated.
Not sure if you're already aware of it, but check out the Chess Programming Wiki - it's a great resource that covers just about every aspect of modern chess AI. In particular, relating to your question, see the Search and Evaluation sections (under Principle Topics) on the main page. You might also be able to discover some interesting techniques used in some of the programs listed here. If your questions still aren't answered, I would definitely recommend you ask in the Chess Programming Forums, where there are likely to be many more specialists around to answer. (Not that you won't necessarily get good answers here, just that it's rather more likely on topic-specific expert forums).
MTD(f) or one of the MTD variants is a big improvement over standard alpha-beta, providing you don't have really fine detail in your evaluation function and assuming that you're using the killer heuristic. The history heuristic is also useful.
The top-rated chess program Rybka has apparently abandoned MDT(f) in favour of PVS with a zero-aspiration window on the non-PV nodes.
Extended futility pruning, which incorporates both normal futility pruning and deep razoring, is theoretically unsound, but remarkably effective in practice.
Iterative deepening is another useful technique. And I listed a lot of good chess programming links here.
Even though many optimizations based on heuristics(I mean ways to increase the tree depth without actualy searching) discussed in chess programming literature, I think most of them are rarely used. The reason is that they are good performance boosters in theory, but not in practice.
Sometimes these heuristics can return a bad(I mean not the best) move too.
The people I have talked to always recommend optimizing the alpha-beta search and implementing iterative deepening into the code rather than trying to add the other heuristics.
The main reason is that computers are increasing in processing power, and research[need citation I suppose] has shown that the programs that use their full CPU time to brute force the alpha-beta tree to the maximum depth have always outrunned the programs that split their time between a certain levels of alpha-beta and then some heuristics,.
Even though using some heuristics to extend the tree depth can cause more harm than good, ther are many performance boosters you can add to the alpha-beta search algorithm.
I am sure that you are aware that for alpha-beta to work exactly as it is intended to work, you should have a move sorting mechanisn(iterative deepening). Iterative deepening can give you about 10% performace boost.
Adding Principal variation search technique to alpha beta may give you an additional 10% boost.
Try the MTD(f) algorithm too. It can also increase the performance of your engine.
One heuristic that hasn't been mentioned is Null move pruning.
Also, Ed Schröder has a great page explaining a number of tricks he used in his Rebel engine, and how much improvement each contributed to speed/performance: Inside Rebel
Using a transposition table with a zobrist hash
It takes very little code to implement [one XOR on each move or unmove, and an if statement before recursing in the game tree], and the benefits are pretty good, especially if you are already using iterative deepening, and it's pretty tweakable (use a bigger table, smaller table, replacement strategies, etc)
Killer moves are good example of small code size and great improvement in move ordering.
Most board game AI algorithms are based on http://en.wikipedia.org/wiki/Minmax MinMax. The goal is to minimize their options while maximizing your options. Although with Chess this is a very large and expensive runtime problem. To help reduce that you can combine minmax with a database of previously played games. Any game that has a similar board position and has a pattern established on how that layout was won for your color can be used as far as "analyzing" where to move next.
I am a bit confused on what you mean by improvement/code_size. Do you really mean improvement / runtime analysis (big O(n) vs. o(n))? If that is the case, talk to IBM and big blue, or Microsoft's Parallels team. At PDC I spoke with a guy (whose name escapes me now) who was demonstrating Mahjong using 8 cores per opponent and they won first place in the game algorithm design competition (whose name also escapes me).
I do not think there are any "canned" algorithms out there to always win chess and do it very fast. The way that you would have to do it is have EVERY possible previously played game indexed in a very large dictionary based database and have pre-cached the analysis of every game. It would be a VERY compex algorithm and would be a very poor improvement / complexity problem in my opinion.
I might be slightly off topic but "state of the art" chess programs use MPI such as Deep Blue for massive parallel power.
Just consider than parallel processing plays a great role in modern chess