Why do Genetics Algorithms evaluate the same genome several times? - genetic-algorithm

I'm trying to solve the DARP Problem, using a Genetic Algorithm DLL.
Thing is that eventhough it sometimes comes up with the right solution, others times it does not. Although Im using a really simple version of the problem. When I checked the genomes the algorithm evaluated I found out that it evaluates several times the same genome.
Why does it evaluate the same several time? wouldn't be more efficient if it does not?

There is a difference between evaluating the same chromosome twice, and using the same chromosome in a population (or different populations) more than once. The first can probably be usefully avoided; the second, maybe not.
Consider:
In some generation G1, you mate 0011 and 1100, cross them right down the middle, get lucky and fail to mutate any of the genes, and end up with 0000 and 1111. You evalate them, stick them back into the population for the next generation, and continue the algorithm.
Then in some later generation G2, you mate 0111 and 1001 at the first index and (again, ignoring mutation) end up with 1111 and 0001. One of those has already been evaluated, so why evaluate it again? Especially if evaluating that function is very expensive, it may very well be better to keep a hash table (or some such) to store the results of previous evaluations, if you can afford the memory.
But! Just because you've already generated a value for a chromosome doesn't mean you shouldn't include it naturally in the results going forward, allowing it to either mutate further or allowing it to mate with other members of the same population. If you don't do that, you're effectively punishing success, which is exactly the opposite of what you want to do. If a chromosome persists or reappears from generation to generation, that is a strong sign that it is a good solution, and a well-chosen mutation operator will act to drive it out if it is a local maximum instead of a global maximum.

The basic explanation for why a GA might evaluate the same individual is precisely because it is non-guided, so the recreation of a previously-seen (or simultaneously-seen) genome is something to be expected.
More usefully, your question could be interpreted as about two different things:
A high cost associated with evaluating the fitness function, which could be mitigated by hashing the genome, but trading off memory. This is conceivable, but I've never seen it. Usually GAs search high-dimensional-spaces so you'd end up with a very sparse hash.
A population in which many or all members have converged to a single or few patterns: at some point, the diversity of your genome will tend towards 0. This is the expected outcome as the algorithm has converged upon the best solution that it has found. If this happens too early, with mediocre results, it indicates that you are stuck in a local minimum and you have lost diversity too early.
In this situation, study the output to determine which of the two scenarios has happened:
You lose diversity because particularly-fit individuals win a very large percentage of the parental lotteries. Or,
You don't gain diversity because the population over time is quite static.
In the former case, you have to maintain diversity. Ensure that less-fit individuals get more chances, perhaps by decreasing turnover or scaling the fitness function.
In the latter case, you have to increase diversity. Ensure that you inject more randomness into the population, perhaps by increasing mutation rate or increasing turnover.
(In both cases, of course, you ultimately do want diversity to decrease as the GA converges in the solution space. So you don't just want to "turn up all the dials" and devolve into a Monte Carlo search.)

Basicly genetic algoritm consists of
initial population (size N)
fitness function
mutation operation
crossover operation (performed usually on 2 individuals by taking some parts of their genome and combining it into new individual)
At every steps it
choses random individuals
performs crossover resulting in new individuals
possibly perform mutation(change random gene in random individual)
evaluate all old and new individuals with fitness function
choose N best fitted to be a new population on next iteration
The algorythm stops when it reaches a threshold of fitness function, or if there is no changes in population in last K iterations.
So, it could stop not at the best solution, but at local maximum.
A part of the population could stay unchanged from one iteration to another, because they could have a good value of fitness function.
Also it is possible to "fall back" to previos genomes because of mutation.
There are a lot of tricks to make genetic algorytm work better : choosing appropriate population encoding into genom, finding a good fitness function, playing with crossover and mutation ratio.

Depending on the particulars of your GA, you may have the same genomes in successive populations. For example Elitism saves the best or n best genomes from each population.
Reevaluating genomes is inefficient from a computational standpoint. The easiest way to avoid this is to put a boolean HasFitness flag for each genome. You could also create a unique string key for each genome encoding and store all the fitness values in a dictionary. This lookup can get very expensive, so this is only recommended if your fitness function is expensive enough to warrant the added expense of the lookup.

Elitism aside, The GA does not evaluate the same genome repeatedly. What you are seeing is identical genomes being regenerated and reevaluated. This is because each generation is a new set of genomes, which may or may not have been evaluated before.
To avoid the reevaluation you would need to keep a list of already produced genomes, with their fitness. To access the fitness you would need to have compare each of your new population with the list, when it is not in the list you would need to evaluate it, and add it to the list.
As real world applications have thousands of parameters, you end up with millions of stored genomes. This then becomes massively expensive to search & maintain. So probably quicker to just evaluate the genome each time.

Related

Algorithm for incomplete ranking with imprecise comparisons

SUMMARY
I'm looking for an algorithm to rank objects. Two objects can be compared. However, the comparisons are real world comparisons that may be flawed. Also, I care more about finding out the very best object than which ones are the worst.
TO MOTIVATE:
Think that I'm scientifically evaluating materials. I combine two materials. I want to find the best working material for in-depth testing. So, I don't care about materials that are unpromising. However, each test can be a false positive or have anomalies between those particular two materials.
PRECISE PROBLEM:
There is an unlimited pool of objects.
Two objects can be compared to each other. It is resource expensive to compare two objects.
It's resource expensive to consider an additional object. So, an object should only be included in the evaluation if it can be fully ranked.
It is very important to find the very best object in the pool of the tested ones. If an object is in the bottom half, it doesn't matter to find out where in the bottom half it is. The importance of finding out the exact rank is a gradient with the top much more important.
Most of the time, if A > B and B > C, it is safe to assume that A > C. Sometimes, there are false positives. Sometimes A > B and B > C and C > A. This is not an abstract math space but real world measurements.
At the start, it is not known how many comparisons are allowed to be taken. The algorithm is granted permission to do another comparison until it isn't. Thus, the decision on including an additional object or testing more already tested objects has to be made.
TO MOTIVATE MORE IN-DEPTH:
Imagine that you are tasked with hiring a team of boxers. You know nothing about evaluating boxers but can ask two boxers to fight each other. There is an unlimited number of boxers in the world. But it's expensive to fly them in. Ideally, you want to hire the n best boxers. Realistically, you don't know if the boxers are going to accept your offer. Plus, you don't know how competitively the other boxing clubs bid. You are going to make offers to only the best n boxers, but have to be prepared to know which next n boxers to send offers to. That you only get the worst boxers is very unlikely.
SOME APPROACHES
I could think of the following approaches. However, they all have drawbacks. I feel like there should be a much better approach.
USE TRADITIONAL SORTING ALGORITHMS
Traditional sorting algorithms could be used.
Drawback:
- A false positive could serious throw of the correctness of the algorithm.
- A sorting algorithm would spend half the time sorting the bottom half of the pack, which is unimportant.
- Sorting algorithms start with all items. With this problem, we are allowed to do the first test, not knowing if we are allowed to do a second test. We may end up only being allowed to do two test. Or we may be allowed to do a million tests.
USE TOURNAMENT ALGORITHMS
There are algorithms for tournaments. E.g., everyone gets a first match. The winner of the first match moves on to the next round. There is a variety of tournament strategies that accounts for people having a bad day or being paired with the champion in their first match.
Drawback:
- This seems pretty promising. The difficulty is to find one that allows adding one more player at a time as we are allowed more comparisons. It seems that there should be a highly specialized solution that's better than a standard tournament algorithm.
BINARY SEARCH
We could start with two objects. Each time an object is added, we could use a binary search to find its spot in the ranking. Because the top is more important, we could use a weighted binary search. E.g. instead of testing the mid point, it tests the point at the top 1/3.
Drawback:
- The algorithm doesn't correct for false positives. If there is a false positive at the top early on, it could skew the whole rest of the tests.
COUNT WINS AND LOSSES
The wins and losses could be counted. The algorithm would choose test subjects by a priority of the least losses and second priority of the most wins. This would focus on testing the best objects. If an object has zero losses, it would get the focus of the testing. It would either quickly get a loss and drop in priority, or it would get a lot more tests because it's the likely top candidate.
DRAWBACK:
- The approach is very nice in that it corrects for false positives. It also allows adding more objects to the test pool easily. However, it does not consider that a win against a top object counts a lot more than a win against a bottom object. Thus, comparisons are wasted.
GRAPH
All the objects could be added to a graph. The graph could be flattened.
DRAWBACK:
- I don't know how to flatten such a messy graph that could have cycles and ambiguous end nodes. There could be multiple objects that are undefeated. How would one pick a winner in such a messy graph? How would one know which comparison would be the most valuable?
SCORING
As a win depends on the rank of the loser, a win could be given a score. Say A > B, means that A gets 1 point. if C > A, C gets 2 points because A has 1 point. In the end, objects are ranked by how many points they have.
DRAWBACK
- The approach seems promising in that it is easy to add new objects to the pool of tested objects. It also takes into account that wins against top objects should count for more. I can't think of a good way to determine the points. That first comparison, was awarded 1 point. Once 10,000 objects are in the pool, an average win would be worth 5,000 points. The award of both tests should be roughly equal. Later comparisons overpower the earlier comparisons and make them be ignored when they shouldn't.
Does anyone have a good idea on tackling this problem?
I would search for an easily computable value for an object, that could be compared between objects to give a good enough approximation of order. You could compare each new object with the current best accurately, then insertion sort the loser into a list of the rest using its computed value.
The best will always be accurate. The ordering of the rest depending on your "value".
I would suggest looking into Elo Rating systems and its derivatives. (like Glicko, BayesElo, WHR, TrueSkill etc.)
So you assign each object a preliminary rating, and then update that value according to the matches/comparisons you make. (with bigger changes to the ratings the more unexpected the outcome was)
This still leaves open the question of how to decide which object to compare to which other object to gain most information. For that I suggest looking into tournament systems and playoff formats. Though I suspect that an optimal solution will be decidedly more ad-hoc than that.

How to valorize better offsprings better than with my roulette selection method?

I am playing around with genetic programming algorithms, and I want to know how I can valorize and make sure my best exemplares reproduce more by substituting or improving the way I choose which one will reproduce. Currently the method I use looks like this:
function roulette(population)
local slice = sum_of_fitnesses(population) * math.random()
local sum = 0
for iter = 1, #population do
sum = sum + population[iter].fitness
if sum >= slice then
return population[iter]
end
end
end
But I can't get my population to reach an average fitness which is above a certain value and I worry it's because of less fit members reproducing with more fit members and thus continuing to spread their weak genes around.
So how can I improve my roulette selection method? Or should I use a completely different fitness proportionate selector?
There are a couple of issues at play here.
You are choosing the probability of an individual replicating based on its fitness, so the fitness function that you are using needs to exaggerate small differences or else having a minor decrease in fitness isn't so bad. For example, if a fitness drops from 81 to 80, this change is probably within the noise of the system and won't make much of a different to evolution. It will certainly be almost impossible to climb to a very high fitness if a series of small changes need to be made because the selective pressure simply won't be strong enough.
The way you solve this problem is by using something like tournament selection. In it's simplest form, every time you want to choose another individual to be born, you pick K random individuals (K is known and the "tournament size"). You calculate the fitness of each individual and whomever has the highest fitness is replicated. It doesn't matter if the fitness difference is 81 vs 80 or if its 10000 vs 2, since it simply takes the highest fitness.
Now the question is: what should you set K to? K can be thought of as the strength of selection. If you set it low (e.g., K=2) then many low fitness individuals will get lucky and slip through, being competed against other low-fitness individuals. You'll get a lot of diversity, but very little section. On the flip side, if you set K to be high (say, K=100), you're ALWAYS going to pick one of the highest fitnesses in the population, ensuring that the population average is driven closer to the max, but also driving down diversity in the population.
The particular tradeoff here depends on the specific problem. I recommend trying out different options (including your original algorithm) with a few different problems to see what happens. For example, try the all-ones problem: potential solutions are bit strings and a fitness is simply the number of 1's. If you have weak selection (as in your original example, or with K=2), you'll see that it never quite gets to a perfect all-ones solution.
So, why not always use a high K? Well consider a problem where ones are negative unless they appear in a block of four consecutive ones (or eight, or however many), when they suddenly become very positive. Such a problem is "deceptive", which means that you need to explore through solutions that look bad in order to find ones that are good. If you set your strength of selection too high, you'll never collect three ones for that final mutation to give you the fourth.
Lots of more advanced techniques exist that use tournament selection that you might want to look at. For example, varying K over time, or even within a population, select some individuals using a low K and others using a high K. It's worth reading up on some more if you're planning to build a better algorithm.

Why do you need fitness scaling in Genetic Algorithms?

Reading the book "Genetic Algorithms" by David E. Goldberg, he mentions fitness scaling in Genetic Algorithms.
My understanding of this function is to constrain the strongest candidates so that they don't flood the pool for reproduction.
Why would you want to constrain the best candidates? In my mind having as many of the best candidates as early as possible would help get to the optimal solution as fast as possible.
What if your early best candidates later on turn out to be evolutionary dead ends? Say, your early fittest candidates are big, strong agents that dominate smaller, weaker candidates. If all the weaker ones are eliminated, you're stuck with large beasts that maybe have a weakness to an aspect of the environment that hasn't been encountered yet that the weak ones can handle: think dinosaurs vs tiny mammals after an asteroid impact. Or, in a more deterministic setting that is more likely the case in a GA, the weaker candidates may be one or a small amount of evolutionary steps away from exploring a whole new fruitful part of the fitness landscape: imagine the weak small critters evolving flight, opening up a whole new world of possibilities that the big beasts most likely will never touch.
The underlying problem is that your early strongest candidates may actually be in or around a local maximum in fitness space, that may be difficult to come out of. It could be that the weaker candidates are actually closer to the global maximum.
In any case, by pruning your population aggressively, you reduce the genetic diversity of your population, which in general reduces the search space you are covering and limits how fast you can search this space. For instance, maybe your best candidates are relatively close to the global best solution, but just inbreeding that group may not move it much closer to it, and you may have to wait for enough random positive mutations to happen. However, perhaps one of the weak candidates that you wanted to cut out has some gene that on its own doesn't help much, but when crossed with the genes from your strong candidates in may cause a big evolutionary jump! Imagine, say, a human crossed with spider DNA.
#sgvd's answer makes valid points but I would like to elaborate more.
First of all, we need to define what fitness scaling actually means. If it means just multiplying the fitnesses by some factor then this does not change the relationships in the population - if the best individual had 10 times higher fitness than the worst one, after such multiplication this is still true (unless you multiply by zero which makes no real sense). So, a much more sensible fitness scaling is an affine transformation of the fitness values:
scaled(f) = a * f + b
i.e. the values are multiplied by some number and offset by another number, up or down.
Fitness scaling makes sense only with certain types of selection strategies, namely those where the selection probability is proportional to the fitness of the individuals1.
Fitness scaling plays, in fact, two roles. The first one is merely practical - if you want a probability to be proportional to the fitness, you need the fitness to be positive. So, if your raw fitness value can be negative (but is limited from below), you can adjust it so you can compute probabilities out of it. Example: if your fitness gives values from the range [-10, 10], you can just add 10 to the values to get all positive values.
The second role is, as you and #sgvd already mentioned, to limit the capability of the strongest solutions to overwhelm the weaker ones. The best illustration would be with an example.
Suppose that your raw fitness values gives values from the range [0, 100]. If you left it this way, the worst individuals would have zero probability of being selected, and the best ones would have up to 100x higher probability than the worst ones (excluding the really worst ones). However, let's set the scaling factors to a = 1/2, b = 50. Then, the range is transformed to [50, 100]. And right away, two things happen:
Even the worst individuals have non-zero probability of being selected.
The best individuals are now only 2x more likely to be selected than the worst ones.
Exploration vs. exploitation
By setting the scaling factors you can control whether the algorithm will do more exploration over exploitation and vice versa. The more "compressed"2 the values are going to be after the scaling, the more exploration is going to be done (because the likelihood of the best individuals being selected compared to the worst ones will be decreased). And vice versa, the more "expanded"2 are the values going to be, the more exploitation is going to be done (because the likelihood of the best individuals being selected compared to the worst ones will be increased).
Other selection strategies
As I have already written at the beginning, fitness scaling only makes sense with selection strategies which derive the selection probability proportionally from the fitness values. There are, however, other selection strategies that do not work like this.
Ranking selection
Ranking selection is identical to roulette wheel selection but the numbers the probabilities are derived from are not the raw fitness values. Instead, the whole population is sorted by the raw fitness values and the rank (i.e. the position in the sorted list) is the number you derive the selection probability from.
This totally erases the discrepancy when there is one or two "big" individuals and a lot of "small" ones. They will just be ranked.
Tournament selection
In this type of selection you don't even need to know the absolute fitness values at all, you just need to be able to compare two of them and tell which one is better. To select one individual using tournament selection, you randomly pick a number of individuals from the population (this number is a parameter) and you pick the best one of them. You repeat that as long as you have selected enough individuals.
Here you can also control the exploration vs. exploitation thing by the size of the tournament - the larger the tournament is the higher is the chance that the best individuals will take part in the tournaments.
1 An example of such selection strategy is the classical roulette wheel selection. In this selection strategy, each individual has its own section of the roulette wheel which is proportional in size to the particular individual's fitness.
2 Assuming the raw values are positive, the scaled values get compressed as a goes down to zero and as b goes up. Expansion goes the other way around.

GA Chromosome Representation with bits of different importance

In a genetic algorithm, is it ok to encode the chromosome in a way such that some bits have more importance than other bits in the same chromosome? For example, the (index%2==0)/(2,4,6,..) bit is more important than (index%2!=0)/(1,3,5,..) bits. For example, if the bit 2 has value in range [1,5], we consider the value of bit 3, and if the bit 2 has value 0, the value of bit 3 makes no effect.
For example, if the problem is that we have multiple courses to be offered by a school and we want to know which course should be offered in the next semester and which should not, and if a course should be offered who should teach that course and when he/she should teach it. So one way to represent the problem is to use a vector of length 2n, where n is the number of courses. Each course is represented by a 2-tuple (who,when), where when is when the course should be taught and who is who should teach it. The tuple in the i-th position holds assignment for the i-th course. Now the possible values for who are the ids of the teachers [1-10], and the possible values for when are all possible times plus 0, where 0 means at no time which means the course should not be offered.
Now is it ok to have two different tuples with the same fitness? For instance, (3,0) and (2,0) are different values for the i-th course but they mean the same thing, this course should not be offered since we don't care about who if when=0. Or should I add 0 to who so that 0 means taught by no one and a tuple means that the corresponding course should not be offered if and only if its value is (0,0). But how about (0,v) and (v,0), where v>0? should I consider these to mean that the course should not be offered? I need help with this please.
I'm not sure I fully understand your question but I'll try to answer as best I can.
When using genetic algorithms to solve problems you can have a lot of flexibility in how it's encoded. Broadly, there are two places where certain bits can have more prominence: In the fitness function or in the implementation of the algorithms (namely selection, crossover and mutation). If you want to change the prominence of certain bits in the fitness function I'd go ahead. This would encourage the behaviour you want and generally lead towards a solution where certain bits are more prominent.
I've worked with a lot of genetic algorithms where the fitness function gives some bits (or groupings of bits) more weight than others. It's fairly standard.
I'd be a lot more careful when making certain bits more prominent than others in the genetic algorithm implementation. I've worked with algorithms that only allow certain bits to mutate, or that can only crossover at certain points. Sometimes they can work well (sometimes they're necessary given the problem) but for the most part they're a lot harder to get right, and more prone to problems like premature convergence.
EDIT:
In answer to the second part of your question, and your comments:
The best way to deal with situations where a course should not be offered is probably in the fitness function. Simply give a low score (or no score) to these. The same applies to course duplicates in a chromosome. In theory, this should discourage them from becoming a prevalent part of your population. Alternatively, you could apply a form of "culling" every generation, which completely removes chromosome which are not viable from the population. You can probably mix the two by completely excluding chromosomes with no fitness score from selection.
From what you've said about the problem it sounds like having non-viable chromosomes is probably going to be common. This doesn't have to be a problem. If your fitness function is encoded well, and you use the correct selection and crossover methods it shouldn't be an issue. As long as the more viable solutions are fitter you should be able to evolve a good solution.
In some cases it's a good idea to stop crossover at certain points in the chromosomes. It sounds like this might be the case, but again, without knowing more about your implementation it's hard to say.
I can't really give a more detailed answer without knowing more about how you plan to implement the algorithm. I'm not really familiar with the problem either. It's not something I've ever done. If you add a bit more detail on how you plan to encode the problem and fitness function I may be able to give more specific advise.

Algorithm to optimize parameters based on imprecise fitness function

I am looking for a general algorithm to help in situations with similar constraints as this example :
I am thinking of a system where images are constructed based on a set of operations. Each operation has a set of parameters. The total "gene" of the image is then the sequential application of the operations with the corresponding parameters. The finished image is then given a vote by one or more real humans according to how "beautiful" it is.
The question is what kind of algorithm would be able to do better than simply random search if you want to find the most beautiful image? (and hopefully improve the confidence over time as votes tick in and improve the fitness function)
Given that the operations will probably be correlated, it should be possible to do better than random search. So for example operation A with parameters a1 and a2 followed by B with parameters b1 could generally be vastly superior to B followed by A. The order of operations will matter.
I have tried googling for research papers on random walk and markov chains as that is my best guesses about where to look, but so far have found no scenarios similar enough. I would really appreciate even just a hint of where to look for such an algorithm.
I think what you are looking for fall in a broad research area called metaheuristics (which include many non-linear optimization algorithms such as genetic algorithms, simulated annealing or tabu search).
Then if your raw fitness function is just giving a statistical value somehow approximating a real (but unknown) fitness function, you can probably still use most metaheuristics by (somehow) smoothing your fitness function (averaging results would do that).
Do you mean the Metropolis algorithm?
This approach uses a random walk, weighted by the fitness function. It is useful for locating local extrema in complicated fitness landscapes, but is generally slower than deterministic approaches where those will work.
You're pretty much describing a genetic algorithm in which the sequence of operations represents the "gene" ("chromosome" would be a better term for this, where the parameter[s] passed to each operation represents a single "gene", and multiple genes make up a chromosome), the image produced represents the phenotypic expression of the gene, and the votes from the real humans represent the fitness function.
If I understand your question, you're looking for an alternative algorithm of some sort that will evaluate the operations and produce a "beauty" score similar to what the real humans produce. Good luck with that - I don't think there really is any such thing, and I'm not surprised that you didn't find anything. Human brains, and correspondingly human evaluations of aesthetics, are much too staggeringly complex to be reducible to a simplistic algorithm.
Interestingly, your question seems to encapsulate the bias against using real human responses as the fitness function in genetic-algorithm-based software. This is a subject of relevance to me, since my namesake software is specifically designed to use human responses (or "votes") to evaluate music produced via a genetic process.
Simple Markov Chain
Markov chains, which you mention, aren't a bad way to go. A Markov chain is just a state machine, represented as a graph with edge weights which are transition probabilities. In your case, each of your operations is a node in the graph, and the edges between the nodes represent allowable sequences of operations. Since order matters, your edges are directed. You then need three components:
A generator function to construct the graph of allowed transitions (which operations are allowed to follow one another). If any operation is allowed to follow any other, then this is easy to write: all nodes are connected, and your graph is said to be complete. You can initially set all the edge weights to 1.
A function to traverse the graph, crossing N nodes, where N is your 'gene-length'. At each node, your choice is made randomly, but proportionally weighted by the values of the edges (so better edges have a higher chance of being selected).
A weighting update function which can be used to adjust the weightings of the edges when you get feedback about an image. For example, a simple update function might be to give each edge involved in a 'pleasing' image a positive vote each time that image is nominated by a human. The weighting of each edge is then normalised, with the currently highest voted edge set to 1, and all the others correspondingly reduced.
This graph is then a simple learning network which will be refined by subsequent voting. Over time as votes accumulate, successive traversals will tend to favour the more highly rated sequences of operations, but will still occasionally explore other possibilities.
Advantages
The main advantage of this approach is that it's easy to understand and code, and makes very few assumptions about the problem space. This is good news if you don't know much about the search space (e.g. which sequences of operations are likely to be favourable).
It's also easy to analyse and debug - you can inspect the weightings at any time and very easily calculate things like the top 10 best sequences known so far, etc. This is a big advantage - other approaches are typically much harder to investigate ("why did it do that?") because of their increased abstraction. Although very efficient, you can easily melt your brain trying to follow and debug the convergence steps of a simplex crawler!
Even if you implement a more sophisticated production algorithm, having a simple baseline algorithm is crucial for sanity checking and efficiency comparisons. It's also easy to tinker with, by messing with the update function. For example, an even more baseline approach is pure random walk, which is just a null weighting function (no weighting updates) - whatever algorithm you produce should perform significantly better than this if its existence is to be justified.
This idea of baselining is very important if you want to evaluate the quality of your algorithm's output empirically. In climate modelling, for example, a simple test is "does my fancy simulation do any better at predicting the weather than one where I simply predict today's weather will be the same as yesterday's?" Since weather is often correlated on a timescale of several days, this baseline can give surprisingly good predictions!
Limitations
One disadvantage of the approach is that it is slow to converge. A more agressive choice of update function will push promising results faster (for example, weighting new results according to a power law, rather than the simple linear normalisation), at the cost of giving alternatives less credence.
This is equivalent to fiddling with the mutation rate and gene pool size in a genetic algorithm, or the cooling rate of a simulated annealing approach. The tradeoff between 'climbing hills or exploring the landscape' is an inescapable "twiddly knob" (free parameter) which all search algorithms must deal with, either directly or indirectly. You are trying to find the highest point in some fitness search space. Your algorithm is trying to do that in less tries than random inspection, by looking at the shape of the space and trying to infer something about it. If you think you're going up a hill, you can take a guess and jump further. But if it turns out to be a small hill in a bumpy landscape, then you've just missed the peak entirely.
Also note that since your fitness function is based on human responses, you are limited to a relatively small number of iterations regardless of your choice of algorithmic approach. For example, you would see the same issue with a genetic algorithm approach (fitness function limits the number of individuals and generations) or a neural network (limited training set).
A final potential limitation is that if your "gene-lengths" are long, there are many nodes, and many transitions are allowed, then the size of the graph will become prohibitive, and the algorithm impractical.

Resources