How to deal with unfeasible individuals in Genetic Algorithm? - genetic-algorithm

I'm trying to optimize a thermal power plant in a thermoeconomic way using Genetic Algorithms. Creating population gets me with a lot of unfeasible Individuals (e.g: ValueErros, TypeError etc.). I tried to use Penalty Functions, but the GA get stucked in first populations with a feasible Individual fitness and it doesn't evolve. There's any other way to deal with it?
I will be grateful if anyone can help me
Thank in advance

Do not allow such individuals to get part of the population. It will slow down your convergence but you will garantee that solutions found are fine.

You may want to look into Diversity Control.
In theory, invalid individuals may contain advantageous/valid pieces of code, and discarding them just because they have a bug is wasteful. In diversity control, your population is grouped into different species based on similarity metric (for tree structures it's usually edit distance), then the fitness of each individual is "shared" with other members of the group. In such a case fitness = performance/group_size. This is usually done to prevent premature convergence and to widen the exploration.
By combining your penalty function with diversity control, if the group of valid individuals becomes too numerous, fitness within that group will go down, and the groups that throw errors yet are less numerous will become more competitive, carrying the potentially valuable material forward.
Finally something like the rank-based selection should make the search insensitive to outliers, so when your top dog is 200% better than the other ones, it won't be selected all the time.

Related

XGBOOST/lLightgbm over-fitting despite no indication in cross-validation test scores?

We aim to identify predictors that may influence the risk of a relatively rare outcome.
We are using a semi-large clinical dataset, with data on nearly 200,000 patients.
The outcome of interest is binary (i.e. yes/no), and quite rare (~ 5% of the patients).
We have a large set of nearly 1,200 mostly dichotomized possible predictors.
Our objective is not to create a prediction model, but rather to use the boosted trees algorithm as a tool for variable selection and for examining high-order interactions (i.e. to identify which variables, or combinations of variables, that may have some influence on the outcome), so we can target these predictors more specifically in subsequent studies. Given the paucity of etiological information on the outcome, it is somewhat possible that none of the possible predictors we are considering have any influence on the risk of developing the condition, so if we were aiming to develop a prediction model it would have likely been a rather bad one. For this work, we use the R implementation of XGBoost/lightgbm.
We have been having difficulties tuning the models. Specifically when running cross validation to choose the optimal number of iterations (nrounds), the CV test score continues to improve even at very high values (for example, see figure below for nrounds=600,000 from xgboost). This is observed even when increasing the learning rate (eta), or when adding some regularization parameters (e.g. max_delta_step, lamda, alpha, gamma, even at high values for these).
As expected, the CV test score is always lower than the train score, but continuous to improve without ever showing a clear sign of over fitting. This is true regardless of the evaluation metrics that is used (example below is for logloss, but the same is observed for auc/aucpr/error rate, etc.). Relatedly, the same phenomenon is also observed when using a grid search to find the optimal value of tree depth (max_depth). CV test scores continue to improve regardless of the number of iterations, even at depth values exceeding 100, without showing any sign of over fitting.
Note that owing to the rare outcome, we use a stratified CV approach. Moreover, the same is observed when a train/test split is used instead of CV.
Are there situations in which over fitting happens despite continuous improvements in the CV-test (or test split) scores? If so, why is that and how would one choose the optimal values for the hyper parameters?
Relatedly, again, the idea is not to create a prediction model (since it would be a rather bad one, owing that we don’t know much about the outcome), but to look for a signal in the data that may help identify a set of predictors for further exploration. If boosted trees is not the optimal method for this, are there others to come to mind? Again, part of the reason we chose to use boosted trees was to enable the identification of higher (i.e. more than 2) order interactions, which cannot be easily assessed using more conventional methods (including lasso/elastic net, etc.).
welcome to Stackoverflow!
In the absence of some code and representative data it is not easy to make other than general suggestions.
Your descriptive statistics step may give some pointers to a starting model.
What does existing theory (if it exists!) suggest about the cause of the medical condition?
Is there a male/female difference or old/young age difference that could help get your foot in the door?
Your medical data has similarities to the fraud detection problem where one is trying to predict rare events usually much rarer than your cases.
It may pay you to check out the use of xgboost/lightgbm in the fraud detection literature.

Combining multiple genetic operators

Please correct me if I'm wrong, but it is my understanding that crossovers tend to lead towards local optima, while mutation increases the random walk of the search thus tend to help in escaping local optima tendencies. This insight I got from reading the following: Introduction to Genetic Algorithms and Wikipedia's article on Genetic Operators.
My question is, what is the best or most ideal way to pick which individuals go through crossover and which go through mutation? Is there a rule of thumb for this? What are the implications?
Thanks in advance. This is a pretty specific question that is a bit hard to Google with (for me at least).
The selection of individuals to participate in crossover operation must consider the fitness, that is "better individuals are more likely to have more child programs than inferior individuals.":
http://cswww.essex.ac.uk/staff/rpoli/gp-field-guide/23Selection.html#7_3
The most common way to perform this is using Tournament Selection (see wikipedia).
Selection of the individuals to mutate should not consider fitness, in fact, should be random. And the number of elements mutated per generation (mutation rate) should be very low, around 1% (or it may fall into random search):
http://cswww.essex.ac.uk/staff/rpoli/gp-field-guide/24RecombinationandMutation.html#7_4
In my experience, tweaking the tournament parameters just a bit could lead to substantial changes in the final results (for better or for worse), so it is usually a good idea to play with these parameters until you find a "sweet spot".

Genetic algorithm - new generations getting worse

I have implemented a simple Genetic Algorithm to generate short story based on Aesop fables.
Here are the parameters I'm using:
Mutation: Single word swap mutation with tested rate with 0.01.
Crossover: Swap the story sentences at given point. rate - 0.7
Selection: Roulette wheel selection - https://stackoverflow.com/a/5315710/536474
Fitness function: 3 different function. highest score of each is 1.0. so total highest fitness score is 3.0.
Population size: Since I'm using 86 Aesop fables, I tested population size with 50.
Initial population: All 86 fable sentence orders are shuffled in order to make complete nonsense. And my goal is to generate something meaningful(at least at certain level) from these structure lost fables.
Stop Condition: 3000 generations.
And the results are below:
However, this still did not produce a favorable result. I was expecting the plot that goes up over the generations. Any ideas to why my GA performing worse result?
Update: As all of you suggested, I've employed elitism by 10% of current generation copied to next generation. Result still remains the same:
Probably I should use tournament selection.
All of the above responses are great and I'd look into them. I'll add my thoughts.
Mutation
Your mutation rate seems fine although with Genetic Algorithms mutation rate can cause a lot of issues if it's not right. I'd make sure you test a lot of other values to be sure.
With mutation I'd maybe use two types of mutation. One that replaces words with other from your dictionary, and one that swaps two words within a sentence. This would encourage diversifying the population as a whole, and shuffling words.
Crossover
I don't know exactly how you've implemented this but one-point crossover doesn't seem like it'll be that effective in this situation. I'd try to implement an n-point crossover, which will do a much better job of shuffling your sentences. Again, I'm not sure how it's implemented but just swapping may not be the best solution. For example, if a word is at the first point, is there ever any way for it to move to another position, or will it always be the first word if it's chosen by selection?
If word order is important for your chosen problem simple crossover may not be ideal.
Selection
Again, this seems fine but I'd make sure you test other options. In the past I've found rank based roulette selection to be a lot more successful.
Fitness
This is always the most important thing to consider in any genetic algorithm and with the complexity of problem you have I'd make doubly sure it works. Have you tested that it works with 'known' problems?
Population Size
Your value seems small but I have seen genetic algorithms work successfully with small populations. Again though, I'd experiment with much larger populations to see if your results are any better.
The most popular suggestion so far is to implement elitism and I'd definitely recommend it. It doesn't have to be much, even just the best couple of chromosome every generation (although as with everything else I'd try different values).
Another sometimes useful operator to implement is culling. Destroy a portion of your weakest chromosomes, or one that are similar to others (or both) and replace them with new chromosomes. This should help to stop your population going 'stale', which, from your graph looks like it might be happening. Mutation only does so much to diversify the population.
You may be losing the best combinations, you should keep the best of each generation without crossing(elite). Also, your function seems to be quite stable, try other types of mutations, that should improve.
Drop 5% to 10% of your population to be elite, so that you don't lose the best you have.
Make sure your selection process is well set up, if bad candidates are passing through very often it'll ruin your evolution.
You might also be stuck in a local optimum, you might need to introduce other stuff into your genome, otherwise you wont move far.
Moving sentences and words around will not probably get you very far, introducing new sentences or words might be interesting.
If you think of story as a point x,y and your evaluation function as f(x,y), and you're trying to find the max for f(x,y), but your mutation and cross-over are limited to x -> y, y ->y, it makes sense that you wont move far. Granted, in your problem there is a lot more variables, but without introducing something new, I don't think you can avoid locality.
As #GettnDer said, elitism might help a lot.
What I would suggest is to use different selection strategy. The roulette wheel selection has one big problem: imagine that the best indidivual's fitness is e.g. 90% of the sum of all fitnesses. Then the roulette wheel is not likely to select the other individuals (see e.g. here). The selction strategy I like the most is the tournament selection. It is much more robust to big differences in fitness values and the selection pressure can be controlled very easily.
Novelty Search
I would also give a try to Novelty Search. It's relatively new approach in evolutionary computation, where you don't do the selection based on the actual fitness but rather based on novelty which is supposed to be some metric of how an individual is different in its behaviour from the others (but you still compute the fitness to catch the good ones). Of special interest might be combinations of classical fitness-driven algorithms and novelty-driven ones, like the this one by J.-B. Mouret.
When working with genetic algorithms, it is a good practice to structure you chromosome in order to reflect the actual knowledge on the process under optimization.
In your case, since you intend to generate stories, which are made of sentences, it could improve your results if you transformed your chromosomes into structured phrases, line <adjectives>* <subject> <verb> <object>* <adverbs>* (huge simplification here).
Each word could then be assigned a class. For instance, Fox=subject , looks=verb , grapes=object and then your crossover operator would exchange elements from the same category between chromosomes. Besides, your mutation operator could only insert new elements of a proper category (for instance, an adjective before the subject) or replace a word for a random word in the same category.
This way you would minimize the number of nonsensical chromosomes (like Fox beautiful grape day sky) and improve the discourse generation power for your GA.
Besides, I agree with all previous comments: if you are using elitism and the best performance decreases, then you are implementing it wrong (notice that in a pathological situation it may remain constant for a long period of time).
I hope it helps.

Elitism in GA: Should I let the elites be selected as parents?

I am a little confused by the elitism concept in Genetic Algorithm (and other evolutionary algorithms). When I reserve and then copy 1 (or more) elite individuals to the next generation,
Should I consider the elite solution(s) in the parent selection of the current generation (making a new population)?
Or, should I use others (putting the elites aside) for making a new population and just copy the elites directly to the next generation?
If the latter, what is the use of elitism? Is it just for not losing the best solution? Because in this scheme, it won't help the convergence at all.
for example, here under the crossover/mutation part, it is stated that the elites aren't participating.
(Of course, the same question can be asked about the survivor selection part.)
Elitism only means that the most fit handful of individuals are guaranteed a place in the next generation - generally without undergoing mutation. They should still be able to be selected as parents, in addition to being brought forward themselves.
That article does take a slightly odd approach to elitism. It suggests duplicating the most fit individual - that individual gets two reserved slots in the next generation. One of these slots is mutated, the other is not. That means that, in the next generation, at least one of those slots will reenter the general population as a parent, and possibly two if both are overtaken.
It does seem a viable approach. Either way - whether by selecting elites as parents while also perpetuating them, or by copying the elites and then mutating one - the elites should still be closely attached to the population at large so that they can share their beneficial genes around.
#Peladao's answer and comment are also absolutely spot on - especially on the need to maintain diversity and avoid premature convergence, and the elites should only represent a small portion of the population.
I see no reason why one would not use the elites as parents, besides perhaps a small loss in diversity. (The number of elites should therefore be small compared to the population size).
Since the elites are the best individuals, they are valuable candidates to create new individuals using crossover, as long as the elites themselves are also copied (unchanged) into the new population.
Keeping sufficient diversity and avoiding premature convergence is always important, also when elites are not used as parents.
There exists different methodologies used in order to implement elitism, as pointed out also by the other valid answers.
Generally, for elitism, just copy N individuals in the new generation without applying any kind of change. However this individuals can be selected by fitness ranking (true elitism) guaranteeing that the bests are really "saved", or they can be chosen via proportional selection (as pointed out in the book Machine Learning by Mitchell T.). The latter one is the same used in the roulette selection, but note that in this case the individuals are not used for generating new offspring, but are directly copied in the new population (survivors!).
When the selection for elitism is proportional we obtain a good compromise between a lack of diversity and a premature over-fitting situation.
Applying real elitism and avoiding to use the "elite" as parents will be counter-productive, especially considering the validity of the crossover operation.
In nutshell the main points about using elitism are:
The number of elites in the population should not exceed say 10% of the total population to maintain diversity.
Out of this say 5% may be direct part of the next generation and the remaining should undergo crossover and mutation with other non-elite population.

Initial Genetic Programming Parameters

I did a little GP (note:very little) work in college and have been playing around with it recently. My question is in regards to the intial run settings (population size, number of generations, min/max depth of trees, min/max depth of initial trees, percentages to use for different reproduction operations, etc.). What is the normal practice for setting these parameters? What papers/sites do people use as a good guide?
You'll find that this depends very much on your problem domain - in particular the nature of the fitness function, your implementation DSL etc.
Some personal experience:
Large population sizes seem to work
better when you have a noisy fitness
function, I think this is because the growth
of sub-groups in the population over successive generations acts
to give more sampling of
the fitness function. I typically use
100 for less noisy/deterministic functions, 1000+
for noisy.
For number of generations it is best to measure improvements in the
fitness function and stop when it
meets your target criteria. I normally run a few hundred generations and see what kind of answers are coming out, if it is showing no improvement then you probably have an issue elsewhere.
Tree depth requirements are really dependent on your DSL. I sometimes try to do an
implementation without explicit
limits but penalise or eliminate
programs that run too long (which is probably
what you really care about....). I've also found total node counts of ~1000 to be quite useful hard limits.
Percentages for different mutation / recombination operators don't seem
to matter all that much. As long as
you have a comprehensive set of mutations, any reasonably balanced
distribution will usually work. I think the reason for this is that you are basically doing a search for favourable improvements so the main objective is just to make sure the trial improvements are reasonably well distributed across all the possibilities.
Why don't you try using a genetic algorithm to optimise these parameters for you? :)
Any problem in computer science can be
solved with another layer of
indirection (except for too many
layers of indirection.)
-David J. Wheeler
When I started looking into Genetic Algorithms I had the same question.
I wanted to collect data variating parameters on a very simple problem and link given operators and parameters values (such as mutation rates, etc) to given results in function of population size etc.
Once I started getting into GA a bit more I then realized that given the enormous number of variables this is a huge task, and generalization is extremely difficult.
talking from my (limited) experience, if you decide to simplify the problem and use a fixed way to implement crossover, selection, and just play with population size and mutation rate (implemented in a given way) trying to come up with general results you'll soon realize that too many variables are still into play because at the end of the day the number of generations after which statistically you will get a decent result (whatever way you wanna define decent) still obviously depend primarily on the problem you're solving and consequently on the genome size (representing the same problem in different ways will obviously lead to different results in terms of effect of given GA parameters!).
It is certainly possible to draft a set of guidelines - as the (rare but good) literature proves - but you will be able to generalize the results effectively in statistical terms only when the problem at hand can be encoded in the exact same way and the fitness is evaluated in a somehow an equivalent way (which more often than not means you're ealing with a very similar problem).
Take a look at Koza's voluminous tomes on these matters.
There are very different schools of thought even within the GP community -
Some regard populations in the (low) thousands as sufficient whereas Koza and others often don't deem if worthy to start a GP run with less than a million individuals in the GP population ;-)
As mentioned before it depends on your personal taste and experiences, resources and probably the GP system used!
Cheers,
Jan

Resources