What are the differences between genetic algorithms and evolution strategies? - genetic-algorithm

I've read a couple of introductory sections of books as well as a few papers on both topics, and it looks to me that these two methods are pretty much exactly the same. That said, I haven't had the time to actually deeply research the topics yet, so I might be wrong.
What are the distinctions between genetic algorithms and evolution strategies? What makes them different, and where are they similar?

In evolution strategies, the individuals are coded as vectors of real numbers. On reproduction, parents are selected randomly and the fittest offsprings are selected and inserted in the next generation. ES individuals are self-adapting. The step size or "mutation strength" is encoded in the individual, so good parameters get to the next generation by selecting good individuals.
In genetic algorithms, the individuals are coded as integers. The selection is done by selecting parents proportional to their fitness. So individuals must be evaluated before the first selection is done. Genetic operators work on the bit-level (e.g. cutting a bit string into multiple pieces and interchange them with the pieces of the other parent or switching single bits).
That's the theory. In practice, it is sometimes hard to distinguish between both evolutionary algorithms, and you need to create hybrid algorithms (e.g. integer (bit-string) individuals that encodes the parameters of the genetic operators).

Just stumbled on this thread when researching Evolution Strategies (ES).
As Paul noticed before, the encoding is not really the difference here, as this is an implementation detail of specific algorithms, although it seems more common in ES.
To answer the question, we first need to do a small step back and look at internals of an ES algorithm.
In ES there is a concept of endogenous and exogenous parameters of the evolution. Endogenous parameters are associated with individuals and therefore are evolved together with them, exogenous are provided from "outside" (e.g. set constant by the developer, or there can be a function/policy which sets their value depending on the iteration no).
The individual k consists therefore of two parts:
y(k) - a set of object parameters (e.g. a vector of real/int values) which denote the individual genotype
s(k) - a set of strategy parameters (e.g. a vector of real/int values again) which e.g. can control statistical properties of mutation)
Those two vectors are being selected, mutated, recombined together.
The main difference between GA and ES is that in classic GA there is no distinction between types of algorithm parameters. In fact all the parameters are set from "outside", so in ES terms are exogenous.
There are also other minor differences, e.g. in ES the selection policy is usually one and the same and in GA there are multiple different approaches which can be interchanged.
You can find a more detailed explanation here (see Chapter 3): Evolution strategies. A comprehensive introduction

In most newer textbooks on GA, real-valued coding is introduced as an alternative to the integer one, i.e. individuals can be coded as vectors of real numbers. This is called continuous parameter GA (see e.g. Haupt & Haupt, "Practical Genetic Algorithms", J.Wiley&Sons, 1998). So this is practically identical to ES real number coding.
With respect to parent selection, there are many different strategies published for GA's. I don't know them all, but I assume selection among all (not only the best has been used for some applications).

The main difference seems to be that a genetic algorithm represents a solution using a sequence of integers, whereas an evolution strategy uses a sequence of real numbers -- reference: http://en.wikipedia.org/wiki/Evolutionary_algorithm#

As the wikipedia source (http://en.wikipedia.org/wiki/Genetic_algorithm) and #Vaughn Cato said the difference in both techniques relies on the implementation. EA use
real numbers and GA use integers.
However, in practice I think you could use integers or real numbers in the formulation of your problem and in your program. It depends on you. For instance, for protein folding you can say the set of dihedral angles form a vector. This is a vector of real numbers, but the entries
are labeled by integers so I think you can formulate your problem and write you program based
on an integer arithmetic. It is just an idea.

Related

Method to choose overall winner across multiple categories [migrated]

I have four numeric variables. All of them are measures of soil quality. Higher the variable, higher the quality. The range for all of them is different:
Var1 from 1 to 10
Var2 from 1000 to 2000
Var3 from 150 to 300
Var4 from 0 to 5
I need to combine four variables into single soil quality score which will successfully rank order.
My idea is very simple. Standardize all four variables, sum them up and whatever you get is the score which should rank-order. Do you see any problem with applying this approach. Is there any other (better) approach that you would recommend?
Thanks
Edit:
Thanks guys. A lot of discussion went into "domain expertise"... Agriculture stuff... Whereas I expected more stats-talk. In terms of technique that I will be using... It will probably be simple z-score summation + logistic regression as an experiment. Because vast majority of samples has poor quality 90% I'm going to combine 3 quality categories into one and basically have binary problem (somequality vs no-quality). I kill two birds with one stone. I increase my sample in terms of event rate and I make a use of experts by getting them to clasify my samples. Expert classified samples will then be used to fit log-reg model to maximize level of concordance / discordance with the experts.... How does that sound to you?
The proposed approach may give a reasonable result, but only by accident. At this distance--that is, taking the question at face value, with the meanings of the variables disguised--some problems are apparent:
It is not even evident that each variable is positively related to "quality." For example, what if a 10 for 'Var1' means the "quality" is worse than the quality when Var1 is 1? Then adding it to the sum is about as wrong a thing as one can do; it needs to be subtracted.
Standardization implies that "quality" depends on the data set itself. Thus the definition will change with different data sets or with additions and deletions to these data. This can make the "quality" into an arbitrary, transient, non-objective construct and preclude comparisons between datasets.
There is no definition of "quality". What is it supposed to mean? Ability to block migration of contaminated water? Ability to support organic processes? Ability to promote certain chemical reactions? Soils good for one of these purposes may be especially poor for others.
The problem as stated has no purpose: why does "quality" need to be ranked? What will the ranking be used for--input to more analysis, selecting the "best" soil, deciding a scientific hypothesis, developing a theory, promoting a product?
The consequences of the ranking are not apparent. If the ranking is incorrect or inferior, what will happen? Will the world be hungrier, the environment more contaminated, scientists more misled, gardeners more disappointed?
Why should a linear combination of variables be appropriate? Why shouldn't they be multiplied or exponentiated or combined as a posynomial or something even more esoteric?
Raw soil quality measures are commonly re-expressed. For example, log permeability is usually more useful than the permeability itself and log hydrogen ion activity (pH) is much more useful than the activity. What are the appropriate re-expressions of the variables for determining "quality"?
One would hope that soils science would answer most of these questions and indicate what the appropriate combination of the variables might be for any objective sense of "quality." If not, then you face a multi-attribute valuation problem. The Wikipedia article lists dozens of methods for addressing this. IMHO, most of them are inappropriate for addressing a scientific question. One of the few with a solid theory and potential applicability to empirical matters is Keeney & Raiffa's multiple attribute valuation theory (MAVT). It requires you to be able to determine, for any two specific combinations of the variables, which of the two should rank higher. A structured sequence of such comparisons reveals (a) appropriate ways to re-express the values; (b) whether or not a linear combination of the re-expressed values will produce the correct ranking; and (c) if a linear combination is possible, it will let you compute the coefficients. In short, MAVT provides algorithms for solving your problem provided you already know how to compare specific cases.
Anyone looked at Russell G. Congalton 'Review of Assessing the Accuracy of Classifications of Remotely Sensed Data' 1990 ?. It describes a technique known as error matrix for varing matrices, also a term he uses called ' Normalizing data' , whereby one gets all the different vectors and 'normalizes' or sets them to equal from 0 to 1. You basically change all vectors to equal ranges from 0 to 1.
One other thing you did not discuss is the scale of the measurements. V1 and V5 looks like they are of rank order and the other seem not. So standardization may be skewing the score. So you may be better transforming all of the variables into ranks, and determining a weighting for each variable, since it is highly unlikely that they have the same weight. Equal weighting is more of a "no nothing" default. You might want to do some correlation or regression analysis to come up with some a priori weights.
I had a similar problem recently and though I add my approach to the nice answers. I think in order to find a simple way to determine which variable leads to the best ranking. One could transform your problem to a gridsearch approach:
Basically use a combined score for the ranking which is composed as such:
Finel_score = Var1 * A + Var2 * B + Var3 * C ....
Then you can compute the final score with different values for A,B,C (sklearn gridsearch could be used) ... and compare the resulting ranking to an expected ranking (some ground truth is needed to determine the goodness of you ranking). The best parameters result in the weights of your individual variables.
Following up on Ralph Winters' answer, you might use PCA (principal component analysis) on the matrix of suitably standardized scores. This will give you a "natural" weight vector that you can use to combine future scores.
Do this also after all scores have been transformed into ranks. If the results are very similar, you have good reasons to continue with either method. If there are discrepancies, this will lead to interesting questions and a better understanding.

Human-interpretable supervised machine learning algorithm

I'm looking for a supervised machine learning algorithm that would produce transparent rules or definitions that can be easily interpreted by a human.
Most algorithms that I work with (SVMs, random forests, PLS-DA) are not very transparent. That is, you can hardly summarize the models in a table in a publication aimed at a non-computer scientist audience. What authors usually do is, for example, publish a list of variables that are important based on some criterion (for example, Gini index or mean decrease of accuracy in the case of RF), and sometimes improve this list by indicating how these variables differ between the classes in question.
What I am looking is a relatively simple output of the style "if (any of the variables V1-V10 > median or any of the variables V11-V20 < 1st quartile) and variable V21-V30 > 3rd quartile, then class A".
Is there any such thing around?
Just to constraint my question a bit: I am working with highly multidimensional data sets (tens of thousands to hundreds of thousands of often colinear variables). So for example regression trees would not be a good idea (I think).
You sound like you are describing decision trees. Why would regression trees not be a good choice? Maybe not optimal, but they work, and those are the most directly interpretable models. Anything that works on continuous values works on ordinal values.
There's a tension between wanting an accurate classifier, and wanting a simple and explainable model. You could build a random decision forest model, and constrain it in several ways to make it more interpretable:
Small max depth
High minimum information gain
Prune the tree
Only train on "understandable" features
Quantize/round decision threhsolds
The model won't be as good, necessarily.
You can find interesting research in the understanding AI methods done by Been Kim at Google Brain.

Methods for crossover in genetic algorithms

When reading about the crossover part of genetic algorithms, books and papers usually refer to methods of simply swapping out bits in the data of two selected candidates which are to reproduce.
I have yet to see actual code of an implemented genetic algorithm for actual industry applications, but I find it hard to imagine that it's enough to operate on simple data types.
I always imagined that the various stages of genetic algorithms would be performed on complex objects involving complex mathematical operations, as opposed to just swapping out some bits in single integers.
Even Wikipedia just lists these kinds of operations for crossover.
Am I missing something important or are these kinds of crossover methods really the only thing used?
There are several things used... although the need for parallelity and several generations (and sometimes a big population) leads to using techniques that perform well...
Another point to keep in mind is that "swapping out some bits" when modeled correctly resembles a simple and rather accurate version of what happens naturally (recombination of genes, mutations)...
For a very simple and nicely written walkthrough see http://www.electricmonk.nl/log/2011/09/28/evolutionary-algorithm-evolving-hello-world/
For some more info see
http://www.codeproject.com/KB/recipes/btl_ga.aspx
http://www.codeproject.com/KB/recipes/genetics_dot_net.aspx
http://www.codeproject.com/KB/recipes/GeneticandAntAlgorithms.aspx
http://www.c-sharpcorner.com/UploadFile/mgold/GeneticAlgorithm12032005044205AM/GeneticAlgorithm.aspx
I always imagined that the various stages of genetic algorithms would be performed on complex objects involving complex mathematical operations, as opposed to just swapping out some bits in single integers.
You probably think complex mathematical operations are used because you think the Genetic Algorithm has to modify a complex object. That's usually not how a Genetic Algorithm works.
So what does happen? Well, usually, the programmer (or scientist) will identify various parameters in a configuration, and then map those parameters to integers/floats. This does limit the directions in which the algorithm can explore, but that's the only realistic method of getting any results.
Let's look at evolving an antenna. You could perform complex simulation with an genetic algorithm rearranging copper molecules, but that would be very complex and take forever. Instead, you'd identify antenna "parameters". Most antenna's are built up out of certain lengths of wire, bent at certain places in order to maximize their coverage area. So you could identify a couple of parameters: number of starting wires, section lengths, angle of the bends. All those are easily represented as integer numbers, and are therefor easy for the Genetic Algorithm to manipulate. The resulting manipulations can be fed into an "Antenna simulator", to see how well it receives signals.
In short, when you say:
I find it hard to imagine that it's enough to operate on simple data types.
you must realize that simple data types can be mapped to much more intricate structures. The Genetic Algorithm doesn't have to know anything about these intricate structures. All it needs to know is how it can manipulate the parameters that build up the intricate structures. That is, after all, the way DNA works.
In Genetic algorithms usually bitswapping of some variety is used.
As you have said:
I always imagined that the various stages of genetic algorithms would
be performed on complex objects involving complex mathematical
operations
What i think you are looking for is Genetic Programming, where the chromosome describes a program, in this case you would be able to do more with the operators when applying crossover.
Also make sure you have understood the difference between your fitness function in genetic algorithms, and the operators within a chromosome in genetic programming.
Different applications require different encondigs. The goal certainly is to find the most effective encoding and often enough the simple encodings are the better suited. So for example a Job Shop Scheduling Problem might be represented as list of permutations which represent the execution order of the jobs on the different machines (so called job sequence matrix). It can however also be represented as a list of priority rules that construct the schedule. A traveling salesman problem or quadratic assignment problem is typically represented by a single permutation that denotes a tour in one case or an assignment in another case. Optimizing the parameters of a simulation model or finding the root of a complex mathematical function is typically represented by a vector of real values.
For all those, still simple types crossover and mutation operators exist. For the permutation these are e.g. OX, ERX, CX, PMX, UBX, OBX, and many more. If you can combine a number of simple representations to represent a solution of your complex problem you might reuse these operations and apply them to each component individually.
The important thing about crossover to work effectively is that a few properties should be fulfilled:
The crossover should conserve those parts that are similar in both
For those parts that are not similar, the crossover should not introduce an element that is not already part of one of the parents
The crossover of two solutions should, if possible, produce a feasible solution
You want to avoid so called unwanted mutations in your crossovers. In that light you also want to avoid having to repair a large part of your chromosomes after crossover, since that is also introducing unwanted mutations.
If you want to experiment with different operators and problems, we have a nice GUI-driven software: HeuristicLab.
Simple Bit swapping is usually the way to go. The key thing to note is the encoding that is used in each candidate solution. Solutions should be encoded such that there is minimal or no error introduced into the new offspring. Any error would require that the algorithm to provide a fix which will lead to increased processing time.
As an example I have developed a university timetable generator in C# that uses a integer encoding to represent the timeslots available in each day. This representation allows very efficient single point or multi-point crossover operator which uses the LINQ intersect function to combine parents.
Typical multipoint crossover with hill-climbing
public List<TimeTable> CrossOver(List<TimeTable> parents) // Multipoint crossover
{
var baby1 = new TimeTable {Schedule = new List<string>(), Fitness = 0};
var baby2 = new TimeTable {Schedule = new List<string>(), Fitness = 0};
for (var gen = 0; gen < parents[0].Schedule.Count; gen++)
{
if (rnd.NextDouble() < (double) CrossOverProb)
{
baby2.Schedule.Add(parents[0].Schedule[gen]);
baby1.Schedule.Add(parents[1].Schedule[gen]);
}
else
{
baby1.Schedule.Add(parents[0].Schedule[gen]);
baby2.Schedule.Add(parents[1].Schedule[gen]);
}
}
CalculateFitness(ref baby1);
CalculateFitness(ref baby2);
// allow hill-climbing
parents.Add(baby1);
parents.Add(baby2);
return parents.OrderByDescending(i => i.Fitness).Take(2).ToList();
}

What are good algorithms for detecting abnormality?

Background
Here is the problem:
A black box outputs a new number each day.
Those numbers have been recorded for a period of time.
Detect when a new number from the black box falls outside the pattern of numbers established over the time period.
The numbers are integers, and the time period is a year.
Question
What algorithm will identify a pattern in the numbers?
The pattern might be simple, like always ascending or always descending, or the numbers might fall within a narrow range, and so forth.
Ideas
I have some ideas, but am uncertain as to the best approach, or what solutions already exist:
Machine learning algorithms?
Neural network?
Classify normal and abnormal numbers?
Statistical analysis?
Cluster your data.
If you don't know how many modes your data will have, use something like a Gaussian Mixture Model (GMM) along with a scoring function (e.g., Bayesian Information Criterion (BIC)) so you can automatically detect the likely number of clusters in your data. I recommend this instead of k-means if you have no idea what value k is likely to be. Once you've constructed a GMM for you data for the past year, given a new datapoint x, you can calculate the probability that it was generated by any one of the clusters (modeled by a Gaussian in the GMM). If your new data point has low probability of being generated by any one of your clusters, it is very likely a true outlier.
If this sounds a little too involved, you will be happy to know that the entire GMM + BIC procedure for automatic cluster identification has been implemented for you in the excellent MCLUST package for R. I have used it several times to great success for such problems.
Not only will it allow you to identify outliers, you will have the ability to put a p-value on a point being an outlier if you need this capability (or want it) at some point.
You could try line fitting prediction using linear regression and see how it goes, it would be fairly easy to implement in your language of choice.
After you fitted a line to your data, you could calculate the mean standard deviation along the line.
If the novel point is on the trend line +- the standard deviation, it should not be regarded as an abnormality.
PCA is an other technique that comes to mind, when dealing with this type of data.
You could also look in to unsuperviced learning. This is a machine learning technique that can be used to detect differences in larger data sets.
Sounds like a fun problem! Good luck
There is little magic in all the techniques you mention. I believe you should first try to narrow the typical abnormalities you may encounter, it helps keeping things simple.
Then, you may want to compute derived quantities relevant to those features. For instance: "I want to detect numbers changing abruptly direction" => compute u_{n+1} - u_n, and expect it to have constant sign, or fall in some range. You may want to keep this flexible, and allow your code design to be extensible (Strategy pattern may be worth looking at if you do OOP)
Then, when you have some derived quantities of interest, you do statistical analysis on them. For instance, for a derived quantity A, you assume it should have some distribution P(a, b) (uniform([a, b]), or Beta(a, b), possibly more complex), you put a priori laws on a, b and you ajust them based on successive information. Then, the posterior likelihood of the info provided by the last point added should give you some insight about it being normal or not. Relative entropy between posterior and prior law at each step is a good thing to monitor too. Consult a book on Bayesian methods for more info.
I see little point in complex traditional machine learning stuff (perceptron layers or SVM to cite only them) if you want to detect outliers. These methods work great when classifying data which is known to be reasonably clean.

Which data mining algorithm would you suggest for this particular scenario?

This is not a directly programming related question, but it's about selecting the right data mining algorithm.
I want to infer the age of people from their first names, from the region they live, and if they have an internet product or not. The idea behind it is that:
there are names that are old-fashioned or popular in a particular decade (celebrities, politicians etc.) (this may not hold in the USA, but in the country of interest that's true),
young people tend to live in highly populated regions whereas old people prefer countrysides, and
Internet is used more by young people than by old people.
I am not sure if those assumptions hold, but I want to test that. So what I have is 100K observations from our customer database with
approx. 500 different names (nominal input variable with too many classes)
20 different regions (nominal input variable)
Internet Yes/No (binary input variable)
91 distinct birthyears (numerical target variable with range: 1910-1992)
Because I have so many nominal inputs, I don't think regression is a good candidate. Because the target is numerical, I don't think decision tree is a good option either. Can anyone suggest me a method that is applicable for such a scenario?
I think you could design discrete variables that reflect the split you are trying to determine. It doesn't seem like you need a regression on their exact age.
One possibility is to cluster the ages, and then treat the clusters as discrete variables. Should this not be appropriate, another possibility is to divide the ages into bins of equal distribution.
One technique that could work very well for your purposes is, instead of clustering or partitioning the ages directly, cluster or partition the average age per name. That is to say, generate a list of all of the average ages, and work with this instead. (There may be some statistical problems in the classifier if you the discrete categories here are too fine-grained, though).
However, the best case is if you have a clear notion of what age range you consider appropriate for 'young' and 'old'. Then, use these directly.
New answer
I would try using regression, but in the manner that I specify. I would try binarizing each variable (if this is the correct term). The Internet variable is binary, but I would make it into two separate binary values. I will illustrate with an example because I feel it will be more illuminating. For my example, I will just use three names (Gertrude, Jennifer, and Mary) and the internet variable.
I have 4 women. Here are their data:
Gertrude, Internet, 57
Jennifer, Internet, 23
Gertrude, No Internet, 60
Mary, No Internet, 35
I would generate a matrix, A, like this (each row represents a respective woman in my list):
[[1,0,0,1,0],
[0,1,0,1,0],
[1,0,0,0,1],
[0,0,1,0,1]]
The first three columns represent the names and the latter two Internet/No Internet. Thus, the columns represent
[Gertrude, Jennifer, Mary, Internet, No Internet]
You can keep doing this with more names (500 columns for the names), and for the regions (20 columns for those). Then you will just be solving the standard linear algebra problem A*x=b where b for the above example is
b=[[57],
[23],
[60],
[35]]
You may be worried that A will now be a huge matrix, but it is a huge, extremely sparse matrix and thus can be stored very efficiently in a sparse matrix form. Each row has 3 1's in it and the rest are 0. You can then just solve this with a sparse matrix solver. You will want to do some sort of correlation test on the resulting predicting ages to see how effective it is.
You might check out the babynamewizard. It shows the changes in name frequency over time and should help convert your names to a numeric input. Also, you should be able to use population density from census.gov data to get a numeric value associated with your regions. I would suggest an additional flag regarding the availability of DSL access - many rural areas don't have DSL coverage. No coverage = less demand for internet services.
My first inclination would be to divide your response into two groups, those very likely to have used computers in school or work and those much less likely. The exposure to computer use at an age early in their career or schooling probably has some effect on their likelihood to use a computer later in their life. Then you might consider regressions on the groups separately. This should eliminate some of the natural correlation of your inputs.
I would use a classification algorithm that accepts nominal attributes and numeric class, like M5 (for trees or rules). Perhaps I would combine it with the bagging meta classifier to reduce variance. The original algorithm M5 was invented by R. Quinlan and Yong Wang made improvements.
The algorithm is implemented in R (library RWeka)
It also can be found in the open source machine learning software Weka
For more information see:
Ross J. Quinlan: Learning with Continuous Classes. In: 5th Australian Joint Conference on Artificial Intelligence, Singapore, 343-348, 1992.
Y. Wang, I. H. Witten: Induction of model trees for predicting continuous classes. In: Poster papers of the 9th European Conference on Machine Learning, 1997.
I think slightly different from you, I believe that trees are excellent algorithms to deal with nominal data because they can help you build a model that you can easily interpret and identify the influence of each one of these nominal variables and it's different values.
You can also use regression with dummy variables in order to represent the nominal attributes, this is also a good solution.
But you can also use other algorithms such as SVM(smo), with the previous transformation of the nominal variables to binary dummy ones, same as in regression.

Resources