Can someone explain to me what means "mutation step size" ?
I'm reading an article regarding genetic algorithms and it says:
"the mutation randomly changes the decision of
a node or mutate the value with a step-size of 0.25"
I know the role of mutation in GA life cycle , but I can't find a good explanation for what is step size of mutation.
thanks.
Essentially it's how far a mutation can be away from the last value.
"As far as real-valued search spaces are concerned, mutation is normally performed by adding a normally distributed random value to each vector component. The step size or mutation strength (i.e. the standard deviation of the normal distribution) is often governed by self-adaptation (see evolution window)."
That's complex talk for given a vector you are mutating (say X = [x1,x2,..,xN]) then you'll modify that vector's values by some random amount that won't be more than the mutation step size. So say we had a function called normal(v,stdDev) that generated random values with a normal distribution around some value with stdDev. Then we'd modify each value of that vector with the following psuedo code:
for x in X {
x = normal(x,mutationStepSize)
}
Related
I have been working on a literature study on Genetic Algorithms in preparation of a project. When researching mutation I encountered the terms "Uniform mutation" and "Non-Uniform mutation" quite often.
Wikipedia explains uniform and non-uniform mutation mutation as a "type":
Uniform Mutation: This operator replaces the value of the chosen gene with a uniform random value selected between the user-specified upper and lower bounds for that gene. This mutation operator can only be used for integer and float genes.
Non-Uniform Mutation: The probability that amount of mutation will go to 0 with the next generation is increased by using non-uniform mutation operator. It keeps the population from stagnating in the early stages of the evolution. It tunes solution in later stages of evolution. This mutation operator can only be used for integer and float genes.
A powerpoint presentation on the subject of genetic algorithms explains uniform mutation in the context of floating point mutations:
xi' is drawn randomly (uniform) from [Lower bound, Upper bound]. It is analogous to bit-flipping of binary strings or random resetting of integer strings.
The MathWorks documentation explains uniform mutation as:
Uniform mutation is a two-step process. First, the algorithm selects a fraction of the vector entries of an individual for mutation, where each entry has a probability Rate of being mutated. The default value of Rate is 0.01. In the second step, the algorithm replaces each selected entry by a random number selected uniformly from the range for that entry.
In line with MathWorks' explanation of uniform as "random", I found this source, which doesn't even name uniform or non-uniform mutation.
However, no information is given on what it actually is. I am unsure if it is an umbrella term for certain methods adhering to some properties, or if it is a method on its own, like Wikipedia says.
I can't find any real demonstration of the term as a method. But I can't find any definition of the term as an umbrella term either. Since one source cited it as being analogous to bit-flipping I am unsure.
What is meant, in a genetic algorithms context, with uniform and non-uniform mutation and what is an example of the use of such methods or terms?
uniform mutation - choose a certain percentage of genes, say 1%, at random and set them to random values, and do this at the same rate throughout the program.
non-uniform mutation - any other scheme, but typically you either lower the mutation rate as the population gets fitter (so mutate 0.1 % of genes after several thousand generations) or you make the mutations smaller as time progresses (so add or subtract one or two places instead of setting to random).
There is a very expensive computation I must make frequently.
The computation takes a small array of numbers (with about 20 entries) that sums to 1 (i.e. the histogram) and outputs something that I can store pretty easily.
I have 2 things going for me:
I can accept approximate answers
The "answers" change slowly. For example: [.1 .1 .8 0] and [.1
.1 .75 .05] will yield similar results.
Consequently, I want to build a look-up table of answers off-line. Then, when the system is running, I can look-up an approximate answer based on the "shape" of the input histogram.
To be precise, I plan to look-up the precomputed answer that corresponds to the histogram with the minimum Earth-Mover-Distance to the actual input histogram.
I can only afford to store about 80 to 100 precomputed (histogram , computation result) pairs in my look up table.
So, how do I "spread out" my precomputed histograms so that, no matter what the input histogram is, I'll always have a precomputed result that is "close"?
Finding N points in M-space that are a best spread-out set is more-or-less equivalent to hypersphere packing (1,2) and in general answers are not known for M>10. While a fair amount of research has been done to develop faster methods for hypersphere packings or approximations, it is still regarded as a hard problem.
It probably would be better to apply a technique like principal component analysis or factor analysis to as large a set of histograms as you can conveniently generate. The results of either analysis will be a set of M numbers such that linear combinations of histogram data elements weighted by those numbers will predict some objective function. That function could be the “something that you can store pretty easily” numbers, or could be case numbers. Also consider developing and training a neural net or using other predictive modeling techniques to predict the objective function.
Building on #jwpat7's answer, I would apply k-means clustering to a huge set of randomly generated (and hopefully representative) histograms. This would ensure that your space was spanned with whatever number of exemplars (precomputed results) you can support, with roughly equal weighting for each cluster.
The trick, of course, will be generating representative data to cluster in the first place. If you can recompute from time to time, you can recluster based on the actual data in the system so that your clusters might get better over time.
I second jwpat7's answer, but my very naive approach was to consider the count of items in each histogram bin as a y value, to consider the x values as just 0..1 in 20 steps, and then to obtain parameters a,b,c that describe x vs y as a cubic function.
To get a "covering" of the histograms I just iterated through "possible" values for each parameter.
e.g. to get 27 histograms to cover the "shape space" of my cubic histogram model I iterated the parameters through -1 .. 1, choosing 3 values linearly spaced.
Now, you could change the histogram model to be quartic if you think your data will often be represented that way, or whatever model you think is most descriptive, as well as generate however many histograms to cover. I used 27 because three partitions per parameter for three parameters is 3*3*3=27.
For a more comprehensive covering, like 100, you would have to more carefully choose your ranges for each parameter. 100**.3 isn't an integer, so the simple num_covers**(1/num_params) solution wouldn't work, but for 3 parameters 4*5*5 would.
Since the actual values of the parameters could vary greatly and still achieve the same shape it would probably be best to store ratios of them for comparison instead, e.g. for my 3 parmeters b/a and b/c.
Here is an 81 histogram "covering" using a quartic model, again with parameters chosen from linspace(-1,1,3):
edit: Since you said your histograms were described by arrays that were ~20 elements, I figured fitting parameters would be very fast.
edit2 on second thought I think using a constant in the model is pointless, all that matters is the shape.
this is my first question, and I hope it's not misdirected/in the wrong place.
Let's say I have a matrix of data that is fully populated except for one value. For example, Column 1 is Height, Column 2 is Weight, and Column 3 is Bench Press. So I surveyed 20 people and got their height, weight, and bench press weight. Now I have a 5'11 individual weighing 170 pounds, and would like to predict his/her bench press weight. You could look at this as the matrix having a missing value, or you could look at it as wanting to predict a dependent variable given a vector of independent variables. There are curve fitting approaches to this kind of problem, but I would like to know how to use the Singular Value Decomposition to answer this question.
I am aware of the Singular Value Decomposition as a means of predicting missing values, but virtually all the information I have found has been in relation to huge, highly sparse matrices, with respect to the Netflix Prize and related problems. I cannot figure out how to use SVD or a similar approach to predict a missing value from a small or medium sized, fully populated (except for one missing value) matrix.
A step-by-step algorithm for solving the example above using SVD would be very helpful to me. Thank you!
I was planning this as a comment but its too long by a fair bit so I've submitted it as an answer.
My reading of SVD suggests to me that it is not very applicable to your example. In particular it seems that you would need to somehow assign some difficulty ranking to the bench-press column of your matrix, or some ability ranking of the individuals. Perhaps both. Since the amount he can bench-press depends solely on his own height and weight I don't think SVD would provide any optimization over just calculating the statistical average of what others in the list have accomplished and using that to predict the outcome for your 5'11 170lb lifter. Perhaps if there was BMI (body mass index) column and if BMI could be ranked... and probably a larger data set. I think the problem is that there is no noise in your matrix to be reduced by using SVD. Here's a tut that appears to use a similar problem: http://www.puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html
Let TARGET be a set of strings that I expect to be spoken.
Let SOURCE be the set of strings returned by a speech recognizer (that is, the possible sentences that it has heard).
I need a way to choose a string from TARGET. I read about the Levenshtein distance and the Damerau-Levenshtein distance, which basically returns the distance between a source string and a target string, that is the number of changes needed to transform the source string into the target string.
But, how can I apply this algorithm to a set of target strings?
I thought I'd use the following method:
For each string that belongs to TARGET, I calculate the distance from each string in SOURCE. In this way we obtain an m-by-n matrix, where n is the cardinality of SOURCE and n is the cardinality of TARGET. We could say that the i-th row represents the similarity of the sentences detected by the speech recognizer with respect to the i-th target.
Calculating the average of the values on each row, you can obtain the average distance between the i-th target and the output of the speech recognizer. Let's call it average_on_row(i), where i is the row index.
Finally, for each row, I calculate the standard deviation of all values in the row. For each row, I also perform the sum of all the standard deviations. The result is a column vector, in which each element (Let's call it stadard_deviation_sum(i)) refers to a string of TARGET.
The string which is associated with the shortest stadard_deviation_sum could be the sentence pronounced by the user. Could be considered the correct method I used? Or are there other methods?
Obviously, too high values indicate that the sentence pronounced by the user probably does not belong to TARGET.
I'm not an expert but your proposal does not make sense. First of all, in practice I'd expect the cardinality of TARGET to be very large if not infinite. Second, I don't believe the Levensthein distance or some similar similarity metric will be useful.
If :
you could really define SOURCE and TARGET sets,
all strings in SOURCE were equally probable,
all strings in TARGET were equally probable,
the strings in SOURCE and TARGET consisted of not characters but phonemes,
then I believe your best bet would be to find the pair p in SOURCE, q in TARGET such that distance(p,q) is minimum. Since especially you cannot guarantee the equal-probability part, I think you should think about the problem from scratch, do some research and make a completely different design. The usual methodology for speech recognition is the use Hidden Markov models. I would start from there.
Answer to your comment: Choose whichever is more probable. If you don't consider probabilities, it is hopeless.
[Suppose the following example is on phonemes, not characters]
Suppose the recognized word the "chees". Target set is "cheese", "chess". You must calculate P(cheese|chees) and P(chess|chees) What I'm trying to say is that not every substitution is equiprobable. If you will model probabilities as distances between strings, then at least you must allow that for example d("c","s") < d("c","q") . (It is common to confuse c and s letters but it is not common to confuse c and q) Adapting the distance calculation algorithm is easy, coming with good values for all pairs is difficult.
Also you must somehow estimate P(cheese|context) and P(chess|context) If we are talking about board games chess is more probable. If we are talking about dairy products cheese is more probable. This is why you'll need large amounts of data to come up with such estimates. This is also why Hidden Markov Models are good for this kind of problem.
You need to calculate these probabilities first: probability of insertion, deletion and substitution. Then use log of these probabilities as penalties for each operation.
In a "context independent" situation, if pi is probability of insertion, pd is probability of deletion and ps probability of substitution, the probability of observing the same symbol is pp=1-ps-pd.
In this case use log(pi/pp/k), log(pd/pp) and log(ps/pp/(k-1)) as penalties for insertion, deletion and substitution respectively, where k is the number of symbols in the system.
Essentially if you use this distance measure between a source and target you get log probability of observing that target given the source. If you have a bunch of training data (i.e. source-target pairs) choose some initial estimates for these probabilities, align source-target pairs and re-estimate these probabilities (AKA EM strategy).
You can start with one set of probabilities and assume context independence. Later you can assume some kind of clustering among the contexts (eg. assume there are k different sets of letters whose substitution rate is different...).
I'm trying hard to do a lab for school. I'm trying to solve a crossword puzzle using genetic algorithms.
Problem is that is not very good (it is still too random)
I will try to give a brief explanation of how my program is implemented now:
If i have the puzzle (# is block, 0 is empty space)
#000
00#0
#000
and a collection of words that are candidates to this puzzle's solution.
My DNA is simply the matrix as a 1D array.
My first set of individuals have random generated DNAs from the pool of letters that my words contains.
I do selection using roulette-selection.
There are some parameters about the chance of combination and mutations, but if mutation will happen then i will always change 25% of the DNA.
I change it with random letters from my pool of letters.(this can have negative effects, as the mutations can destroy already formed words)
Now the fitness function:
I traverse the matrix both horizontaly and verticaly:
If i find a word then FITNESS += word.lengh +1
If i find a string that is a part of some word then FITNESS += word.length / (puzzle_size*4) . Anyway it should give a value between 0 and 1.
So it can find "to" from "tool" and ads X to FITNESS, then right after it finds "too" from "tool" and adds another Y to FITNESS.
My generations are not actually improving over time. They appear random.
So even after 400 generations with a pool of 1000-2000 (these numbers dont really matter) i get a solution with 1-2 words (of 2 or 3 letters) when the solution should have 6 words.
I think your fitness function might be ill-defined. I would set this up so each row has a binary fitness level. Either a row is fit, or it is not. (eg a Row is a word or it is not a word) Then the overall fitness of the solution would be #fit rows / total rows (both horizontally and vertically). Also, you might be changing too much of the dna, I would make that variable and experiment with that.
Your fitness function looks OK to me, although without more detail it's hard to get a really good picture of what you're doing.
You don't specify the mutation probability, but when you do mutate, 25% is a very high mutation. Also, roulette wheel selection applies a lot of selection pressure. What you often see is that the algorithm pretty early on finds a solution that is quite a bit better than all the others, and roulette wheel selection causes the algorithm to select it with such high probability that you quickly end up with a population full of copies of that. At that point, search halts except for the occasional blindly lucky mutation, and since your mutations are so large, it's very unlikely that you'll find an improving move without wrecking the rest of the chromosome.
I'd try binary tournament selection, and a more sensible mutation operator. The usual heuristic people use for mutation is to (on average) flip one "bit" of each chromosome. You don't want a deterministic one letter change each time though. Something like this:
for(i=0; i<chromosome.length(); ++i) {
// random generates double in the range [0, 1)
if(random() < 1.0/chromosome.length()) {
chromosome[i] = pick_random_letter();
}
}