TSP/ TSPTW with different seed - genetic-algorithm

I would like to ask is it possible to run the GA with different seed to generate the initial solution and make analysis ?. However, at the beginning of applying GA, you have to produce number of population solutions.
For example, You run the genetic algorithm using seed "12345" to generate initial solution , and then you populate list of random solutions from this initial solution, and continue applying GA steps to solve the problem.
Then you run the genetic with another seed for example "5678" to generate initial solution , and then you populate list of random solutions from this initial solution, and continue applying GA steps to solve the problem..
That means the populated list in first run may contain the initial solution that has been generated in the second run.
My question is , Is there any way I can use GA with different seed to make comparison and analysis ?, If not, how can I compare and make analysis , should I only use different instance file for the problem ?

To make comparisons of stochastic algorithms first you typically run them multiple times with different random seeds. The output you then obtain is a random variable. You can then assess whether one algorithm is better than another by performing a statistical hypothesis test (ANOVA or Kruskal-Wallis for multiple comparisons, t-test or Mann-Whitney U test for pairwise comparisons) on the resulting samples. If the obtained p-value in these tests is below your desired threshold (typically 0.05, but for more rigorous proofs you would set this lower e.g. 0.01) you would reject the H0 hypothesis that these results are equal e.g. with respect to their means. Thus you assume that the results are unequal and further, that the one with the better average performance is the better algorithm to choose (if you're interested in average performance - "better" usually has many dimensions).
One thing made me wonder in your comments:
If I run the GA algorithm multiple times with the same seed for initial solution, the result will be completely different
I think you have made some error in your code. You need to use the same random object throughout any random decision made inside your algorithm in order to obtain exactly the same result. Somewhere in your code you probably use a new Random() instead of the one that you obtained initially with the given seed. Another reason could be that you use parallel processing of parts that draw random numbers. You can never guarantee that your threads are always executed in the same order, so one time thread 1 gets the first random number and thread 2 the second one, another time thread 2 executes first and gets the first number.

Related

Different stream of pseudorandom numbers

I have a homework problem on running a simulation where I generate 100 random numbers and perform a calculation on each outcome. The next question asks me to repeat the previous question but with a different stream of pseudorandom numbers. The side note tells me to to perform two computations within one call to the program because changing the seed/state arbitrary can lead to overlapping streams.
Can someone explain to me what this means? Why do I have to do it through 1 loop?
Why can't I just call the same code twice using a different seed each time?
Pseudo-random number generators (PRNGs) work by iterating through a deterministic set of calculations on some internal information known as the generator's state, and then handing you back a value which is based on the state. There's a finite amount of state information that determines what the next state, and thus the next outcome, will be. Since it's finite, eventually the generator will revisit a state that it used before, and from that point forward all values will be exact duplicates of the sequence you've already seen. The PRNG is said to have cycled. "Seeding" a random number generator sets the starting point for the state, so it effectively corresponds to choosing an entry point to the cycle.
If a human intervenes by changing the seed arbitrarily, there's a chance that they will prematurely put the state back to where some portion of the output sequence repeat gets repeated. This is referred to as overlapping streams. The solution is to seed your PRNG once, and then don't mess with it so it can achieve its full cycle.
In your case it means that the values and ordering of your first set of 100 numbers will be distinct from your the values and ordering of your second set of 100.

Parallel Normal Distributions

I'm working on a simulation where a large task is completed by a series of independent smaller tasks either in parallel or in series. The smaller task's time of completion follows a normal distribution with a mean time say "t" and a variance say "v". I understand that if this task is repeated in series say "n" times than the new total time distribution is normal with mean t*n and variance v*n, which is nice but I don't know what happens to the mean and variance if a set of the same tasks are done simultaneously/in parallel, it's been a while since prob stat class. Is there a nice/fast way to find the new time distribution for "n" of these independent normally distributed task done in parallel?
If the tasks are undertaken independently and in parallel, the distribution of time until completion depends on the time of the longest process.
Unfortunately, the max function doesn't have particularly nice properties for theoretical analysis, but if you're already simulating there's an easy way to do it. For each subprocess i with mean t_i and variance v_i, draw time until completion for each i independently then look at the biggest. Repeating this lots of times will give you a bunch of samples from the max distribution you're interested in: you can compute the expectation (average), variance, or whatever you want.
The question is, what is the distribution of the maximum (greatest value) of the random completion times. The distribution function (i.e. the indefinite integral of the probability density) of the maximum of a collection of independent random variables is just the product of the distribution function of each variable. (The distribution function of the minimum is just 1 - (product of (1 - distribution function)).)
If you want to find a time such that probability(maximum > time) = (some given value), you might be able to solve that exactly, or resort to a numerical method. Still, solving the equation numerically (e.g. bisection method) is much faster and more accurate than a Monte Carlo method, as you mentioned you have already tried.
This isn't exactly a programming problem, but what you're looking for are the distributions of order statistics of normal random variables, i.e., the expected value/variance/etc of the job that took the longest, shortest, etc. This is a solved problem for identical means and variances, because you can scale all the random variables to the standard normal distribution, which has been analyzed.
Here's the paper that gives you the answer, though you're going to need some math knowledge to understand it:
Algorithm AS 177: Expected Normal Order Statistics (Exact and Approximate) J. P. Royston. Journal of the Royal Statistical Society. Series C (Applied Statistics) Vol. 31, No. 2 (1982), pp. 161-165
See this post on stats.stackexchange for more information.

Why do Genetics Algorithms evaluate the same genome several times?

I'm trying to solve the DARP Problem, using a Genetic Algorithm DLL.
Thing is that eventhough it sometimes comes up with the right solution, others times it does not. Although Im using a really simple version of the problem. When I checked the genomes the algorithm evaluated I found out that it evaluates several times the same genome.
Why does it evaluate the same several time? wouldn't be more efficient if it does not?
There is a difference between evaluating the same chromosome twice, and using the same chromosome in a population (or different populations) more than once. The first can probably be usefully avoided; the second, maybe not.
Consider:
In some generation G1, you mate 0011 and 1100, cross them right down the middle, get lucky and fail to mutate any of the genes, and end up with 0000 and 1111. You evalate them, stick them back into the population for the next generation, and continue the algorithm.
Then in some later generation G2, you mate 0111 and 1001 at the first index and (again, ignoring mutation) end up with 1111 and 0001. One of those has already been evaluated, so why evaluate it again? Especially if evaluating that function is very expensive, it may very well be better to keep a hash table (or some such) to store the results of previous evaluations, if you can afford the memory.
But! Just because you've already generated a value for a chromosome doesn't mean you shouldn't include it naturally in the results going forward, allowing it to either mutate further or allowing it to mate with other members of the same population. If you don't do that, you're effectively punishing success, which is exactly the opposite of what you want to do. If a chromosome persists or reappears from generation to generation, that is a strong sign that it is a good solution, and a well-chosen mutation operator will act to drive it out if it is a local maximum instead of a global maximum.
The basic explanation for why a GA might evaluate the same individual is precisely because it is non-guided, so the recreation of a previously-seen (or simultaneously-seen) genome is something to be expected.
More usefully, your question could be interpreted as about two different things:
A high cost associated with evaluating the fitness function, which could be mitigated by hashing the genome, but trading off memory. This is conceivable, but I've never seen it. Usually GAs search high-dimensional-spaces so you'd end up with a very sparse hash.
A population in which many or all members have converged to a single or few patterns: at some point, the diversity of your genome will tend towards 0. This is the expected outcome as the algorithm has converged upon the best solution that it has found. If this happens too early, with mediocre results, it indicates that you are stuck in a local minimum and you have lost diversity too early.
In this situation, study the output to determine which of the two scenarios has happened:
You lose diversity because particularly-fit individuals win a very large percentage of the parental lotteries. Or,
You don't gain diversity because the population over time is quite static.
In the former case, you have to maintain diversity. Ensure that less-fit individuals get more chances, perhaps by decreasing turnover or scaling the fitness function.
In the latter case, you have to increase diversity. Ensure that you inject more randomness into the population, perhaps by increasing mutation rate or increasing turnover.
(In both cases, of course, you ultimately do want diversity to decrease as the GA converges in the solution space. So you don't just want to "turn up all the dials" and devolve into a Monte Carlo search.)
Basicly genetic algoritm consists of
initial population (size N)
fitness function
mutation operation
crossover operation (performed usually on 2 individuals by taking some parts of their genome and combining it into new individual)
At every steps it
choses random individuals
performs crossover resulting in new individuals
possibly perform mutation(change random gene in random individual)
evaluate all old and new individuals with fitness function
choose N best fitted to be a new population on next iteration
The algorythm stops when it reaches a threshold of fitness function, or if there is no changes in population in last K iterations.
So, it could stop not at the best solution, but at local maximum.
A part of the population could stay unchanged from one iteration to another, because they could have a good value of fitness function.
Also it is possible to "fall back" to previos genomes because of mutation.
There are a lot of tricks to make genetic algorytm work better : choosing appropriate population encoding into genom, finding a good fitness function, playing with crossover and mutation ratio.
Depending on the particulars of your GA, you may have the same genomes in successive populations. For example Elitism saves the best or n best genomes from each population.
Reevaluating genomes is inefficient from a computational standpoint. The easiest way to avoid this is to put a boolean HasFitness flag for each genome. You could also create a unique string key for each genome encoding and store all the fitness values in a dictionary. This lookup can get very expensive, so this is only recommended if your fitness function is expensive enough to warrant the added expense of the lookup.
Elitism aside, The GA does not evaluate the same genome repeatedly. What you are seeing is identical genomes being regenerated and reevaluated. This is because each generation is a new set of genomes, which may or may not have been evaluated before.
To avoid the reevaluation you would need to keep a list of already produced genomes, with their fitness. To access the fitness you would need to have compare each of your new population with the list, when it is not in the list you would need to evaluate it, and add it to the list.
As real world applications have thousands of parameters, you end up with millions of stored genomes. This then becomes massively expensive to search & maintain. So probably quicker to just evaluate the genome each time.

Distribute numbers to two "containers" and minimize their difference of sum

Suppose there are n numbers let says we have the following 4 numbers 15,20,10,25
There are two container A and B and my job is to distribute numbers to them so that the sum of the number in each container have the least difference.
In the above example, A should have 15+20 and B should have 10+ 25. So difference = 0.
I think of a method. It seems to work but I don't know why.
Sort the number list in descending order first. In each round, take the maximum number out
and put to the container have less sum.
Btw, is it can be solved by DP?
THX
In fact, your method doesn't always work. Think about that 2,4,4,5,5.The result by your method will be (5,4,2)(5,4), while the best answer is (5,5)(4,4,2).
Yes, it can be solved by Dynamical Programming.Here are some useful link:
Tutorial and Code: http://www.cs.cornell.edu/~wdtseng/icpc/notes/dp3.pdf
A practice: http://people.csail.mit.edu/bdean/6.046/dp/ (then click Balanced Partition)
What's more, please note that if the scale of problem is damn large (like you have 5 million numbers etc.), you won't want to use DP which needs a too huge matrix. If this is the case, you want to use a kind of Monte Carlo Algorithm:
divide n numbers into two groups randomly (or use your method at this step if you like);
choose one number from each group, if (to swap these two number decrease the difference of sum) swap them;
repeat step 2 until "no swap occurred for a long time".
You don't want to expect this method could always work out with the best answer, but it is the only way I know to solve this problem at very large scale within reasonable time and memory.

Genetic algorithm on a knapsack-alike optiproblem

I have a optimzation problem i'm trying to solve using a genetic algorithm. Basically, there is a list of 10 bound real valued variables (-1 <= x <= 1), and I need to maximize some function of that list. The catch is that only up to 4 variables in the list may be != 0 (subset condition).
Mathematically speaking:
For some function f: [-1, 1]^10 -> R
min f(X)
s.t.
|{var in X with var != 0}| <= 4
Some background on f: The function is NOT similar to any kind of knapsack objective function like Sum x*weight or anything like that.
What I have tried so far:
Just a basic genetic algorithm over the genome [-1, 1]^10 with 1-point-crossover and some gaussian mutation on the variables. I tried to encode the subset condition in the fitness function by using just the first 4 nonzero (zero as in close enough to 0) values. This approach doesn't work that well and the algorithm is stuck at the 4 first variables and never uses values beyond that. I saw some kind of GA for the 01-knapsack problem where this approach worked well, but apparently this works just with binary variables.
What would you recommend me to try next?
If your fitness function is quick and dirty to evaluate then it's cheap to increase your total population size.
The problem you are running into is that you're trying to select two completely different things simultaneously. You want to select which 4 genomes you care about, and then what values are optimal.
I see two ways to do this.
You create 210 different "species". Each specie is defined by which 4 of the 10 genomes they are allowed to use. Then you can run a genetic algorithm on each specie separately (either serially, or in parallel within a cluster).
Each organism has only 4 genome values (when creating random offspring choose which genomes at random). When two organisms mate you only cross over with genomes that match. If your pair of organisms contain 3 common genomes then you could randomly pick which of the genome you may prefer as the 4th. You could also, as a heuristic, avoid mating organisms that appear to be too genetically different (i.e. a pair that shares two or fewer genomes may make for a bad offspring).
I hope that gives you some ideas you can work from.
You could try a "pivot"-style step: choose one of the existing nonzero values to become zero, and replace it by setting one of the existing zero values to become nonzero. (My "pivot" term comes from linear programming, in which a pivot is the basic step in the simplex method).
Simplest case would be to be evenhandedly random in the selection of each of these values; you can choose a random value, or multiple values, for the new nonzero variable. A more local kind of step would be to use a Gaussian step only on the existing nonzero variables, but if one of those variables crosses zero, spawn variations that pivot to one of the zero values. (Note that these are not mutually exclusive, as you can easily add both kinds of steps).
If you have any information about the local behavior of your fitness score, you can try to use that to guide your choice. Just because actual evolution doesn't look at the fitness function, doesn't mean you can't...
Does your GA solve the problem well without the subset constraint? If not, you might want to tackle that first.
Secondly, you might make your constraint soft instead of hard: Penalize a solution's fitness for each zero-valued variable it has, beyond 4. (You might start by loosening the constraint even further, allowing 9 0-valued variables, then 8, etc., and making sure the GA is able to handle those problem variants before making the problem more difficult.)
Thirdly, maybe try 2-point or multi-point crossover instead of 1-point.
Hope that helps.
-Ted

Resources