Im just learning about genetic algorithm when i was given a task to design a genetic algorithm that learns rules that predicts if a person will vote yes or no given a data set.
I've been reading in book and internet about GA and GP for 2 days straight. So now i kind of understand the concept of GA about the population management, genetic operators, fitness functions and crossover with the different types of crossover masks. But i'm still nowhere near making my own GA for a given data set. I just don't get how to start or with what and i'm kind of getting desperate since i get a feeling i'm to stupid for this.
So any kind of help, such as hints, tips or pseudo code, will be much appreciated!
The given data set is as follows (groups):
G1 | G2 | G3 | G4
A1 | B1 | C1 | None
A2 | B2 | C2 | D2
A3 | B3 | C3 | D3
A4 | B4 | C4 | D4
A5 | - | - | D5
Well the data are not a,b,c's. They are something else much longer, but i'm kind of lazy so yea :P The - means there is no more attributes. Note that none is an attribute.
Thanks for any sort of help guys!
First, and foremost, you'll have to determine what you're trying to solve with your data set in the first place. You generally use a genetic algorithm to tackle non-deterministic problems: problems that take a long time to solve, but whose answers are easily verifiable.
So the first question is: what does your dataset represent?
The second question: what are you trying to solve and is a genetic algorithm a fitting method to solve your problem?
Anyway, creating a genetic algorithm is done through the following steps:
Represent the problem variable domain as a chromosome of a fixed length, choose the size of the population N, the crossover probability p(c) and the mutation probability p(m)
Define a fitness function f(x) to measure the performance, or fitness, of an individual chromosome in the problem domain. The fitness function establishes the basis for selecting chromosomes that will be mated during reproduction
Randomly generate an initial population of chromosomes of size N: x1, x2, ..., xn
Calculate the fitness of each individual chromosome: f(x1), f(x2), ..., f(xn)
Select a pair of chromosomes for mating from the current population. Parent chromosomes are selected with a probability related to their fitness. Highly fit chromosomes have a higher probability of being selected for mating than less fit chromosomes.
Create a pair of offspring chromosomes by applying the genetic operators - crossover and mutation
Place the created offspring chromosomes in the new population
Repeat step 5 until the size of the new chromosome population become equal to the size of the initial population N
Replace the initial (parent) chromosome population with the new (offspring) population
Go to step 4 and repeat the process until the termination criterion is satisfied.
So, you have to find a notation for your solution (such as an array of bits or a string) that allows you to swap parts of chromosomes easily. Then you have to identify the crossover and mutation operations.
If you're dealing with ordered chromosomes, then depending on the applied crossover strategy you may have to repair your chromosomes afterwards. An ordered chromosome is a chromosome where the order or the genes matter. If you preform a standard crossover on two solution that represent the cities that the travelling salesman has to visit, you might end up with a chromosome where he visits some cities twice or more and some not at all!
There's no clear description on how to translate each problem in a genetic algorithm, because it's different for each problem. The above steps don't change, but you may need to introduce several different crossover and mutation operations to prevent premature convergence.
Well, I do not fully understand the description of the dataset, so my answer is based on the following assumptions:
We have a set of attributes, say n different one. Each attributes have a set of different possible symbolic (=non numeric) values, say m(i) different possibilities. Each person have the same attributes, but some of them might be missing or None.
If these assumptions are correct and the set of attributes and possible values are not too high, then one of these might work:
if these two sets are really small you could have an n dimensional array as an individual/genotype. Each dimension would have the size m(i) and each value of this structure would be the yes/no answer. It would be the generalization (=more dimensions) of a fixed size (bit) vector. How to create random/mutate/crossover should be easy. Fitness would be how often it makes a good prediction.
if they are bigger then you will need something more complicated. One possibility is to have lists of rules. Each rule could be an vector with length n + a yes/no flag. In each position of the vector you would have a possible value of the related attribute. You could also have a jolly joker attribute accepting everything.
Interpretation of a rule (p:person, r:rule) : if p1=r1 and p2=r2 and ... pn=rn then the result is the flag of the rule.
You will have to evaluate rules until you find a matching one. You will also need a default.
Genetic operators are a bit more tricky in this case, but I think you will find something if you search for variable length encoding.
I've used a similar encoding (for a different problem) and it worked fine.
to make it more general (but also more complicated) you can represent your rules as trees where the internal nodes are and/or/not and possibly other logical operators, leafs are predicates like pi=ri. This would be a kind of genetic programming, google for that if you like this solution.
To be honest I'm not 100% sure if a genetic algorithm is the best choice for this problem, especially if the values are not symbolic, but numeric. It seems to be a pattern matching problem, and for that there are much better solutions. I would look for some alternatives, e.g. neural networks in the numeric case.
Related
I have a table of 21 students (A1…A21) and their 25 characteristics (table 1) and I have another matrix (table 2) which shows if a student likes another student or not (0 means likes and 100 means dislike).
How can I find the least no. of characteristics that can give me similar distance in space as the likeability matrix?
For Example:
If we get 5 dimensions with characteristics C1, C3, C4, C5, C10, then the points A1,..A21 when plotted for these characteristics will have the proportional distance as the likeability matrix.
For example, if A3 and A2 have a small distance between them in that 5D characteristics space, then they will have a corresponding smaller distance/value in the likeability matrix.
Table 1
Table 2
You can make this look like a well-known statistical problem, but you have made assumptions (that similar students like each other), I will make further assumptions, and most of the solutions to the statistical problem are not very respectable, so you should take the results with a pinch of salt.
With 21 students, you have 21*20/2 = 210 pairs of students. Treat each pair as a separate observation. You have a likeability value for that pair. For each pair compute, for each characteristic, the absolute value of the difference between the values for each of the two students. This gives you a vector of 25 elements for each observation. You will now try and predict the 210 likeabilities given the 210 25-long vectors of absolute differences.
Procedures for this go under the names of all-subsets regression and stepwise regression. See https://www.r-bloggers.com/variable-selection-using-automatic-methods/ and https://www.r-bloggers.com/variable-selection-using-automatic-methods/. One way to compute these is to use the free open source statistical package R from https://www.r-project.org/.
For each possible selection of variables you can use linear regression to predict likeability from the vector of absolute differences. From that linear regression you can get a measure of how good the prediction is, and so whether that particular selection of variables was any good or not. All subsets regression uses a variation on branch and bound to work out, for each N, the set of variables of size N which predicts best. Stepwise regression starts off with a possibly incomplete selection of variables and performs a sort of hillclimb, adding or subtracting one variable from the set at each stage, trying all of the variables and choosing the one that gives the best prediction. Typically you start with no variables and add one variable at a time, or start will all variables, and remove one variable at a time. Stepwise selection isn't guaranteed to find the absolute best selection of variables that all subsets regression will find, but all subsets regression can be very expensive.
From this you will get a best selection of variables (probably one best selection for each number of variables) and you may get some indication of statistical significance. You have broken so many rules about multiple testing and independence (inflating 21 observations to 210) that you shouldn't take any statistical significance seriously. If you want some idea of whether you are looking at real information or prettied-up random noise, automate the procedure and see what it looks like on fake data where there is no underlying effect at all, and perhaps on fake data where you have constructed data from which there is an underlying effect that you know about because you have constructed it. See also https://en.wikipedia.org/wiki/Bootstrapping_(statistics) and https://en.wikipedia.org/wiki/Resampling_(statistics)#Permutation_tests
I have two genes with different sizes and I want to produce offspring from them. The position of the chromosome doesn't make difference in the gene.
I want to know what is common to do in this situation
Gene1:
123456789
Gene2:
ABCDEFGHIJKL
I can use a single cross point in each
12345.6789
ABCD.EFGHIJKL
And with this I have 8 possible combinations
1. 12345ABCD
2. 12345EFGHIJKL
3. 6789ABCD
4. 6789EFGHIJKL
5. ABCD12345
6. ABCD6789
7. EFGHIJKL12345
8. EFGHIJKL6789
Is it okay to create all the 8 offsprings, or should I just make 1, if so, do I need to randomize the method or just pick one and stick with it?
Genetic algorithms are mocking biological processes where chromosomes crossover at one point and exchange their parts after the crossover point if we are talking about single point crossover.
As you can see in picture above parents exchange "tail" parts of the chromosome after the crossover point. Therefore you have only 2 offspring/children produced by crossover. That's how crossover occurs in nature, how biologists describe it.
If you refer to any literature dealing with topic of Genetic Algorithms they also state this convention that when using single point crossover parent chromosomes are split into head and tail denoted as H/T like that (see citation below):
H T
123456.789
H T
ABCDEF.GHI
Therefore offspring produced with this crossover will be:
123456GHI
ABCDEF789
Following this convention is much better than creating all possible combinations and then selecting random or fittest of the offspring as it is computationally more efficient. If you want to solve more complex problems you just simply increase size of the population to allow more diversity.
"Single point crossover: A single random cut is made, producing two head sections and two tail sections. The two tail sections are then swapped to produce
two new individuals (chromosomes)".
Genetic Algorithms and Genetic Programming:
Affenzeller, Michael
Wagner, Stefan
Winkler, Stephan,
ISBN: 1584886293
Alternatively you can use multipoint crossover which follows similar convention where chromosomes are split into sections and parents exchange parts in a way that offspring chromosome is just alternation of parents chromosomes so if you have parents with chromosomes:
A1.A2.A3.A4
B1.B2.B3.B4
|
| this will produce offspring
|
A1 B2 A3 B4
and
B1 A2 B3 A4
This answer might help you as well:
Crossover of chromosomes with different length
It seams you use Gene instead of Chromosome and vice versa.
In this case and if different size of chromosome is okay, you can create all 8 offspring. But your population increases in each iteration and you should control this. For example keep 2 of the best offspring or 2 random offspring and replace their parents by.
First of all, this is a part of a homework.
I am trying to implement a genetic algorithm. I am confused about selecting parents to crossover.
In my notes (obviously something is wrong) this is what is done as example;
Pc (possibility of crossover) * population size = estimated chromosome count to crossover (if not even, round to one of closest even)
Choose a random number in range [0,1] for every chromosome and if this number is smaller then Pc, choose this chromosome for a crossover pair.
But when second step applied, chosen chromosome count is equals to result found in first step. Which is not always guaranteed because of randomness.
So this does not make any sense. I searched for selecting parents for crossover but all i found is crossover techniques (one-point, cut and slice etc.) and how to crossover between chosen parents (i do not have a problem with these). I just don't know which chromosomesi should choose for crossover. Any suggestions or simple example?
You can implement it like this:
For every new child you decide if it will result from crossover by random probability. If yes, then you select two parents, eg. through roulette wheel selection or tournament selection. The two parents make a child, then you mutate it with mutation probability and add it to the next generation. If no, then you select only one "parent" clone it, mutate it with probability and add it to the next population.
Some other observations I noted and that I like to comment. I often read the word "chromosomes" when it should be individual. You hardly ever select chromosomes, but full individuals. A chromosome is just one part of a solution. That may be nitpicking, but a solution is not a chromosome. A solution is an individual that consists of several chromosomes which consist of genes which show their expression in the form of alleles. Often an individual has only one chromosome, but it's still not okay to mix terms.
Also I noted that you tagged genetic programming which is basically only a special type of a genetic algorithm. In GP you consider trees as a chromosome which can represent mathematical formulas or computer programs. Your question does not seem to be about GP though.
This is very late answer, but hoping it will help someone in the future. Even if two chromosomes are not paired (and produced children), they goes to the next generation (without crossover) but after some mutation (subject to probability again). And on the other hand, if two chromosomes paired, then they produce two children (replacing the original two parents) for the next generation. So, that's why the no of chromosomes remain same in two generations.
I read various stuff on this and understand the principle and concepts involved, however, none of paper mentions the details of how to calculate the fitness of a chromosome (which represents a route) involving adjacent cities (in the chromosome) that are not directly connected by an edge (in the graph).
For example, given a chromosome 1|3|2|8|4|5|6|7, in which each gene represents the index of a city on the graph/map, how do we calculate its fitness (i.e. the total sum of distances traveled) if, say, there is no direct edge/link between city 2 and 8. Do we follow some sort of greedy algorithm to work out a route between 2 and 8, and add the distance of this route to the total?
This problem seems pretty common when applying GA to TSP. Anyone who's done it before please share your experience. Thanks.
If there is no link between 2 and 8 on your graph, then any chromosome with 2|8 or 8|2 in it is invalid for the classical travelling salesman problem. If you find some other route between 2 and 8, you are probably going to violate the "visit each location once" requirement.
One really dodgy-but-pragmatic solution is to include edges between those nodes with incredibly high distances, or even +INF if your language supports it. That way, your standard minimizing fitness function will naturally prune them.
I think the original formulation of the problem includes edges between all nodes, so this is a non-issue.
This is the exact kind of problem, specialized Crossover and mutation methods have been applied for GA based solutions to TSP problems. See this question.
if the chromosone does not represent a valid solution then it is completly unfit to solve the problem. So depending on how you order fitness. ie if a lower number represents more fitness (possibly a good idea when fitness represents total cost) then you'd assign it a max value and break any further fitness calculation on that chromosone when you get to a gene sequence that is invalid.
(or vice versa, assign it a fitness of zero if a higher fitness means a chromosone is more fit for the job)
however as others have pointed out it could be better to ensure that invalid chromosones dont occur. However if that is itself an overly compex process then allowing them and ensuring that broken chromosones are unlikely to make it into successive generations could be an acceptable approach.
My best shot so far:
A delivery vehicle needs to make a series of deliveries (d1,d2,...dn), and can do so in any order--in other words, all the possible permutations of the set D = {d1,d2,...dn} are valid solutions--but the particular solution needs to be determined before it leaves the base station at one end of the route (imagine that the packages need to be loaded in the vehicle LIFO, for example).
Further, the cost of the various permutations is not the same. It can be computed as the sum of the squares of distance traveled between di -1 and di, where d0 is taken to be the base station, with the caveat that any segment that involves a change of direction costs 3 times as much (imagine this is going on on a railroad or a pneumatic tube, and backing up disrupts other traffic).
Given the set of deliveries D represented as their distance from the base station (so abs(di-dj) is the distance between two deliveries) and an iterator permutations(D) which will produce each permutation in succession, find a permutation which has a cost less than or equal to that of any other permutation.
Now, a direct implementation from this description might lead to code like this:
function Cost(D) ...
function Best_order(D)
for D1 in permutations(D)
Found = true
for D2 in permutations(D)
Found = false if cost(D2) > cost(D1)
return D1 if Found
Which is O(n*n!^2), e.g. pretty awful--especially compared to the O(n log(n)) someone with insight would find, by simply sorting D.
My question: can you come up with a plausible problem description which would naturally lead the unwary into a worse (or differently awful) implementation of a sorting algorithm?
I assume you're using this question for an interview to see if the applicant can notice a simple solution in a seemingly complex question.
[This assumption is incorrect -- MarkusQ]
You give too much information.
The key to solving this is realizing that the points are in one dimension and that a sort is all that is required. To make this question more difficult hide this fact as much as possible.
The biggest clue is the distance formula. It introduces a penalty for changing directions. The first thing an that comes to my mind is minimizing this penalty. To remove the penalty I have to order them in a certain direction, this ordering is the natural sort order.
I would remove the penalty for changing directions, it's too much of a give away.
Another major clue is the input values to the algorithm: a list of integers. Give them a list of permutations, or even all permutations. That sets them up to thinking that a O(n!) algorithm might actually be expected.
I would phrase it as:
Given a list of all possible
permutations of n delivery locations,
where each permutation of deliveries
(d1, d2, ...,
dn) has a cost defined by:
Return permutation P such that the
cost of P is less than or equal to any
other permutation.
All that really needs to be done is read in the first permutation and sort it.
If they construct a single loop to compare the costs ask them what the big-o runtime of their algorithm is where n is the number of delivery locations (Another trap).
This isn't a direct answer, but I think more clarification is needed.
Is di allowed to be negative? If so, sorting alone is not enough, as far as I can see.
For example:
d0 = 0
deliveries = (-1,1,1,2)
It seems the optimal path in this case would be 1 > 2 > 1 > -1.
Edit: This might not actually be the optimal path, but it illustrates the point.
YOu could rephrase it, having first found the optimal solution, as
"Give me a proof that the following convination is the most optimal for the following set of rules, where optimal means the smallest number results from the sum of all stage costs, taking into account that all stages (A..Z) need to be present once and once only.
Convination:
A->C->D->Y->P->...->N
Stage costs:
A->B = 5,
B->A = 3,
A->C = 2,
C->A = 4,
...
...
...
Y->Z = 7,
Z->Y = 24."
That ought to keep someone busy for a while.
This reminds me of the Knapsack problem, more than the Traveling Salesman. But the Knapsack is also an NP-Hard problem, so you might be able to fool people to think up an over complex solution using dynamic programming if they correlate your problem with the Knapsack. Where the basic problem is:
can a value of at least V be achieved
without exceeding the weight W?
Now the problem is a fairly good solution can be found when V is unique, your distances, as such:
The knapsack problem with each type of
item j having a distinct value per
unit of weight (vj = pj/wj) is
considered one of the easiest
NP-complete problems. Indeed empirical
complexity is of the order of O((log
n)2) and very large problems can be
solved very quickly, e.g. in 2003 the
average time required to solve
instances with n = 10,000 was below 14
milliseconds using commodity personal
computers1.
So you might want to state that several stops/packages might share the same vj, inviting people to think about the really hard solution to:
However in the
degenerate case of multiple items
sharing the same value vj it becomes
much more difficult with the extreme
case where vj = constant being the
subset sum problem with a complexity
of O(2N/2N).
So if you replace the weight per value to distance per value, and state that several distances might actually share the same values, degenerate, some folk might fall in this trap.
Isn't this just the (NP-Hard) Travelling Salesman Problem? It doesn't seem likely that you're going to make it much harder.
Maybe phrasing the problem so that the actual algorithm is unclear - e.g. by describing the paths as single-rail railway lines so the person would have to infer from domain knowledge that backtracking is more costly.
What about describing the question in such a way that someone is tempted to do recursive comparisions - e.g. "can you speed up the algorithm by using the optimum max subset of your best (so far) results"?
BTW, what's the purpose of this - it sounds like the intent is to torture interviewees.
You need to be clearer on whether the delivery truck has to return to base (making it a round trip), or not. If the truck does return, then a simple sort does not produce the shortest route, because the square of the return from the furthest point to base costs so much. Missing some hops on the way 'out' and using them on the way back turns out to be cheaper.
If you trick someone into a bad answer (for example, by not giving them all the information) then is it their foolishness or your deception that has caused it?
How great is the wisdom of the wise, if they heed not their ego's lies?