How do I representation percentage in evolutionary Algorithm? - algorithm

Considering I have 4 chromosomes (gi, i=1 to 4}) to represent 4 percentages of different things so that the sum of 4 percentages are equal to 100. How Do I represent this efficiently?
I know that it is possible by: g1/(g1+g2+g3+g4). However, This is not efficient. Consider all gi=0.2 or all gi=0.1 will represent 25% in these two cases. It is possible to generate many cases where different genes present same percentage. Is there any other efficient way, where unique set of combination of genes present unique set of percentages.
Thanks in advance.

I think you're confusing genes and chromosomes. A chromosome encodes a candidate solution to your problem. A gene is part of a chromosome.
Under this setting, why would you want that constraint on the chromosomes? it sounds like you want it on the genes of a chromosome.
In order to do this you can do a number of things: have each gene encode an integer in [0, 100]. If the genes do not add to 100 in the end, penalize the fitness of those chromosomes.
Another way, which might make crossover operators more natural to apply, is to have each gene store 100 bits. If x bits are set, that means the gene will encode x%.
Yet another way is to have the entire chromosome encode 100 set bits. Then each gene will hold a value x, which represents an interval. The number of set bits between two split points is the percentage associated to that gene. For example:
1 2 3 4 5 6 7 8 ... 100
1 1 1 1 1 1 1 1 ... 1
| | | | |
g1 g2 g3 g4
This can be done by generating 5 random numbers <= 100, sorting them and taking the differences between them.

One way to assign X units to N possibilities is to store X * (N-1) bits. Every unit is given (N-1) bits and if k of the (N-1) bits are set then the unit is assigned to k.
This is easy to work with as there are no invalid solutions and no penalties/repairs are necessary. This makes fitness evaluation, crossover and mutation easier to implement.
For example, the problem is to assign 5 units (X) to one of 4 (N) possibilities. Each individual is (4-1)x5=15 bits.
The bit string: 010 100 000 011 111 assigns the first 2 units to possibility 1 because both groups have 1 bit set. The third unit which has no bits set is assigned to 0. The fourth unit is assigned to 2 and the fifth to 3.
partition units
0 1
1 2
2 1
3 1

Related

Distance algorithm - minimum coins required to clear all the level

Thor is playing a game where there are N levels and M types of available weapons. The levels are numbered from 0 to N-1 and the weapons are numbered from 0 to M-1. He can clear these levels in any order. In each level, some subset of these M weapons is required to clear this level. If in a particular level, he needs to buy x new weapons, he will pay x^2 coins for it. Also note that he can carry all the weapons he has currently to the next level. Initially, he has no weapons. Can you find out the minimum coins required such that he can clear all the levels?
Input Format
The first line of input contains 2 space separated integers:
N = the number of levels in the game
M = the number of types of weapons
N lines follow. The ith of these lines contains a binary string of length M. If the jth character of this string is 1, it means we need a weapon of type j to clear the ith level.
Constraints
1 <= N <= 20
1 <= M <= 20
Output Format
Print a single integer which is the answer to the problem.
Sample TestCase 1
Input
1 4
0101
Output
4
Explanation
There is only one level in this game. We need 2 types of weapons - 1 and 3. Since, initially, Thor has no weapons he will have to buy these, which will cost him 2^2 = 4 coins.
Sample TestCase 2
Input
3 3
111
001
010
Output
3
Explanation
There are 3 levels in this game. The 0th level (111) requires all 3 types of weapons. The 1st level (001) requires only weapon of type 2. The 2nd level requires only weapon of type 1. If we clear the levels in the given order (0-1-2), total cost = 3^2 + 0^2 + 0^2 = 9 coins. If we clear the levels in the order 1-2-0, it will cost = 1^2 + 1^2 + 1^2 = 3 coins, which is the optimal way.
The beauty of Gassa's answer is partly in the fact that if a different state can be reached by oring one of the levels' bitstring masks with the current state, we are guaranteed that achieving the current state did not include visiting this level (since otherwise those bits would already be set). This means checking a transition from one state to another by adding a different bitmask, guarantees we are looking at an ordering that did not yet include that mask. So a formulation like Gassa's could work: let f(st) represent the cost of acheiving state st, then:
f(st) = min(
some known cost of f(st),
f(prev_st) + (popcount(prev_st | level) - popcount(prev_st))^2
)
for all level and prev_st that or to st

Efficient solution to find the number of unique binary vectors

Suppose that you have a binary vector (each element can 0, 1 or X, which corresponds to either 0 or 1), of length N.
for example, given N = 4:
1001 is a single binary vector
1XX1 denotes four different binary vectors {1001, 1011, 1101, 1111}
Now suppose you have three different descriptions, e.g.
X11X
1XX1
11XX
What would be an efficient solution to find the number of unique binary vectors described by this set of specifications?
Note that a brute force solution becomes impractical when N grows, so listing every possible vector and deleting duplicates is not a viable solution. Also note that we just want to know the number of unique vectors but we don't need to compute their exact value.
Editing with the solution for this example which would be:
X11X --> 0110 0111 1110 1111
1XX1 --> 1001 1011 1101 1111
11XX --> 1100 1101 1110 1111
Among these 12 vectors, we only want to count the unique ones, which are 8 e.g.
0110 0111 1110 1111 1001 1011 1101 1100
I'd use the inclusion-exclusion principle. You want to know the cardinality of the union of the set. For your example you have:
N(X11X || 1XX1 || 11XX) = N(X11X) + N(1XX1) + N(11XX) -
N(X11X && 1XX1) - N(X11X && 11XX) - N(1XX1 && 11XX) +
N(X11X && 1XX1 && 11XX)
Cardinality of "single" elements are easy to calculate (2^Nx, where Nx is the number of X elements). For the intersection, you compare element by element. If they are different from X and different from each other, you have zero. If they both are equal, you have 1. If you have a X and a number, you have one. If you have X and X you have two. Then you multiply these numbers. An example:
N(X11X && 1XX1) = 1 * 1 * 1 * 1 = 1.
which correspond to the only common sequence (1111). This can be easily generalized for any N and shouldn't be hard to implement in any language.
If the number of patterns stays small then you can solve this using an inclusion-exclusion type approach.
The number of binary vectors for each individual pattern is easy to compute: it is just the appropriate power of 2. Now the total number of patterns is just the sum of the binary vectors for each pattern individually, minus the number of binary vectors for the common solutions of each pair of patterns, plus the sum of the number of common solutions for each triplet, and so on.
The common solutions of a set patterns are again the solution for a single pattern: If, at some position, one pattern has a 0 and another has a 1, then there is no common solution. Otherwise we obtain a pattern by placing 0 or 1 at a position if one of the patterns has a 0 or 1 at this position, and an X if all patterns have an X at this position.

AND of all natural numbers lying between A and B both inclusive

We are required to compute the bit wise AND amongst all natural numbers lying between A and B, both inclusive.I came across this problem on a website and here is the approach they used but i couldn't understand the method.Can anyone explain this more clearly with an example ?
In order to solve this problem, we just need to focus on the occurrences of each power 2, which turn out to be cyclic. Now for each 2^i(the length of the cycle will be 2^(i+1) having 2^i zeros followed by same number of ones) we just need to compute if 1 remains constant in the given interval, which is done by simple arithmetic. If so, that power of 2 will be present in the answer, otherwise it won't.
Let's count (unsigned) with 3 bits to visualize some numbers first:
000 = 0
001 = 1
010 = 2
011 = 3
100 = 4
101 = 5
110 = 6
111 = 7
If you look at the columns, you can see that the lowest bit is alternating with a cycle of 1, the next with a cycle of 2, then 4, and the nth lowest bit is alternating with a cycle of 2^(n-1).
As soon as a bit was 0 once it is always 0 (because 0 and whatever is 0).
You could also say the nth bit is only 1 if the nth bit of A and B is 1 and d < 2^(n-1). In other words a bit will only be 1 if it is 1 at the beginning and the end and didn't had time to change to 0 in between because its cycle is too large.

Confusion regarding genetic algorithms

My books(Artificial Intelligence A modern approach) says that Genetic algorithms begin with a set of k randomly generated states, called population. Each state is represented as a string over a finite alphabet- most commonly, a string of 0s and 1s. For eg, an 8-queens state must specify the positions of 8 queens, each in a column of 8 squares, and so requires 8 * log(2)8 = 24 bits. Alternatively the state could be represented as 8 digits, each in range from 1 to 8.
[ http://en.wikipedia.org/wiki/Eight_queens_puzzle ]
I don't understand the expression 8 * log(2)8 = 24 bits , why log2 ^ 8? And what are these 24 bits supposed to be for?
If we take first example on the wikipedia page, the solution can be encoded as [2,4,6,8,3,1,7,5] : the first digit gives the row number for the queen in column A, the second for the queen in column B and so on. Now instead of starting the row numbering at 1, we will start at 0. The solution is then encoded with [1,3,5,7,0,6,4]. Any position can be encoded such way.
We have only digits between 0 and 7, if we write them in binary 3 bit (=log2(8)) are enough :
000 -> 0
001 -> 1
...
110 -> 6
111 -> 7
A position can be encoded using 8 times 3 digits, e.g. from [1,3,5,7,2,0,6,4] we get [001,011,101,111,010,000,110,100] or more briefly 001011101111010000110100 : 24 bits.
In the other way, the bitstring 000010001011100101111110 decodes as 000.010.001.011.100.101.111.110 then [0,2,1,3,4,5,7,6] and gives [1,3,2,4,5,8,7] : queen in column A is on row 1, queen in column B is on row 3, etc.
The number of bits needed to store the possible squares (8 possibilities 0-7) is log(2)8. Note that 111 in binary is 7 in decimal. You have to specify the square for 8 columns, so you need 3 bits 8 times

Special scheduling Algorithm (pattern expansion)

Question
Do you think genetic algorithms worth trying out for the problem below, or will I hit local-minima issues?
I think maybe aspects of the problem is great for a generator / fitness-function style setup. (If you've botched a similar project I would love hear from you, and not do something similar)
Thank you for any tips on how to structure things and nail this right.
The problem
I'm searching a good scheduling algorithm to use for the following real-world problem.
I have a sequence with 15 slots like this (The digits may vary from 0 to 20) :
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
(And there are in total 10 different sequences of this type)
Each sequence needs to expand into an array, where each slot can take 1 position.
1 1 0 0 1 1 1 0 0 0 1 1 1 0 0
1 1 0 0 1 1 1 0 0 0 1 1 1 0 0
0 0 1 1 0 0 0 1 1 1 0 0 0 1 1
0 0 1 1 0 0 0 1 1 1 0 0 0 1 1
The constraints on the matrix is that:
[row-wise, i.e. horizontally] The number of ones placed, must either be 11 or 111
[row-wise] The distance between two sequences of 1 needs to be a minimum of 00
The sum of each column should match the original array.
The number of rows in the matrix should be optimized.
The array then needs to allocate one of 4 different matrixes, which may have different number of rows:
A, B, C, D
A, B, C and D are real-world departments. The load needs to be placed reasonably fair during the course of a 10-day period, not to interfere with other department goals.
Each of the matrix is compared with expansion of 10 different original sequences so you have:
A1, A2, A3, A4, A5, A6, A7, A8, A9, A10
B1, B2, B3, B4, B5, B6, B7, B8, B9, B10
C1, C2, C3, C4, C5, C6, C7, C8, C9, C10
D1, D2, D3, D4, D5, D6, D7, D8, D9, D10
Certain spots on these may be reserved (Not sure if I should make it just reserved/not reserved or function-based). The reserved spots might be meetings and other events
The sum of each row (for instance all the A's) should be approximately the same within 2%. i.e. sum(A1 through A10) should be approximately the same as (B1 through B10) etc.
The number of rows can vary, so you have for instance:
A1: 5 rows
A2: 5 rows
A3: 1 row, where that single row could for instance be:
0 0 1 1 1 0 0 0 0 0 0 0 0 0 0
etc..
Sub problem*
I'de be very happy to solve only part of the problem. For instance being able to input:
1 1 2 3 4 2 2 3 4 2 2 3 3 2 3
And get an appropriate array of sequences with 1's and 0's minimized on the number of rows following th constraints above.
Sub-problem solution attempt
Well, here's an idea. This solution is not based on using a genetic algorithm, but some ideas could be used in going in that direction.
Basis vectors
First of all, you should generate what I think of as the basis vectors. For instance, if your sequence were 3 numbers long rather than 15, the basis vectors would be:
v1 = [1 1 0]
v2 = [0 1 1]
v3 = [1 1 1]
Any solution for sequence length 3 would be a linear combination of these three vectors using only positive integers. In other words, the general solution would be
a*v1 + b*v2 + c*v3
where a, b and c are positive integers. For the sequence [1 2 1], the solution is v1 = 1, v2 = 1, v3 = 0. What you first want to do is find all of the possible basis vectors of length 15. From my rough calculations I think that there are somewhere between 300-400 basis vectors of length 15. I can give you some tips towards generating them if you want.
Finding solutions
Now, what you want to do is sort these basis vectors by their sums/magnitudes. Then in searching for your solution, you start with the basis vectors which have the largest sums. We start with the vectors that have the largest sums because they lead to having less total rows. We also have an array, veccoefs, which contains an entry for the linear coefficient for each basis vector. At the beginning of searching for the solution, all the veccoefs are 0.
So we take the first basis vector (the one with the largest sum/magnitude) and subtract this vector from the sequence until we either create an unsolvable result ( having a 0 1 0 in it for instance) or any of the numbers in the result is negative. We store the number of times we subtract the vector in veccoefs. We use the result after subtracting the basis vector from the sequence as the sequence for the next basis vector. If there are only zeros left in the result, then we stop the loop.
I'm not sure of the efficiency/accuracy of this method, but it might at least give you some ideas.
Other possible solutions
Another idea for solving this is to use the basis vectors and form the problem as an optimization/least squares problem. You form a matrix of the basis vectors such that the basic problem will be minimizing Sum[(Ax - b)^2] where A is the matrix of basis vectors, b is the input sequence, and x are the basis vector coefficients. However, you also want to minimize the number of rows, so you can add a term like x^T*x to the minimization function where x^T is the transpose of x. The hard part in my opinion is finding differentiable terms to add that will encourage integer vector coefficients. If you can think of a way to do that, then optimization could very well be a good way to do this.
Also, you might consider a Metropolis-type Monte Carlo solution. You would choose randomly whether to add a vector, remove a vector, or substitute a vector at each step. The vector to be added/removed/substituted would be chosen randomly. The probability of this change to be accepted would be a ratio of the suitabilities of the solutions before the change and after the change. The suitability could be equal to the difference between the current solution and the sequence, squared and summed, minus the number of rows/basis vectors involved in the solution. You would need to put in appropriate constants to for various terms to try to get the acceptance rate around 50%. I kind of doubt that this will work very well, but I thought that you should still consider it when looking for possible solutions.
GA can be applied to this problem, but it won't be 5 minute task. You need to put several things together, without knowing which implementation of each of them is best.
So:
Solution representation - how you will represent possible solution? Using matrix seems to be most straight forward. Using collection of one dimensional arrays is possible also.
But you have some constrains, so maybe SuperGene concept is worth considering?
You must use proper mutation/crossover operators for given gene representation.
How will you enforce constrains on solutions? Destroying those that are not proper? What if they contain valuable information? Maybe let them stay in population but add some penalty to fitness, so they will contribute to offspring, but won't go into next generations?
Anyway I think that GA can be applied to this problem. Is it worth? Usually GA are not best algorithm, but they are decent algorithm if others fail. I would go with GA, just because it would be most fun but I would look for alternative solution (just in case).
P.S. Personal insight: I was solving N Queens Problem, for 70 < N < 100 (board NxN, N queens). Algorithm was working fine for lower N (maybe it was trying all combination?), but with N in this range, I couldn't find proper solution. Fitness quickly jumped to about 90% of max, but in the end there were always two queens conflicting. But it was very naive implementation.

Resources