Efficient genetic algorithm - genetic-algorithm

Considering this problem : Having a vector of 1000 real positive numbers,find the optim partition of the 1000 elements in 7 parts so that the sum of parts have aproximative(close) values.
How would you make the chromosome representation, operators (mutation,crossover), fitness function, selection.. so that you solve the problem in the most efficent & optimized way ?
My idea is to give each number a index (the lowest number has index 1, the highest has index 1000 for example)... but I don't think this is the most efficent way? Any suggestions are welcome !

Since its a partition problem I think you need to have the whole set in a single chromosome. Say you have an array of length 1000 that can have values from 1 to 7. The fitness function can calculate the difference of the sum of every partition (less is better). Crossover can be done with single point or double point. Then mutation can randomly change an individual gene from its value to a random value, say position 102 is 4, then mutates to 1. With this solution you guarantee that every chromosome is a valid solution, altough possibly a bad one, so you don't have to check after every iteration for chromosomes that do not follow the problem rules (a problem you would have if you choose to have one chromosome per partition). As usual the criteria for crossover and the likelihood of mutation needs exploration and tunning before achiving best performance.

Related

Return N Optimal Choices for Multiple Choice Knapsack Variation

Problem
I'm trying to return N optimal answers (for an exact number I want 250). From what I understand with dynamic programming is that it will return one most optimal answer. So I believe I have to use backtracking in order to generate N optimal answers.
For the knapsack variant problem, I have a maximum weight that the combination of objects should not pass. I have 4 sets of objects, and exactly one must be chosen from each set to give the highest value without surpassing the weight constraint. Each object in the sets have a value and a weight.
The sets have 164, 201, 90 and 104 objects which means there are 308,543,040 variations to try. I have a brute force algorithm implemented but it takes forever.
Attempts At Optimization
So far, my attempt at optimizing is to preprocess the input sets by sorting by increasing weight. (lowest first). At the addition of each object, if the constraint weight is greater than the object's combination weight, then I can skip the rest of the set since all other options will not be valid. This can be run at any level of the recursive function.
I also have a minimum heap that stores the maximum values I've found. If the combination of four objects is less than the top of the heap, then it will not be added. Otherwise, pushpop to the heap. I'm not sure if I can use this to optimize the backtracking even further, since it requires all four objects to be selected. It's used more as validation rather than improving the speed.
Questions
Are there any other optimizations I can do with backtracking that will speed up the process of finding N optimal answers? Have I exhausted optimization and should just use multiple threads?
Is it possible to use dynamic programming with this? How can I modify dynamic programming to return N optimal choices?
Any other algorithms to look into?
Since exactly one item has to be picked from each set, you can try this optimization:
Let the sets be A,B,C,D.
Create all combinations of items from sets A,B together and sets C,D together. This will have O(n^2) complexity, assuming lists have length n. Let the combination lists be X and Y now.
Sort X and Y based on weight. You can use something like a cumulative array to track the combination with the max possible value under a given weight. (Other data structures might be used for the same task as well, this is just a suggestion to highlight the underlying idea).
Create a max heap to store the combinations with max values
For each combination in X, pick the combination in Y with the highest value under the constraint that it's weight is <= target weight - X_combination_weight. Based on this combination's value, insert it in the max heap.

Algorithmic help needed (N bags and items distributed randomly)

I have encountered an algorithmic problem but am not able to figure out anything better than brute force or reduce it to a better know problem. Any hints?
There are N bags of variable sizes and N types of items. Each type of items belongs to one bag. There are lots of items of each type and each item may be of a different size. Initially, these items are distributed across all the bags randomly. We have to place the items in their respective bags. However, we can only operate with a pair of bags at one time by exchanging items (as much as possible) and proceeding to the next pair. The aim is to reduce the total number of pairs. Edit: The aim is to find a sequence of transfers that minimizes the total number of bag pairs involved
Clarification:
The bags are not arbitrarily large (You can assume the bag and item sizes to be integers between 0 to 1000 if it helps). You'll frequently encounter scenarios where the all the items between 2 bags cannot be swapped due to the limited capacity of one of the bags. This is where the algorithm needs to make an optimisation. Perhaps, if another pair of bags were swapped first, the current swap can be done in one go. To illustrate this, let's consider Bags A, B and C and their items 1, 2, 3 respectively. The number in the brackets is the size.
A(10) : 3(8)
B(10): 1(2), 1(3)
C(10): 1(4)
The swap orders can be AB, AC, AB or AC, AB. The latter is optimal as the number of swaps is lesser.
Since I cannot come to an idea for an algorithm that will always find an optimal answer, and approximation of the fitness of the solution (amount of swaps) is also fine, I suggest a stochastic local search algorithm with pruning.
Given a random starting configuration, this algorithm considers all possible swaps, and makes a weighed decision based on chance: the better a swap is, the more likely it is chosen.
The value of a swap would be the sum of the value of the transaction of an item, which is zero if the item does not end up in it's belonging bag, and is positive if it does end up there. The value increases as the item's size increases (the idea behind this is that a larger block is hard to move many times in comparison to smaller blocks). This fitness function can be replaced by any other fitness function, it's efficiency is unknown until empirically shown.
Since any configuration can be the consequence of many preceding swaps, we keep track of which configurations we have seen before, along with a fitness (based on how many items are in their correct bag - this fitness is not related to the value of a swap) and the list of preceded swaps. If the fitness function for a configuration is the sum of the items that are in their correct bags, then the amount of items in the problem is the highest fitness (and therefor marks a configuration to be a solution).
A swap is not possible if:
Either of the affected bags is holding more than it's capacity after the potential swap.
The new swap brings you back to the last configuration you were in before the last swap you did (i.e. reversed swap).
When we identify potential swaps, we look into our list of previously seen configurations (use a hash function for O(1) lookup). Then we either set its preceded swaps to our preceded swaps (if our list is shorter than it's), or we set our preceded swaps to its list (if it's list is shorter than ours). We can do this because it does not matter which swaps we did, as long as the amount of swaps is as small as possible.
If there are no more possible swaps left in a configuration, it means you're stuck. Local search tells you 'reset' which you can do in may ways, for instance:
Reset to a previously seen state (maybe the best one you've seen so far?)
Reset to a new valid random solution
Note
Since the algorithm only allows you to do valid swaps, all constraints will be met for each configuration.
The algorithm does not guarantee to 'stop' out of the box, you can implement a maximum number of iterations (swaps)
The algorithm does not guarantee to find a correct solution, as it does it's best to find a better configuration each iteration. However, since a perfect solution (set of swaps) should look closely to an almost perfect solution, a human might be able to finish what the local search algorithm was not after it results in a invalid configuration (where not every item is in its correct bag).
The used fitness functions and strategies are very likely not the most efficient out there. You could look around to find better ones. A more efficient fitness function / strategy should result in a good solution faster (less iterations).

how to use Genetic algorithm in matlab for selection of specific number of features?

i am trying to select 3 feature from a data set of 24*461. my problem is in generation part. after cross-over, new chromosome can have more than three 1 and therefore more than three variable. in mutation step, when a zero is changed to one, number of selected feature is more than 3. Any help will be greatly appreciated
A common technique to solve this problem is to impose a "penalty", wherein, any chromosome that have more than three 1 have a penalty added. For example if a chromosome have five 1, add 2x to chromosome fitness score. In this case any chromosome that have more than three 1, gradually Remove from population and permitting other (that have three or less 1) individuals to be maintained in the population.

Genetic Algorithm Implementation for weight optimization

I am a data mining student and I have a problem that I was hoping that you guys could give me some advice on:
I need a genetic algo that optimizes the weights between three inputs. The weights need to be positive values AND they need to sum to 100%.
The difficulty is in creating an encoding that satisfies the sum to 100% requirement.
As a first pass, I thought that I could simply create a chrom with a series of numbers (ex.4,7,9). Each weight would simply be its number divided by the sum of all of the chromosome's numbers (ex. 4/20=20%).
The problem with this encoding method is that any change to the chromosome will change the sum of all the chromosome's numbers resulting in a change to all of the chromosome's weights. This would seem to significantly limit the GA's ability to evolve a solution.
Could you give any advice on how to approach this problem?
I have read about real valued encoding and I do have an implementation of a GA but it will give me weights that may not necessarily add up to 100%.
It is mathematically impossible to change one value without changing at least one more if you need the sum to remain constant.
One way to make changes would be exactly what you suggest: weight = value/sum. In this case when you change one value, the difference to be made up is distributed across all the other values.
The other extreme is to only change pairs. Start with a set of values that add to 100, and whenever 1 value changes, change another by the opposite amount to maintain your sum. The other could be picked randomly, or by a rule. I'd expect this would take longer to converge than the first method.
If your chromosome is only 3 values long, then mathematically, these are your only two options.

Calculating the actual average value

I've got a relatively little (~100 values) set of integers: each of them represents how much time (in millisecond) a test I ran lasted.
The trivial algorithm to calculate the average is to sum up all the n values and divide the result by n, but this doesn't take into account that some ridiculously high/low value must be wrong and should get discarded.
What algorithms are available to estimate the actual average value?
As you said you can discard all values that diverge more than a given value from the average and then recompute the average. Another value that can be interesting is the Median, that is the most frequent value.
It depends on different conditions of your test. And it is a task from probability theory.
One of the simplest way is to try calculate a median, that you can deal with ridiculously high/low values. Look at link below:
Wiki about median
As you noted, the arithmetic mean isn't good if there are very high/low values.
You could compute the median, as someone suggested, which is, in a sorted list of your values, the "middle" value (if your set contains an uneven amount of items) or the arithmetic mean of the two "middle" values (else).
Another method would be to drop, say, the lowest and highest five percentiles and compute the arithmetic mean of the rest.
Some options:
First discard N highest and lowest values and compute arithmetic mean for the rest. Set N to suitable value so that, for example 1% or 10% of values are discarded.
Use the the median, or middle value.
Use geometric mean that give less weight for the outliers.
Wikipedia lists some ways to compute different "mean" values

Resources