Algorithm to find best combination in multi-dimensional array based on certain criteria - algorithm

I am looking for an algorithm to work out the best combination based on criteria (Best == use the most cells possible and if possible use all cells). The source is always a multi-dimensional array and only consists of 2 different elements. The goal is to use all the elements in the array and elements can only be used once.
As an example :
Input (Multi-Dimensional array) :
AAAAA
ABBBA
AAAAA
Where the first items position is 0,0
Criteria :
Every combination must contain at least one A and at least one B.
The maximum number of cells per combination allowed is 6.
The solution for this example is :
Three different combinations that uses every cell from input and does not exceed the maximum number of cells per combination that is 6
AA A AA
AB B BA
AA A AA
The first combination is between rows 0,2 and columns 0,1
The second combination is between rows 0,2 and columns 2,2
The third combination is between rows 0,2 and columns 3,4
A more real life example would be
ABBBAAA
BBBBABB
AABAABA
ABBABBB
AAAAAAB
AAAAAAB
Criteria :
Every combination must contain at least one A and at least one B.
The maximum number of cells per combination allowed is 5.

Related

Partitioning N arrays into K groups with constraints

I have been stuck in this problem and can't find the efficient solution for this problem .
I have N (Upto 10 Million ) arrays of say maximum 100 elements. These arrays contain numbers from 1-10000 .
Now my problem is to partition these arrays into K groups such that i minimize the duplicates across all the arrays i.e for an array containing 1, 4, 10 ,100 and another containing 1, 100. I would like them to go into same group because that minimizes duplicity. Two constraints my problem has are as follows -
i don't want to increase size of unique elements more than 110 for a group of arrays. So i have an array of size 100 and there is another array of size 100 which is a 60% match i would rather create new group because this increases no. of unique elements to 140 and this will go on increasing.
The number of vectors in the groups should be uniformly distributed.
Grouping these arrays based on size in decreasing order. Then finding unique vectors unique hashing and applying a greedy algo of maximum match with the constraints but the greedy doesn't seem to be working well because that will entirely depend on the partitions i picked first. I couldn't figure out how DP can be applied because number of combinations given total number of vectors is just huge. I am not sure what methodology should i take.
some of the fail cases of my algo are , say there are two vectors which are mutually exclusive of each other but if i form a group with them i could match 100% with a third vector which otherwise matched just 30% in a group and made that group full following the addition to that group this will increase my duplicity because the third vector should have formed a group with first two vectors.
Simple yet intensive on computing and memory is iterate 10 million times for each array to match maximum numbers match. Now store match numbers in an array and find match of such arrays similarly by iterating with criteria that match should be at least 60%

Algorithm X to Solve the Exact Cover: Fat Matrices

As I was reading about Knuth's Algorithm X to solve the exact cover problem, I thought of an edge case that I wanted some clarification on.
Here are my assumptions:
Given a matrix A, Algorithm X's "goal is to select a subset of the rows so that the digit 1 appears in each column exactly once."
If the matrix is empty, the algorithm terminates successfully and the solution is then the subset of rows logged in the partial solution up to that point.
If there is a column of 0's, the algorithm terminates unsuccessfully.
For reference: http://en.wikipedia.org/wiki/Algorithm_X
Consider the matrix A:
[[1 1 0]
[0 1 1]]
Steps I took:
Given Matrix A:
1. Choose a column, c, with the least number of 1's. I choose: column 1
2. Choose a row, r, that contains to a 1 in column c. I choose: row 1
3. Add r to the partial solution.
4. For each column j such that A(r, j) = 1:
For each row i such that A(i, j) = 1:
delete row i
delete column j
5. Matrix A is empty. Algorithm terminates successfully and solution is allegedly: {row 1}.
However, this is clearly not the case as row 1 only consists of [1 1 0] and clearly does not cover the 3rd column.
I would assume that the algorithm should at some point reduce the matrix to the point where there is only a single 0 and terminate unsuccessfully.
Could someone please explain this?
I think the confusion here is simply in the use of the term empty matrix. If you read Knuth's original paper (linked on the Wikipedia article you cited), you can see that he was treating the rows and columns as doubly-linked lists. When he says that the matrix is empty, he doesn't mean that it has no entries, he means that all the row and column objects have been deleted.
To clarify, I'll label the rows with lower case letters and the columns with upper case letters, as follows:
| A | B | C
---------------
a | 1 | 1 | 0
---------------
b | 0 | 1 | 1
The algorithm states that you choose a column deterministically (using any rule you wish), and he suggests choosing a column with the fewest number of 1's. So, we'll proceed as you suggest and choose column A. The only row with a 1 in column A is row a, so we choose row a and add it to the possible solution { a }. Now, row a has 1s in columns A and B, so we must delete those columns, and any rows containing 1s in those columns, that is, rows a and b, just as you did. The resulting matrix has a single column C and no rows:
| C
-------
This is not an empty matrix (it has a column remaining). However, column C has no 1s in it, so we terminate unsuccessfully, as the algorithm indicates.
This may seem odd, but it is a very important case if we intend to use an incidence matrix for the Exact Cover Problem, because columns represent elements of the set X that we wish to cover and rows represents subsets of X. So a matrix with some columns and no rows represents the exact covering problem where the collection of subsets to choose from is empty (but there are still points to cover).
If this description causes problems for your implementation, there is a simple workaround: just include the empty set in every problem. The empty set (containing no points of X) is represented by a row of all zeros. It is never selected by your algorithm as part of a solution, never collides with any other selected rows, but always ensures that the matrix is nonempty (there is at least one row) until all the columns have been deleted, which is really all you care about since you need to make sure that each column is covered by some row.

Genetic Algorithm: 2D chromosome; crossover and mutation probabilities

Let me start with the version of genetic algorithm I am implementing. I apologize in advance for any terminology errors that I make here. Please feel free to correct me.
The chromosome for my problem is two dimensional. Three rows and thirty two columns. Essentially the alleles (values) are indexes that are contained by this chromosome.
How an Index is formulated
Each row and column (together) of the chromosome refer to a single gene. Each gene contains an integer value (0 - 30). A single column (I believe referred to as a gnome) therefore refers to an index of a four dimensional array containing user data on which the fitness function operates.
This is how a chromosome would look like
11 22 33 14 27 15 16 ...
3 29 1 7 18 24 22 ...
29 9 16 10 14 21 3 ...
e.g. column 0 ==> data[11][3][29]
where
11 -> (0, 0); 0th row, 0th column
3 -> (1, 0); 1st row, 0th column
29 -> (2, 0); 2nd row, 0th column
For completeness, the fitness function works as follows: (for a single chromosome)
for first 10 iterations: (user 0 to 9)
for each column (genome)
consider gene value for first row as the first index of data array
consider gene value for the second row as the second index of data array
consider gene value for the third row as the third index of data array
so if the first column contains [11][3][29] user = 0
it refers to data[0][11][3][29]
SUM the data array value for all columns and save it
Do the same for all iterations (users)
for second 10 iterations: (user 10 to 19)
for each column (genome)
consider gene value for the SECOND row as the FIRST index of data array
consider gene value for the THIRD row as the SECOND index of data array
consider gene value for FIRST row as the THIRD index of data array
SUM the data array value for all columns and save it
Do the same for all iterations (users)
for third 10 iterations: (user 20 to 29)
for each column (genome)
consider gene value for the THIRD row as the FIRST index of data array
consider gene value for FIRST row as the SECOND index of data array
consider gene value for the SECOND row as the THIRD index of data array
SUM the data array value for all columns and save it
Do the same for all iterations (users)
Out of the 30 (sum) values calculated so far, assign the minimum value as fitness value
to this chromosome.
The point to explain the fitness function here is to explain the optimization problem I am dealing with. I am sorry I could not formulate it in Mathematical notation. Anyone who could do it, his/her comment is more than welcome. Essentially it is maximizing the minimum X. Where X refers to data contained in data array. (maximizing is done over generation where the highest fitness chromosomes are selected for next generations)
Q1) I am using a single random number generator for crossover and mutation probabilities. Generally speaking, is this correct was to implement it with a single generator? I ask this question because the crossover rate I chose is 0.7 and mutation to be 0.01. My random number generator generates a uniformly distributed integer number. The number are between 0 to (2^31 - 1). If a number generated by the random function lies under the border where it satisfies mutation, the same number also satisfies crossover. Does this effect the evolution process?
NOTE: the highest number that the random number generates is 2147483647. 1% of this value is 21474836. so whenever a number less than 21474836, it suggests that this gene can be mutated. this number also suggest that crossover must be done. Shouldn't there be different generators?
Q2) Although I see that there is a relation between genes is a column when calculating fitness. But while performing mutation, all the genes should be considered independent from each other or all the rows for a genome (column) should be effected by mutation.
Explanation
As I learned in a binary string of e.g. 1000 bits where each bit corresponds to a gene, with a mutation rate of 1% would mean 1 out of 100 bits might get flipped. in my case however I have chromosome which is 2D (3 rows, 32 columns). Should I consider all 96 genes independent of each other or simply consider 32 genes. And whenever I need a flip, flip the column all together. How does mutation work in 2D chromosome?
Q3) Do I really have a correlation between rows here. I am a bit confused?
Explanation
I have 2D chromosome, whose column values altogether points to the data i have to use to calculate fitness of this chromosome. Genetic algorithm manipulates chromosomes where as fitness is assigned by the data that is associated with this chromosome. My question is how would genetic algorithm should treat 2D chromosome. Should there be a relation between the genes in a column. Can I get a reference to some paper/code where a 2D chromosome is manipulated?
I'm not sure if i understood the chromosome structure, but it doesn't matter, the concepts are the same:
1 - You have a chromosome object, which you can access the individual genes
2 - You have a fitness function, which takes a chromosome and outputs a value
3 - You have a selection function, which selects chromosomes to mate
4 - You have a crossover function, which generally takes 2 chromosomes, exchange genes between them and outputs two new chromosomes
5 - You have a mutation operator, which acts randomly on the genes of a chromosome
So
Q1) You can use a single random generator, there's no problem at all. But why are you using
integer numbers? It's much easier to generate a random between [0, 1).
Q2) This is up to you, but generally the genes are mutated randomly, independent of each other (mutation happens after the crossover, but i think you already know that).
EDIT: Yes, you should consider all the 96 genes independent of each other. For each mutation, you'll select one 'row' and one 'column' and modify (mutate) that gene with some probability p, so:
for row in chromosome.row
for col in row
val = random_between_0_and_1
if val < p
chromosome[row][col] = noise
Q4) It's up to you to decide what the fitness function will do. If this chromosome is 'good' or 'bad' at solving your problem, then you should return a value that reflects that.
All the random numbers you use would typically be independently generated, so use one RNG or many, it doesn't matter. You should generate new numbers for each gene for crossover and mutation step, if you use the same single random number for multiple purposes you will limit the explorable solution space.
To make your algorithm easier to understand, generate uniformly distributed floats in [0..1) as r()=rand()/(2^32-1), then you can express things simply as, for example,
if r() < 0.3
mutate()
I don't understand your other questions. Please rewrite them.
An improvement you can do relatively to mutation and crossover probabilities is built a GA that choose these probabilities by itself. Because the use of given probabilities (or a function that evolves with the number of runs for probabilities) is always arbitrary, codify your operators inside chromosomes.
For example, you have two operators. Add a bit to the end of chromosome where 1 codify for mutation and 0 for crossover. When you apply operators on parents, you will obtain childs that will have the code for the operator to apply. In this way, the GA makes a double search: in the space of solutions and in the space of operators. The choose of operators is given by the nature of your problem a by the concrete conditions of the run. During the calculation, probabilites of both operators will change automatically to maximize you objective function.
Same thing for an arbitrary number of operators. You will need simply more bits to codify. I use generally three operators (three for crossover and one for mutation) and this mechanism works fine.

Need pairing algorithm - based on Hungarian?

Hungarian or Kuhn-Munkres algorithm (good description here) pairs objects from two sets (of n and m objects respectively, n>=m) so that the overall "difference" (or "cost" of assignment) between paired objects be minimal. One feature of the algo doesn't suit me however: it does only exhaustive pairing, in the sense that it will pair all m objects with some of n objects. Instead of this, I'd want to be able to create arbitrary number k of pairs (k<=m) with overall cost minimal. For example, there is a 50x30 input cost matrix; Kuhn-Munkres will optimally create but all 30 pairs. While I need just 20 pairs to be created such optimally.
Can there be any modification of Hungarian algorithm allowing for this, or maybe a totally another algo to do it? I appreciate your answers highly.
Here are a few ideas to think about:
1) Suppose you write down your cost matrix with n columns and m rows. If n is greater than m you add padding rows with constant large cost to make it square. A minimum cost assignment of rows and columns will now discard some columns by matching them to padding rows. Suppose you now add a padding column with very low cost for the ordinary rows and the constant large cost for the padding columns. The solution will now match one of the proper rows to this column, to take advantage of the very low cost. This reduces the number of rows that match to something sensible. I think if you add m-k such columns you will end up with a minimum cost matching that really assigns only k of the rows.
Here is an example of pairing 3 with 3 in 5x5, assuming ?
marks problem-specific values > 0 but < 100 (you may
need more extreme values than 0 and 100 to force the sort of
solution you want depending on what your data values are).
? ? ? ? ? 0 0
? ? ? ? ? 0 0
? ? ? ? ? 0 0
? ? ? ? ? 0 0
? ? ? ? ? 0 0
100 100 100 100 100 100 100
100 100 100 100 100 100 100
I expect that an optimal solution will use
two 0s from the far
right and two 100s from the bottom rows. The remaining cells
are a 3 x 3 matching within the square of ?s
OK - here is a proof that adding columns and then rows as above produces the sort of matching you want:
Suppose you take a cost matrix with values 0 < x < 100 and add a border of s columns and rows of 0s and 100s as above, then solve it as an assignment problem. Draw two lines at the border of the 0s and 100s, extending them to cut the square into four regions, where the region at the top left is the original matrix. If the assignment algorithm didn't choose any of the cells in the bottom right region then it chose s cells in the top right region (to pick the s rightmost columns), so s rows in the orgininal cost matrix in the top left region are paired with cells in a zero column. The other rows in the top region must be paired with a non-zero column, so you have a matching in the original region that leaves s rows, and so s columns, unpaired (that is, paired with a zero cell).
Is it possible that the assigment solution has any cells in the s x s lower right region chosen? Consider any such assignment. To prove that at least one cell in the upper left region must be chosen, suppose none are chosen. Then we must somehow choose a cell from each of the top n rows, presumably by picking cells from the top right region. Each such cell must be in a separate column, but there are only s columns in the top right region, which won't be enough because we need only one column for each matching we want to skip, and we have used one column in this region already to fill in a cell in the lower right region. So suppose the solution chooses at least one cell in the original upper left region and at least one cell in the lower right region. Pick the two other cells that make this into four corners of a square. These cells cannot be chosen. If we choose those cells instead of the two that are currently chosen, we get a different solution. The two new cells are a 0 cell from the top right and a 100 cell from the bottom left. They would replace a 100 cell from the bottom right and a cell of value greater than zero in the main matrix. So this would make our supposed solution better, so any solution that contains a cell in the bottom right region is not a best solution, and the assignment algorithm will not return it to us.
So this trick of adding columns of 0s and then rows of large values will produce an assignment algorithm solution that does omits one matching from the original solution for each (row, column) added.
2) The assignment problem is a special case of the http://en.wikipedia.org/wiki/Minimum-cost_flow_problem. I think you want a minimum cost flow that transfers k units from rows to columns, so you could try solving it like this.
3) The minimum cost flow problem is a special case of linear programming. I think you could write down a linear program to assign numbers in the range [0,1] to cells of the matrix such that each row and each column sums up to no more than 1 and the total of all the cells is k. The objective function is then the number in each cell times its cost.
Maybe your approach is wrong but the hungarian algorithm is only for bipartite graph. For a general (non-bipartite) graph (i.e. weigthed matching) look here http://en.wikipedia.org/wiki/Edmonds%27s_matching_algorithm. Or you want to cheat and you give only the top ten of the maximum-cost pair matching?

random number with ratio 1:2

I have to generate two random sets of matrices
Each containing 3 digit numbers ranging from 2 - 10
like that
matrix 1: 994,878,129,121
matrix 2: 272,794,378,212
the numbers in both matrices have to be greater then 100 and less then 999
BUT
the mean for both matrices has to be in the ratio of 1:2 or 2:3 what ever constraint the user inputs
my math skills are kind of limited so any ideas how do i make this happen?
In order to do this, you have to know how many numbers are in each list. I'm assuming from your example that there are four numbers in each.
Fill the first list with four random numbers.
Calculate the mean of the first list.
Multiply the mean by 2 or by 3/2, whichever the user input. This is the required mean of the second list.
Multiply by 4. This is the required total of the second list.
Generate 3 random numbers.
Subtract the total of the three numbers in step 5 from the total in step 4. This is the fourth number for the second list.
If the number in step 6 is not in the correct range, start over from step 5.
Note that the last number in the second list is not truly random, since it's based on the other values in the list.
You have a set of random numbers, s1.
s1= [ random.randint(100,999) for i in range(n) ]
For some other set, s2, to have a different mean it's simply got to have a different range. Either you select values randomly from a different range, or you filter random values to get a different range.
No matter how many random numbers you select from the range 100 to 999, the mean is always just about 550. The odds of being a different value are exactly the normal distribution probabilities on either side of the mean.
You can't have a radically different mean with values selected from the same range.

Resources