Genetic algorithms: How to do crossover in "subset" problems? - algorithm

I have a problem which I am trying to solve with genetic algorithms. The problem is selecting some subset (say 4) of 100 integers (these integers are just ids that represent something else). Order does not matter, the solution to the problem is a SET of integers not an ordered list. I have a good fitness function but am having trouble with the crossover function.
I want to be able to mate the following two chromosomes:
[1 2 3 4] and
[3 4 5 6] into something useful. Clearly I cannot use the typical crossover function because I could end up with duplicates in my children which would represent invalid solutions. What is the best crossover method in this case.

Just ignore any element that occurs in both of the sets (i.e. in their intersection.), that is leave such elements unchanged in both sets.
The rest of the elements form two disjoint sets, to which you can apply pretty much any random transformation (e.g. swapping some pairs randomly) without getting duplicates.
This can be thought of as ordering and aligning both sets so that matching elements face each other and applying one of the standard crossover algorithms.

Sometimes it is beneficial to let your solution go "out of bounds" so that your search will converge more quickly. Rather than making a set of 4 unique integers a requirement for your chromosome, make the number of integers (and their uniqueness) part of the fitness function.

Since order doesn't matter, just collect all the numbers into an array, sort the array, throw out the duplicates (by disconnecting them from a linked list, or setting them to a negative number, or whatever). Shuffle the array and take the first 4 numbers.

I don't really know what you mean on "typical crossover", but I think you could use a crossover similar to what is often used for permutations:
take m ints from the first parent (m < n, where n is the number of ints in your sets)
scan the second and fill your subset from it with (n-m) ints that are free (not in the subset already).
This way you will have n ints from the first and n-m ints from the second parent, without duplications.
Sounds like a valid crossover for me :-).
I guess it might be beneficial not to do either steps on ordered sets (or using an iterator where the order of returned elements correlates somehow with the natural ordering of ints), otherwise either smaller or higher numbers will get a higher chance to be in the child making your search biased.
If it is the best method depends on the problem you want to solve...

In order to combine sets A and B, you could choose the resulting set S probabilistically so that the probability that x is in S is (number of sets out of A, B, which contain x) / 2. This will be guaranteed to contain the intersection and be contained in the union, and will have expected cardinality 4.

Related

Efficiently search for pairs of numbers in various rows

Imagine you have N distinct people and that you have a record of where these people are, exactly M of these records to be exact.
For example
1,50,299
1,2,3,4,5,50,287
1,50,299
So you can see that 'person 1' is at the same place with 'person 50' three times. Here M = 3 obviously since there's only 3 lines. My question is given M of these lines, and a threshold value (i.e person A and B have been at the same place more than threshold times), what do you suggest the most efficient way of returning these co-occurrences?
So far I've built an N by N table, and looped through each row, incrementing table(N,M) every time N co occurs with M in a row. Obviously this is an awful approach and takes 0(n^2) to O(n^3) depending on how you implent. Any tips would be appreciated!
There is no need to create the table. Just create a hash/dictionary/whatever your language calls it. Then in pseudocode:
answer = []
for S in sets:
for (i, j) in pairs from S:
count[(i,j)]++
if threshold == count[(i,j)]:
answer.append((i,j))
If you have M sets of size of size K the running time will be O(M*K^2).
If you want you can actually keep the list of intersecting sets in a data structure parallel to count without changing the big-O.
Furthermore the same algorithm can be readily implemented in a distributed way using a map-reduce. For the count you just have to emit a key of (i, j) and a value of 1. In the reduce you count them. Actually generating the list of sets is similar.
The known concept for your case is Market Basket analysis. In this context, there are different algorithms. For example Apriori algorithm can be using for your case in a specific case for sets of size 2.
Moreover, in these cases to finding association rules with specific supports and conditions (which for your case is the threshold value) using from LSH and min-hash too.
you could use probability to speed it up, e.g. only check each pair with 1/50 probability. That will give you a 50x speed up. Then double check any pairs that make it close enough to 1/50th of M.
To double check any pairs, you can either go through the whole list again, or you could double check more efficiently if you do some clever kind of reverse indexing as you go. e.g. encode each persons row indices into 64 bit integers, you could use binary search / merge sort type techniques to see which 64 bit integers to compare, and use bit operations to compare 64 bit integers for matches. Other things to look up could be reverse indexing, binary indexed range trees / fenwick trees.

Conditional Randomization

Imagine there is a list of elements as follow:
1a, 2a, 3a, 4a, 5b, 6b, 7b, 8b
Now we need to randomize it such that not more than 2 "a"s or 2 "b"s get next to each other. For instance the following list is not allowed because of the 2nd, third and fourth elements:
3a, 7b, 8b, 5b, 2a, 1a, 5b, 4a
How can we write write an efficient code without generating many random sequences and many triad comparisons?
Create two bins, one for the a's and one for the b's. Pick from a random bin and record the bin. Pick a second number from a random bin. If the bin is not the same as before just record the bin. If the bin is the same as before then force the next pick to be from the other bin. Carry on forward, only forcing a bin when you have two picks in succession from the same bin.
I'm going to assume that:
There are only two kinds of element, a and b, and
There aren't "too many" of either kind (say, less than 30) or that you're willing to use a bignum package.
The basic idea is to (conceptually) first construct a valid sequence of as and bs, and then randomly assign the actual elements to the as and bs in the sequence. In practice, you could do both of these steps in parallel; every time you add an a to the sequence, you select a random a element from the set of such elements not yet assigned, and similarly with b elements.
The (slightly) complicated part is constructing the valid sequence without bias, and that's what I'm going to focus on.
As is often the case, the key is to be able to count the number of possible sequences, in a way which leads to an enumeration. We don't actually enumerate the possibilities -- that would take really a long time for even moderately long sequences -- but we do need to know for every prefix how to enumerate the sequences starting with that prefix.
Rather than produce the sequence element by element, we'll produce it in chunks of one or two elements of the same kind. Since we don't allow more than two consecutive elements of the same kind, the final sequence must be a series of alternating chunks. In effect, at every point except the very beginning, the choice is whether to select one or two of the "other" kind. At the beginning, we must select one or two of either kind, so we must first choose the starting kind, after which all the kinds are fixed; we merely need a sequence of 1's and 2's -- representing one element or two elements of the same kind -- with the kind alternating at each step. The sequence of 1s and 2s is constrained by the fact that we know how many elements there are of each kind, which corresponds to the sum of the numbers in the even and odd positions of the {1,2}-sequence.
Now, let's define f(m,n) as the count of sequences whose even and odd sums are m and n. (Using CS rather than maths rules, we'll assume that the first position is 0 (even) but it actually makes absolutely no difference.) Suppose that we have 6 as and 4 bs. There are then f(6,4) sequences which start with an a, and f(4,6) sequences which start with a b, so that the total count of valid sequences is f(6,4)+f(4,6).
Now, suppose we need to compute f(m,n). Assuming m is large enough, we have exactly two options: choose one of the m elements of the even kind or choose two of the m elements of the even kind. After that, we will swap even and odd because the next choice applies to the other kind.
That rather directly leads to the recursion
f(m, n) = f(n, m-1) + f(n, m-2)
which we might think of as a kind of two-dimensional fibonacci recursion. (Recall that fib(m) = fib(m-1) + fib(m-2); the difference here is the second argument, and the fact that the argument order flip-flops at each recursion.
As with Fibonacci numbers, computing the values naively without memoization leads to exponential blow-up of recursive calls, and a more efficient strategy is to compute the entire table starting from f(0,0) (which has the value 1, obviously); in essence, a dynamic programming approach. We could also just do the recursive computation with memoization, which is slightly less efficient but possibly easier to read.
For now, let's just assume that we've arranged for the computation of f(m,n) to be suitably fast, either because we've prebuilt the entire array of possibilities up to the largest values of m and n we will need, or because we're using a memoizing recursive solution so that we only need to do the slow computation once for any given m,n. Now let's construct the random sequence.
Suppose there are na a-elements and nb b-elements. Since we don't know whether the random sequence will start with an a or a b, we need to first make that decision. We know there are f(na,nb) valid sequences which start a and f(nb,na) valid sequences starting with a b, so we start by generating a random non-negative integer less than f(na,nb) + f(nb,na). If the random is less than f(na,nb) then we'll start with a-elements; otherwise we'll start with b elements.
Having made that decision, we'll proceed as follows. We know what the next element kind is and how many elements remain of each kind, so we only need to know whether to select one or two elements of the correct kind. To make that choice, we generate a non-negative random integer less than f(m, n); if it is less than f(n, m-1) then we select one element; otherwise we select two elements. Then we swap the element sets, fix the counts, and continue until m and n are both 0.

Dynamic algorithm to multiply elements in a sequence two at a time and find the total

I am trying to find a dynamic approach to multiply each element in a linear sequence to the following element, and do the same with the pair of elements, etc. and find the sum of all of the products. Note that any two elements cannot be multiplied. It must be the first with the second, the third with the fourth, and so on. All I know about the linear sequence is that there are an even amount of elements.
I assume I have to store the numbers being multiplied, and their product each time, then check some other "multipliable" pair of elements to see if the product has already been calculated (perhaps they possess opposite signs compared to the current pair).
However, by my understanding of a linear sequence, the values must be increasing or decreasing by the same amount each time. But since there are an even amount of numbers, I don't believe it is possible to have two "multipliable" pairs be the same (with potentially opposite signs), due to the issue shown in the following example:
Sequence: { -2, -1, 0, 1, 2, 3 }
Pairs: -2*-1, 0*1, 2*3
Clearly, since there are an even amount of pairs, the only case in which the same multiplication may occur more than once is if the elements are increasing/decreasing by 0 each time.
I fail to see how this is a dynamic programming question, and if anyone could clarify, it would be greatly appreciated!
A quick google for define linear sequence gave
A number pattern which increases (or decreases) by the same amount each time is called a linear sequence. The amount it increases or decreases by is known as the common difference.
In your case the common difference is 1. And you are not considering any other case.
The same multiplication may occur in the following sequence
Sequence = {-3, -1, 1, 3}
Pairs = -3 * -1 , 1 * 3
with a common difference of 2.
However this is not necessarily to be solved by dynamic programming. You can just iterate over the numbers and store the multiplication of two numbers in a set(as a set contains unique numbers) and then find the sum.
Probably not what you are looking for, but I've found a closed solution for the problem.
Suppose we observe the first two numbers. Note the first number by a, the difference between the numbers d. We then count for a total of 2n numbers in the whole sequence. Then the sum you defined is:
sum = na^2 + n(2n-1)ad + (4n^2 - 3n - 1)nd^2/3
That aside, I also failed to see how this is a dynamic problem, or at least this seems to be a problem where dynamic programming approach really doesn't do much. It is not likely that the sequence will go from negative to positive at all, and even then the chance that you will see repeated entries decreases the bigger your difference between two numbers is. Furthermore, multiplication is so fast the overhead from fetching them from a data structure might be more expensive. (mul instruction is probably faster than lw).

Find medians in multiple sub ranges of a unordered list

E.g. given a unordered list of N elements, find the medians for sub ranges 0..100, 25..200, 400..1000, 10..500, ...
I don't see any better way than going through each sub range and run the standard median finding algorithms.
A simple example: [5 3 6 2 4]
The median for 0..3 is 5 . (Not 4, since we are asking the median of the first three elements of the original list)
INTEGER ELEMENTS:
If the type of your elements are integers, then the best way is to have a bucket for each number lies in any of your sub-ranges, where each bucket is used for counting the number its associated integer found in your input elements (for example, bucket[100] stores how many 100s are there in your input sequence). Basically you can achieve it in the following steps:
create buckets for each number lies in any of your sub-ranges.
iterate through all elements, for each number n, if we have bucket[n], then bucket[n]++.
compute the medians based on the aggregated values stored in your buckets.
Put it in another way, suppose you have a sub-range [0, 10], and you would like to compute the median. The bucket approach basically computes how many 0s are there in your inputs, and how many 1s are there in your inputs and so on. Suppose there are n numbers lies in range [0, 10], then the median is the n/2th largest element, which can be identified by finding the i such that bucket[0] + bucket[1] ... + bucket[i] greater than or equal to n/2 but bucket[0] + ... + bucket[i - 1] is less than n/2.
The nice thing about this is that even your input elements are stored in multiple machines (i.e., the distributed case), each machine can maintain its own buckets and only the aggregated values are required to pass through the intranet.
You can also use hierarchical-buckets, which involves multiple passes. In each pass, bucket[i] counts the number of elements in your input lies in a specific range (for example, [i * 2^K, (i+1) * 2^K]), and then narrow down the problem space by identifying which bucket will the medium lies after each step, then decrease K by 1 in the next step, and repeat until you can correctly identify the medium.
FLOATING-POINT ELEMENTS
The entire elements can fit into memory:
If your entire elements can fit into memory, first sorting the N element and then finding the medians for each sub ranges is the best option. The linear time heap solution also works well in this case if the number of your sub-ranges is less than logN.
The entire elements cannot fit into memory but stored in a single machine:
Generally, an external sort typically requires three disk-scans. Therefore, if the number of your sub-ranges is greater than or equal to 3, then first sorting the N elements and then finding the medians for each sub ranges by only loading necessary elements from the disk is the best choice. Otherwise, simply performing a scan for each sub-ranges and pick up those elements in the sub-range is better.
The entire elements are stored in multiple machines:
Since finding median is a holistic operator, meaning you cannot derive the final median of the entire input based on the medians of several parts of input, it is a hard problem that one cannot describe its solution in few sentences, but there are researches (see this as an example) have been focused on this problem.
I think that as the number of sub ranges increases you will very quickly find that it is quicker to sort and then retrieve the element numbers you want.
In practice, because there will be highly optimized sort routines you can call.
In theory, and perhaps in practice too, because since you are dealing with integers you need not pay n log n for a sort - see http://en.wikipedia.org/wiki/Integer_sorting.
If your data are in fact floating point and not NaNs then a little bit twiddling will in fact allow you to use integer sort on them - from - http://en.wikipedia.org/wiki/IEEE_754-1985#Comparing_floating-point_numbers - The binary representation has the special property that, excluding NaNs, any two numbers can be compared like sign and magnitude integers (although with modern computer processors this is no longer directly applicable): if the sign bit is different, the negative number precedes the positive number (except that negative zero and positive zero should be considered equal), otherwise, relative order is the same as lexicographical order but inverted for two negative numbers; endianness issues apply.
So you could check for NaNs and other funnies, pretend the floating point numbers are sign + magnitude integers, subtract when negative to correct the ordering for negative numbers, and then treat as normal 2s complement signed integers, sort, and then reverse the process.
My idea:
Sort the list into an array (using any appropriate sorting algorithm)
For each range, find the indices of the start and end of the range using binary search
Find the median by simply adding their indices and dividing by 2 (i.e. median of range [x,y] is arr[(x+y)/2])
Preprocessing time: O(n log n) for a generic sorting algorithm (like quick-sort) or the running time of the chosen sorting routine
Time per query: O(log n)
Dynamic list:
The above assumes that the list is static. If elements can freely be added or removed between queries, a modified Binary Search Tree could work, with each node keeping a count of the number of descendants it has. This will allow the same running time as above with a dynamic list.
The answer is ultimately going to be "in depends". There are a variety of approaches, any one of which will probably be suitable under most of the cases you may encounter. The problem is that each is going to perform differently for different inputs. Where one may perform better for one class of inputs, another will perform better for a different class of inputs.
As an example, the approach of sorting and then performing a binary search on the extremes of your ranges and then directly computing the median will be useful when the number of ranges you have to test is greater than log(N). On the other hand, if the number of ranges is smaller than log(N) it may be better to move elements of a given range to the beginning of the array and use a linear time selection algorithm to find the median.
All of this boils down to profiling to avoid premature optimization. If the approach you implement turns out to not be a bottleneck for your system's performance, figuring out how to improve it isn't going to be a useful exercise relative to streamlining those portions of your program which are bottlenecks.

How to find the subset with the greatest number of items in common?

Let's say I have a number of 'known' sets:
1 {a, b, c, d, e}
2 {b, c, d, e}
3 {a, c, d}
4 {c, d}
I'd like a function which takes a set as an input, (for example {a, c, d, e}) and finds the set that has the highest number of elements, and no more other items in common. In other words, the subset with the greatest cardinality. The answer doesn't have to be a proper subset. The answer in this case would be {a, c, d}.
EDIT: the above example was wrong, now fixed.
I'm trying to find the absolute most efficient way of doing this.
(In the below, I am assuming that the cost of comparing two sets is O(1) for the sake of simplicity. That operation is outside my control so there's no point thinking about it. In truth it would be a function of the cardinality of the two sets being compared.)
Candiate 1:
Generate all subsets of the input, then iterate over the known sets and return the largest one that is a subset. The downside to this is that the complexity will be something like O(n! × m), where n is the cardinality of the input set and m is the number of 'known' subsets.
Candidate 1a (thanks #bratbrat):
Iterate over all 'known' sets and calculate the cardinatlity of the intersection, and take the one with the highest value. This would be O(n) where n is the number of subsets.
Candidate 2:
Create an inverse table and calculate the euclidean distance between the input and the known sets. This could be quite quick. I'm not clear how I could limit this to include only subsets without a subsequent O(n) filter.
Candidate 3:
Iterate over all known sets and compare against the input. The complexity would be O(n) where n is the number of known sets.
I have at my disposal the set functions built into Python and Redis.
None of these seems particularly great. Ideas? The number of sets may get large (around 100,000 at a guess).
There's no possible way to do this in less than O(n) time... just reading the input is O(n).
A couple ideas:
Sort the sets by size (biggest first), and search for the first set which is a subset of the input set. Once you find one, you don't have to examine the rest.
If the number of possible items which could be in the sets is limited, you could represent them by bit-vectors. Then you could calculate a lookup table to tell you whether a given set is a subset of the input set. (Walk down the bits for each input set under consideration, word by word, indexing each word into the appropriate table. If you find an entry telling you that it's not a subset, again, you can move on directly to the next input set.) Whether this would actually buy you performance, depends on the implementation language. I imagine it would be most effective in a language with primitive integral types, like C or Java.
Take the union of the known sets. This becomes a dictionary of known elements.
Sort the known elements by their value (they're integers, right). This defines a given integer's position in a bit string.
Use the above to define bit strings for each of the known sets. This is a one time operation - the results should be stored to avoid recomputation.
For an input set, run it through the same transform to obtain its bit string.
To get the largest subset, run through the list of known bit strings, taking the intersection (logical and) with the input bit string. Count the '1' elements. Remember the largest one.
http://packages.python.org/bitstring
As mentioned in the comments, this can be paralleled up by subdividing the known sets and giving each thread its own subset to work on. Each thread serves up its best match and then the parent thread picks the best from the threads.
How many searches are you making? In case you are searching multiple input sets you should be able to pre-process all the known sets (perhaps as a tree structure) and your search time for each query would be in the order of your query set size.
Eg: Create a Trie structure with all the known sets. Make sure to sort each set before inserting them. For the query, follow the links that are in the set.

Resources