Arranging groups of people optimally - algorithm

I have this homework assignment that I think I managed to solve, but am not entirely sure as I cannot prove my solution. I would like comments on what I did, its correctness and whether or not there's a better solution.
The problem is as follows: we have N groups of people, where group ihas g[i]people in it. We want to put these people on two rows of S seats each, such that: each group can only be put on a single row, in a contiguous sequence, OR if the group has an even number of members, we can split them in two and put them on two rows, but with the condition that they must form a rectangle (so they must have the same indices on both rows). What is the minimum number of seats S needed so that nobody is standing up?
Example: groups are 4 11. Minimum S is 11. We put all 4 in one row, and the 11 on the second row. Another: groups are 6 2. We split the 6 on two rows, and also the two. Minimum is therefore 4 seats.
This is what I'm thinking:
Calculate T = (sum of all groups + 1) / 2. Store the group numbers in an array, but split all the even values x in two values of x / 2 each. So 4 5 becomes 2 2 5. Now run subset sum on this vector, and find the minimum value higher than or equal to T that can be formed. That value is the minimum number of seats per row needed.
Example: 4 11 => 2 2 11 => T = (15 + 1) / 2 = 8. Minimum we can form from 2 2 11 that's >= 8 is 11, so that's the answer.
This seems to work, at least I can't find any counter example. I don't have a proof though. Intuitively, it seems to always be possible to arrange the people under the required conditions with the number of seats supplied by this algorithm.
Any hints are appreciated.

I think your solution is correct. The minimum number of seats per row in an optimal distribution would be your T (which is mathematically obvious).
Splitting even numbers is also correct, since they have two possible arrangements; by logically putting all the "rectangular" groups of people on one end of the seat rows you can also guarantee that they will always form a proper rectangle, so that this condition is met as well.
So the question boils down to finding a sum equal or as close as possible to T (e.g. partition problem).

Minor nit: I'm not sure if the proposed solution above works in the edge case where each group has 0 members, because your numerator in T = SUM ALL + 1 / 2 is always positive, so there will never be a subset sum that is greater than or equal to T.
To get around this, maybe a modulus operation might work here. We know that we need at least n seats in a row if n is the maximal odd term, so maybe the equation should have a max(n * (n % 2)) term in it. It will come out to max(odd) or 0. Since the maximal odd term is always added to S, I think this is safe (stated boldly without proof...).
Then we want to know if we should split the even terms or not. Here's where the subset sum approach might work, but with T simply equal to SUM ALL / 2.

Related

Possible NxN matrices, t 1's in each row and column, none in diagonal?

Background:
This is extra credit in a logic and algorithms class, we are currently covering propositional logic, P implies Q that kind of thing, so I think the Prof wanted to give us and assignment out of our depth.
I will implement this in C++, but right now I just want to understand whats going on in the example....which I don't.
Example
Enclosed is a walkthrough for the Lefty algorithm which computes the number
of nxn 0-1 matrices with t ones in each row and column, but none on the main
diagonal.
The algorithm used to verify the equations presented counts all the possible
matrices, but does not construct them.
It is called "Lefty", it is reasonably simple, and is best described with an
example.
Suppose we wanted to compute the number of 6x6 0-1 matrices with 2 ones
in each row and column, but no ones on the main diagonal. We first create a
state vector of length 6, filled with 2s:
(2 2 2 2 2 2)
This state vector symbolizes the number of ones we must yet place in each
column. We accompany it with an integer which we call the "puck", which is
initialized to 1. This puck will increase by one each time we perform a ones
placement in a row of the matrix (a "round"), and we will think of the puck as
"covering up" the column that we wonít be able to place ones in for that round.
Since we are starting with the first row (and hence the first round), we place
two ones in any column, but since the puck is 1, we cannot place ones in the
first column. This corresponds to the forced zero that we must place in the first
column, since the 1,1 entry is part of the matrixís main diagonal.
The algorithm will iterate over all possible choices, but to show each round,
we shall make a choice, say the 2nd and 6th columns. We then drop the state
vector by subtracting 1 from the 2nd and 6th values, and advance the puck:
(2 1 2 2 2 1); 2
For the second round, the puck is 2, so we cannot place a one in that column.
We choose to place ones in the 4th and 6th columns instead and advance the
puck:
(2 1 2 1 2 0); 3
Now at this point, we can place two ones anywhere but the 3rd and 6th
columns. At this stage the algorithm treats the possibilities di§erently: We
can place some ones before the puck (in the column indexes less than the puck
value), and/or some ones after the puck (in the column indexes greater than
the puck value). Before the puck, we can place a one where there is a 1, or
where there is a 2; after the puck, we can place a one in the 4th or 5th columns.
Suppose we place ones in the 4th and 5th columns. We drop the state vector
and advance the puck once more:
(2 1 2 0 1 0); 4
1
For the 4th round, we once again notice we can place some ones before the
puck, and/or some ones after.
Before the puck, we can place:
(a) two ones in columns of value 2 (1 choice)
(b) one one in the column of value 2 (2 choices)
(c) one one in the column of value 1 (1 choice)
(d) one one in a column of value 2 and one one in a column of value 1 (2
choices).
After we choose one of the options (a)-(d), we must multiply the listed
number of choices by one for each way to place any remaining ones to the right
of the puck.
So, for option (a), there is only one way to place the ones.
For option (b), there are two possible ways for each possible placement of
the remaining one to the right of the puck. Since there is only one nonzero value
remaining to the right of the puck, there are two ways total.
For option (c), there is one possible way for each possible placement of the
remaining one to the right of the puck. Again, since there is only one nonzero
value remaining, there is one way total.
For option (d), there are two possible ways to place the ones.
We choose option (a). We drop the state vector and advance the puck:
(1 1 1 0 1 0); 5
Since the puck is "covering" the 1 in the 5th column, we can only place
ones before the puck. There are (3 take 2) ways to place two ones in the three
columns of value 1, so we multiply 3 by the number of ways to get remaining
possibilities. After choosing the 1st and 3rd columns (though it doesnít matter
since weíre left of the puck; any two of the three will do), we drop the state
vector and advance the puck one final time:
(0 1 0 0 1 0); 6
There is only one way to place the ones in this situation, so we terminate
with a count of 1. But we must take into account all the multiplications along
the way: 1*1*1*1*3*1 = 3.
Another way of thinking of the varying row is to start with the first matrix,
focus on the lower-left 2x3 submatrix, and note how many ways there were to
permute the columns of that submatrix. Since there are only 3 such ways, we
get 3 matrices.
What I think I understand
This algorithm counts the the all possible 6x6 arrays with 2 1's in each row and column with none in the descending diagonal.
Instead of constructing the matrices it uses a "state_vector" filled with 6 2's, representing how many 2's are in that column, and a "puck" that represents the index of the diagonal and the current row as the algorithm iterates.
What I don't understand
The algorithm comes up with a value of 1 for each row except 5 which is assigned a 3, at the end these values are multiplied for the end result. These values are supposed to be the possible placements for each row but there are many possibilities for row 1, why was it given a one, why did the algorithm wait until row 5 to figure all the possible permutations?
Any help will be much appreciated!
I think what is going on is a tradeoff between doing combinatorics and doing recursion.
The algorithm is using recursion to add up all the counts for each choice of placing the 1's. The example considers a single choice at each stage, but to get the full count it needs to add the results for all possible choices.
Now it is quite possible to get the final answer simply using recursion all the way down. Every time we reach the bottom we just add 1 to the total count.
The normal next step is to cache the result of calling the recursive function as this greatly improves the speed. However, the memory use for such a dynamic programming approach depends on the number of states that need to be expanded.
The combinatorics in the later stages is making use of the fact that once the puck has passed a column, the exact arrangement of counts in the columns doesn't matter so you only need to evaluate one representative of each type and then add up the resulting counts multiplied by the number of equivalent ways.
This both reduces the memory use and improves the speed of the algorithm.
Note that you cannot use combinatorics for counts to the right of the puck, as for these the order of the counts is still important due to the restriction about the diagonal.
P.S. You can actually compute the number of ways for counting the number of n*n matrices with 2 1's in each column (and no diagonal entries) with pure combinatorics as:
a(n) = Sum_{k=0..n} Sum_{s=0..k} Sum_{j=0..n-k} (-1)^(k+j-s)*n!*(n-k)!*(2n-k-2j-s)!/(s!*(k-s)!*(n-k-j)!^2*j!*2^(2n-2k-j))
According to OEIS.

Specific knapsack-like

I am looking for some ideas how to deal with this specific knapsack problem (I believe it is knapsack-like problem although I might be mistaken).
As input we get set of numbers, and each can be positive or negative - we don't know that.
We have to find minimum possible absolute value of sum of some these numbers.
We don't have to use all numbers. We have to do additions (or subtractions) in same order in which numbers are given and we have to start with first number (and add or subtract following ones).
Example would be:
4 11 5 5 => 0
because 4+11-5-5 = 0
10 3 9 4 100 => 2
because 10-3-9 = -2
In second example we skipped two last numbers - because adding next numbers wouldn't give us smaller absolute number.
Amount of numbers can be up to 5,000
, and the sum of them won't over exceed 10,000
They are integers
If you were to explore all combinations of addition and subtraction of 5000 numbers, you would have to go through 25000−1 ≈ 1.4⋅101505 alternatives. That's obviously not reasonable. However, since the sum of the numbers is at most 10000, we know that all partial sums (including subtraction) must lie between -10000 and 10000, so there can be less than 20000 different sums. If you only consider different sums when you work through the 5000 positions you have less than 100 million sums to consider, which is not that much work for a computer.
Example: suppose the first three numbers are 5,1,1. The possible sums that include exactly three numbers are
5+1+1=7
5+1-1=5
5-1+1=5
5-1-1=3
Before adding the fourth number it is important to recognize that you have only three unique results from the four computations.

Heuristics for this (probably) NP-complete puzzle game

I asked whether this problem was NP-complete on the Computer Science forum, but asking for programming heuristics seems better suited for this site. So here it goes.
You are given an NxN grid of unit squares and 2N binary strings of length N. The goal is to fill the grid with 0's and 1's so that each string appears once and only once in the grid, either horizontally (left to right) or vertically (top down). Or determine that no such solution exists. If N is not fixed I suspect this is an NP-complete problem. However are there any heuristics that can hopefully speed up the search to faster than brute force trying all ways to fill in the grid with N vertical strings?
I remember programming this for my friend that had the 5x5 physical version of this game, but I used brute force back then. I can only think of this heuristic:
Consider a 4x4 map with these 8 strings (read each from left to right):
1 1 0 1
1 0 0 1
1 0 1 1
1 0 1 0
1 1 1 1
1 0 0 0
0 0 1 1
1 1 1 0
(Note that this is already solved, since the second 4 is the first 4 transposed)
First attempt:
We will choose columns from left to right. Since 7 of 8 strings start with 1, we will try to put the one with most 1s to the first column (so that we can lay rows more easily when columns are done).
In the second column, most string have 0, so you can also try putting a string with most zeros to the second row, and so on.
This i would call a wide-1 prediction, since it only looks at one column at a time
(Possible) Improvement:
You can look at 2 columns at a time (a wide-2 prediction, if i may call it like that). In this case, from the 8 strings, the most common combination of first two bits is 10 (5/8), so you would like to choose first two columns so the the combination 10 occurring as much as possible (in this case, 1111 followed by 1000 has 3 of 4 10 at start).
(Of course you don't have to stop at 2)
Weaknesses:
I don't know if this would work. I just made it up and thought it might work.
If you choose to he wide-X prediction, the number of possibilities is exponential with X
This can absolutely fail if the distribution of combinations if even.
What you can do:
As i said, this game has physical 5x5 adaptation, only there you can also lay the string from right-to-left and bottom-to-top, if you found that name, you could google further. I unfortunately don't remember it.
Sounds like you want the crossword grid filling algorithm:
First, build 2N subsets of your 2N strings -- each subset has all the strings with a particular bit at a particular postion. So subset(0,3) is all the strings that have a 0 in the 3rd position and subset(1,5) is all the strings that have a 1 in the 5th position.
The algorithm is a basic brute-force depth fist search trying all possible mappings of strings to slots in the grid, with severe pruning of impossible branches
Your search state is a set of assignments of strings to slots and a set of sets of possible assignments to the remaining slots. The initial state has 0 assignments and 2N sets, all of which contain all 2N strings.
At each step of the search, pick the most constrained set (the set with the fewest elements) from the set of possible sets. Try each element of the set in turn in that slot (adding it to the assigments and removing it from the set of sets), and constrain all the remaining sets of sets by removing the chosen string and intersecting the crossing sets with subset(X,N) (computed in step 1) where X is the bit from the chosen string and N is the row/column number of the chosen string
If you find an empty set when picking above, there is no solution with the choices so far, so backtrack up the tree to a different choice
This is still EXPTIME, but it is about as fast as you can get it. Since the main time consuming step is the set intersections, using 2N bit binary strings for your set representation is very fast -- for N=32, the sets fit in a 64-bit word and can be intersected with a single AND instruction. It also helps to have a POPCOUNT instruction, since you also need set sizes.
This can be solved as a 0/1 integer linear program with O(N^2) variables and constraints. First there are variables Xij which are 1 if string i is assigned to line j (where j=1 to N are rows and j = (N+1) to 2N are columns). Then there is a variable for each square in the grid, which indicates if the entry is 0 or 1. If the position of the square is (i,j) with variable Yij then the sum of all X variables for line j that correspond to strings that have a 1 in position i is equal to Yij, and the sum of all X variables for line j that correspond to strings that have a 0 in position i is equal to (1 - Yij). And similarly for line i and position j. Finally, the sum of all X variables Xij for each string i (summed over all lines j) is equal to 1.
There has been a lot of research in speeding up solvers for 0/1 integer programming so this may be able to often handle fairly large N (like N=100) for many examples. Also, in some cases, solving the relaxed non-integer linear program and rounding the solution off to 0/1 may produce a valid solution, in polynomial time.
We could choose the first lg 2N rows out of the 2N strings, and then since 2^(lg 2N) = 2N, in a lot of cases there shouldn't be very many ways to assign the N columns so that the prefixes of length lg 2N are respected. Then all the rows are filled in so they can be checked to see if a solution has been found. We can also try assigning more rows in the beginning, and fill in different combinations of rows besides the initial rows. (e.g. we can try filling in contiguous rows starting anywhere in the grid).
Running time for assigning lg 2N rows out of 2N strings is O((2N)^(lg 2N)) = O(2^((lg 2N)^2)), which grows slower than 2^N. Assigning columns to match the prefixes is the part that's the hardest to predict run time. If a prefix occurs K times among the assigned rows, and there are M remaining strings that have the prefix, then the number of assignments for this prefix is M*(M-1)...(M-K+1). The total number of possible column assignments is the product of these terms over all prefixes that occur among the rows. If this gets to be too large, the number of rows initially assigned can be increased. But it's hard to predict the worst-case run time unless an assumption is made like the NxN grid is filled in randomly.

Solving ACM ICPC - SEERC 2009

I have been sitting on this for almost a week now. Here is the question in a PDF format.
I could only think of one idea so far but it failed. The idea was to recursively create all connected subgraphs which works in O(num_of_connected_subgraphs), but that is way too slow.
I would really appreciate someone giving my a direction. I'm inclined to think that the only way is dynamic programming but I can't seem to figure out how to do it.
OK, here is a conceptual description for the algorithm that I came up with:
Form an array of the (x,y) board map from -7 to 7 in both dimensions and place the opponents pieces on it.
Starting with the first row (lowest Y value, -N):
enumerate all possible combinations of the 2nd player's pieces on the row, eliminating only those that conflict with the opponents pieces.
for each combination on this row:
--group connected pieces into separate networks and number these
networks starting with 1, ascending
--encode the row as a vector using:
= 0 for any unoccupied or opponent position
= (1-8) for the network group that that piece/position is in.
--give each such grouping a COUNT of 1, and add it to a dictionary/hashset using the encoded vector as its key
Now, for each succeeding row, in ascending order {y=y+1}:
For every entry in the previous row's dictionary:
--If the entry has exactly 1 group, add it's COUNT to TOTAL
--enumerate all possible combinations of the 2nd player's pieces
on the current row, eliminating only those that conflict with the
opponents pieces. (change:) you should skip the initial combination
(where all entries are zero) for this step, as the step above actually
covers it. For each such combination on the current row:
+ produce a grouping vector as described above
+ compare the current row's group-vector to the previous row's
group-vector from the dictionary:
++ if there are any group-*numbers* from the previous row's
vector that are not adjacent to any gorups in the current
row's vector, *for at least one value of X*, then skip
to the next combination.
++ any groups for the current row that are adjacent to any
groups of the previous row, acquire the lowest such group
number
++ any groups for the current row that are not adjacent to
any groups of the previous row, are assigned an unused
group number
+ Re-Normalize the group-number assignments for the current-row's
combination (**) and encode the vector, giving it a COUNT equal
to the previous row-vector's COUNT
+ Add the current-row's vector to the dictionary for the current
Row, using its encoded vector as the key. If it already exists,
then add it's COUNT to the COUNT for the pre-exising entry
Finally, for every entry in the dictionary for the last row:
If the entry has exactly one group, then add it's COUNT to TOTAL
**: Re-Normalizing simply means to re-assign the group numbers so as to eliminate any permutations in the grouping pattern. Specifically, this means that new group numbers should be assigned in increasing order, from left-to-right, starting from one. So for example, if your grouping vector looked like this after grouping ot to the previous row:
2 0 5 5 0 3 0 5 0 7 ...
it should be re-mapped to this normal form:
1 0 2 2 0 3 0 2 0 4 ...
Note that as in this example, after the first row, the groupings can be discontiguous. This relationship must be preserved, so the two groups of "5"s are re-mapped to the same number ("2") in the re-normalization.
OK, a couple of notes:
A. I think that this approach is correct , but I I am really not certain, so it will definitely need some vetting, etc.
B. Although it is long, it's still pretty sketchy. Each individual step is non-trivial in itself.
C. Although there are plenty of individual optimization opportunities, the overall algorithm is still pretty complicated. It is a lot better than brute-force, but even so, my back-of-the-napkin estimate is still around (2.5 to 10)*10^11 operations for N=7.
So it's probably tractable, but still a long way off from doing 74 cases in 3 seconds. I haven't read all of the detail for Peter de Revaz's answer, but his idea of rotating the "diamond" might be workable for my algorithm. Although it would increase the complexity of the inner loop, it may drop the size of the dictionaries (and thus, the number of grouping-vectors to compare against) by as much as a 100x, though it's really hard to tell without actually trying it.
Note also that there isn't any dynamic programming here. I couldn't come up with an easy way to leverage it, so that might still be an avenue for improvement.
OK, I enumerated all possible valid grouping-vectors to get a better estimate of (C) above, which lowered it to O(3.5*10^9) for N=7. That's much better, but still about an order of magnitude over what you probably need to finish 74 tests in 3 seconds. That does depend on the tests though, if most of them are smaller than N=7, it might be able to make it.
Here is a rough sketch of an approach for this problem.
First note that the lattice points need |x|+|y| < N, which results in a diamond shape going from coordinates 0,6 to 6,0 i.e. with 7 points on each side.
If you imagine rotating this diamond by 45 degrees, you will end up with a 7*7 square lattice which may be easier to think about. (Although note that there are also intermediate 6 high columns.)
For example, for N=3 the original lattice points are:
..A..
.BCD.
EFGHI
.JKL.
..M..
Which rotate to
A D I
C H
B G L
F K
E J M
On the (possibly rotated) lattice I would attempt to solve by dynamic programming the problem of counting the number of ways of placing armies in the first x columns such that the last column is a certain string (plus a boolean flag to say whether some points have been placed yet).
The string contains a digit for each lattice point.
0 represents an empty location
1 represents an isolated point
2 represents the first of a new connected group
3 represents an intermediate in a connected group
4 represents the last in an connected group
During the algorithm the strings can represent shapes containing multiple connected groups, but we reject any transformations that leave an orphaned connected group.
When you have placed all columns you need to only count strings which have at most one connected group.
For example, the string for the first 5 columns of the shape below is:
....+ = 2
..+++ = 3
..+.. = 0
..+.+ = 1
..+.. = 0
..+++ = 3
..+++ = 4
The middle + is currently unconnected, but may become connected by a later column so still needs to be tracked. (In this diagram I am also assuming a up/down/left/right 4-connectivity. The rotated lattice should really use a diagonal connectivity but I find that a bit harder to visualise and I am not entirely sure it is still a valid approach with this connectivity.)
I appreciate that this answer is not complete (and could do with lots more pictures/explanation), but perhaps it will prompt someone else to provide a more complete solution.

Find the smallest set group to cover all combinatory possibilities

I'm making some exercises on combinatorics algorithm and trying to figure out how to solve the question below:
Given a group of 25 bits, set (choose) 15 (non-permutable and order NON matters):
n!/(k!(n-k)!) = 3.268.760
Now for every of these possibilities construct a matrix where I cross every unique 25bit member against all other 25bit member where
in the relation in between it there must be at least 11 common setted bits (only ones, not zeroes).
Let me try to illustrate representing it as binary data, so the first member would be:
0000000000111111111111111 (10 zeros and 15 ones) or (15 bits set on 25 bits)
0000000001011111111111111 second member
0000000001101111111111111 third member
0000000001110111111111111 and so on....
...
1111111111111110000000000 up to here. The 3.268.760 member.
Now crossing these values over a matrix for the 1 x 1 I must have 15 bits common. Since the result is >= 11 it is a "useful" result.
For the 1 x 2 we have 14 bits common so also a valid result.
Doing that for all members, finally, crossing 1 x 3.268.760 should result in 5 bits common so since it's < 11 its not "useful".
What I need is to find out (by math or algorithm) wich is the minimum number of members needed to cover all possibilities having 11 bits common.
In other words a group of N members that if tested against all others may have at least 11 bits common over the whole 3.268.760 x 3.268.760 universe.
Using a brute force algorithm I found out that with 81 25bit member is possible achive this. But i'm guessing that this number should be smaller (something near 12).
I was trying to use a brute force algorithm to make all possible variations of 12 members over the 3.268.760 but the number of possibilities
it's so huge that it would take more than a hundred years to compute (3,156x10e69 combinations).
I've googled about combinatorics but there are so many fields that i don't know in wich these problem should fit.
So any directions on wich field of combinatorics, or any algorithm for these issue is greatly appreciate.
PS: Just for reference. The "likeness" of two members is calculated using:
(Not(a xor b)) and a
After that there's a small recursive loop to count the bits given the number of common bits.
EDIT: As promissed (#btilly)on the comment below here's the 'fractal' image of the relations or link to image
The color scale ranges from red (15bits match) to green (11bits match) to black for values smaller than 10bits.
This image is just sample of the 4096 first groups.
tl;dr: you want to solve dominating set on a large, extremely symmetric graph. btilly is right that you should not expect an exact answer. If this were my problem, I would try local search starting with the greedy solution. Pick one set and try to get rid of it by changing the others. This requires data structures to keep track of which sets are covered exactly once.
EDIT: Okay, here's a better idea for a lower bound. For every k from 1 to the value of the optimal solution, there's a lower bound of [25 choose 15] * k / [maximum joint coverage of k sets]. Your bound of 12 (actually 10 by my reckoning, since you forgot some neighbors) corresponds to k = 1. Proof sketch: fix an arbitrary solution with m sets and consider the most coverage that can be obtained by k of the m. Build a fractional solution where all symmetries of the chosen k are averaged together and scaled so that each element is covered once. The cost of this solution is [25 choose 15] * k / [maximum joint coverage of those k sets], which is at least as large as the lower bound we're shooting for. It's still at least as small, however, as the original m-set solution, as the marginal returns of each set are decreasing.
Computing maximum coverage is in general hard, but there's a factor (e/(e-1))-approximation (≈ 1.58) algorithm: greedy, which it sounds as though you could implement quickly (note: you need to choose the set that covers the most uncovered other sets each time). By multiplying the greedy solution by e/(e-1), we obtain an upper bound on the maximum coverage of k elements, which suffices to power the lower bound described in the previous paragraph.
Warning: if this upper bound is larger than [25 choose 15], then k is too large!
This type of problem is extremely hard, you should not expect to be able to find the exact answer.
A greedy solution should produce a "fairly good" answer. But..how to be greedy?
The idea is to always choose the next element to be the one that is going to match as many possibilities as you can that are currently unmatched. Unfortunately with over 3 million possible members, that you have to try match against millions of unmatched members (note, your best next guess might already match another member in your candidate set..), even choosing that next element is probably not feasible.
So we'll have to be greedy about choosing the next element. We will choose each bit to maximize the sum of the probabilities of eventually matching all of the currently unmatched elements.
For that we will need a 2-dimensional lookup table P such that P(n, m) is the probability that two random members will turn out to have at least 11 bits in common, if m of the first n bits that are 1 in the first member are also 1 in the second. This table of 225 probabilities should be precomputed.
This table can easily be computed using the following rules:
P(15, m) is 0 if m < 11, 1 otherwise.
For n < 15:
P(n, m) = P(n+1, m+1) * (15-m) / (25-n) + P(n+1, m) * (10-n+m) / (25-n)
Now let's start with a few members that are "very far" from each other. My suggestion would be:
First 15 bits 1, rest 0.
First 10 bits 0, rest 1.
First 8 bits 1, last 7 1, rest 0.
Bits 1-4, 9-12, 16-23 are 1, rest 0.
Now starting with your universe of (25 choose 15) members, eliminate all of those that match one of the elements in your initial collection.
Next we go into the heart of the algorithm.
While there are unmatched members:
Find the bit that appears in the most unmatched members (break ties randomly)
Make that the first set bit of our candidate member for the group.
While the candidate member has less than 15 set bits:
Let p_best = 0, bit_best = 0;
For each unset bit:
Let p = 0
For each unmatched member:
p += P(n, m) where m = number of bits in common between
candidate member+this bit and the unmatched member
and n = bits in candidate member + 1
If p_best < p:
p_best = p
bit_best = this unset bit
Set bit_best as the next bit in our candidate member.
Add the candidate member to our collection
Remove all unmatched members that match this from unmatched members
The list of candidate members is our answer
I have not written code, I therefore have no idea how good an answer this algorithm will produce. But assuming that it does no better than your current, for 77 candidate members (we cheated and started with 4) you have to make 271 passes through your unmatched candidates (25 to find the first bit, 24 to find the second, etc down to 11 to find the 15th, and one more to remove the matched members). That's 20867 passes. If you have an average of 1 million unmatched members, that's on the order of a 20 billion operations.
This won't be quick. But it should be computationally feasible.

Resources