Find the smallest set group to cover all combinatory possibilities - algorithm

I'm making some exercises on combinatorics algorithm and trying to figure out how to solve the question below:
Given a group of 25 bits, set (choose) 15 (non-permutable and order NON matters):
n!/(k!(n-k)!) = 3.268.760
Now for every of these possibilities construct a matrix where I cross every unique 25bit member against all other 25bit member where
in the relation in between it there must be at least 11 common setted bits (only ones, not zeroes).
Let me try to illustrate representing it as binary data, so the first member would be:
0000000000111111111111111 (10 zeros and 15 ones) or (15 bits set on 25 bits)
0000000001011111111111111 second member
0000000001101111111111111 third member
0000000001110111111111111 and so on....
...
1111111111111110000000000 up to here. The 3.268.760 member.
Now crossing these values over a matrix for the 1 x 1 I must have 15 bits common. Since the result is >= 11 it is a "useful" result.
For the 1 x 2 we have 14 bits common so also a valid result.
Doing that for all members, finally, crossing 1 x 3.268.760 should result in 5 bits common so since it's < 11 its not "useful".
What I need is to find out (by math or algorithm) wich is the minimum number of members needed to cover all possibilities having 11 bits common.
In other words a group of N members that if tested against all others may have at least 11 bits common over the whole 3.268.760 x 3.268.760 universe.
Using a brute force algorithm I found out that with 81 25bit member is possible achive this. But i'm guessing that this number should be smaller (something near 12).
I was trying to use a brute force algorithm to make all possible variations of 12 members over the 3.268.760 but the number of possibilities
it's so huge that it would take more than a hundred years to compute (3,156x10e69 combinations).
I've googled about combinatorics but there are so many fields that i don't know in wich these problem should fit.
So any directions on wich field of combinatorics, or any algorithm for these issue is greatly appreciate.
PS: Just for reference. The "likeness" of two members is calculated using:
(Not(a xor b)) and a
After that there's a small recursive loop to count the bits given the number of common bits.
EDIT: As promissed (#btilly)on the comment below here's the 'fractal' image of the relations or link to image
The color scale ranges from red (15bits match) to green (11bits match) to black for values smaller than 10bits.
This image is just sample of the 4096 first groups.

tl;dr: you want to solve dominating set on a large, extremely symmetric graph. btilly is right that you should not expect an exact answer. If this were my problem, I would try local search starting with the greedy solution. Pick one set and try to get rid of it by changing the others. This requires data structures to keep track of which sets are covered exactly once.
EDIT: Okay, here's a better idea for a lower bound. For every k from 1 to the value of the optimal solution, there's a lower bound of [25 choose 15] * k / [maximum joint coverage of k sets]. Your bound of 12 (actually 10 by my reckoning, since you forgot some neighbors) corresponds to k = 1. Proof sketch: fix an arbitrary solution with m sets and consider the most coverage that can be obtained by k of the m. Build a fractional solution where all symmetries of the chosen k are averaged together and scaled so that each element is covered once. The cost of this solution is [25 choose 15] * k / [maximum joint coverage of those k sets], which is at least as large as the lower bound we're shooting for. It's still at least as small, however, as the original m-set solution, as the marginal returns of each set are decreasing.
Computing maximum coverage is in general hard, but there's a factor (e/(e-1))-approximation (≈ 1.58) algorithm: greedy, which it sounds as though you could implement quickly (note: you need to choose the set that covers the most uncovered other sets each time). By multiplying the greedy solution by e/(e-1), we obtain an upper bound on the maximum coverage of k elements, which suffices to power the lower bound described in the previous paragraph.
Warning: if this upper bound is larger than [25 choose 15], then k is too large!

This type of problem is extremely hard, you should not expect to be able to find the exact answer.
A greedy solution should produce a "fairly good" answer. But..how to be greedy?
The idea is to always choose the next element to be the one that is going to match as many possibilities as you can that are currently unmatched. Unfortunately with over 3 million possible members, that you have to try match against millions of unmatched members (note, your best next guess might already match another member in your candidate set..), even choosing that next element is probably not feasible.
So we'll have to be greedy about choosing the next element. We will choose each bit to maximize the sum of the probabilities of eventually matching all of the currently unmatched elements.
For that we will need a 2-dimensional lookup table P such that P(n, m) is the probability that two random members will turn out to have at least 11 bits in common, if m of the first n bits that are 1 in the first member are also 1 in the second. This table of 225 probabilities should be precomputed.
This table can easily be computed using the following rules:
P(15, m) is 0 if m < 11, 1 otherwise.
For n < 15:
P(n, m) = P(n+1, m+1) * (15-m) / (25-n) + P(n+1, m) * (10-n+m) / (25-n)
Now let's start with a few members that are "very far" from each other. My suggestion would be:
First 15 bits 1, rest 0.
First 10 bits 0, rest 1.
First 8 bits 1, last 7 1, rest 0.
Bits 1-4, 9-12, 16-23 are 1, rest 0.
Now starting with your universe of (25 choose 15) members, eliminate all of those that match one of the elements in your initial collection.
Next we go into the heart of the algorithm.
While there are unmatched members:
Find the bit that appears in the most unmatched members (break ties randomly)
Make that the first set bit of our candidate member for the group.
While the candidate member has less than 15 set bits:
Let p_best = 0, bit_best = 0;
For each unset bit:
Let p = 0
For each unmatched member:
p += P(n, m) where m = number of bits in common between
candidate member+this bit and the unmatched member
and n = bits in candidate member + 1
If p_best < p:
p_best = p
bit_best = this unset bit
Set bit_best as the next bit in our candidate member.
Add the candidate member to our collection
Remove all unmatched members that match this from unmatched members
The list of candidate members is our answer
I have not written code, I therefore have no idea how good an answer this algorithm will produce. But assuming that it does no better than your current, for 77 candidate members (we cheated and started with 4) you have to make 271 passes through your unmatched candidates (25 to find the first bit, 24 to find the second, etc down to 11 to find the 15th, and one more to remove the matched members). That's 20867 passes. If you have an average of 1 million unmatched members, that's on the order of a 20 billion operations.
This won't be quick. But it should be computationally feasible.

Related

Quickly determine if number divides any element in a set

Is there an algorithm that can quickly determine if a number is a factor of a given set of numbers ?
For example, 12 is a factor of [24,33,52] while 5 is not.
Is there a better approach than linear search O(n)? The set will contain a few million elements. I don't need to find the number, just a true or false result.
If a large number of numbers are checked against a constant list one possible approach to speed up the process is to factorize the numbers in the list into their prime factors first. Then put the list members in a dictionary and have the prime factors as the keys. Then when a number (potential factor) comes you first factorize it into its prime factors and then use the constructed dictionary to check whether the number is a factor of the numbers which can be potentially multiples of the given number.
I think in general O(n) search is what you will end up with. However, depending on how large the numbers are in general, you can speed up the search considerably assuming that the set is sorted (you mention that it can be) by observing that if you are searching to find a number divisible by D and you have currently scanned x and x is not divisible by D, the next possible candidate is obviously at floor([x + D] / D) * D. That is, if D = 12 and the list is
5 11 13 19 22 25 27
and you are scanning at 13, the next possible candidate number would be 24. Now depending on the distribution of your input, you can scan forwards using binary search instead of linear search, as you are searching now for the least number not less than 24 in the list, and the list is sorted. If D is large then you might save lots of comparisons in this way.
However from pure computational complexity point of view, sorting and then searching is going to be O(n log n), whereas just a linear scan is O(n).
For testing many potential factors against a constant set you should realize that if one element of the set is just a multiple of two others, it is irrelevant and can be removed. This approach is a variation of an ancient algorithm known as the Sieve of Eratosthenes. Trading start-up time for run-time when testing a huge number of candidates:
Pick the smallest number >1 in the set
Remove any multiples of that number, except itself, from the set
Repeat 2 for the next smallest number, for a certain number of iterations. The number of iterations will depend on the trade-off with start-up time
You are now left with a much smaller set to exhaustively test against. For this to be efficient you either want a data structure for your set that allows O(1) removal, like a linked-list, or just replace "removed" elements with zero and then copy non-zero elements into a new container.
I'm not sure of the question, so let me ask another: Is 12 a factor of [6,33,52]? It is clear that 12 does not divide 6, 33, or 52. But the factors of 12 are 2*2*3 and the factors of 6, 33 and 52 are 2*2*2*3*3*11*13. All of the factors of 12 are present in the set [6,33,52] in sufficient multiplicity, so you could say that 12 is a factor of [6,33,52].
If you say that 12 is not a factor of [6,33,52], then there is no better solution than testing each number for divisibility by 12; simply perform the division and check the remainder. Thus 6%12=6, 33%12=9, and 52%12=4, so 12 is not a factor of [6.33.52]. But if you say that 12 is a factor of [6,33,52], then to determine if a number f is a factor of a set ns, just multiply the numbers ns together sequentially, after each multiplication take the remainder modulo f, report true immediately if the remainder is ever 0, and report false if you reach the end of the list of numbers ns without a remainder of 0.
Let's take two examples. First, is 12 a factor of [6,33,52]? The first (trivial) multiplication results in 6 and gives a remainder of 6. Now 6*33=198, dividing by 12 gives a remainder of 6, and we continue. Now 6*52=312 and 312/12=26r0, so we have a remainder of 0 and the result is true. Second, is 5 a factor of [24,33,52]? The multiplication chain is 24%5=5, (5*33)%5=2, and (2*52)%5=4, so 5 is not a factor of [24,33,52].
A variant of this algorithm was recently used to attack the RSA cryptosystem; you can read about how the attack worked here.
Since the set to be searched is fixed any time spent organising the set for search will be time well spent. If you can get the set in memory, then I expect that a binary tree structure will suit just fine. On average searching for an element in a binary tree is an O(log n) operation.
If you have reason to believe that the numbers in the set are evenly distributed throughout the range [0..10^12] then a binary search of a sorted set in memory ought to perform as well as searching a binary tree. On the other hand, if the middle element in the set (or any subset of the set) is not expected to be close to the middle value in the range encompassed by the set (or subset) then I think the binary tree will have better (practical) performance.
If you can't get the entire set in memory then decomposing it into chunks which will fit into memory and storing those chunks on disk is probably the way to go. You would store the root and upper branches of the set in memory and use them to index onto the disk. The depth of the part of the tree which is kept in memory is something you should decide for yourself, but I'd be surprised if you needed more than the root and 2 levels of branch, giving 8 chunks on disk.
Of course, this only solves part of your problem, finding whether a given number is in the set; you really want to find whether the given number is the factor of any number in the set. As I've suggested in comments I think any approach based on factorising the numbers in the set is hopeless, giving an expected running time beyond polynomial time.
I'd approach this part of the problem the other way round: generate the multiples of the given number and search for each of them. If your set has 10^7 elements then any given number N will have about (10^7)/N multiples in the set. If the given number is drawn at random from the range [0..10^12] the mean value of N is 0.5*10^12, which suggests (counter-intuitively) that in most cases you will only have to search for N itself.
And yes, I am aware that in many cases you would have to search for many more values.
This approach would parallelise relatively easily.
A fast solution which requires some precomputation:
Organize your set in a binary tree with the following rules:
Numbers of the set are on the leaves.
The root of the tree contains r the minimum of all prime numbers that divide a number of the set.
The left subtree correspond to the subset of multiples of r (divided by r so that r won't be repeated infinitly).
The right subtree correspond to the subset of numbers not multiple of r.
If you want to test if a number N divides some element of the set, compute its prime decomposition and go through the tree until you reach a leaf. If the leaf contains a number then N divides it, else if the leaf is empty then N divides no element in the set.
Simply calculate the product of the set and mod the result with the test factor.
In your example
{24,33,52} P=41184
Tf 12: 41184 mod 12 = 0 True
Tf 5: 41184 mod 5 = 4 False
The set can be broken into chunks if calculating the product would overflow the arithmetic of the calculator, but huge numbers are possible by storing a strings.

Arranging groups of people optimally

I have this homework assignment that I think I managed to solve, but am not entirely sure as I cannot prove my solution. I would like comments on what I did, its correctness and whether or not there's a better solution.
The problem is as follows: we have N groups of people, where group ihas g[i]people in it. We want to put these people on two rows of S seats each, such that: each group can only be put on a single row, in a contiguous sequence, OR if the group has an even number of members, we can split them in two and put them on two rows, but with the condition that they must form a rectangle (so they must have the same indices on both rows). What is the minimum number of seats S needed so that nobody is standing up?
Example: groups are 4 11. Minimum S is 11. We put all 4 in one row, and the 11 on the second row. Another: groups are 6 2. We split the 6 on two rows, and also the two. Minimum is therefore 4 seats.
This is what I'm thinking:
Calculate T = (sum of all groups + 1) / 2. Store the group numbers in an array, but split all the even values x in two values of x / 2 each. So 4 5 becomes 2 2 5. Now run subset sum on this vector, and find the minimum value higher than or equal to T that can be formed. That value is the minimum number of seats per row needed.
Example: 4 11 => 2 2 11 => T = (15 + 1) / 2 = 8. Minimum we can form from 2 2 11 that's >= 8 is 11, so that's the answer.
This seems to work, at least I can't find any counter example. I don't have a proof though. Intuitively, it seems to always be possible to arrange the people under the required conditions with the number of seats supplied by this algorithm.
Any hints are appreciated.
I think your solution is correct. The minimum number of seats per row in an optimal distribution would be your T (which is mathematically obvious).
Splitting even numbers is also correct, since they have two possible arrangements; by logically putting all the "rectangular" groups of people on one end of the seat rows you can also guarantee that they will always form a proper rectangle, so that this condition is met as well.
So the question boils down to finding a sum equal or as close as possible to T (e.g. partition problem).
Minor nit: I'm not sure if the proposed solution above works in the edge case where each group has 0 members, because your numerator in T = SUM ALL + 1 / 2 is always positive, so there will never be a subset sum that is greater than or equal to T.
To get around this, maybe a modulus operation might work here. We know that we need at least n seats in a row if n is the maximal odd term, so maybe the equation should have a max(n * (n % 2)) term in it. It will come out to max(odd) or 0. Since the maximal odd term is always added to S, I think this is safe (stated boldly without proof...).
Then we want to know if we should split the even terms or not. Here's where the subset sum approach might work, but with T simply equal to SUM ALL / 2.

in a series of n elements of arithmetic progression, [n/2] elements are changed. Find the difference in the initial arithmetic progression

I have a list of size n which contains n consecutive members of an arithmetic progression which are not in order. I changed less than half of the elements in this list with some random integer. From this new list, how can I find the difference of the initial arithmetic progression?
I thought a lot about it but except brute force, I was not able to come up with any other thing :(
Thanks for thinking on this one :)
It's not possible to solve this in general and be 100% sure that your answer is correct. Let's say that the initial list is the following arithmetic progression (not in order):
1 3 2 4
Change less than half the elements at random... let's say for example that we changed 2 to 5:
1 3 5 4
If we can first find out which numbers we need to change to obtain a valid shuffled arithmetic sequence then we can easily solve the problem stated in the question. However we can see that there are multiple possible answers depending in which we number we choose to change:
6, 3, 5, 4 (difference is 1)
1, 3, 2, 4 (difference is 1)
1, 3, 5, 7 (difference is 2)
There is no way to know which of these possible sequence is the original sequence, so you cannot be sure what the original difference was.
Since there is no deterministic solution for the problem (as stated by #Mark Byers), you can try a probabilistic approach.
It's difficult to obtain the original progression, but its rate can be obtained easily by comparing the differences between elements. The difference of original ones will be multiples of rate.
Consider you take 2 elements from the list (probability that both of them belongs to the original sequence is 1/4), and compute the difference. This difference, with probability of 1/4, will be a multiple of the rate. Decompose it to prime factors and count them (for example, 12 = 2^^2 * 3 will add 2 to 2's counter and will increment 3's counter).
After many such iterations (it looks like a good problem for probabilistic methods, like Monte Carlo), you could analize the counters.
If a prime factor belongs to the rate, its counter will be at least num_iteartions/4 ( or num_iterations/2 if it appears twice).
The main problem is that small factors will have large probability on random input (for example, the difference between two random numbers will have 50% probability to be divisible by 2). So you'll have to compensate it: since 3/4 of your differences were random, you'll have to consider that (3/8)*num_iterations of 2's counter must be ignored. Since this also applies to all powers of two, the simpliest way is to pregenerate "white noise mask" by taking the differences only between random numbers.
EDIT: let's take this approach further. Consider that you create this "white noise mask" (let's call it spectrum) for random numbers, and consider that it's base-1 spectrum, since their smallest "largest common factor" is 1. By computing it for a differences of the arithmetic sequence, you'll obtain a base-R spectrum, where R is the rate, and it will equivalent to a shifted version of base-1 spectrum. So you have to find the value of R such that
your_spectrum ~= spectrum(1)*3/4 + spectrum(R)*1/4
You could also check for largest number R such that at least half of the elements will be equal modulo R.

Revisit: 2D Array Sorted Along X and Y Axis

So, this is a common interview question. There's already a topic up, which I have read, but it's dead, and no answer was ever accepted. On top of that, my interests lie in a slightly more constrained form of the question, with a couple practical applications.
Given a two dimensional array such that:
Elements are unique.
Elements are sorted along the x-axis and the y-axis.
Neither sort predominates, so neither sort is a secondary sorting parameter.
As a result, the diagonal is also sorted.
All of the sorts can be thought of as moving in the same direction. That is to say that they are all ascending, or that they are all descending.
Technically, I think as long as you have a >/=/< comparator, any total ordering should work.
Elements are numeric types, with a single-cycle comparator.
Thus, memory operations are the dominating factor in a big-O analysis.
How do you find an element? Only worst case analysis matters.
Solutions I am aware of:
A variety of approaches that are:
O(nlog(n)), where you approach each row separately.
O(nlog(n)) with strong best and average performance.
One that is O(n+m):
Start in a non-extreme corner, which we will assume is the bottom right.
Let the target be J. Cur Pos is M.
If M is greater than J, move left.
If M is less than J, move up.
If you can do neither, you are done, and J is not present.
If M is equal to J, you are done.
Originally found elsewhere, most recently stolen from here.
And I believe I've seen one with a worst-case O(n+m) but a optimal case of nearly O(log(n)).
What I am curious about:
Right now, I have proved to my satisfaction that naive partitioning attack always devolves to nlog(n). Partitioning attacks in general appear to have a optimal worst-case of O(n+m), and most do not terminate early in cases of absence. I was also wondering, as a result, if an interpolation probe might not be better than a binary probe, and thus it occurred to me that one might think of this as a set intersection problem with a weak interaction between sets. My mind cast immediately towards Baeza-Yates intersection, but I haven't had time to draft an adaptation of that approach. However, given my suspicions that optimality of a O(N+M) worst case is provable, I thought I'd just go ahead and ask here, to see if anyone could bash together a counter-argument, or pull together a recurrence relation for interpolation search.
Here's a proof that it has to be at least Omega(min(n,m)). Let n >= m. Then consider the matrix which has all 0s at (i,j) where i+j < m, all 2s where i+j >= m, except for a single (i,j) with i+j = m which has a 1. This is a valid input matrix, and there are m possible placements for the 1. No query into the array (other than the actual location of the 1) can distinguish among those m possible placements. So you'll have to check all m locations in the worst case, and at least m/2 expected locations for any randomized algorithm.
One of your assumptions was that matrix elements have to be unique, and I didn't do that. It is easy to fix, however, because you just pick a big number X=n*m, replace all 0s with unique numbers less than X, all 2s with unique numbers greater than X, and 1 with X.
And because it is also Omega(lg n) (counting argument), it is Omega(m + lg n) where n>=m.
An optimal O(m+n) solution is to start at the top-left corner, that has minimal value. Move diagonally downwards to the right until you hit an element whose value >= value of the given element. If the element's value is equal to that of the given element, return found as true.
Otherwise, from here we can proceed in two ways.
Strategy 1:
Move up in the column and search for the given element until we reach the end. If found, return found as true
Move left in the row and search for the given element until we reach the end. If found, return found as true
return found as false
Strategy 2:
Let i denote the row index and j denote the column index of the diagonal element we have stopped at. (Here, we have i = j, BTW). Let k = 1.
Repeat the below steps until i-k >= 0
Search if a[i-k][j] is equal to the given element. if yes, return found as true.
Search if a[i][j-k] is equal to the given element. if yes, return found as true.
Increment k
1 2 4 5 6
2 3 5 7 8
4 6 8 9 10
5 8 9 10 11

Sum of numbers making a sequence

While watching the rugby last night I was wondering if any scores were impossible given you can only score points in lots of 3, 5 or 7. It didn't take long to work out that any number greater than 4 is attainable. 5=5, 6=3+3, 7=7, 8=3+5, 9=3+3+3, 10=5+5 and so on.
Extending on that idea for 5, 7 and 9 yields the following possible scores:
5,7,9,10,12,14 // and now all numbers are possible.
For 7, 9 and 11:
7,9,11,14,16,18,20,22,23,25,27 // all possible from here
I did these in my head, can anyone suggest a good algorithm that would determine the lowest possible score above which all scores are attainable given a set of scores.
I modelled it like this:
forall a < 10:
forall b < 10:
forall c < 10:
list.add(3a + 5b + 7c);
list.sort_smallest_first();
Then check the list for a sequence longer than 3 (the smallest score possible). Seems pretty impractical and slow for anything beyond the trivial case.
There is only one unattainable number above which all scores are attainable.
This is called the frobenius number. See: http://en.wikipedia.org/wiki/Frobenius_number
The wiki page should have links for algorithms to solve it, for instance: http://www.combinatorics.org/Volume_12/PDF/v12i1r27.pdf
For 2 numbers a,b an exact formula (ab-a-b) is known (which you could use to cut down your search space), and for 3 numbers a,b,c a sharp lower bound (sqrt(3abc)-a-b-c) and quite fast algorithms are known.
If the numbers are in arithmetic progression, then an exact formula is known (see wiki). I mention this because in your examples all numbers are in arithmetic progression.
So to answer your question, find the Frobenius number and add 1 to it.
Hope that helps.

Resources