Finding overlapping sets

Finding overlapping sets - algorithm

I'm writing a Digital Fountain system in C#. Part of this system creates me sets of integers, I need to find the combinations of sets which create can leave me with a set of just one item. What's the fastest way to do this?
Set A: 1,2,3,4,5,6
Set B: 1,2,3,4,6
Set C: 1,2,3
Set D: 5,6
Solutions:
A - B => 5
A - (C + D) => 4
I don't need to find all combinations, just enough to find me as many unique numbers as possible. This may be possible to exploit to create a more efficient algorithm.
An important point that I forgot to mention:
I do not know, beforehand, how many sets there are, instead I add them one by one, and must determine each time if I have found every number I require. So the algorithm must be something which can be run in stages as new sets are added.
Nb. Solutions in C# get bonus marks ;)

i think some nice solutions can be gained by some sort of modification of using greedy set cover (http://en.wikipedia.org/wiki/Set_cover_problem) algorithm.
[pseudocode]
so:
1. sort sets by size descending
2.
foreach set in sets do:
uncovered = set.size
while uncovered > 1
current_set = the biggest set that covers no more than (uncovered - 1) and was not used before to cover set
uncovered = uncovered - covered_by_set(set)
collect current_set to some array
end
end
edit:
you can ommit foreach loop for last
set
this will bring you no more than one
solution for each of sets (to fix
this you can change problem directly
into set cover problem and use greedy
set cover), for example if you array
[1,3,4], you need to find solution of
SCV problem for all subsets of it
that have size = 2: [1,3],
[1,4], [3,4]. it will make problem
much more complex
another way that you may consider are
evolution algorithms (representation
here will be very simple, treat
specified number as bit, fitness
function should grow closer to 1),
but this still don't solve problem of
adding new set after calculations
(maybe when you have best population
from last problem, then after adding
new set just add new place in
chromosome)

Related

Data structure for subset-reduced growing list

I'm working on a problem which involves going through a lot of data. To reduce the work (because current calculations take about two weeks of compute time, and I'd like to reduce that dramatically) I came up with an algorithm which would be much faster if it was able to avoid a certain type of duplication. (The current algorithm avoids storing this information because it is too large, unreduced, to fit in memory.)
I have a collection of sets, and I don't want to insert a set A if there is already a set B which is a subset of A. At the moment the sets are represented by integers where individual binary digits represent a particular element being present or absent. In that interpretation the set/integer A should not be inserted if there is already a set/integer B such that (~A) & B is 0, where ~ is bitwise negation and & is bitwise AND.
For example, if my collection has the following sets
[ {a,b}, {b,c}, {b,d,e} ]
and I asked to add {b,c,e} it should not be added (since {b,c} is already there) and similarly with {a,b} (since {a,b} is there) but {a,e} should be added.
The numeric equivalent would be starting with
[ `0b11`, `0b110`, `0b11010` ]
where 0b10110 is not added since (~0b10110) & 0b110 == 0, 0b11 is not added since (~0b11) ^ 0b11 == 0, but 0b10001 can be added.
Ideally the structure would prune itself as new sets are added, so if {c} were added all existing sets containing c would be removed. But it's acceptable if it doesn't update in that way as long as I can normalize it to that form in some not-too-expensive way every so often.

This is a well-known problem known as "finding extremal sets"; unfortunately, there is nothing fundamentally faster known than the obvious approach of testing a newly inserted set against all existing sets, but good heuristic improvements exist. Here is a recent paper discussing this problem: https://arxiv.org/abs/1508.01753
An open-source implementation of a related algorithm:
https://code.google.com/archive/p/google-extremal-sets/

Algorithm determining the smallest coprime subset

Given a set A of n positive integers, determine a non-empty subset B
consisting of as few elements as possible such that their GCD is 1 and output its size.
For example: 5 6 10 12 15 18
yields an output of "3", while:
5 2 4 6 8 10
equals "NONE" since no subset can be determined.
So it seems really basic but I'm still stuck with it. My thoughts on it are as follows: we know that having the multiples of some number already present in the set are useless since their divisors are the same times some factor k and we're going for the smallest subsest. Hence, for every ni, we remove any kni where k is a positive int from further calculations.
That's where I get stuck, though. What should I do next? I can only think of a dumb, brute force approach of trying if there is already some 2-element subset, then 3-elem and so on. What should I check to determine it in some more clever way?

Suppose for each A,B (two elements) we calculate their greatest common
divisor D. And then we store these D values somewhere as a map of the form:
A,B -> D
Let's say we also store the reverse map
D -> A,B
If there's at least one D=1 then there we go - the answer is 2.
Suppose now, there's no such D that D=1.
What condition should be met for the answer to be 3?
I think this one:
there exist two D values say D1 and D2 such that GCD(D1, D2)=1.
Right?
So now instead of As and Bs, we've transformed our problem to the
same problem over the set of all Ds and we've transformed the option of
a the 2 answer to the option a 3 answer. Right?
I am not 100% sure just thinking out loud.
But this transformed problem is even worse as
we have to store much more values.
(combinations of N elements class 2).
Not sure, this problem you pose seems like a hard
problem to me. I would be surprised if there exists
a better approach than brute-force
and would be interested to know it.
What you need to think on (and look for) is this:
is there a way to express GCD(a1, a2, ... aN)
if you know their pair-wise GCDs. If there's some
sort of method or formula you can simplify a bit
your search (for the smallest subset matching
the desired criterion).
See also this link. Maybe it could help.
https://cs.stackexchange.com/questions/10249/finding-the-size-of-the-smallest-subset-with-gcd-1

The problem is definitely a tough one to solve. I can't see any computationally efficient algorithm that would guaranteed find the solution in reasonable time.
One approach is:
Form a list of ordered sets that would contain the prime factors of each element in the original set.
Now you need to find the minimum number of sets for which their intersection is zero.
To do that, first order these sets in your list so that the sets that have least number of intersections with other sets are towards the beginning. Now what are "least number of intersections"?
This is where heuristics come into play. It can be:
1. set having Less of MIN number of intersections with other elements.
2. set having Less of MAX number of intersections with other elements.
3. Any other more suitable definition.
Now you will need to expensively iterate through all the combinations maybe through recursion to determine the solution.

Algorithm for finding basis of a set of bitstrings?

This is for a diff utility I'm writing in C++.
I have a list of n character-sets {"a", "abc", "abcde", "bcd", "de"} (taken from an alphabet of k=5 different letters). I need a way to observe that the entire list can be constructed by disjunctions of the character-sets {"a", "bc", "d", "e"}. That is, "b" and "c" are linearly dependent, and every other pair of letters is independent.
In the bit-twiddling version, the character-sets above are represented as {10000, 11100, 11111, 01110, 00011}, and I need a way to observe that they can all be constructed by ORing together bitstrings from the smaller set {10000, 01100, 00010, 00001}.
In other words, I believe I'm looking for a "discrete basis" of a set of n different bit-vectors in {0,1}k. This paper claims the general problem is NP-complete... but luckily I'm only looking for a solution to small cases (k < 32).
I can think of really stupid algorithms for generating the basis. For example: For each of the k2 pairs of letters, try to demonstrate (by an O(n) search) that they're dependent. But I really feel like there's an efficient bit-twiddling algorithm that I just haven't stumbled upon yet. Does anyone know it?
EDIT: I ended up not really needing a solution to this problem after all. But I'd still like to know if there is a simple bit-twiddling solution.

I'm thinking a disjoint set data structure, like union find turned on it's head (rather than combining nodes, we split them).
Algorithm:
Create an array main where you assign all the positions to the same group, then:
for each bitstring curr
for each position i
if (curr[i] == 1)
// max of main can be stored for constant time access
main[i] += max of main from previous iteration
Then all the distinct numbers in main are your different sets (possibly using the actual union-find algorithm).
Example:
So, main = 22222. (I won't use 1 as groups to reduce possible confusion, as curr uses bitstrings).
curr = 10000
main = 42222 // first bit (=2) += max (=2)
curr = 11100
main = 86622 // first 3 bits (=422) += max (=4)
curr = 11111
main = 16-14-14-10-10
curr = 01110
main = 16-30-30-26-10
curr = 00011
main = 16-30-30-56-40
Then split by distinct numbers:
{10000, 01100, 00010, 00001}
Improvement:
To reduce the speed at which main increases, we can replace
main[i] += max of main from previous iteration
with
main[i] += 1 + (max - min) of main from previous iteration
EDIT: Edit based on j_random_hacker's comment

You could combine the passes of the stupid algorithm at the cost of space.
Make a bit vector called violations that is (k - 1) k / 2 bits long (so, 496 for k = 32.) Take a single pass over character sets. For each, and for each pair of letters, look for violations (i.e. XOR the bits for those letters, OR the result into the corresponding position in violations.) When you're done, negate and read off what's left.

You could give Principal Component Analysis a try. There are some flavors of PCA designed for binary or more generally for categorical data.

Since someone showed it as NP complete, for large vocabs I doubt you will do better than a brute force search (with various pruning possible) of the entire set of possibilities O((2k-1) * n). At least in a worst case scenario, probably some heuristics will help in many cases as outlined in the paper you linked. This is your "stupid" approach generalized to all possible basis strings instead of just basis of length 2.
However, for small vocabs, I think an approach like this would do a lot better:
Are your words disjoint? If so, you are done (simple case of independent words like "abc" and "def")
Perform bitwise and on each possible pair of words. This gives you an initial set of candidate basis strings.
Goto step 1, but instead of using the original words, use the current basis candidate strings
Afterwards you also need to include any individual letter which is not a subset of one of the final accepted candidates. Maybe some other minor bookeeping for things like unused letters (using something like a bitwise or on all possible words).
Considering your simple example:
First pass gives you a, abc, bc, bcd, de, d
Second pass gives you a, bc, d
Bookkeeping gives you a, bc, d, e
I don't have a proof that this is right but I think intuitively it is at least in the right direction. The advantage lies in using the words instead of the brute force's approach of using possible candidates. With a large enough set of words, this approach would become terrible, but for vocabularies up to say a few hundred or maybe even a few thousand I bet it would be pretty quick. The nice thing is that it will still work even for a huge value of k.
If you like the answer and bounty it I'd be happy to try to solve in 20 lines of code :) and come up with a more convincing proof. Seems very doable to me.

What is determining the items that make the difference between two arrays called?

I want to find which elements of two arrays make the two arrays different.
For example, if I start off with
known_unacceptable_array = [bad, bad, good, good, good, bad, good]
known_acceptable_array = []
and an array is only unacceptable if there's three bads (but I don't know that at the time), but I'm able to evaluate whether an array is acceptable or unacceptable, I would like to find the smallest array that makes the array unacceptable
possibly_minimal_unacceptable = [bad, bad, bad]
maximal_acceptable = [bad, bad] # Third bad required to make the array unacceptable
What is this problem called, and what algorithms are there for this?
Edit: The elements can't be changed in order, and adding an element can only either change the list from being acceptable to unacceptable or have no effect - it can't change it from being unacceptable to acceptable.
Background: I've randomly generated thousands of instructions that make a ruby interpreter crash, and I want to isolate the specific instructions that cause it to crash, and at the time I thought that multiple bad instructions were required to make it crash. A very naive attempt to determine what the bad instructions is at this link

What is determining the elements that make the difference
between two arrays called?
Differencing is often called
subtraction.
I want to determine which elements of two arrays make the
two arrays different.
Again, that's subtraction(at least
some form of it):
Given A ={ x , y , z } B = { x , y a },
A - B = { z , -a }
or "only A has z and only B has a", or "z and a" make them
different.
For example, if I start off with
known_bad = [bad, bad, good, good, good, bad, good] >
known_good = []
Why start with a full array and an empty one? Isn't this an
extreme case, or are these "two arrays" not two of which you
are trying to determine the "difference."
possibly_minimal_bad = [bad, bad, bad]
maximal_good = [bad, bad] # Third bad required to make the list bad
Is this just a set of rules? Or is this the result of
finding the difference between the two arrays of the previous
(known_good,bad) set?
What is this problem called, and what algorithms are there
for this?
If it isn't called "difference" or "subtraction" then why
introduce it that way?
Is the problem: a. going from
the first two arrays (known_xx) to the second two (min,max);
or is it: b. classifying finite sequences of the words "good"
and "bad."
a) I can't see a relation between the first two
arrays and the second two. How did you get from the first two
to the second?
b) Classifying a sequence of words could be
"parsing a language", or decoding a message, recognizing a
pattern, etc.
Is it "Pattern Recognition"?
It appears that you are looking for a pattern in test input(or test point) data and it's relationship to product failure,
and want to represent the relationship in some codical
form for further analysis. Or searching for a correlation between certain test points and product failure. That makes this question rather
interesting. However, the presentation of the question
is quite confusing. Maybe those groups of
equations could be explained a little more, clarifying if they are related,and if so, then: In what way?

I'm not entirely sure if I understand the question. If my answer is unsatisfactory, please rephrase your question to be more clear. I'll base my answer on this.
I want to determine which elements of two arrays make the two arrays different.
This is a combination of the three set operations union, intersection and difference. Different combinations can achieve the same result.
Complement is the the subset of A which is not in B.
Intersection is the set of elements which is both in A and B, but not just A or B.
Union is the subset which is either in A or B (no duplicates).
It sounds like you want the union of both complements, which is:
A\B ∪ B\A
Or the complement between the intersection and the union:
A∩B \ A∪B
See http://en.wikipedia.org/wiki/Set_operations_(Boolean) for more information.

Random Pairings that don't Repeat

This little project / problem came out of left field for me. Hoping someone can help me here. I have some rough ideas but I am sure (or at least I hope) a simple, fairly efficient solution exists.
Thanks in advance.... pseudo code is fine. I generally work in .NET / C# if that sheds any light on your solution.
Given:
A pool of n individuals that will be meeting on a regular basis. I need to form pairs that have not previously meet. The pool of individuals will slowly change over time. For the purposes of pairing, (A & B) and (B & A) constitute the same pair. The history of previous pairings is maintained. For the purpose of the problem, assume an even number of individuals. For each meeting (collection of pairs) and individual will only pair up once.
Is there an algorithm that will allow us to form these pairs? Ideally something better than just ordering the pairs in a random order, generating pairings and then checking against the history of previous pairings. In general, randomness within the pairing is ok.
A bit more:
I can figure a number of ways to create a randomized pool from which to pull pairs of individuals. Check those against the history and either throw them back in the pool or remove them and add them to the list of paired individuals. What I can't get my head around is that at some point I will be left with a list of individuals that cannot be paired up. But... some of those individuals could possibly be paired with members that are in the paired list. I could throw one of those partners back in the pool of unpaired members but this seems to lead to a loop that would be difficult to test and that could run on forever.

Interesting idea for converting a standard search into a probability selection:
Load the history in a structure with O(1) "contains" tests e.g. a HashSet of (A,B) pairs.
Loop through each of 0.5*n*(n-1) possible pairings
check if this pairing is in history
if not then continue to the next iteration of loop
increase "number found" counter
save pairing as "result" with probability 1/"number found" (i.e. always for the first unused pairing found)
Finally if "result" has an answer then use it, else all possibilities are exhausted
This will run in O(n^2) + O(size of history), and nicely detects the case when all probabilities are exhausted.

Based on your requirements, I think what you really need is quasi-random numbers that ultimately result in uniform coverage of your data (i.e., everyone pairs up with everyone else one time). Quasi-random pairings give you a much less "clumped" result than simple random pairings, with the added benefit that you have a much much greater control of the resulting data, hence you can control the unique pairings rule without having to detect whether the newly randomized pairings duplicate the historically randomized pairings.
Check this wikipedia entry:
http://en.wikipedia.org/wiki/Low-discrepancy_sequence
More good reading:
http://www.google.com/url?sa=t&source=web&cd=10&ved=0CEQQFjAJ&url=http%3A%2F%2Fwww.johndcook.com%2Fblog%2F2009%2F03%2F16%2Fquasi-random-sequences-in-art-and-integration%2F&ei=6KQXTMSuDoG0lQfVwPmbCw&usg=AFQjCNGQga_MKXJgfEQnQXy1qrHcwfOr4Q&sig2=ox7FB0mnToQbrOCYm9-OpA
I tried to find a C# library that would help you generate the sort of quasi-random spreads you're looking for, but the only libs I could find were in c/c++. But I still recommend downloading the source since the full logic of the quasi-random algorithms (look for quasi-Monte Carlo) is there:
http://www.gnu.org/software/gsl/

I see that as a graph problem where individuals are Nodes and vertex join individuals not yet related. With this reformulation create new pairs is simply to find a set of independant vertexes (without any common node).
That is not yet an answer but there is chances that this is a common graph problem with well known solutions.
One thing we can say at that point is that in some cases there may be no solution (you would have to redo some previous pairs).
It may also be simpler to consider dual graph (exchanging role of vertexes and nodes: nodes would be pairs and common individual between pairs vertexes).

at startup, build a list of all possible pairings.
add all possible new pairings to this list as individuals are added, and remove any expired pairings as individuals are removed from the pool.
select new pairings randomly from this list, and remove them from the list when the pairing is selected.

Form an upper diagonal matrix with your elements
Individual A B C D
A *
B * *
C * * *
D * * * *
Each blank element will contain True if the pair have been formed and False if not.
Each pairing session consist of looping through columns for each row until a False is found, form the pair and set the matrix element to true.
When deleting an individual, delete row and column.
If performance is an issue, you can keep the last pair formed for a row in a counter, updating it carefully when deleting
When adding an individual, add a last row & col.

Your best bet is probably:
Load the history in a structure with fast access e.g. a HashSet of (A,B) pairs.
Create a completely random set of pairings (e.g. by randomly shuffling the list of individuals and partitioning into adjacent pairs)
Check if each pairing is in the history (both (A,B) and (B,A) should be checked)
If none of the pairings are found, you have a completely new pairing set as required, else goto 2
Note that step 1 can be done once and simply updated when new pairings are created if you need to efficiently create large numbers of new unique pairings.
Also note that you will need to take some extra precautions if there is a chance that all possible pairings will be exhausted (in which case you need to bail out of the loop!)

Is there any way of ordering two elements? If so, you can save one (possibly only half) a hash probe per iteration by always ordering a pair the same way. So, if you have A, B, C and D, the generated possible pairs would be [AB, CD] [AC, BD] or [AD, BC].
What I'd do then is something like:
pair_everyone (pool, pairs, history):
if pool is empty:
all done, update global history, return pairs
repeat for pool_size/2:
pick element1 (randomly from pool)
pick element2 (randomly from pool)
set pair=pair(e1, e2)
until pair not in history or all possible pairs tried:
pick element1 (randomly from pool)
pick element2 (randomly from pool)
set pair=pair(e1, e2)
if pair is not in history:
result=pair_everyone(pool-e1-e2, pairs+pair, history+pair)
if result != failure:
return result
else:
return failure

How about:
create a set CI of all current individuals
then:
randomly select one individual A and remove from CI
create a new set of possible partners PP by copying CI and removing all previous partners of A
if PP is empty scan the list of pairs found and swap A for an individual C who is paired with someone not in A's history and who still has possible partners in CI. Recalculate PP for A = C.
if PP is not empty select one individual B from PP to be paired with A
remove B from CI
repeat until no new pair can be found

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio