Algorithm: Removing as few elements as possible from a set in order to enforce no subsets - algorithm

I got a problem which I do not know how to solve:
I have a set of sets A = {A_1, A_2, ..., A_n} and I have a set B.
The target now is to remove as few elements as possible from B (creating B'), such that, after removing the elements for all 1 <= i <= n, A_i is not a subset of B'.
For example, if we have A_1 = {1,2}, A_2 = {1,3,4}, A_3={2,5}, and B={1,2,3,4,5}, we could e.g. remove 1 and 2 from B (that would yield B'={3,4,5}, which is not a superset of one of the A_i).
Is there an algorithm for determining the (minimal number of) elements to be removed?

It sounds like you want to remove the minimal hitting set of A from B (this is closely related to the vertex cover problem).
A hitting set for some set-of-sets A is itself a set such that it contains at least one element from each set in A (it "hits" each set). The minimal hitting set is the smallest such hitting set. So, if you have an MHS for your set-of-sets A, you have an element from each set in A. Removing this from B means no set in A can be a subset of B.
All you need to do is calculate the MHS for (A1, A2, ... An), then remove that from B. Unfortunately, finding the MHS is an NP-complete problem. Knowing that though, you have a few options:
If your data set is small, do the obvious brute-force solution
Use a probabilistic algorithm to get a fast, approximate answer (see this PDF)
Run far, far away in the opposite direction

If you just need some approximation, start with the smallest set in A, and remove one element from B. (You could just grab one at random, or check to see which element is in the most sets in A, depending on how accurate, how fast you need)
Now the smallest set in A isn't a subset of B. Move on from there, but check first to see whether or not the sets you're examining are subsets at this point or not.
This reminds me of the vertex covering problem, and I remember some approximation algorithm for that that is similar to this one.

I think you should find the minimum length from these sets and then delete these elements which is in this set.

Related

Select the most elements that do not overlap so that the sum of their size is maximized

I'm trying to find an algorithm to the following problem.
Say I have a number of objects A, B, C,...
I have a list of valid combinations of these objects. Each combination is of length 2 or 4.
For eg. AF, CE, CEGH, ADFG,... and so on.
For combinations of two objects, eg. AF, the length of the combination is 2. For combination of four objects, eg CEGH, the length of the combination.
I can only pick non-overlapping combinations, i.e. I cannot pick AF and ADFG because both require objects 'A' and 'F'. I can pick combinations AF and CEGH because they do not require common objects.
If my solution consists of only the two combinations AF and CEGH, then my objective is the sum of the length of the combinations, which is 2 + 4 = 6.
Given a list of objects and their valid combinations, how do I pick the most valid combinations that don't overlap with each other so that I maximize the sum of the lengths of the combinations? I do not want to formulate it as an IP as I am working with a problem instance with 180 objects and 10 million valid combinations and solving an IP using CPLEX is prohibitively slow. Looking for some other elegant way to solve it. Can I perhaps convert this to a network? And solve it using a max-flow algorithm? Or a Dynamic program? Stuck as to how to go about solving this problem.
My first attempt at showing this problem to be NP-hard was wrong, as it did not take into account the fact that only combinations of size 2 or 4 were allowed. However, using Jim D.'s suggestion to reduce from 3-dimensional matching (3DM), we can show that the problem is nevertheless NP-hard.
I'll show that the natural decision problem form of your problem ("Given a set O of objects, and a set C of combinations of either 2 or 4 objects from O, and an integer m, does there exist a subset D of C such that all sets in D are pairwise disjoint, and the union of all sets in D has size at least m?") is NP-hard. Clearly the optimisation problem (i.e., your original problem, where we seek an actual subset of combinations that maximises m above) is at least as hard as this problem. (To see that the optimisation problem is not "much" harder than the decision problem, notice that you could first find the maximum m value for which a solution exists using a binary search on m in which you solve a decision problem at each step, and then, once this maximal m value has been found, solving a series of decision problems in which each combination in turn is removed: if the solution after removing some particular combination is still "YES", then it may also be left out of all future problem instances, while if the solution becomes "NO", then it is necessary to keep this combination in the solution.)
Given an instance (X, Y, Z, T, k) of 3DM, where X, Y and Z are sets that are pairwise disjoint from each other, T is a subset of X*Y*Z (i.e., a set of ordered triples with first, second and third components from X, Y and Z, respectively) and k is an integer, our task is to determine whether there is any subset U of T such that |U| >= k and all triples in U are pairwise disjoint (i.e., to answer the question, "Are there at least k non-overlapping triples in T?"). To turn any such instance of 3DM into an instance of your problem, all we need to do is create a fresh 4-combination from each triple in T, by adding a distinct dummy value to each. The set of objects in the constructed instance of your problem will consist of the union of X, Y, Z, and the |T| dummy values we created. Finally, set m to k.
Suppose that the answer to the original 3DM instance is "YES", i.e., there are at least k non-overlapping triples in T. Then each of the k triples in such a solution corresponds to a 4-combination in the input C to your problem, and no two of these 4-combinations overlap, since by construction, their 4th elements are all distinct, and by assumption of the. Thus there are at least m = k non-overlapping 4-combinations in the instance of your problem, so the solution for that problem must also be "YES".
In the other direction, suppose that the solution to the constructed instance of your problem is "YES", i.e., there are at least m non-overlapping 4-combinations in C. We can simply take the first 3 elements of each of the 4-combinations (throwing away the fourth) to produce a set of k = m non-overlapping triples in T, so the answer to the original 3DM instance must also be "YES".
We have shown that a YES-answer to one problem implies a YES-answer to the other, thus a NO-answer to one problem implies a NO-answer to the other. Thus the problems are equivalent. The instance of your problem can clearly be constructed in polynomial time and space. It follows that your problem is NP-hard.
You can reduce this problem to the maximum weighted clique problem, which is, unfortunately, NP-hard.
Build a graph such that every combination is a vertex with weight equal to the length of the combination, and connect vertices if the corresponding combinations do not share any object (i.e. if you can pick both them at the same time). Then, a solution is valid if and only if it is a clique in that graph.
A simple search on google brings up a lot of approximation algorithms for this problem, such as this one.

How to find a minimal set of keys?

I have a set of keys K and a finite set S &subset; K n of n-tuples of keys. Is there an efficient algorithm to find a bijective mapping f : S &mapsto; S' where S' &subset; K k with k < n minimal that strips some of the keys, leaving the others untouched?
I'm afraid this is NP-complete.
It is equivalent to set cover.
Each of your keys allows you to distinguish certain pairs of elements (i.e. a set of edges). Your task is to select the smallest number of keys that allows you to distinguish every element - i.e. the smallest number of sets of edges that allows you to cover every edge.
However, the wiki page shows an approximate solution based on integer programming that may give a useful solution in practice.
Sketch of Proof
Suppose we have a generic set cover problem:
A,B,C
C,D
A,B,D
where we need to find the smallest number of these sets to cover every element A,B,C,D.
We construct a tuple for each letter A,B,C,D.
The tuple has a unique number in position i if and only if set i contains the letter. Otherwise, they contain 0.
There is also a zero tuple.
This means that the tuples would look like:
(0,0,0) The zero tuple
(1,0,2) The tuple for A (in sets 1 and 3)
(3,0,4) The tuple for B (in sets 1 and 3)
(5,6,0) The tuple for C (in sets 1 and 2)
(0,7,8) The tuple for D (in sets 2 and 3)
If you could solve your problem efficiently, you would then be able to use this mapping to solve set cover efficiently.

How to code the maximum set packing algorithm?

Suppose we have a finite set S and a list of subsets of S. Then, the set packing problem asks if some k subsets in the list are pairwise disjoint .
The optimization version of the problem, maximum set packing, asks for the maximum number of pairwise disjoint sets in the list.
http://en.wikipedia.org/wiki/Set_packing
So, Let S = {1,2,3,4,5,6,7,8,9,10}
and `Sa = {1,2,3,4}`
and `Sb = {4,5,6}`
and `Sc = {5,6,7,8}`
and `Sd = {9,10}`
Then the maximum number of pairwise disjoint sets are 3 ( Sa, Sc, Sd )
I could not find any articles about the algorithm involved. Can you shed some light on the same?
My approach:
Sort the sets according to the size. Start from the set of the smallest size. If no element of the next set intersects with the current set, then we unite the set and increase the count of maximum sets. Does this sound good to you? Any better ideas?
As hivert pointed out, this problem is NP-hard, so there's no efficient way to do this. However, if your input is relatively small, you can still pull it off. Exponential doesn't mean impossible, after all. It's just that exponential problems become impractical very quickly, as the input size grows. But for something like 25 sets, you can easily brute force it.
Here's one approach. Let's say you have n subsets, called S0, S1, ..., etc. We can try every combination of subsets, and pick the one with maximum cardinality. There are only 2^25 = 33554432 choices, so this is probably reasonable enough.
An easy way to do this is to notice that any non-negative number strictly below 2^N represents a particular choice of subsets. Look at the binary representation of the number, and choose the sets whose indices correspond to the bits that are on. So if the number is 11, the 0th, 1st and 3rd bits are on, and this corresponds to the combination [S0, S1, S3]. Then you just verify that these three sets are in fact disjoint.
Your procedure is as follows:
Iterate i from 0 to 2^N - 1
For each value of i, use the bits that are on to figure out the corresponding combination of subsets.
If those subsets are pairwise disjoint, update your best answer with this combination (i.e., use this if it is bigger than your current best).
Alternatively, use backtracking to generate your subsets. The two approaches are equivalent, modulo implementation tradeoffs. Backtracking will have some stack overhead, but can cut off entire lines of computation if you check disjointness as you go. For example, if S1 and S2 are not disjoint, then it will never bother with any bigger combinations containing those two, saving some time. The iterative method can't optimize itself in this way, but is fast and efficient because of the bitwise operations and tight loop.
The only nontrivial matter here is how to check if the subsets are pairwise disjoint. There are all sorts of tricks you can pull here as well, depending on the constraints.
A simple approach is to start with an empty set structure (pick whatever you want from the language of your choice) and add elements from each subset one by one. If you ever hit an element that's already in the set, then it occurs in at least two subsets, and you can give up on this combination.
If the original set S has m elements, and m is relatively small, you can map each of them to the range [0, m-1] and use bitmasks for each set. So if m <= 64, you can use a Java long to represent each subset. Turn on all the bits that correspond to the elements in the subset. This allows blazing fast set operation, because of the speed of bitwise operations. Bitwise AND corresponds to set intersection, and bitwise OR is a union. You can check if two subsets are disjoint by seeing if the intersection is empty (i.e., ANDing the two bitmasks gives you 0).
If you don't have so few elements, you can still avoid repeating the set intersections multiple times. You have very few sets, so precompute which ones are disjoint at the start. You can just store a boolean matrix D, such that D[i][j] = true iff i and j are disjoint. Then you just look up all pairs in a combination to verify pairwise disjointness, rather than doing real set operations.
You can solve the set packing problem searching a Maximum independent set. You encode your problem as follows:
for each set you put a vertex
you put an edge between two vertex if they share a common number.
Then you wan't a maximum set of vertex without two having two related vertex. Unfortunately this is a NP-Hard problem. Any know algorithm is exponential.

Algorithm determining the smallest coprime subset

Given a set A of n positive integers, determine a non-empty subset B
consisting of as few elements as possible such that their GCD is 1 and output its size.
For example: 5 6 10 12 15 18
yields an output of "3", while:
5 2 4 6 8 10
equals "NONE" since no subset can be determined.
So it seems really basic but I'm still stuck with it. My thoughts on it are as follows: we know that having the multiples of some number already present in the set are useless since their divisors are the same times some factor k and we're going for the smallest subsest. Hence, for every ni, we remove any kni where k is a positive int from further calculations.
That's where I get stuck, though. What should I do next? I can only think of a dumb, brute force approach of trying if there is already some 2-element subset, then 3-elem and so on. What should I check to determine it in some more clever way?
Suppose for each A,B (two elements) we calculate their greatest common
divisor D. And then we store these D values somewhere as a map of the form:
A,B -> D
Let's say we also store the reverse map
D -> A,B
If there's at least one D=1 then there we go - the answer is 2.
Suppose now, there's no such D that D=1.
What condition should be met for the answer to be 3?
I think this one:
there exist two D values say D1 and D2 such that GCD(D1, D2)=1.
Right?
So now instead of As and Bs, we've transformed our problem to the
same problem over the set of all Ds and we've transformed the option of
a the 2 answer to the option a 3 answer. Right?
I am not 100% sure just thinking out loud.
But this transformed problem is even worse as
we have to store much more values.
(combinations of N elements class 2).
Not sure, this problem you pose seems like a hard
problem to me. I would be surprised if there exists
a better approach than brute-force
and would be interested to know it.
What you need to think on (and look for) is this:
is there a way to express GCD(a1, a2, ... aN)
if you know their pair-wise GCDs. If there's some
sort of method or formula you can simplify a bit
your search (for the smallest subset matching
the desired criterion).
See also this link. Maybe it could help.
https://cs.stackexchange.com/questions/10249/finding-the-size-of-the-smallest-subset-with-gcd-1
The problem is definitely a tough one to solve. I can't see any computationally efficient algorithm that would guaranteed find the solution in reasonable time.
One approach is:
Form a list of ordered sets that would contain the prime factors of each element in the original set.
Now you need to find the minimum number of sets for which their intersection is zero.
To do that, first order these sets in your list so that the sets that have least number of intersections with other sets are towards the beginning. Now what are "least number of intersections"?
This is where heuristics come into play. It can be:
1. set having Less of MIN number of intersections with other elements.
2. set having Less of MAX number of intersections with other elements.
3. Any other more suitable definition.
Now you will need to expensively iterate through all the combinations maybe through recursion to determine the solution.

How to find the subset with the greatest number of items in common?

Let's say I have a number of 'known' sets:
1 {a, b, c, d, e}
2 {b, c, d, e}
3 {a, c, d}
4 {c, d}
I'd like a function which takes a set as an input, (for example {a, c, d, e}) and finds the set that has the highest number of elements, and no more other items in common. In other words, the subset with the greatest cardinality. The answer doesn't have to be a proper subset. The answer in this case would be {a, c, d}.
EDIT: the above example was wrong, now fixed.
I'm trying to find the absolute most efficient way of doing this.
(In the below, I am assuming that the cost of comparing two sets is O(1) for the sake of simplicity. That operation is outside my control so there's no point thinking about it. In truth it would be a function of the cardinality of the two sets being compared.)
Candiate 1:
Generate all subsets of the input, then iterate over the known sets and return the largest one that is a subset. The downside to this is that the complexity will be something like O(n! × m), where n is the cardinality of the input set and m is the number of 'known' subsets.
Candidate 1a (thanks #bratbrat):
Iterate over all 'known' sets and calculate the cardinatlity of the intersection, and take the one with the highest value. This would be O(n) where n is the number of subsets.
Candidate 2:
Create an inverse table and calculate the euclidean distance between the input and the known sets. This could be quite quick. I'm not clear how I could limit this to include only subsets without a subsequent O(n) filter.
Candidate 3:
Iterate over all known sets and compare against the input. The complexity would be O(n) where n is the number of known sets.
I have at my disposal the set functions built into Python and Redis.
None of these seems particularly great. Ideas? The number of sets may get large (around 100,000 at a guess).
There's no possible way to do this in less than O(n) time... just reading the input is O(n).
A couple ideas:
Sort the sets by size (biggest first), and search for the first set which is a subset of the input set. Once you find one, you don't have to examine the rest.
If the number of possible items which could be in the sets is limited, you could represent them by bit-vectors. Then you could calculate a lookup table to tell you whether a given set is a subset of the input set. (Walk down the bits for each input set under consideration, word by word, indexing each word into the appropriate table. If you find an entry telling you that it's not a subset, again, you can move on directly to the next input set.) Whether this would actually buy you performance, depends on the implementation language. I imagine it would be most effective in a language with primitive integral types, like C or Java.
Take the union of the known sets. This becomes a dictionary of known elements.
Sort the known elements by their value (they're integers, right). This defines a given integer's position in a bit string.
Use the above to define bit strings for each of the known sets. This is a one time operation - the results should be stored to avoid recomputation.
For an input set, run it through the same transform to obtain its bit string.
To get the largest subset, run through the list of known bit strings, taking the intersection (logical and) with the input bit string. Count the '1' elements. Remember the largest one.
http://packages.python.org/bitstring
As mentioned in the comments, this can be paralleled up by subdividing the known sets and giving each thread its own subset to work on. Each thread serves up its best match and then the parent thread picks the best from the threads.
How many searches are you making? In case you are searching multiple input sets you should be able to pre-process all the known sets (perhaps as a tree structure) and your search time for each query would be in the order of your query set size.
Eg: Create a Trie structure with all the known sets. Make sure to sort each set before inserting them. For the query, follow the links that are in the set.

Resources