Suppose we have a finite set S and a list of subsets of S. Then, the set packing problem asks if some k subsets in the list are pairwise disjoint .
The optimization version of the problem, maximum set packing, asks for the maximum number of pairwise disjoint sets in the list.
So, Let S = {1,2,3,4,5,6,7,8,9,10}
and `Sa = {1,2,3,4}`
and `Sb = {4,5,6}`
and `Sc = {5,6,7,8}`
and `Sd = {9,10}`
Then the maximum number of pairwise disjoint sets are 3 ( Sa, Sc, Sd )
I could not find any articles about the algorithm involved. Can you shed some light on the same?
My approach:
Sort the sets according to the size. Start from the set of the smallest size. If no element of the next set intersects with the current set, then we unite the set and increase the count of maximum sets. Does this sound good to you? Any better ideas?

As hivert pointed out, this problem is NP-hard, so there's no efficient way to do this. However, if your input is relatively small, you can still pull it off. Exponential doesn't mean impossible, after all. It's just that exponential problems become impractical very quickly, as the input size grows. But for something like 25 sets, you can easily brute force it.
Here's one approach. Let's say you have n subsets, called S0, S1, ..., etc. We can try every combination of subsets, and pick the one with maximum cardinality. There are only 2^25 = 33554432 choices, so this is probably reasonable enough.
An easy way to do this is to notice that any non-negative number strictly below 2^N represents a particular choice of subsets. Look at the binary representation of the number, and choose the sets whose indices correspond to the bits that are on. So if the number is 11, the 0th, 1st and 3rd bits are on, and this corresponds to the combination [S0, S1, S3]. Then you just verify that these three sets are in fact disjoint.
Your procedure is as follows:
Iterate i from 0 to 2^N - 1
For each value of i, use the bits that are on to figure out the corresponding combination of subsets.
If those subsets are pairwise disjoint, update your best answer with this combination (i.e., use this if it is bigger than your current best).
Alternatively, use backtracking to generate your subsets. The two approaches are equivalent, modulo implementation tradeoffs. Backtracking will have some stack overhead, but can cut off entire lines of computation if you check disjointness as you go. For example, if S1 and S2 are not disjoint, then it will never bother with any bigger combinations containing those two, saving some time. The iterative method can't optimize itself in this way, but is fast and efficient because of the bitwise operations and tight loop.
The only nontrivial matter here is how to check if the subsets are pairwise disjoint. There are all sorts of tricks you can pull here as well, depending on the constraints.
A simple approach is to start with an empty set structure (pick whatever you want from the language of your choice) and add elements from each subset one by one. If you ever hit an element that's already in the set, then it occurs in at least two subsets, and you can give up on this combination.
If the original set S has m elements, and m is relatively small, you can map each of them to the range [0, m-1] and use bitmasks for each set. So if m <= 64, you can use a Java long to represent each subset. Turn on all the bits that correspond to the elements in the subset. This allows blazing fast set operation, because of the speed of bitwise operations. Bitwise AND corresponds to set intersection, and bitwise OR is a union. You can check if two subsets are disjoint by seeing if the intersection is empty (i.e., ANDing the two bitmasks gives you 0).
If you don't have so few elements, you can still avoid repeating the set intersections multiple times. You have very few sets, so precompute which ones are disjoint at the start. You can just store a boolean matrix D, such that D[i][j] = true iff i and j are disjoint. Then you just look up all pairs in a combination to verify pairwise disjointness, rather than doing real set operations.

You can solve the set packing problem searching a Maximum independent set. You encode your problem as follows:
for each set you put a vertex
you put an edge between two vertex if they share a common number.
Then you wan't a maximum set of vertex without two having two related vertex. Unfortunately this is a NP-Hard problem. Any know algorithm is exponential.


Bloom filters for determining which sets in a family are subsets of a given set

I am trying to use a Bloom filter to determine which sets from a family of sets A1, A2,...,Am are subsets of another fixed set Q. I am hoping that someone can verify the correctness of the stated approach or offer any improvements.
Let Q be a given set of integers, containing anywhere from 1-10000 elements from the universe set U = {1,2,...,10000}.
Also, let there be a family of sets A1, A2,...,Am each containing anywhere from 1-3 elements from the same universe set U. The size m is on the order of 5000.
outline of algorithm:
Let there be a collection of k hash functions. For each element of Q apply the hash functions and add it to a bitset of size n, denoted Q_b.
Also, for each of the Ai, i = 1,...,m sets, apply the hash functions to each element of Ai, generating the bitset (also of size n), denoted Ai_b.
To check if Ai is a subset of Q, perform a logical AND on the two bitsets, Q_b & Ai_b, and check if it is equal to the bitset Ai_b. That is, if Q_b & Ai_b == Ai_b is false, then we know that Ai is not a subset of Q; if it is true, then we do not know for sure (possibility of a false positive) and we need to check the given Ai using a deterministic approach.
The hope is that the filter tells us the majority of the Ai's that are not in Q and we can check the ones that return true more carefully.
Is this a good approach for my problem?
(Side questions: How big should n be? What are some good hash functions to use?)
If the range of values is rather small (as in your example), you can use a simple deterministic solution with linear time complexity.
Let's create an array was (with indices from 1 to 10000, that is, one cell for each element of the universal set), initially filled with false values.
For each element q of Q, we set was[q] = true.
Now we iterate over all sets of the family. For each set A_i, we iterate over all elements x of the set and check if was[x] is true. If it's not for at least one x, then A_i is not a subset of Q. Otherwise, it is.
This solution is clearly correct as it checks if one set is a subset of the other by definition. It's also rather simple and deterministic. The only potential downside it has is that it requires an auxiliary array of 10000 elements, but it looks admissible for most practical purposes (a bloom filter would require some extra space too, anyway).
Please try to ask only one question in your question.
I will address the first one: "Is this a good approach for my problem?", but not the last two, "How big should n be? What are some good hash functions to use?"
This is probably not a good approach.
First, Q is tiny; 10,000 elements from {1,...,10k} means Q can be stored with a bitset in 10k bits or about 1.2 kibibytes. That is very, very small. For instance, it is smaller than your question, which uses almost 1.5 kibibytes.
Second, Ai contains one to three elements, so Ai_b will likely be larger than Ai unless you chose them to be so small that the false positive rate is very high.
Finally, hash function computation is not free.
You can do this much more simply if you check each element of each Ai against a bitset representing Q.

Is there a common algorithm for this set content comparison? What is this called?

I'm coding in python.
I have a list of sets. These sets contain integers. If two sets share an integer item, it is "connected". My goal is to determine if all of these sets are all mutually connected into a single group (as opposed to no connected sets or multiple groups of mutually connected sets).
Is there a common algorithm for this? It seems like a widely applicable goal.
This is my proposed solution:
start with first set and check if contents are shared with any other set
delete any set with shared content and add other contents to first set
repeat until no change to first set
if all other sets have been deleted, then they are all connected
I want to distinguish one mutually connected chain of sets
from separate groups of mutually connected sets
o--o--o o--o--o
So, simply checking if each set is connected to another set is not enough.
Your solution is correct, and is a variant of DFS (though since you manipulate the sets it might be a bit inefficient)
Your problem is basically a graph problem, where the graph is:
G = (V,E)
V = { sets } = {S1, S2, ..., Sn}
E = { (Si,Sj) | Si and Sj share an integer }
This graph is undirected by nature, and your problem is finding if it is connected or not. This can be done by BFS or DFS. Just start from one arbitrary vertex, until you are "stuck" (without restarting from a new source). If when it happens, you have "discovered" all sets, the graph is connected. Otherwise, it is not.
Run time is O(|V|+|E|), where |V| is the number of sets you have, and |E| is the number of connections.
Note: The set E can be calculate efficiently for sparsed graphs by creating an inverted index. For each number, create a list of all the sets that contain this number (this is liner in the size of the input), and then generate the edges by going through all pairs in the list (for sparsed graphs this should be fairly small).
Though for dense graph, a more efficient way to generate it will probably just be to go through all pairs of sets.
Here's what I would try:
go through each set pair.
compare each of the set members:
if there is a common Integer then leave one set and continue with another, continue with this until there are no more sets.
If there isn't any members shared than break and output false.
If the possible integers are known and their number has a small upper bound like 32 you can represent each set as vector of bits and apply bitwise and like this: x(n) = x(n-1) & s(n) with s(n) being the n-th set and & being bitwise and. If all bits of x(n) are zero for any n you will know that there are multiple groups of sets. The time complexity of this approach is linear and uses operations that current hardware can execute very efficiently.
This and any other solution can be extended by the following check. It has to be applied before the original solution. The idea is to be finished quickly in some cases. The check requires that the minimal and and maximal integer of each set are known. If the greatest of all these minimum integers is greater than the smallest of all maximal integers you will know that there are multiple groups of sets. So, in this case you can finish. If the condition is not true you will have to continue with the original solution.

Algorithm to generate k element subsets in order of their sum

If I have an unsorted large set of n integers (say 2^20 of them) and would like to generate subsets with k elements each (where k is small, say 5) in increasing order of their sums, what is the most efficient way to do so?
Why I need to generate these subsets in this fashion is that I would like to find the k-element subset with the smallest sum satisfying a certain condition, and I thus would apply the condition on each of the k-element subsets generated.
Also, what would be the complexity of the algorithm?
There is a similar question here: Algorithm to get every possible subset of a list, in order of their product, without building and sorting the entire list (i.e Generators) about generating subsets in order of their product, but it wouldn't fit my needs due to the extremely large size of the set n
I intend to implement the algorithm in Mathematica, but could do it in C++ or Python too.
If your desired property of the small subsets (call it P) is fairly common, a probabilistic approach may work well:
Sort the n integers (for millions of integers i.e. 10s to 100s of MB of ram, this should not be a problem), and sum the k-1 smallest. Call this total offset.
Generate a random k-subset (say, by sampling k random numbers, mod n) and check it for P-ness.
On a match, note the sum-total of the subset. Subtract offset from this to find an upper bound on the largest element of any k-subset of equivalent sum-total.
Restrict your set of n integers to those less than or equal to this bound.
Repeat (goto 2) until no matches are found within some fixed number of iterations.
Note the initial sort is O(n log n). The binary search implicit in step 4 is O(log n).
Obviously, if P is so rare that random pot-shots are unlikely to get a match, this does you no good.
Even if only 1 in 1000 of the k-sized sets meets your condition, That's still far too many combinations to test. I believe runtime scales with nCk (n choose k), where n is the size of your unsorted list. The answer by Andrew Mao has a link to this value. 10^28/1000 is still 10^25. Even at 1000 tests per second, that's still 10^22 seconds. =10^14 years.
If you are allowed to, I think you need to eliminate duplicate numbers from your large set. Each duplicate you remove will drastically reduce the number of evaluations you need to perform. Sort the list, then kill the dupes.
Also, are you looking for the single best answer here? Who will verify the answer, and how long would that take? I suggest implementing a Genetic Algorithm and running a bunch of instances overnight (for as long as you have the time). This will yield a very good answer, in much less time than the duration of the universe.
Do you mean 20 integers, or 2^20? If it's really 2^20, then you may need to go through a significant amount of (2^20 choose 5) subsets before you find one that satisfies your condition. On a modern 100k MIPS CPU, assuming just 1 instruction can compute a set and evaluate that condition, going through that entire set would still take 3 quadrillion years. So if you even need to go through a fraction of that, it's not going to finish in your lifetime.
Even if the number of integers is smaller, this seems to be a rather brute force way to solve this problem. I conjecture that you may be able to express your condition as a constraint in a mixed integer program, in which case solving the following could be a much faster way to obtain the solution than brute force enumeration. Assuming your integers are w_i, i from 1 to N:
min sum(i) w_i*x_i
x_i binary
sum over x_i = k
subject to (some constraints on w_i*x_i)
If it turns out that the linear programming relaxation of your MIP is tight, then you would be in luck and have a very efficient way to solve the problem, even for 2^20 integers (Example: max-flow/min-cut problem.) Also, you can use the approach of column generation to find a solution since you may have a very large number of values that cannot be solved for at the same time.
If you post a bit more about the constraint you are interested in, I or someone else may be able to propose a more concrete solution for you that doesn't involve brute force enumeration.
Here's an approximate way to do what you're saying.
First, sort the list. Then, consider some length-5 index vector v, corresponding to the positions in the sorted list, where the maximum index is some number m, and some other index vector v', with some max index m' > m. The smallest sum for all such vectors v' is always greater than the smallest sum for all vectors v.
So, here's how you can loop through the elements with approximately increasing sum:
sort arr
for i = 1 to N
for v = 5-element subsets of (1, ..., i)
set = arr{v}
if condition(set) is satisfied
break_loop = true
compute sum(set), keep set if it is the best so far
break if break_loop
Basically, this means that you no longer need to check for 5-element combinations of (1, ..., n+1) if you find a satisfying assignment in (1, ..., n), since any satisfying assignment with max index n+1 will have a greater sum, and you can stop after that set. However, there is no easy way to loop through the 5-combinations of (1, ..., n) while guaranteeing that the sum is always increasing, but at least you can stop checking after you find a satisfying set at some n.
This looks to be a perfect candidate for map-reduce ( If you know of any way of partitioning them smartly so that passing candidates are equally present in each node then you can probably get a great throughput.
Complete sort may not really be needed as the map stage can take care of it. Each node can then verify the condition against the k-tuples and output results into a file that can be aggregated / reduced later.
If you know of the probability of occurrence and don't need all of the results try looking at probabilistic algorithms to converge to an answer.

finding max value on each subset

(I'm banging my head here. Let X={x1,x2,...,xn} is an integer set. Let A1,A2,...Am be the m subsets of X. For any i and j, Ai and Aj are not necessarily disjoint. Now the goal is to find the maximal value on each Ai (i=1,...,m) efficiently, with the number of operations as fewer as possible.
For example, given X={2,4,6,3,1}, and its subsets A1={2,3,1}, A2={2,6,3,1}, A3={4,2,3,1}. We need to find Max{A1}, Max{A2}, Max{A3}, respectively.
The brute-force way for finding Max{A1}, Max{A2}, Max{A3} is to scan all the elements in each Ai, and (m*d) operations are required, with m the number of subsets of X, and d the average length of the subsets {Ai} of X.
Now, I have some observations:
(1) For any set Y⊆X, max{Y}≤max{X},
For instance, since Max{X}=6 and 6 is in A2, then Max{A2}=6 can be found directly.
(2) For any two sets A and B, if A∩B is non-empty, Max{A} and Max{B} can be identified as follows:
First, we find the common parts between A and B, deonted as c=max{A∩B}.
Then, we find Max{A}=Max{Max{A-(A∩B)}, c} and Max{B}=Max{Max{B-(A∩B)}, c}.
I am not sure whether there are some other interesting obervations for find these max values.
Any ideas are warmly welcome!
My question is what if for the general case when X={x1,x2,...,xn} and there are m subsets of X, denoted as A1,A2,...Am, is there some more efficient techniques to find such max values Max{Ai} (i=1,...,m) ?
Your help will be highly appreciated!
There is no method asymptotically better than brute force, assuming a typical representation of the given sets. Simply scanning through the sets to find the largest member of each requires linear time and linear time is optimal since every member of the set must be read in order to determine the maximum value.
Now if the input representation is not simply a listing of the elements in each set, than other bounds and algorithms may apply. For example, if we know the input sets are sorted and the length of the set is given as part of the input, we can obviously find the maximum elements in time linear only on the number of subsets but not on their length.
If your sets are implemented in a hash (or, more generally, if you can otherwise check for the presence of a value in the set in O(1) time) you can improve on a brute-force approach.
Instead of iterating through the elements of the subset and maintaining the maximum, iterate over the elements of the parent set in descending order, checking for the presence of those elements in the subset. The first found element is necessarily the subset's maximum. Technically, this still takes O(n) time (n = subset carnality) in the general case, but will generally carry a great performance benefit in practice. (If you have any data regarding the number and size of the subsets, and they favor this approach, you can improve on O(n) in the average case.)
This approach requires sorting of the parent set's elements (n log n), however, so it may only be worthwhile if the number of subsets is much greater than the carnality of the parent set.

How to find the subset with the greatest number of items in common?

Let's say I have a number of 'known' sets:
1 {a, b, c, d, e}
2 {b, c, d, e}
3 {a, c, d}
4 {c, d}
I'd like a function which takes a set as an input, (for example {a, c, d, e}) and finds the set that has the highest number of elements, and no more other items in common. In other words, the subset with the greatest cardinality. The answer doesn't have to be a proper subset. The answer in this case would be {a, c, d}.
EDIT: the above example was wrong, now fixed.
I'm trying to find the absolute most efficient way of doing this.
(In the below, I am assuming that the cost of comparing two sets is O(1) for the sake of simplicity. That operation is outside my control so there's no point thinking about it. In truth it would be a function of the cardinality of the two sets being compared.)
Candiate 1:
Generate all subsets of the input, then iterate over the known sets and return the largest one that is a subset. The downside to this is that the complexity will be something like O(n! × m), where n is the cardinality of the input set and m is the number of 'known' subsets.
Candidate 1a (thanks #bratbrat):
Iterate over all 'known' sets and calculate the cardinatlity of the intersection, and take the one with the highest value. This would be O(n) where n is the number of subsets.
Candidate 2:
Create an inverse table and calculate the euclidean distance between the input and the known sets. This could be quite quick. I'm not clear how I could limit this to include only subsets without a subsequent O(n) filter.
Candidate 3:
Iterate over all known sets and compare against the input. The complexity would be O(n) where n is the number of known sets.
I have at my disposal the set functions built into Python and Redis.
None of these seems particularly great. Ideas? The number of sets may get large (around 100,000 at a guess).
There's no possible way to do this in less than O(n) time... just reading the input is O(n).
A couple ideas:
Sort the sets by size (biggest first), and search for the first set which is a subset of the input set. Once you find one, you don't have to examine the rest.
If the number of possible items which could be in the sets is limited, you could represent them by bit-vectors. Then you could calculate a lookup table to tell you whether a given set is a subset of the input set. (Walk down the bits for each input set under consideration, word by word, indexing each word into the appropriate table. If you find an entry telling you that it's not a subset, again, you can move on directly to the next input set.) Whether this would actually buy you performance, depends on the implementation language. I imagine it would be most effective in a language with primitive integral types, like C or Java.
Take the union of the known sets. This becomes a dictionary of known elements.
Sort the known elements by their value (they're integers, right). This defines a given integer's position in a bit string.
Use the above to define bit strings for each of the known sets. This is a one time operation - the results should be stored to avoid recomputation.
For an input set, run it through the same transform to obtain its bit string.
To get the largest subset, run through the list of known bit strings, taking the intersection (logical and) with the input bit string. Count the '1' elements. Remember the largest one.
As mentioned in the comments, this can be paralleled up by subdividing the known sets and giving each thread its own subset to work on. Each thread serves up its best match and then the parent thread picks the best from the threads.
How many searches are you making? In case you are searching multiple input sets you should be able to pre-process all the known sets (perhaps as a tree structure) and your search time for each query would be in the order of your query set size.
Eg: Create a Trie structure with all the known sets. Make sure to sort each set before inserting them. For the query, follow the links that are in the set.
