Select the most elements that do not overlap so that the sum of their size is maximized - algorithm

I'm trying to find an algorithm to the following problem.
Say I have a number of objects A, B, C,...
I have a list of valid combinations of these objects. Each combination is of length 2 or 4.
For eg. AF, CE, CEGH, ADFG,... and so on.
For combinations of two objects, eg. AF, the length of the combination is 2. For combination of four objects, eg CEGH, the length of the combination.
I can only pick non-overlapping combinations, i.e. I cannot pick AF and ADFG because both require objects 'A' and 'F'. I can pick combinations AF and CEGH because they do not require common objects.
If my solution consists of only the two combinations AF and CEGH, then my objective is the sum of the length of the combinations, which is 2 + 4 = 6.
Given a list of objects and their valid combinations, how do I pick the most valid combinations that don't overlap with each other so that I maximize the sum of the lengths of the combinations? I do not want to formulate it as an IP as I am working with a problem instance with 180 objects and 10 million valid combinations and solving an IP using CPLEX is prohibitively slow. Looking for some other elegant way to solve it. Can I perhaps convert this to a network? And solve it using a max-flow algorithm? Or a Dynamic program? Stuck as to how to go about solving this problem.

My first attempt at showing this problem to be NP-hard was wrong, as it did not take into account the fact that only combinations of size 2 or 4 were allowed. However, using Jim D.'s suggestion to reduce from 3-dimensional matching (3DM), we can show that the problem is nevertheless NP-hard.
I'll show that the natural decision problem form of your problem ("Given a set O of objects, and a set C of combinations of either 2 or 4 objects from O, and an integer m, does there exist a subset D of C such that all sets in D are pairwise disjoint, and the union of all sets in D has size at least m?") is NP-hard. Clearly the optimisation problem (i.e., your original problem, where we seek an actual subset of combinations that maximises m above) is at least as hard as this problem. (To see that the optimisation problem is not "much" harder than the decision problem, notice that you could first find the maximum m value for which a solution exists using a binary search on m in which you solve a decision problem at each step, and then, once this maximal m value has been found, solving a series of decision problems in which each combination in turn is removed: if the solution after removing some particular combination is still "YES", then it may also be left out of all future problem instances, while if the solution becomes "NO", then it is necessary to keep this combination in the solution.)
Given an instance (X, Y, Z, T, k) of 3DM, where X, Y and Z are sets that are pairwise disjoint from each other, T is a subset of X*Y*Z (i.e., a set of ordered triples with first, second and third components from X, Y and Z, respectively) and k is an integer, our task is to determine whether there is any subset U of T such that |U| >= k and all triples in U are pairwise disjoint (i.e., to answer the question, "Are there at least k non-overlapping triples in T?"). To turn any such instance of 3DM into an instance of your problem, all we need to do is create a fresh 4-combination from each triple in T, by adding a distinct dummy value to each. The set of objects in the constructed instance of your problem will consist of the union of X, Y, Z, and the |T| dummy values we created. Finally, set m to k.
Suppose that the answer to the original 3DM instance is "YES", i.e., there are at least k non-overlapping triples in T. Then each of the k triples in such a solution corresponds to a 4-combination in the input C to your problem, and no two of these 4-combinations overlap, since by construction, their 4th elements are all distinct, and by assumption of the. Thus there are at least m = k non-overlapping 4-combinations in the instance of your problem, so the solution for that problem must also be "YES".
In the other direction, suppose that the solution to the constructed instance of your problem is "YES", i.e., there are at least m non-overlapping 4-combinations in C. We can simply take the first 3 elements of each of the 4-combinations (throwing away the fourth) to produce a set of k = m non-overlapping triples in T, so the answer to the original 3DM instance must also be "YES".
We have shown that a YES-answer to one problem implies a YES-answer to the other, thus a NO-answer to one problem implies a NO-answer to the other. Thus the problems are equivalent. The instance of your problem can clearly be constructed in polynomial time and space. It follows that your problem is NP-hard.

You can reduce this problem to the maximum weighted clique problem, which is, unfortunately, NP-hard.
Build a graph such that every combination is a vertex with weight equal to the length of the combination, and connect vertices if the corresponding combinations do not share any object (i.e. if you can pick both them at the same time). Then, a solution is valid if and only if it is a clique in that graph.
A simple search on google brings up a lot of approximation algorithms for this problem, such as this one.

Related

Knapsack with unique elements

I'm trying to solve the following:
The knapsack problem is as follows: given a set of integers S={s1,s2,…,sn}, and a given target number T, find a subset of S that adds up exactly to T. For example, within S={1,2,5,9,10} there is a subset that adds up to T=22 but not T=23. Give a correct programming algorithm for knapsack that runs in O(nT) time.
but the only algorithm I could come up with is generating all the 1 to N combinations and try the sum out (exponential time).
I can't devise a dynamic programming solution since the fact that I can't reuse an object makes this problem different from a coin rest exchange problem and from a general knapsack problem.
Can somebody help me out with this or at least give me a hint?
The O(nT) running time gives you the hint: do dynamic programming on two axes. That is, let f(a,b) denote the maximum sum <= b which can be achieved with the first a integers.
f satisfies the recurrence
f(a,b) = max( f(a-1,b), f(a-1,b-s_a)+s_a )
since the first value is the maximum without using s_a and the second is the maximum including s_a. From here the DP algorithm should be straightforward, as should outputting the correct subset of S.
I did find a solution but with O(T(n2)) time complexity. If we make a table from bottom to top. In other words If we sort the array and start with the greatest number available and make a table where columns are the target values and rows the provided number. We will need to consider the sum of all possible ways of making i- cost [j] +j . Which will take n^2 time. And this multiplied with target.

How to code the maximum set packing algorithm?

Suppose we have a finite set S and a list of subsets of S. Then, the set packing problem asks if some k subsets in the list are pairwise disjoint .
The optimization version of the problem, maximum set packing, asks for the maximum number of pairwise disjoint sets in the list.
http://en.wikipedia.org/wiki/Set_packing
So, Let S = {1,2,3,4,5,6,7,8,9,10}
and `Sa = {1,2,3,4}`
and `Sb = {4,5,6}`
and `Sc = {5,6,7,8}`
and `Sd = {9,10}`
Then the maximum number of pairwise disjoint sets are 3 ( Sa, Sc, Sd )
I could not find any articles about the algorithm involved. Can you shed some light on the same?
My approach:
Sort the sets according to the size. Start from the set of the smallest size. If no element of the next set intersects with the current set, then we unite the set and increase the count of maximum sets. Does this sound good to you? Any better ideas?
As hivert pointed out, this problem is NP-hard, so there's no efficient way to do this. However, if your input is relatively small, you can still pull it off. Exponential doesn't mean impossible, after all. It's just that exponential problems become impractical very quickly, as the input size grows. But for something like 25 sets, you can easily brute force it.
Here's one approach. Let's say you have n subsets, called S0, S1, ..., etc. We can try every combination of subsets, and pick the one with maximum cardinality. There are only 2^25 = 33554432 choices, so this is probably reasonable enough.
An easy way to do this is to notice that any non-negative number strictly below 2^N represents a particular choice of subsets. Look at the binary representation of the number, and choose the sets whose indices correspond to the bits that are on. So if the number is 11, the 0th, 1st and 3rd bits are on, and this corresponds to the combination [S0, S1, S3]. Then you just verify that these three sets are in fact disjoint.
Your procedure is as follows:
Iterate i from 0 to 2^N - 1
For each value of i, use the bits that are on to figure out the corresponding combination of subsets.
If those subsets are pairwise disjoint, update your best answer with this combination (i.e., use this if it is bigger than your current best).
Alternatively, use backtracking to generate your subsets. The two approaches are equivalent, modulo implementation tradeoffs. Backtracking will have some stack overhead, but can cut off entire lines of computation if you check disjointness as you go. For example, if S1 and S2 are not disjoint, then it will never bother with any bigger combinations containing those two, saving some time. The iterative method can't optimize itself in this way, but is fast and efficient because of the bitwise operations and tight loop.
The only nontrivial matter here is how to check if the subsets are pairwise disjoint. There are all sorts of tricks you can pull here as well, depending on the constraints.
A simple approach is to start with an empty set structure (pick whatever you want from the language of your choice) and add elements from each subset one by one. If you ever hit an element that's already in the set, then it occurs in at least two subsets, and you can give up on this combination.
If the original set S has m elements, and m is relatively small, you can map each of them to the range [0, m-1] and use bitmasks for each set. So if m <= 64, you can use a Java long to represent each subset. Turn on all the bits that correspond to the elements in the subset. This allows blazing fast set operation, because of the speed of bitwise operations. Bitwise AND corresponds to set intersection, and bitwise OR is a union. You can check if two subsets are disjoint by seeing if the intersection is empty (i.e., ANDing the two bitmasks gives you 0).
If you don't have so few elements, you can still avoid repeating the set intersections multiple times. You have very few sets, so precompute which ones are disjoint at the start. You can just store a boolean matrix D, such that D[i][j] = true iff i and j are disjoint. Then you just look up all pairs in a combination to verify pairwise disjointness, rather than doing real set operations.
You can solve the set packing problem searching a Maximum independent set. You encode your problem as follows:
for each set you put a vertex
you put an edge between two vertex if they share a common number.
Then you wan't a maximum set of vertex without two having two related vertex. Unfortunately this is a NP-Hard problem. Any know algorithm is exponential.

How to find all possible pairs from three subsets of a set with constraints in Erlang?

I have a set M which consists of three subsets A,B and C.
Problem: I would like to calculate all possible subsets S(1)...S(N) of M which contain all possible pairs between elements of A, B and C in such manner that:
elements of A and B can happen in a pair only once for each of two positions in a pair (that is {a1,a2} and {b1,a1} can be in one subset S, but no more elements {a1,_} and {_,a1} are allowed in this subset S);
elements of C can happen 1-N times in a subset S (that is {a,c}, {b,c}, {x,c} can happen in one subset S), but I would like to get subsets S for all possible numbers of elements of C in a subset S.
For example, if we have A = [a1,a2], B = [b1,b2], C = [c1,c2], then some of the resulting subsets S would be (remember, they should contain pairs of elements):
- {a1,b1}, {b1,a2}, {a2,b2}, {b2,c1};
- {a1,b1}, {b1,a2}, {a2,b2}, {b2,c1}, {c1,c2};
- {a1,c1}, {c1,a2}, {c1,b2}, {b1,c1};
- etc.
I tend to think that first I need to find all possible subsets of M, which contain only one element of A, one element of B and 1..N elements of C (1). And after that I should somehow generate sets of pairs (2) from that. But I am not sure that this is the right strategy.
So, the more elaborated question would be:
what is the best way to create sets and find subsets in Erlang if the elements of the set M a integers?
are there any ready-made tools to find subsets of a set in Erlang?
are there any ready-made tools to generate all possible pairs of elements of a set in Erlang?
How can I solve the aforementioned problem in Erlang?
There is a sets module*, but I suspect you're better off thinking up an algorithm first -- its implementation in Erlang is the problem (or not) that comes after this. (Maybe you notice its actually a graph algorithm (like, bipartite matching something something), and you'll get happy with Erlang's digraph module.)
Long story short, when you come up with an algorithm, Erlang can very probably be used to implement it. Yes, there is a certain support for sets. But solutions to a problem requiring "all possible subsets" tend to be exponential (i.e., given n elements, there are 2^n subsets; for every element you either have it in your subset or not) and thus bad.
(* there are some modules concerning sets)

Algorithm to find discriminating data points?

Given n samples and p >> n (discrete) data points for each of the n samples, what is a good algorithm for finding a smallest possible set of k data points such that those k data points discriminate between all n samples?
For my purposes, a good algorithm that finds an approximately smallest set would also suffice.
It sounds as though your problem is closely related to the test cover problem. The test cover problem is, given a ground set X = {1, …, n} and a collection T = {T1, …, Tm} of subsets of X, to find the smallest subcollection U of T such that for all y ≠ z in X, there exists a set S in T such that either (x in S and y not in S) or (x not in S and y in S).
The test cover problem is NP-hard, so in practice, optimal solutions are found using branch and bound techniques. See De Bontridder et al.
Here is a simple greedy algorithm, shouldn't generate too bad results:
Check if data points are same for two different elements, if so, there is no solution.
In each step we add one new data point to the set k.
We test all the different points in all of the p in n.
Try to add that point to k.
The new k divides n into a couple of distinct sets (some of these
contain just one element, some more.. finally all will contain just one).
Pick the point which generates
the most sets.
Do this till all sets are distinct.

Algorithm: Removing as few elements as possible from a set in order to enforce no subsets

I got a problem which I do not know how to solve:
I have a set of sets A = {A_1, A_2, ..., A_n} and I have a set B.
The target now is to remove as few elements as possible from B (creating B'), such that, after removing the elements for all 1 <= i <= n, A_i is not a subset of B'.
For example, if we have A_1 = {1,2}, A_2 = {1,3,4}, A_3={2,5}, and B={1,2,3,4,5}, we could e.g. remove 1 and 2 from B (that would yield B'={3,4,5}, which is not a superset of one of the A_i).
Is there an algorithm for determining the (minimal number of) elements to be removed?
It sounds like you want to remove the minimal hitting set of A from B (this is closely related to the vertex cover problem).
A hitting set for some set-of-sets A is itself a set such that it contains at least one element from each set in A (it "hits" each set). The minimal hitting set is the smallest such hitting set. So, if you have an MHS for your set-of-sets A, you have an element from each set in A. Removing this from B means no set in A can be a subset of B.
All you need to do is calculate the MHS for (A1, A2, ... An), then remove that from B. Unfortunately, finding the MHS is an NP-complete problem. Knowing that though, you have a few options:
If your data set is small, do the obvious brute-force solution
Use a probabilistic algorithm to get a fast, approximate answer (see this PDF)
Run far, far away in the opposite direction
If you just need some approximation, start with the smallest set in A, and remove one element from B. (You could just grab one at random, or check to see which element is in the most sets in A, depending on how accurate, how fast you need)
Now the smallest set in A isn't a subset of B. Move on from there, but check first to see whether or not the sets you're examining are subsets at this point or not.
This reminds me of the vertex covering problem, and I remember some approximation algorithm for that that is similar to this one.
I think you should find the minimum length from these sets and then delete these elements which is in this set.

Resources