an algorithm to find the minimum size set cover for the Set-cover problem - algorithm

In the Set Covering problem, we are given a universe U, such that |U|=n, and sets S1,……,Sk are subsets of U. A set cover is a collection C of some of the sets from S1,……,Sk whose union is the entire universe U.
I'm trying to come up with an algorithm that will find the minimum number of set cover so that I can show that the greedy algorithm for set covering sometimes finds more sets.
Following is what I came up with:
repeat for each set.
1. Cover<-Seti (i=1,,,n)
2. if a set is not a subset of any other sets, then take take that set into cover.
but it's not working for some instances.
Please help me figure out an algorithm to find the minimum set cover.
I'm still having problem find this algorithm online. Anyone has any suggestion?

Set cover is NP-hard, so it's unlikely that there'll be an algorithm much more efficient than looking at all possible combinations of sets, and checking if each combination is a cover.
Basically, look at all combinations of 1 set, then 2 sets, etc. until they form a cover.
EDIT
This is an example pseudocode. Note that I do not claim that this is efficient. I simply claim that there isn't a much more efficient algorithm (algorithms will be worse than polynomial time unless something really cool is discovered)
for size in 1..|S|:
for C in combination(S, size):
if (union(C) == U) return C
where combination(K, n) returns all possible sets of size n whose elements come from K.
EDIT
However, I'm not too sure why you need an algorithm to find the minimum. In the question you state that you want to show that the greedy algorithm for set covering sometimes finds more sets. But this is easily achieved via a counterexample (and a counterexample is shown in the wikipedia entry for set cover). So I am quite puzzled.
EDIT
A possible implementation of combination(K, n) is:
if n == 0: return [{}] //a list containing an empty set
r = []
for k in K:
K = K \ {k} // remove k from K.
for s in combination(K, n-1):
r.append(union({k}, s))
return r
But in combination with the cover problem, one probably wants to perform the test of coverage from the base case n == 0 instead. Well.

Try Donald E. Knuth algorithm-X for exact set coverage, using a sparse matrix. Must be adapted a little to solve minimum set cover problems also.

Related

Smallest set of multi-sets that together contains all numbers from 1 to N

Lets assume that we have only integer numbers which values are in range 1 to N. Next we will split them into K-element multi-sets. How would you find such set which contains smallest possible number of those multi-sets yet sum of this multi-set contains all numbers from 1 to N? In case of ambiguity answer will be any set that matches criteria (first found).
For instance, we have N = 9, K = 3
(1,2,3)(4,5,6)(7,8,8)(8,7,6)(1,9,2)(4,4,3)
Smallest number of multi-sets that contains all the numbers from 1 to 9 is equal to 4 and can be either (1,2,3)(4,5,6)(7,8,8)(1,9,2) or (1,2,3)(4,5,6)(8,7,6)(1,9,2).
Any idea for efficient algorithm to find such set?
PS
After writing an answer I found yet another 4 element set: (4,5,6)(1,9,2)(4,4,3)(7,8,8) or (4,5,6)(1,9,2)(4,4,3)(8,7,6) But as I said algorithm finding any minimum set would be fine.
Your question is a restricted version the classic Set Covering problem, but it still easy to show that it is NP-Hard.
Any approximation technique for this problem would be reasonable here. In particular, the greedy solution of choosing the next subset covering the most uncovered items - is esp. easy to implement.
This problem, as #Ami Tavroy said, is NP-hard by reduction to 3-dimensional matching (here).
To do the reduction, note the restricted decision variant of 3-dimensional matching when it reduces to a exact cover (here):
...given a set T and an integer k, decide whether there exists a
3-dimensional matching M ⊆ T with |M| ≥ k. ... The problem is
NP-complete even in the special case that k = |X| = |Y| =
|Z|.1[4][5] In this case, a 3-dimensional (dominating) matching is
not only a set packing but also an exact cover: the set M covers each
element of X, Y, and Z exactly once.[6]
This variant can be solved in P if you can solve the other question in P - you can produce all the triples in O(N ^ 3) time and then do set cover, and check if K = N / 3 or not. Thus by reduction, the original questions is also NP-hard.

Greedy Set Coverage algorithm built by *removing* sets

I am trying to implement a solution for a set coverage problem using a greedy algorithm.
The classic greedy approximation algorithm for it is
input: collection C of sets over universe U , costs: C→R ≥0
output: set cover S
1. Let S←∅.
2. Repeat until S covers all elements:
3. Add a set s to S, where s∈C maximizes the number of elements in s not yet covered by set s in S, divided by the cost c(s).
4. Return S.
I have a question in 2 parts:
a. Will doing the algorithm in reverse be a valid algorithm i.e.
input: collection C of sets over universe U , costs: C→R ≥0
output: set cover S
1. Let S←C .
2. Repeat until there are no s∈S such that S-s=S (i.e. all elements in s are redundant):
3. Remove a set s from S, where s∈S minimises the number of elements in s, divided by the cost c(s).
4. Return S.
b. The nature of the problem is such that it easy to get C and there will be a limited number (<5) of redundant sets - in this case will this removal algorithmm would perform better?
The algorithm will surely return a valid set cover as at every step it checks if all elements of s are redundant.
Intuitively I feel that part b is true though I am unable to write a formal proof for it. Read chapter 2 of Vijay Vazirani as it might help do the analysis part.

A greedy or dynamic algorithm to subset selection

I have a simple algorithmic question. I would be grateful if you could help me.
We have some 2 dimensional points. A positive weight is associated to them (a sample problem is attached). We want to select a subset of them which maximizes the weights and neither of two selected points overlap each other (for example, in the attached file, we cannot select both A and C because they are in the same row, and in the same way we cannot select both A and B, because they are in the same column.) If there is any greedy (or dynamic) approach I can use. I'm aware of non-overlapping interval selection algorithm, but I cannot use it here, because my problem is 2 dimensional.
Any reference or note is appreciated.
Regards
Attachment:
A simple sample of the problem:
A (30$) -------- B (10$)
|
|
|
|
C (8$)
If you are OK with a good solution, and do not demand the best solution - you can use heuristical algorithms to solve this.
Let S be the set of points, and w(s) - the weightening function.
Create a weight function W:2^S->R (from the subsets of S to real numbers):
W(U) = - INFINITY is the solution is not feasible
Sigma(w(u)) for each u in U otherwise
Also create a function next:2^S -> 2^2^S (a function that gets a subset of S, and returns a set of subsets of S)
next(U) = V you can get V from U by adding/removing one element to/from U
Now, given that data - you can invoke any optimization algorithm in the Artificial Intelligence book, such as Genetic Algorithm or Hill Climbing.
For example, Hill Climbing with random restarts, will be something like that:
1. best<- -INFINITY
2. while there is more time
3. choose a random subset s
4. NEXT <- next(s)
5. if max{ W(v) | for each v in NEXT} < W(s): //s is a local maximum
5.1. if W(s) > best: best <- W(s) //if s is better then the previous result - store it.
5.2. go to 2. //restart the hill climbing from a different random point.
6. else:
6.1. s <- max { NEXT }
6.2. goto 4.
7. return best //when out of time, return the best solution found so far.
The above algorithm is anytime - meaning it will produce better results if given more time.
This can be treated as a linear assignment problem, which can be solved using an algorithm like the Hungarian algorithm. The algorithm tries to minimize the sum of costs, so just negate your weights, and use them as the costs. The assignment of rows to columns will give you the subset of points that you need. There are sparse variants for cases where not every (row,column) pair has an associated point, but you can also just use a large positive cost for these.
Well you can think of this as a binary constraint optimization problem, and there are various algorithms. The easiest algorithm for this problem is backtracking and arc propogation. However, it takes exponential time in the worst case. I am not sure if there are any specific algorithms to take advantage of the geometrical nature of the problem.
This can be solved by a pretty straight forward dynamic programming approach with a exponential time complexity
s = {A, B, C ...}
getMaxSum(s) = max( A.value + getMaxSum(compatibleSubSet(s, A)),
B.value + getMaxSum(compatibleSubSet(s, B)),
...)
where compatibleSubSet(s, A) gets the subset of s that does not overlap with A
To optimize it, you can memorize the result for each subset
Some way to do it:
Write a function that generates subsets ordered from the subset off maximum weight to the subset off minimum weight while ignoring the constraints.
Then call this function repeatedly until a subset that honors the constraints pops up.
In order to improve the performance, you can write a not so dumb generator function that for instance honors the not-on-the-same-row constraint but that ignores the not-on-the-same-column one.

Algorithm to find discriminating data points?

Given n samples and p >> n (discrete) data points for each of the n samples, what is a good algorithm for finding a smallest possible set of k data points such that those k data points discriminate between all n samples?
For my purposes, a good algorithm that finds an approximately smallest set would also suffice.
It sounds as though your problem is closely related to the test cover problem. The test cover problem is, given a ground set X = {1, …, n} and a collection T = {T1, …, Tm} of subsets of X, to find the smallest subcollection U of T such that for all y ≠ z in X, there exists a set S in T such that either (x in S and y not in S) or (x not in S and y in S).
The test cover problem is NP-hard, so in practice, optimal solutions are found using branch and bound techniques. See De Bontridder et al.
Here is a simple greedy algorithm, shouldn't generate too bad results:
Check if data points are same for two different elements, if so, there is no solution.
In each step we add one new data point to the set k.
We test all the different points in all of the p in n.
Try to add that point to k.
The new k divides n into a couple of distinct sets (some of these
contain just one element, some more.. finally all will contain just one).
Pick the point which generates
the most sets.
Do this till all sets are distinct.

Point covering problem

I recently had this problem on a test: given a set of points m (all on the x-axis) and a set n of lines with endpoints [l, r] (again on the x-axis), find the minimum subset of n such that all points are covered by a line. Prove that your solution always finds the minimum subset.
The algorithm I wrote for it was something to the effect of:
(say lines are stored as arrays with the left endpoint in position 0 and the right in position 1)
algorithm coverPoints(set[] m, set[][] n):
chosenLines = []
while m is not empty:
minX = min(m)
bestLine = n[0]
for i=1 to length of n:
if n[i][0] <= minX and n[i][1] > bestLine[1] then
bestLine = n[i]
add bestLine to chosenLines
for i=0 to length of m:
if m[i] <= bestLine[1] then delete m[i] from m
return chosenLines
I'm just not sure if this always finds the minimum solution. It's a simple greedy algorithm so my gut tells me it won't, but one of my friends who is much better than me at this says that for this problem a greedy algorithm like this always finds the minimal solution. For proving mine always finds the minimal solution I did a very hand wavy proof by contradiction where I made an assumption that probably isn't true at all. I forget exactly what I did.
If this isn't a minimal solution, is there a way to do it in less than something like O(n!) time?
Thanks
Your greedy algorithm IS correct.
We can prove this by showing that ANY other covering can only be improved by replacing it with the cover produced by your algorithm.
Let C be a valid covering for a given input (not necessarily an optimal one), and let S be the covering according to your algorithm. Now lets inspect the points p1, p2, ... pk, that represent the min points you deal with at each iteration step. The covering C must cover them all as well. Observe that there is no segment in C covering two of these points; otherwise, your algorithm would have chosen this segment! Therefore, |C|>=k. And what is the cost (segments count) in your algorithm? |S|=k.
That completes the proof.
Two notes:
1) Implementation: Initializing bestLine with n[0] is incorrect, since the loop may be unable to improve it, and n[0] does not necessarily cover minX.
2) Actually this problem is a simplified version of the Set Cover problem. While the original is NP-complete, this variation results to be polynomial.
Hint: first try proving your algorithm works for sets of size 0, 1, 2... and see if you can generalise this to create a proof by induction.

Resources