Number of elements required to occur at least ones in each set of a set - algorithm

I have a list L of lists l[i] of elements e. I am looking for an algorithm that finds a minimum set S_min of elements such that at least one member of S_min occurs in each l.
I am not only curious to find a simple algorithm that does this for me, but also to learn what problems of this sort are actually called. I am sure there is something out there
I have implemented brute force algorithms that start with adding all those elements to S_min which occur in sets of len(l[i])=1. The rest is simple trial and error.

The problem you describe ist the vertex cover problem in hypergraphs, an optimization problem which is NP-hard in the general case but admits approximation algorithms for suitably bounded instances.

Related

Extended version of the set cover problem

I don't generally ask questions on SO, so if this question seems inappropriate for SO, just tell me (help would still be appreciated of course).
I'm still a student and I'm currently taking a class in Algorithms. We recently learned about the Branch-and-Bound paradigm and since I didn't fully understand it, I tried to do some exercises in our course book. I came across this particular instance of the Set Cover problem with a special twist:
Let U be a set of elements and S = {S1, S2, ..., Sn} a set of subsets of U, where the union of all sets Si equals U. Outline a Branch-and-Bound algorithm to find a minimal subset Q of S, so that for all elements u in U there are at least two sets in Q, which contain u. Specifically, elaborate how to split the problem up into subproblems and how to calculate upper and lower bounds.
My first thought was to sort all the sets Si in S in descending order, according to how many elements they contain which aren't yet covered at least twice by the currently chosen subsets of S, so our current instance of Q. I was then thinking of recursively solving this, where I choose the first set Si in the sorted order and make one recursive call, where I take this set Si and one where I don't (meaning from those recursive calls onwards the subset is no longer considered). If I choose it I would then go through each element in this chosen subset Si and increase a counter for all its elements (before the recursive call), so that I'll eventually know, when an element is already covered by two or more chosen subsets. Since I sort the not chosen sets Si for each recursive call, I would theoretically (in my mind at least) always be making the best possible choice for the moment. And since I basically create a binary tree of recursive calls, because I always make one call with the current best subset chosen and one where I don't I'll eventually cover all 2^n possibilities, meaning eventually I'll find the optimal solution.
My problem now is I don't know or rather understand how I would implement a heuristic for upper and lower bounds, so the algorithm can discard some of the paths in the binary tree, which will never be better than the current best Q. I would appreciate any help I could get.
Here's a simple lower bound heuristic: Find the set containing the largest number of not-yet-twice-covered elements. (It doesn't matter which set you pick if there are multiple sets with the same, largest possible number of these elements.) Suppose there are u of these elements in total, and this set contains k <= u of them. Then, you need to add at least u/k further sets before you have a solution. (Why? See if you can prove this.)
This lower bound works for the regular set cover problem too. As always with branch and bound, using it may or may not result in better overall performance on a given instance than simply using the "heuristic" that always returns 0.
First, some advice: don't re-sort S every time you recurse/loop. Sorting is an expensive operation (O(N log N)) so putting it in a loop or a recursion usually costs more than you gain from it. Generally you want to sort once at the beginning, and then leverage that sort throughout your algorithm.
The sort you've chosen, descending by the length of the S subsets is a good "greedy" ordering, so I'd say just do that upfront and don't re-sort after that. You don't get to skip over subsets that are not ideal within your recursion, but checking a redundant/non-ideal subset is still faster than re-sorting every time.
Now what upper/lower bounds can you use? Well generally, you want your bounds and bounds-checking to be as simple and efficient as possible because you are going to be checking them a lot.
With this in mind, an upper bounds is easy: use the shortest set-length solution that you've found so far. Initially set your upper-bounds as var bestQlength = int.MaxVal, some maximum value that is greater than n, the number of subsets in S. Then with every recursion you check if currentQ.length > bestQlength, if so then this branch is over the upper-bounds and you "prune" it. Obviously when you find a new solution, you also need to check if it is better (shorter) than your current bestQ and if so then update both bestQ and bestQlength at the same time.
A good lower bounds is a bit trickier, the simplest I can think of for this problem is: Before you add a new subset Si into your currentQ, check to see if Si has any elements that are not already in currentQ two or more times, if it does not, then this Si cannot contribute in any way to the currentQ solution that you are trying to build, so just skip it and move on to the next subset in S.

Linear 3SAT : a version of 3SAT in linear time

Consider a 3SAT instance with the following special locality property. Suppose there are n variables in the Boolean formula, and that they are numbered 1,2,3....n in such a way that each clause involves variables whose numbers are within +-10 of each other. Give a linear-time algorithm for solving such an instance of 3SAT.
I could not solve the problem but my intuition is that if we could map the problem in graph then may be solved but could not go much farther ..
This is a relatively straightforward dynamic programming problem. I'll describe a solution, ignoring the fairly straightforward indexing issues around either boundary.
After the m'th step we have the set of possible values for variables (m-10, m-9, ..., m+10) which could be solutions so far, each linked to a set of values for all previous variables that leads to solutions to equations 1..m.
For the m+1'th step we take each member of this possible solution set, ignore the m-10'th value, and consider each possibility for the m+11'th value. If the m+1'th equation is true, we add this to the next solution set, pointing to our history, only if that solution pattern has not already been added.
This lands us ready for the m+2nd step.
There are n steps required, each of which can have about 2 million possible cases to consider, so this is linear.
(Fun challenge. Modify this algorithm to not just find a solution, but to count how many solutions there are.)
I think you can just brute force it in poly time. Divide the clause list into two pieces. Exhaustive search over variables which are on both sides of the split. There are at most 30 of them, so that's 2^30 = O(1) settings to try. Once those variables are set, you can recursively solve both sides, each one is an independent SAT instance with n/2 variables.

SAT/CNF optimization

Problem
I'm looking at a special subset of SAT optimization problem. For those not familiar with SAT and related topics, here's the related Wikipedia article.
TRUE=(a OR b OR c OR d) AND (a OR f) AND ...
There are no NOTs and it's in conjunctive normal form. This is easily solvable. However I'm trying to minimize the number of true assignments to make the whole statement true. I couldn't find a way to solve that problem.
Possible solutions
I came up with the following ways to solve it:
Convert to a directed graph and search the minimum spanning tree, spanning only a subset of vertices. There's Edmond's algorithm but that gives a MST for the complete graph instead of a subset of the vertices.
Maybe there's a version of Edmond's algorithm that solves the problem for a subset of the vertices?
Maybe there's a way to construct a graph out of the original problem that's solvable with other algorithms?
Use a SAT solver, a LIP solver or exhaustive search. I'm not interested in those solutions as I'm trying to use this problem as lecture material.
Question
Do you have any ideas/comments? Can you come up with other approaches that might work?
This problem is NP-Hard as well.
One can show an east reduction from Hitting Set:
Hitting Set problem: Given sets S1,S2,...,Sn and a number k: chose set S of size k, such that for every Si there is an element s in S such that s is in Si. [alternative definition: the intersection between each Si and S is not empty].
Reduction:
for an instance (S1,...,Sn,k) of hitting set, construct the instance of your problem: (S'1 AND S'2 And ... S'n,k) where S'i is all elements in Si, with OR. These elements in S'i are variables in the formula.
proof:
Hitting Set -> This problem: If there is an instance of hittins set, S then by assigning all of S's elements with true, the formula is satisfied with k elements, since for every S'i there is some variable v which is in S and Si and thus also in S'i.
This problem -> Hitting set: build S with all elements whom assigment is true [same idea as Hitting Set->This problem].
Since you are looking for the optimization problem for this, it is also NP-Hard, and if you are looking for an exact solution - you should try an exponential algorithm

Find the priority function / alphabet order for extreme higher order elements relation

This question is an extension to the following one. The difference is that now our function to optimize will have higher order relations between elements:
We have an array of elements a1,a2,...aN from an alphabet E. Assuming |N| >> |E|.
For each symbol of the alphabet we define an unique integer priority = V(sym). Let's define V{i} := V(symbol(ai)) for the simplicity.
The task is to find a priority function V for which:
Count(i)->MIN | V{i} > V{i+1} <= V{i+2}
In other words, I need to find the priorities / permutation of the alphabet for which the number of positions i, satisfying the condition V{i}>V{i+1}<=V{i+2}, is minimum.
Maximum required abstraction (low priority for me). I guess once the solution model for the initial question is extended to cover the first part of this one, extending it farther (see below) will be easier.
Given a matrix of signs B of size MxK (basically B[i,j] is from the set {<,>,<=,>=}), find the priority function V for which:
Sum(for all j in range [1,M]) {Count(i)}->EXTREMUM | V{i} B[j,1] V{i+1} B[j,2] ... B[j,K] V{i+K}
As an example, find the priority function V, for which the number of i, satisfying V{i}<V{i+1}<V{i+2} or V{i}>V{i+1}>V{i+2}, is minimum.
My intuition is that all variations on this problem will prove to be NP-hard. So I'd begin looking for heuristics that produce reasonable answers. This may involve some trial and error.
A simplistic approach is to write down a possible permutation. And then try possible swaps until you've arrived at a local minimum. Try several times, and pick the best answer.
Simulated annealing provides a more sophisticated version of this approach, see http://en.wikipedia.org/wiki/Simulated_annealing for a description. It may take some experimentation to find a set of parameters that seems to converge relatively well.
Another idea is to look for a genetic algorithm. Based on a quick Google search it looks like the standard way to do this is to try to turn an NP-complete problem into a SAT problem, and then use a genetic algorithm on that problem. This approach would require turning this into a SAT problem in some reasonable way. Unfortunately it is not obvious to me how one would go about doing this reduction. Indeed in the first version that you had, your problem was closely connected to a classic NP-hard problem. The fact that it is labeled NP-hard rather than NP-complete is evidence that people haven't found a good way to transform it into a SAT problem. So if it isn't obvious how to turn the simple version into a SAT problem, then you are unlikely to convert the hard problem either.
But you could still try some variation on genetic algorithms. Mutation is pretty simple, just swap some elements around. One way to combine elements would be to take 3 permutations and use quicksort to find the combination as follows: take a random pivot, and then use "majority wins" to bucket elements into bigger and smaller. Sort each half in the same way.
I'm sorry that I can't just give you an approach and say, "This should work." You've got what looks like an open-ended research project, and the best I can do is give you some ideas about things you can try that might work reasonably well.

backtracking algorithm for set cover

Can someone provide me with a backtracking algorithm to solve the "set cover" problem to find the minimum number of sets that cover all the elements in the universe?
The greedy approach almost always selects more sets than the optimal number of sets.
This paper uses Linear Programming Relaxation to solve covering problems.
Basically, the LP relaxation yields good bounds, and can be used to identify solutions that are optimum in many cases. Incidentally, when I last looked at open source LP solvers (~2003) I wasn't impressed (some gave incorrect results), but there seem to be some decent open source LP solvers now.
Your problem needs a little more clarification - it seems that you are given a family of subsets $$S_1,\ldots,S_n$$ of a set A, such that the union of the subsets equals A, and you want a minimum number of subsets whose union is still A.
The basic approach is branch and bound with some heuristics. E.g., if a particular element of A is in only one subset $$S_i$$, then you must select $$S_i$$. Similarly, if $$S_k$$ is a subset of $$S_j$$, then there's no reason to consider $$S_k$$; if element $$a_i$$ is in every subset that $$a_j$$ is in, then you can not bother considering $$a_i$$.
For branch and bound you need good bounding heuristics. Lower bounds can come from independent sets (if there are k elements $$i_1,\ldots,i_L$$ in A such that each if $$i_p$$ is contained in $$A_p$$ and $$i_q$$ is contained in $$A_q$$ then $$A_p$$ and $$A_q$$ are disjoint). Better lower bounds come from the LP relaxation described above.
The Espresso logic minimization system from Berkeley has a very high quality set covering engine.

Resources