Suppose C refers to a set of containers {c1,c2,c3....cn}, where each of these containers contains a finite set of integers {i1,i2,i3...im}. Further, suppose that it is possible for an integer to exist in more than one container. Given a finite set of integers S {s1,s2,s3...sz}, find the size of the smallest subset of C that contains all integers in S.
Note that there could be thousands of containers each with hundreds of integers. Therefore, brute force is slow for solving this problem.
I tried to solve the problem using Greedy algorithm. That is, each time I select the container with the largest number of integers in the set S, but I failed!
Can anyone suggest a fast algorithm for this problem?
This is the well known set cover problem. It is NP-hard — in fact, its decision version was one of the canonical NP-complete problems and was among the 21 problems included in Karp's 1972 paper — and so no efficient algorithm is known. Unless you can identify some special extra structure to the problem, you will have to be satisfied with an approximate result: that is, a subset of C whose union contains S, which but which is not necessarily the smallest such subset of C.
The greedy algorithm is probably your best bet: it finds a collection of sets that is no more than O(log |C|) times the size of the smallest such collection.
You say that you were unable to get the greedy algorithm to work. I think this is probably because you failed to implement it correctly. You describe your algorithm like this:
each time I select the container with the largest number of integers in the set S
but the rule in the usual greedy algorithm is to select at each stage the container with the largest number of integers in the set S that are not in any container selected so far.
Related
I have a list L of lists l[i] of elements e. I am looking for an algorithm that finds a minimum set S_min of elements such that at least one member of S_min occurs in each l.
I am not only curious to find a simple algorithm that does this for me, but also to learn what problems of this sort are actually called. I am sure there is something out there
I have implemented brute force algorithms that start with adding all those elements to S_min which occur in sets of len(l[i])=1. The rest is simple trial and error.
The problem you describe ist the vertex cover problem in hypergraphs, an optimization problem which is NP-hard in the general case but admits approximation algorithms for suitably bounded instances.
I came across the following question in this course:
Consider a variation of the Knapsack problem where we have two
knapsacks, with integer capacities 𝑊1 and 𝑊2. As usual, we are given
𝑛 items with positive values and positive integer weights. We want to
pick subsets 𝑆1,𝑆2 with maximum total value such that the total weights of 𝑆1 and 𝑆1 are at most 𝑊1 and 𝑊2, respectively. Assume that every item fits in either knapsack. Consider the following two algorithmic approaches.
(1) Use the algorithm from lecture to pick a max-value feasible solution 𝑆1 for the first knapsack, and then run it again on the remaining items to pick a max-value feasible solution 𝑆2 for the second knapsack.
(2) Use the algorithm from lecture to pick a max-value feasible solution for a knapsack with capacity 𝑊1+𝑊2, and then split the chosen items into
two sets 𝑆1+𝑆2 that have size at most 𝑊1 and 𝑊2, respectively.
Which of the following statements is true?
Algorithm (1) is guaranteed to produce an optimal feasible solution to the original problem provided 𝑊1=𝑊2.
Algorithm (1) is guaranteed to
produce an optimal feasible solution to the original problem but
algorithm (2) is not.
Algorithm (2) is guaranteed to produce an
optimal feasible solution to the original problem but algorithm (1) is
not.
Neither algorithm is guaranteed to produce an optimal feasible
solution to the original problem.
The "algorithm from lecture" is on YouTube. https://www.youtube.com/watch?v=KX_6OF8X6HQ, which is 0-1 knapsack problem for one bag.
The correct answer to this question is option 4. This, this and this post present solutions to the problem. However, I'm having a hard time finding counterexamples showing that options 1 through 3 are incorrect. Can you cite any?
Edit:
The accepted answer doesn't provide a counterexample for option 1; see 2 knapsacks with same capacity - Why can't we just find the max-value twice for that.
(Weight; Value): (3;10), (3;10), (4;2)
capacities 7, 3
The first method chooses 3+3 into the first sack, remaining items does not fit into the second one
(Weight; Value): (4;10), (4;10), (4;10), (2:1)
capacities 6, 6
The second method chooses (4+4+4) but this set cannot fit into two sacks without loss, while (4+2) and (4) is better
I have been thinking about a variation of the closest pair problem in which the only available information is the set of distances already calculated (we are not allowed to sort points according to their x-coordinates).
Consider 4 points (A, B, C, D), and the following distances:
dist(A,B) = 0.5
dist(A,C) = 5
dist(C,D) = 2
In this example, I don't need to evaluate dist(B,C) or dist(A,D), because it is guaranteed that these distances are greater than the current known minimum distance.
Is it possible to use this kind of information to reduce the O(n²) to something like O(nlogn)?
Is it possible to reduce the cost to something close to O(nlogn) if I accept a kind of approximated solution? In this case, I am thinking about some technique based on reinforcement learning that only converges to the real solution when the number of reinforcements go to infinite, but provides a great approximation for small n.
Processing time (measured by the big O notation) is not the only issue. To keep a very large amount of previous calculated distances can also be an issue.
Imagine this problem for a set with 10⁸ points.
What kind of solution should I look for? Was this kind of problem solved before?
This is not a classroom problem or something related. I have been just thinking about this problem.
I suggest using ideas that are derived from quickly solving k-nearest-neighbor searches.
The M-Tree data structure: (see http://en.wikipedia.org/wiki/M-tree and http://www.vldb.org/conf/1997/P426.PDF ) is designed to reduce the number distance comparisons that need to be performed to find "nearest neighbors".
Personally, I could not find an implementation of an M-Tree online that I was satisfied with (see my closed thread Looking for a mature M-Tree implementation) so I rolled my own.
My implementation is here: https://github.com/jon1van/MTreeMapRepo
Basically, this is binary tree in which each leaf node contains a HashMap of Keys that are "close" in some metric space you define.
I suggest using my code (or the idea behind it) to implement a solution in which you:
Search each leaf node's HashMap and find the closest pair of Keys within that small subset.
Return the closest pair of Keys when considering only the "winner" of each HashMap.
This style of solution would be a "divide and conquer" approach the returns an approximate solution.
You should know this code has an adjustable parameter the governs the maximum number of Keys that can be placed in an individual HashMap. Reducing this parameter will increase the speed of your search, but it will increase the probability that the correct solution won't be found because one Key is in HashMap A while the second Key is in HashMap B.
Also, each HashMap is associated a "radius". Depending on how accurate you want your result you maybe able to just search the HashMap with the largest hashMap.size()/radius (because this HashMap contains the highest density of points, thus it is a good search candidate)
Good Luck
If you only have sample distances, not original point locations in a plane you can operate on, then I suspect you are bounded at O(E).
Specifically, it would seem from your description that any valid solution would need to inspect every edge in order to rule out it having something interesting to say, meanwhile, inspecting every edge and taking the smallest solves the problem.
Planar versions bypass O(V^2), by using planar distances to deduce limitations on sets of edges, allowing us to avoid needing to look at most of the edge weights.
Use same idea as in space partitioning. Recursively split given set of points by choosing two points and dividing set in two parts, points that are closer to first point and points that are closer to second point. That is same as splitting points by a line passing between two chosen points.
That produces (binary) space partitioning, on which standard nearest neighbour search algorithms can be used.
I have N binary sequences of length L, where N and L maybe very large, and those sequences maybe very sparse, say have much more 0s then 1s.
I want to select M sequences from them, namely b_1, b_2, b_3..., such that
b_1 | b_2 | b_3 ... | b_M = 1111...11 (L 1s)
Is there an algorithm to achieve it?
My idea is:
STEP1: for position from 1 to L, count the total number of sequences which has 1 at that position. Name it 'owning number'
STEP2: consider the position having minimum owning number, and choose the sequence having the maximum number of 1s from the owning sequence of that position.
STEP3: ignore the chosen sequence, update owning number and go back to STEP2.
I believe that my method cannot generate the best answer.
Does anyone has a better idea?
This is the well known set cover problem. It is NP-hard — in fact, its decision version is one of the canonical NP-complete problems and was among the 21 problems included in Karp's 1972 paper — and so no efficient algorithm is known for solving it.
The algorithm you describe in your question is known as the "greedy algorithm" and (unless your problem has some special features that you are not telling us) it's essentially the best known approach. It finds a collection of sets that is no more than O(log |N|) times the size of the smallest such collection.
Sounds like a typical backtrack task.
Yes, your algoryth sounds reasonable if you want to have a good answer quickly. If you want to have the combination of the least possible samples you can't do better than try all combinations.
Depending on the exact structure of the problem, there is an other technique that often works well (and actually gives an optimal result):
Let x[j] be a boolean variable representing the choice whether to include the j'th binary sequence in the result. A zero-suppressed binary decision diagram can now represent (maybe succinctly - depending on the characteristics of the problem) the family of sets such that the OR of the binary sequences corresponding to a variable x[j] included in the set is all ones. Finding the smallest such set (thus minimizing the number of sequences included) is relatively easy if the ZDD was succinct. Details can be found in The Art of Computer Programming chapter 7.1.4 (volume 4A).
It's also easy to adapt to an exact cover, by taking the family of sets such that there is exactly one 1 for every position.
Can someone provide me with a backtracking algorithm to solve the "set cover" problem to find the minimum number of sets that cover all the elements in the universe?
The greedy approach almost always selects more sets than the optimal number of sets.
This paper uses Linear Programming Relaxation to solve covering problems.
Basically, the LP relaxation yields good bounds, and can be used to identify solutions that are optimum in many cases. Incidentally, when I last looked at open source LP solvers (~2003) I wasn't impressed (some gave incorrect results), but there seem to be some decent open source LP solvers now.
Your problem needs a little more clarification - it seems that you are given a family of subsets $$S_1,\ldots,S_n$$ of a set A, such that the union of the subsets equals A, and you want a minimum number of subsets whose union is still A.
The basic approach is branch and bound with some heuristics. E.g., if a particular element of A is in only one subset $$S_i$$, then you must select $$S_i$$. Similarly, if $$S_k$$ is a subset of $$S_j$$, then there's no reason to consider $$S_k$$; if element $$a_i$$ is in every subset that $$a_j$$ is in, then you can not bother considering $$a_i$$.
For branch and bound you need good bounding heuristics. Lower bounds can come from independent sets (if there are k elements $$i_1,\ldots,i_L$$ in A such that each if $$i_p$$ is contained in $$A_p$$ and $$i_q$$ is contained in $$A_q$$ then $$A_p$$ and $$A_q$$ are disjoint). Better lower bounds come from the LP relaxation described above.
The Espresso logic minimization system from Berkeley has a very high quality set covering engine.