How to find a minimal set of keys? - algorithm

I have a set of keys K and a finite set S &subset; K n of n-tuples of keys. Is there an efficient algorithm to find a bijective mapping f : S &mapsto; S' where S' &subset; K k with k < n minimal that strips some of the keys, leaving the others untouched?

I'm afraid this is NP-complete.
It is equivalent to set cover.
Each of your keys allows you to distinguish certain pairs of elements (i.e. a set of edges). Your task is to select the smallest number of keys that allows you to distinguish every element - i.e. the smallest number of sets of edges that allows you to cover every edge.
However, the wiki page shows an approximate solution based on integer programming that may give a useful solution in practice.
Sketch of Proof
Suppose we have a generic set cover problem:
A,B,C
C,D
A,B,D
where we need to find the smallest number of these sets to cover every element A,B,C,D.
We construct a tuple for each letter A,B,C,D.
The tuple has a unique number in position i if and only if set i contains the letter. Otherwise, they contain 0.
There is also a zero tuple.
This means that the tuples would look like:
(0,0,0) The zero tuple
(1,0,2) The tuple for A (in sets 1 and 3)
(3,0,4) The tuple for B (in sets 1 and 3)
(5,6,0) The tuple for C (in sets 1 and 2)
(0,7,8) The tuple for D (in sets 2 and 3)
If you could solve your problem efficiently, you would then be able to use this mapping to solve set cover efficiently.

Related

Adjacency of intersected sets

Given a known number N of sets of items. Each set containing exactly D - 1 (N >= D) unique (within the set) items. But each item shared between D - 1 sets. Therefore every set have two "neighbouring" sets: both neighbours differs by exactly one element. Also every set have (if D is big enough) two more distant neighbouring sets, that differs by exactly two elements, etc. All sets together forms a closed chain.
E.g. there are ten elements a x b n q p j t r c. D = 4. And sets are (in parentheses are hints for possible ordering of neighbouring sets):
c x j (1)
p j x (2)
x a p (3)
p a n (4)
n q b (6)
a n q (5)
b r t (8)
b q t (7)
j c r (0)
c t r (9)
=> respective chain of items is: r c j x p a n q b t. The example generated as the result of backward substitution. But how to perform the restoration of the neighbourhood algoritmically ?
One way is obvious: simply enumerate all possible pairs of sets and compare sets from each pair whether they are differs exactly by one element or not (also there possible small optimizations, but they not matters much).
Another way to solve the problem is to generate (for each set from input) hashes for all possible D - 2-tuples of ordered sets of elements, then find pairs of collisions. There is a knowledge domain called Locality-Sensitive Hashing.
Both approaches seems to me as a full opposites. Hashing is faster, but implies the adjustment (of buckets sizes, choosing the way of hash combining for vector elements etc.) and most of its operations have amortized constant time. So, there involved some probabilistic actions. I can conclude, that for some D and N there is possibility to encounter performance degradation.
I suspect that there is a deterministic (in above sense) way to find all the neighbouring (adjacent) sets.
Here is an O(D*N) solution.
Two sets are defined to be neighbors if they differ in exactly 1 element. E.g. xap and pan.
Define N buckets, each labeled with a single element.
Place a copy of each set in each associated bucket. E.g. Bucket a holds xap, pan, and anq.
Start the chain by selecting any bucket, say a. Find 3 sets in the bucket that form a chain. E.g. xap, pan, anq. These are the first 3 sets in the chain.
Based on last two sets in the current chain, find the element in the last set that is not in the previous. For pan and anq, this is q. Go to this bucket.
In the present bucket, find the set which is a neighbor of the last element in the chain. E.g. in bucket q, the set neighboring anq is nqb. Add this set to the chain. Go to previous step until you circle back to the first set of the chain, e.g. xap.
One slight optimization is to remove sets from buckets once they are put into the chain.

Covering N sets of contiguous integers with minimum nos

We are Given N sets of contiguous integers. Each such set is defined by two numbers. Ex : 2,5 represents a set containing 2,3,4,5. We have to print minimum nos. of numbers to select in order to cover all N sets. A nos. is said to cover a set if it is contained in the set.
Ex: Given sets [2,5] , [3,4] , [10,100]. We can choose for example {3,10} so we cover up all 3 sets. Hence answer is 2.
I can't find a proper algorithm for N<=5000.
Here is an O(nlogn) approach to solve the problem:
Sort the sets by the last element (for example, your example will be sorted as [3,4], [2,5] , [10,100]).
Choose the end of the first interval
Remove all intersecting sets
If there is some uncovered set, return to 2.
Example (based on your example):
sort - your list of sets is sorted as l =[3,4], [2,5] , [10,100]
Choose 4
Remove the covered sets, you now have l=[10,100]
back to 2 - choose 100
Remove the last entry from the list l=[]
Stop clause is reached, you are done with two points: 4,100.
Correctness Proof (Guidelines) by #j_random_hacker:
Some element in that first (after sorting) range [i,j] must be
included in the answer, or that range would not be covered. Its
rightmost element j covers at least the same set of ranges as any
other element in [i,j]. Why? Suppose to the contrary that there was
some element k < j that covered a range that is not covered by j: then
that range must have an endpoint < j, which contradicts the fact that
[i,j] has the smallest endpoint (which we know because it's the first
in the sorted list)
Note the following is a greedy algorithm that doesn't work (see the comments). I am leaving it here, in case it helps someone else.
I would approach this using a recursive algorithm. First, note that if the sets are disjoint, then then you need "n" numbers. Second, the set of "covering" points can be the ends of the sets, so this is a reduced number of options.
You can iterate/recurse your way through this. The following is a high-level sketch of the algorithm:
One iteration step is:
Extract the endpoints from all the sets
Count the number of sets that each endpoint covers
Choose the endpoint with the maximum coverage
If the maximum coverage is 1, then choose an arbitrary point from each set.
Otherwise, choose the endpoint with the maximum coverage. If there are ties for the maximum, arbitrarily choose one. I don't believe it makes a difference when there are ties.
Remove all the sets covered by the endpoint, and add the endpoint to your "coverage points".
Repeat the process until either there are no sets left or the maximum coverage is 1.

Algorithm: Removing as few elements as possible from a set in order to enforce no subsets

I got a problem which I do not know how to solve:
I have a set of sets A = {A_1, A_2, ..., A_n} and I have a set B.
The target now is to remove as few elements as possible from B (creating B'), such that, after removing the elements for all 1 <= i <= n, A_i is not a subset of B'.
For example, if we have A_1 = {1,2}, A_2 = {1,3,4}, A_3={2,5}, and B={1,2,3,4,5}, we could e.g. remove 1 and 2 from B (that would yield B'={3,4,5}, which is not a superset of one of the A_i).
Is there an algorithm for determining the (minimal number of) elements to be removed?
It sounds like you want to remove the minimal hitting set of A from B (this is closely related to the vertex cover problem).
A hitting set for some set-of-sets A is itself a set such that it contains at least one element from each set in A (it "hits" each set). The minimal hitting set is the smallest such hitting set. So, if you have an MHS for your set-of-sets A, you have an element from each set in A. Removing this from B means no set in A can be a subset of B.
All you need to do is calculate the MHS for (A1, A2, ... An), then remove that from B. Unfortunately, finding the MHS is an NP-complete problem. Knowing that though, you have a few options:
If your data set is small, do the obvious brute-force solution
Use a probabilistic algorithm to get a fast, approximate answer (see this PDF)
Run far, far away in the opposite direction
If you just need some approximation, start with the smallest set in A, and remove one element from B. (You could just grab one at random, or check to see which element is in the most sets in A, depending on how accurate, how fast you need)
Now the smallest set in A isn't a subset of B. Move on from there, but check first to see whether or not the sets you're examining are subsets at this point or not.
This reminds me of the vertex covering problem, and I remember some approximation algorithm for that that is similar to this one.
I think you should find the minimum length from these sets and then delete these elements which is in this set.

Finding Common Sets within noisy data

Context: Consider each set within G to be a collection of the files (contents or MD5 hashes, not names) that are found on a particular computer.
Suppose I have a giant list of giant sets G and an unknown to me list of sets H. Each individual set I in G was created by taking the union of some unknown number of sets from list H, then adding and removing an unknown number of elements.
Now, I could use other data to construct a few of the sets in list H. However, I feel like there might be some sort of technique involving Bayesian probability to do this. E.g. something like, "If finding X in a set within G means there is a high probability of also finding Y, then there is probably a set in H containing both X and Y."
Edit: My goal is to construct a set of sets that is, with high probability, very similar or equal to H.
Any thoughts?
Example usage:
Compress G by replacing chunks of it with pieces of H, e.g.
G[1] = {1,2,3,5,6,7,9,10,11}
H[5] = {1,2,3}
H[6] = {5,6,7,8,9,10}
G[1]' = {H[5],H[6],-8,11}
Define the distance d(i,j) = 1/(number of sets in G which contain both i and j) and then run a cluster analysis.(http://en.wikipedia.org/wiki/Cluster_analysis) The resulting clusters are your candidates for the elements in H.
There are tons of non-brainy ad hoc ways to attack this. Here's one.
Start by taking a random sample from G, say 64 sets.
For each file in these sets, construct a 64-bit integer telling which sets it appears in.
Group the files by this 64-bit value; so all the files that always appear together end up in the same group. Find the group with maximum ((number of files in group - 1) × (number of bits set in the bit-vector - 1)) and call that H[0].
Now throw that sample back and take a new random sample. Reduce it as much as you can using the H[0] you've already defined. Then apply the same algorithm to find H[1]. Rinse. Repeat.
Stop when additional H's are no longer helping you compress the sets.
To improve on this algorithm:
You can easily choose a slightly different measure of the goodness of groups that promotes groups with lots of nearby neighbors--files that appear in nearly the same set of sets.
You can also pretty easily test your existing H's against random samples from G to see if there are files you should consider adding or removing.
Well, the current ad-hoc way, which seems to be good enough, is as follows:
Remove all elements from all G_x that are in under 25 sets.
Create a mapping from element to set and from set to element.
For each element E in the element map, pick 3 sets and take their intersection. Make two copies of this, A and B.
For each set S in the set map that does not contain E, remove all elements of S from A or B (alternate between them)
Add Union(A,B) to H
Remove all elements of `Union(A,B) from the element to set map (i.e. do not find overlapping sets).
How about a deterministic way (if you do not wish sets to intersect at all):
A) Turn sets in H into vertices labeled 1, 2, 3, ... size(H). Create a complete [un] directed graph between them all. Each vertex gets a value - equal to the cardinality / size of the set.
B) Go through all elements x in sets in H, create a mapping x -> [x1, x2, ... xm] if and only if x is in H[xi]. An array of sets will do. This helps you find overlapping sets.
C) Go through through all sets in this array, for every pair of x1, x2 that are within the same set - eliminate two edges between x1 and x2.
D) In the remaining graph only non-overlapping sets (well, their indices in H).
E) Now find the non-intersecting path within this graph with the highest total value. From this you can reconstruct the list of non-intersecting sets with highest coverage. It is trivial to compute the missing elements.
F) If you want to minimize the cardinality of the remaining set, then subtract 0.5 from the value of each vertex. We know that 1 + 1 = 2, but 0.5 + 0.5 < 1.5 - so the algorithm will prefer a set {a,b} over {a} and {b}. This may not be exactly what you want, but it might expire you.

Superset Search

I'm looking for an algorithm to solve the following in a reasonable amount of time.
Given a set of sets, find all such sets that are subsets of a given set.
For example, if you have a set of search terms like ["stack overflow", "foo bar", ...], then given a document D, find all search terms whose words all appear in D.
I have found two solutions that are adequate:
Use a list of bit vectors as an index. To query for a given superset, create a bit vector for it, and then iterate over the list performing a bitwise OR for each vector in the list. If the result is equal to the search vector, the search set is a superset of the set represented by the current vector. This algorithm is O(n) where n is the number of sets in the index, and bitwise OR is very fast. Insertion is O(1). Caveat: to support all words in the English language, the bit vectors will need to be several million bits long, and there will need to exist a total order for the words, with no gaps.
Use a prefix tree (trie). Sort the sets before inserting them into the trie. When searching for a given set, sort it first. Iterate over the elements of the search set, activating nodes that match if they are either children of the root node or of a previously activated node. All paths, through activated nodes to a leaf, represent subsets of the search set. The complexity of this algorithm is O(a log a + ab) where a is the size of the search set and b is the number of indexed sets.
What's your solution?
The prefix trie sounds like something I'd try if the sets were sparse compared to the total vocabulary. Don't forget that if the suffix set of two different prefixes is the same, you can share the subgraph representing the suffix set (this can be achieved by hash-consing rather than arbitrary DFA minimization), giving a DAG rather than a tree. Try ordering your words least or most frequent first (I'll bet one or the other is better than some random or alphabetic order).
For a variation on your first strategy, where you represent each set by a very large integer (bit vector), use a sparse ordered set/map of integers (a trie on the sequence of bits which skips runs of consecutive 0s) - http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.5452 (implemented in http://www.scala-lang.org/docu/files/api/scala/collection/immutable/IntMap.html).
If your reference set (of sets) is fixed, and you want to find for many of those sets which ones contain others, I'd compute the immediate containment relation (a directed acyclic graph with a path from a->b iff b is contained in a, and without the redundant arcs a->c where a->b and b->c). The branching factor is no more than the number of elements in a set. The vertices reachable from the given set are exactly those that are subsets of it.
First I would construct 2 data structures, S and E.
S is an array of sets (set S has the N subsets).
S[0] = set(element1, element2, ...)
S[1] = set(element1, element2, ...)
...
S[N] = set(element1, element2, ...)
E is a map (element hash for index) of lists. Each list contains S-indices, where the element appears.
// O( S_total_elements ) = O(n) operation
E[element1] = list(S1, S6, ...)
E[element2] = list(S3, S4, S8, ...)
...
Now, 2 new structures, set L and array C.
I store all the elements of D, that exist in E, in the L. (O(n) operation)
C is an array (S-indices) of counters.
// count subset's elements that are in E
foreach e in L:
foreach idx in E[e]:
C[idx] = C[idx] + 1
Finally,
for i in C:
if C[i] == S[i].Count()
// S[i] subset exists in D
Can you build an index for your documents? i.e. a mapping from each word to those documents containing that word. Once you've built that, lookup should be pretty quick and you can just do set intersection to find the documents matching all words.
Here's Wiki on full text search.
EDIT: Ok, I got that backwards.
You could convert your document to a set (if your language has a set datatype), do the same with your searches. Then it becomes a simple matter of testing whether one is a subset of the other.
Behind the scenes, this is effectively the same idea: it would probably involve building a hash table for the document, hashing the queries, and checking each word in the query in turn. This would be O(nm) where n is the number of searches and m the average number of words in a search.

Resources