Superset Search - algorithm

I'm looking for an algorithm to solve the following in a reasonable amount of time.
Given a set of sets, find all such sets that are subsets of a given set.
For example, if you have a set of search terms like ["stack overflow", "foo bar", ...], then given a document D, find all search terms whose words all appear in D.
I have found two solutions that are adequate:
Use a list of bit vectors as an index. To query for a given superset, create a bit vector for it, and then iterate over the list performing a bitwise OR for each vector in the list. If the result is equal to the search vector, the search set is a superset of the set represented by the current vector. This algorithm is O(n) where n is the number of sets in the index, and bitwise OR is very fast. Insertion is O(1). Caveat: to support all words in the English language, the bit vectors will need to be several million bits long, and there will need to exist a total order for the words, with no gaps.
Use a prefix tree (trie). Sort the sets before inserting them into the trie. When searching for a given set, sort it first. Iterate over the elements of the search set, activating nodes that match if they are either children of the root node or of a previously activated node. All paths, through activated nodes to a leaf, represent subsets of the search set. The complexity of this algorithm is O(a log a + ab) where a is the size of the search set and b is the number of indexed sets.
What's your solution?

The prefix trie sounds like something I'd try if the sets were sparse compared to the total vocabulary. Don't forget that if the suffix set of two different prefixes is the same, you can share the subgraph representing the suffix set (this can be achieved by hash-consing rather than arbitrary DFA minimization), giving a DAG rather than a tree. Try ordering your words least or most frequent first (I'll bet one or the other is better than some random or alphabetic order).
For a variation on your first strategy, where you represent each set by a very large integer (bit vector), use a sparse ordered set/map of integers (a trie on the sequence of bits which skips runs of consecutive 0s) - http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.5452 (implemented in http://www.scala-lang.org/docu/files/api/scala/collection/immutable/IntMap.html).
If your reference set (of sets) is fixed, and you want to find for many of those sets which ones contain others, I'd compute the immediate containment relation (a directed acyclic graph with a path from a->b iff b is contained in a, and without the redundant arcs a->c where a->b and b->c). The branching factor is no more than the number of elements in a set. The vertices reachable from the given set are exactly those that are subsets of it.

First I would construct 2 data structures, S and E.
S is an array of sets (set S has the N subsets).
S[0] = set(element1, element2, ...)
S[1] = set(element1, element2, ...)
...
S[N] = set(element1, element2, ...)
E is a map (element hash for index) of lists. Each list contains S-indices, where the element appears.
// O( S_total_elements ) = O(n) operation
E[element1] = list(S1, S6, ...)
E[element2] = list(S3, S4, S8, ...)
...
Now, 2 new structures, set L and array C.
I store all the elements of D, that exist in E, in the L. (O(n) operation)
C is an array (S-indices) of counters.
// count subset's elements that are in E
foreach e in L:
foreach idx in E[e]:
C[idx] = C[idx] + 1
Finally,
for i in C:
if C[i] == S[i].Count()
// S[i] subset exists in D

Can you build an index for your documents? i.e. a mapping from each word to those documents containing that word. Once you've built that, lookup should be pretty quick and you can just do set intersection to find the documents matching all words.
Here's Wiki on full text search.
EDIT: Ok, I got that backwards.
You could convert your document to a set (if your language has a set datatype), do the same with your searches. Then it becomes a simple matter of testing whether one is a subset of the other.
Behind the scenes, this is effectively the same idea: it would probably involve building a hash table for the document, hashing the queries, and checking each word in the query in turn. This would be O(nm) where n is the number of searches and m the average number of words in a search.

Related

How to effectively answer range queries in an array of integers?

How to effectively and range queries in an array of integers?
Queries are of one type only, which is, given a range [a,b], find the sum of elements that are less than x (here x is a part of each query, say of the form a b x).
Initially, I tried to literally go from a to b and check if current element is less than x and adding up. But, this way is very inefficient as complexity is O(n).
Now I am trying with segment trees and sort the numbers while merging. But now my challenge is if I sort, then I am losing integers relative order. So when a query comes, I cannot use the sorted array to get values from a to b.
Here are two approaches to solving this problem with segment trees:
Approach 1
You can use a segment tree of sorted arrays.
As usual, the segment tree divides your array into a series of subranges of different sizes. For each subrange you store a sorted list of the entries plus a cumulative sum of the sorted list. You can then use binary search to find the sum of entries below your threshold value in any subrange.
When given a query, you first work out the O(log(n)) subrange that cover your [a,b] range. For each of these you use a O(log(n)) binary search. Overall this is O(qlog^2n) complexity to answer q queries (plus the preprocessing time).
Approach 2
You can use a dynamic segment tree.
A segment tree allows you to answer queries of the form "Compute sum of elements from a to b" in O(logn) time, and also to modify a single entry in O(logn).
Therefore if you start with an empty segment tree, you can reinsert the entries in increasing order. Suppose we have added all entries from 1 to 5, so our array may look like:
[0,0,0,3,0,0,0,2,0,0,0,0,0,0,1,0,0,0,4,4,0,0,5,1]
(The 0s represent entries that are bigger than 5 so haven't been added yet.)
At this point you can answer any queries that have a threshold of 5.
Overall this will cost O(nlog(n)) to add all the entries into the segment tree, O(qlog(q)) to sort the queries, and O(qlog(n)) to use the segment tree to answer the queries.

How to find a minimal set of keys?

I have a set of keys K and a finite set S &subset; K n of n-tuples of keys. Is there an efficient algorithm to find a bijective mapping f : S &mapsto; S' where S' &subset; K k with k < n minimal that strips some of the keys, leaving the others untouched?
I'm afraid this is NP-complete.
It is equivalent to set cover.
Each of your keys allows you to distinguish certain pairs of elements (i.e. a set of edges). Your task is to select the smallest number of keys that allows you to distinguish every element - i.e. the smallest number of sets of edges that allows you to cover every edge.
However, the wiki page shows an approximate solution based on integer programming that may give a useful solution in practice.
Sketch of Proof
Suppose we have a generic set cover problem:
A,B,C
C,D
A,B,D
where we need to find the smallest number of these sets to cover every element A,B,C,D.
We construct a tuple for each letter A,B,C,D.
The tuple has a unique number in position i if and only if set i contains the letter. Otherwise, they contain 0.
There is also a zero tuple.
This means that the tuples would look like:
(0,0,0) The zero tuple
(1,0,2) The tuple for A (in sets 1 and 3)
(3,0,4) The tuple for B (in sets 1 and 3)
(5,6,0) The tuple for C (in sets 1 and 2)
(0,7,8) The tuple for D (in sets 2 and 3)
If you could solve your problem efficiently, you would then be able to use this mapping to solve set cover efficiently.

Find the pair of bitstrings with the largest number of common set bits

I want to find an algorithm to find the pair of bitstrings in an array that have the largest number of common set bits (among all pairs in the array). I know it is possible to do this by comparing all pairs of bitstrings in the array, but this is O(n2). Is there a more efficient algorithm? Ideally, I would like the algorithm to work incrementally by processing one incoming bitstring in each iteration.
For example, suppose we have this array of bitstrings (of length 8):
B1:01010001
B2:01101010
B3:01101010
B4:11001010
B5:00110001
The best pair here is B2 and B3, which have four common set bits.
I found a paper that appears to describe such an algorithm (S. Taylor & T. Drummond (2011); "Binary Histogrammed Intensity Patches for Efficient and Robust Matching"; Int. J. Comput. Vis. 94:241–265), but I don't understand this description from page 252:
This can be incrementally updated in each iteration as the only [bitstring] overlaps that need recomputing are those for the new parent feature and any other [bitstrings] in the root whose “most overlapping feature” was one of the two selected for combination. This avoids the need for the O(N2) overlap comparison in every iteration and allows a forest for a typically-sized database of 700 features to be built in under a second.
As far as I can tell, Taylor & Drummond (2011) do not purport to give an O(n) algorithm for finding the pair of bitstrings in an array with the largest number of common set bits. They sketch an argument that a record of the best such pairs can be updated in O(n) after a new bitstring has been added to the array (and two old bitstrings removed).
Certainly the explanation of the algorithm on page 252 is not very clear, and I think their sketch argument that the record can be updated in O(n) is incomplete at best, so I can see why you are confused.
Anyway, here's my best attempt to explain Algorithm 1 from the paper.
Algorithm
The algorithm takes an array of bitstrings and constructs a lookup tree. A lookup tree is a binary forest (set of binary trees) whose leaves are the original bitstrings from the array, whose internal nodes are new bitstrings, and where if node A is a parent of node B, then A & B = A (that is, all the set bits in A are also set in B).
For example, if the input is this array of bitstrings:
then the output is the lookup tree:
The algorithm as described in the paper proceeds as follows:
Let R be the initial set of bitstrings (the root set).
For each bitstring f1 in R that has no partner in R, find and record its partner (the bitstring f2 in R − {f1} which has the largest number of set bits in common with f1) and record the number of bits they have in common.
If there is no pair of bitstrings in R with any common set bits, stop.
Let f1 and f2 be the pair of bitstrings in R with the largest number of common set bits.
Let p = f1 & f2 be the parent of f1 and f2.
Remove f1 and f2 from R; add p to R.
Go to step 2.
Analysis
Suppose that the array contains n bitstrings of fixed length. Then the algorithm as described is O(n3) because step 2 is O(n2), and there are O(n) iterations, because at each iteration we remove two bitstrings from R and add one.
The paper contains an argument that step 2 is Ω(n2) only on the first time around the loop, and on other iterations it is O(n) because we only have to find the partner of p "and any other bitstrings in R whose partner was one of the two selected for combination." However, this argument is not convincing to me: it is not clear that there are only O(1) other such bitstrings. (Maybe there's a better argument?)
We could bring the algorithm down to O(n2) by storing the number of common set bits between every pair of bitstrings. This requires O(n2) extra space.
Reference
S. Taylor & T. Drummond (2011). "Binary Histogrammed Intensity Patches for Efficient and Robust Matching". Int. J. Comput. Vis. 94:241–265.
Well for each bit position you could maintain two sets, those with that position on and those with it off. The sets could be placed in two binary trees for example.
Then you just perform set unions, first with all eight bits, than every combination of 7 and so on, until you find union with two elements.
The complexity here grows exponentially in the bit size, but if it is small and fixed this isn't a problem.
Another way to do it might be to look at the n k-bit strings as n points in a kD space, and your task is to find the two points closest together. There are a number of geometric algorithms to do this.

How to find the subset with the greatest number of items in common?

Let's say I have a number of 'known' sets:
1 {a, b, c, d, e}
2 {b, c, d, e}
3 {a, c, d}
4 {c, d}
I'd like a function which takes a set as an input, (for example {a, c, d, e}) and finds the set that has the highest number of elements, and no more other items in common. In other words, the subset with the greatest cardinality. The answer doesn't have to be a proper subset. The answer in this case would be {a, c, d}.
EDIT: the above example was wrong, now fixed.
I'm trying to find the absolute most efficient way of doing this.
(In the below, I am assuming that the cost of comparing two sets is O(1) for the sake of simplicity. That operation is outside my control so there's no point thinking about it. In truth it would be a function of the cardinality of the two sets being compared.)
Candiate 1:
Generate all subsets of the input, then iterate over the known sets and return the largest one that is a subset. The downside to this is that the complexity will be something like O(n! × m), where n is the cardinality of the input set and m is the number of 'known' subsets.
Candidate 1a (thanks #bratbrat):
Iterate over all 'known' sets and calculate the cardinatlity of the intersection, and take the one with the highest value. This would be O(n) where n is the number of subsets.
Candidate 2:
Create an inverse table and calculate the euclidean distance between the input and the known sets. This could be quite quick. I'm not clear how I could limit this to include only subsets without a subsequent O(n) filter.
Candidate 3:
Iterate over all known sets and compare against the input. The complexity would be O(n) where n is the number of known sets.
I have at my disposal the set functions built into Python and Redis.
None of these seems particularly great. Ideas? The number of sets may get large (around 100,000 at a guess).
There's no possible way to do this in less than O(n) time... just reading the input is O(n).
A couple ideas:
Sort the sets by size (biggest first), and search for the first set which is a subset of the input set. Once you find one, you don't have to examine the rest.
If the number of possible items which could be in the sets is limited, you could represent them by bit-vectors. Then you could calculate a lookup table to tell you whether a given set is a subset of the input set. (Walk down the bits for each input set under consideration, word by word, indexing each word into the appropriate table. If you find an entry telling you that it's not a subset, again, you can move on directly to the next input set.) Whether this would actually buy you performance, depends on the implementation language. I imagine it would be most effective in a language with primitive integral types, like C or Java.
Take the union of the known sets. This becomes a dictionary of known elements.
Sort the known elements by their value (they're integers, right). This defines a given integer's position in a bit string.
Use the above to define bit strings for each of the known sets. This is a one time operation - the results should be stored to avoid recomputation.
For an input set, run it through the same transform to obtain its bit string.
To get the largest subset, run through the list of known bit strings, taking the intersection (logical and) with the input bit string. Count the '1' elements. Remember the largest one.
http://packages.python.org/bitstring
As mentioned in the comments, this can be paralleled up by subdividing the known sets and giving each thread its own subset to work on. Each thread serves up its best match and then the parent thread picks the best from the threads.
How many searches are you making? In case you are searching multiple input sets you should be able to pre-process all the known sets (perhaps as a tree structure) and your search time for each query would be in the order of your query set size.
Eg: Create a Trie structure with all the known sets. Make sure to sort each set before inserting them. For the query, follow the links that are in the set.

Generate all subset sums within a range faster than O((k+N) * 2^(N/2))?

Is there a way to generate all of the subset sums s1, s2, ..., sk that fall in a range [A,B] faster than O((k+N)*2N/2), where k is the number of sums there are in [A,B]? Note that k is only known after we have enumerated all subset sums within [A,B].
I'm currently using a modified Horowitz-Sahni algorithm. For example, I first call it to for the smallest sum greater than or equal to A, giving me s1. Then I call it again for the next smallest sum greater than s1, giving me s2. Repeat this until we find a sum sk+1 greater than B. There is a lot of computation repeated between each iteration, even without rebuilding the initial two 2N/2 lists, so is there a way to do better?
In my problem, N is about 15, and the magnitude of the numbers is on the order of millions, so I haven't considered the dynamic programming route.
Check the subset sum on Wikipedia. As far as I know, it's the fastest known algorithm, which operates in O(2^(N/2)) time.
Edit:
If you're looking for multiple possible sums, instead of just 0, you can save the end arrays and just iterate through them again (which is roughly an O(2^(n/2) operation) and save re-computing them. The value of all the possible subsets is doesn't change with the target.
Edit again:
I'm not wholly sure what you want. Are we running K searches for one independent value each, or looking for any subset that has a value in a specific range that is K wide? Or are you trying to approximate the second by using the first?
Edit in response:
Yes, you do get a lot of duplicate work even without rebuilding the list. But if you don't rebuild the list, that's not O(k * N * 2^(N/2)). Building the list is O(N * 2^(N/2)).
If you know A and B right now, you could begin iteration, and then simply not stop when you find the right answer (the bottom bound), but keep going until it goes out of range. That should be roughly the same as solving subset sum for just one solution, involving only +k more ops, and when you're done, you can ditch the list.
More edit:
You have a range of sums, from A to B. First, you solve subset sum problem for A. Then, you just keep iterating and storing the results, until you find the solution for B, at which point you stop. Now you have every sum between A and B in a single run, and it will only cost you one subset sum problem solve plus K operations for K values in the range A to B, which is linear and nice and fast.
s = *i + *j; if s > B then ++i; else if s < A then ++j; else { print s; ... what_goes_here? ... }
No, no, no. I get the source of your confusion now (I misread something), but it's still not as complex as what you had originally. If you want to find ALL combinations within the range, instead of one, you will just have to iterate over all combinations of both lists, which isn't too bad.
Excuse my use of auto. C++0x compiler.
std::vector<int> sums;
std::vector<int> firstlist;
std::vector<int> secondlist;
// Fill in first/secondlist.
std::sort(firstlist.begin(), firstlist.end());
std::sort(secondlist.begin(), secondlist.end());
auto firstit = firstlist.begin();
auto secondit = secondlist.begin();
// Since we want all in a range, rather than just the first, we need to check all combinations. Horowitz/Sahni is only designed to find one.
for(; firstit != firstlist.end(); firstit++) {
for(; secondit = secondlist.end(); secondit++) {
int sum = *firstit + *secondit;
if (sum > A && sum < B)
sums.push_back(sum);
}
}
It's still not great. But it could be optimized if you know in advance that N is very large, for example, mapping or hashmapping sums to iterators, so that any given firstit can find any suitable partners in secondit, reducing the running time.
It is possible to do this in O(N*2^(N/2)), using ideas similar to Horowitz Sahni, but we try and do some optimizations to reduce the constants in the BigOh.
We do the following
Step 1: Split into sets of N/2, and generate all possible 2^(N/2) sets for each split. Call them S1 and S2. This we can do in O(2^(N/2)) (note: the N factor is missing here, due to an optimization we can do).
Step 2: Next sort the larger of S1 and S2 (say S1) in O(N*2^(N/2)) time (we optimize here by not sorting both).
Step 3: Find Subset sums in range [A,B] in S1 using binary search (as it is sorted).
Step 4: Next, for each sum in S2, find using binary search the sets in S1 whose union with this gives sum in range [A,B]. This is O(N*2^(N/2)). At the same time, find if that corresponding set in S2 is in the range [A,B]. The optimization here is to combine loops. Note: This gives you a representation of the sets (in terms of two indexes in S2), not the sets themselves. If you want all the sets, this becomes O(K + N*2^(N/2)), where K is the number of sets.
Further optimizations might be possible, for instance when sum from S2, is negative, we don't consider sums < A etc.
Since Steps 2,3,4 should be pretty clear, I will elaborate further on how to get Step 1 done in O(2^(N/2)) time.
For this, we use the concept of Gray Codes. Gray codes are a sequence of binary bit patterns in which each pattern differs from the previous pattern in exactly one bit.
Example: 00 -> 01 -> 11 -> 10 is a gray code with 2 bits.
There are gray codes which go through all possible N/2 bit numbers and these can be generated iteratively (see the wiki page I linked to), in O(1) time for each step (total O(2^(N/2)) steps), given the previous bit pattern, i.e. given current bit pattern, we can generate the next bit pattern in O(1) time.
This enables us to form all the subset sums, by using the previous sum and changing that by just adding or subtracting one number (corresponding to the differing bit position) to get the next sum.
If you modify the Horowitz-Sahni algorithm in the right way, then it's hardly slower than original Horowitz-Sahni. Recall that Horowitz-Sahni works two lists of subset sums: Sums of subsets in the left half of the original list, and sums of subsets in the right half. Call these two lists of sums L and R. To obtain subsets that sum to some fixed value A, you can sort R, and then look up a number in R that matches each number in L using a binary search. However, the algorithm is asymmetric only to save a constant factor in space and time. It's a good idea for this problem to sort both L and R.
In my code below I also reverse L. Then you can keep two pointers into R, updated for each entry in L: A pointer to the last entry in R that's too low, and a pointer to the first entry in R that's too high. When you advance to the next entry in L, each pointer might either move forward or stay put, but they won't have to move backwards. Thus, the second stage of the Horowitz-Sahni algorithm only takes linear time in the data generated in the first stage, plus linear time in the length of the output. Up to a constant factor, you can't do better than that (once you have committed to this meet-in-the-middle algorithm).
Here is a Python code with example input:
# Input
terms = [29371, 108810, 124019, 267363, 298330, 368607,
438140, 453243, 515250, 575143, 695146, 840979, 868052, 999760]
(A,B) = (500000,600000)
# Subset iterator stolen from Sage
def subsets(X):
yield []; pairs = []
for x in X:
pairs.append((2**len(pairs),x))
for w in xrange(2**(len(pairs)-1), 2**(len(pairs))):
yield [x for m, x in pairs if m & w]
# Modified Horowitz-Sahni with toolow and toohigh indices
L = sorted([(sum(S),S) for S in subsets(terms[:len(terms)/2])])
R = sorted([(sum(S),S) for S in subsets(terms[len(terms)/2:])])
(toolow,toohigh) = (-1,0)
for (Lsum,S) in reversed(L):
while R[toolow+1][0] < A-Lsum and toolow < len(R)-1: toolow += 1
while R[toohigh][0] <= B-Lsum and toohigh < len(R): toohigh += 1
for n in xrange(toolow+1,toohigh):
print '+'.join(map(str,S+R[n][1])),'=',sum(S+R[n][1])
"Moron" (I think he should change his user name) raises the reasonable issue of optimizing the algorithm a little further by skipping one of the sorts. Actually, because each list L and R is a list of sizes of subsets, you can do a combined generate and sort of each one in linear time! (That is, linear in the lengths of the lists.) L is the union of two lists of sums, those that include the first term, term[0], and those that don't. So actually you should just make one of these halves in sorted form, add a constant, and then do a merge of the two sorted lists. If you apply this idea recursively, you save a logarithmic factor in the time to make a sorted L, i.e., a factor of N in the original variable of the problem. This gives a good reason to sort both lists as you generate them. If you only sort one list, you have some binary searches that could reintroduce that factor of N; at best you have to optimize them somehow.
At first glance, a factor of O(N) could still be there for a different reason: If you want not just the subset sum, but the subset that makes the sum, then it looks like O(N) time and space to store each subset in L and in R. However, there is a data-sharing trick that also gets rid of that factor of O(N). The first step of the trick is to store each subset of the left or right half as a linked list of bits (1 if a term is included, 0 if it is not included). Then, when the list L is doubled in size as in the previous paragraph, the two linked lists for a subset and its partner can be shared, except at the head:
0
|
v
1 -> 1 -> 0 -> ...
Actually, this linked list trick is an artifact of the cost model and never truly helpful. Because, in order to have pointers in a RAM architecture with O(1) cost, you have to define data words with O(log(memory)) bits. But if you have data words of this size, you might as well store each word as a single bit vector rather than with this pointer structure. I.e., if you need less than a gigaword of memory, then you can store each subset in a 32-bit word. If you need more than a gigaword, then you have a 64-bit architecture or an emulation of it (or maybe 48 bits), and you can still store each subset in one word. If you patch the RAM cost model to take account of word size, then this factor of N was never really there anyway.
So, interestingly, the time complexity for the original Horowitz-Sahni algorithm isn't O(N*2^(N/2)), it's O(2^(N/2)). Likewise the time complexity for this problem is O(K+2^(N/2)), where K is the length of the output.

Resources