Quickly checking if set is superset of stored sets - algorithm

The problem
I am given N arrays of C booleans. I want to organize these into a datastructure that allows me to do the following operation as fast as possible: Given a new array, return true if this array is a "superset" of any of the stored arrays. With superset I mean this: A is a superset of B if A[i] is true for every i where B[i] is true. If B[i] is false, then A[i] can be anything.
Or, in terms of sets instead of arrays:
Store N sets (each with C possible elements) into a datastructure so you can quickly look up if a given set is a superset of any of the stored sets.
Building the datastructure can take as long as possible, but the lookup should be as efficient as possible, and the datastructure can't take too much space.
Some context
I think this is an interesting problem on its own, but for the thing I'm really trying to solve, you can assume the following:
N = 10000
C = 1000
The stored arrays are sparse
The looked up arrays are random (so not sparse)
What I've come up with so far
For O(NC) lookup: Just iterate all the arrays. This is just too slow though.
For O(C) lookup: I had a long description here, but as Amit pointed out in the comments, it was basically a BDD. While this has great lookup speed, it has an exponential number of nodes. With N and C so large, this takes too much space.
I hope that in between this O(N*C) and O(C) solution, there's maybe a O(log(N)*C) solution that doesn't require an exponential amount of space.
EDIT: A new idea I've come up with
For O(sqrt(N)C) lookup: Store the arrays as a prefix trie. When looking up an array A, go to the appropriate subtree if A[i]=0, but visit both subtrees if A[i]=1.
My intuition tells me that this should make the (average) complexity of the lookup O(sqrt(N)C), if you assume that the stored arrays are random. But: 1. they're not, the arrays are sparse. And 2. it's only intuition, I can't prove it.
I will try out both this new idea and the BDD method, and see which of the 2 work out best.
But in the meantime, doesn't this problem occur more often? Doesn't it have a name? Hasn't there been previous research? It really feels like I'm reinventing the wheel here.

Just to add some background information to the prefix trie solution, recently I found the following paper:
I.Savnik: Index data structure for fast subset and superset queries. CD-ARES, IFIP LNCS, 2013.
The paper proposes the set-trie data structure (container) which provides support for efficient storage and querying of sets of sets using the trie data structure, supporting operations like finding all the supersets/subsets of a given set from a collection of sets.
For any python users interested in an actual implementation, I came up with a python3 package based partly on the above paper. It contains a trie-based container of sets and also a mapping container where the keys are sets. You can find it on github.

I think prefix trie is a great start.
Since yours arrays are sparse, I would additionally test them in bulk. If (B1 ∪ B2) ⊂ A, both are included. So the idea is to OR-pack arrays by pairs, and to reiterate until there is only one "root" array (it would take only twice as much space). It allows to answer 'Yes' to your question earlier, which is mainly useful if you don't need to know with array is actually contained.
Independently, you can apply for each array a hash function preserving ordering.
Ie : B ⊂ A ⇒ h(B) ≺ h(A)
ORing bits together is such a function, but you can also count each 1-bit in adequate partitions of the array. Here, you can eliminate candidates faster (answering 'No' for a particular array).

You can simplify the problem by first reducing your list of sets to "minimal" sets: keep only those sets which are not supersets of any other ones. The problem remains the same because if some input set A is a superset of some set B you removed, then it is also a superset of at least one "minimal" subset C of B which was not removed. The advantage of doing this is that you tend to eliminate large sets, which makes the problem less expensive.
From there I would use some kind of ID3 or C4.5 algorithm.

Building on the trie solution and the paper mentioned by #mmihaltz, it is also possible to implement a method to find subsets by using already existing efficient trie implementations for python. Below I use the package datrie. The only downside is that the keys must be converted to strings, which can be done with "".join(chr(i) for i in myset). This, however, limits the range of elements to about 110000.
from datrie import BaseTrie, BaseState
def existsSubset(trie, setarr, trieState=None):
if trieState is None:
trieState = BaseState(trie)
trieState2 = BaseState(trie)
trieState.copy_to(trieState2)
for i, elem in enumerate(setarr):
if trieState2.walk(elem):
if trieState2.is_terminal() or existsSubset(trie, setarr[i:], trieState2):
return True
trieState.copy_to(trieState2)
return False
The trie can be used like dictionary, but the range of possible elements has to be provided at the beginning:
alphabet = "".join(chr(i) for i in range(100))
trie = BaseTrie(alphabet)
for subset in sets:
trie["".join(chr(i) for i in subset)] = 0 # the assigned value does not matter
Note that the trie implementation above works only with keys larger than (and not equal to) 0. Otherwise, the integer to character mapping does not work properly. This problem can be solved with an index shift.
A cython implementation that also covers the conversion of elements can be found here.

Related

Simple ordering for a linked list

I want to create a doubly linked list with an order sequence (an integer attribute) such that sorting by the order sequence could create an array that would effectively be equivalent to the linked list.
given: a <-> b <-> c
a.index > b.index
b.index > c.index
This index would need to handle efficiently arbitrary numbers of inserts.
Is there a known algorithm for accomplishing this?
The problem is when the list gets large and the index sequence has become packed. In that situation the list has to be scanned to put slack back in.
I'm just not sure how this should be accomplished. Ideally there would be some sort of automatic balancing so that this borrowing is both fast and rare.
The naive solution of changing all the left or right indecies by 1 to make room for the insert is O(n).
I'd prefer to use integers, as I know numbers tend to get less reliable in floating point as they approach zero in most implementations.
This is one of my favorite problems. In the literature, it's called "online list labeling", or just "list labeling". There's a bit on it in wikipedia here: https://en.wikipedia.org/wiki/Order-maintenance_problem#List-labeling
Probably the simplest algorithm that will be practical for your purposes is the first one in here: https://www.cs.cmu.edu/~sleator/papers/maintaining-order.pdf.
It handles insertions in amortized O(log N) time, and to manage N items, you have to use integers that are big enough to hold N^2. 64-bit integers are sufficient in almost all practical cases.
What I wound up going for was a roll-my-own solution, because it looked like the algorithm wanted to have the entire list in memory before it would insert the next node. And that is no good.
My idea is to borrow some of the ideas for the algorithm. What I did was make Ids ints and sort orders longs. Then the algorithm is lazy, stuffing entries anywhere they'll fit. Once it runs out of space in some little clump somewhere it begins a scan up and down from the clump and tries to establish an even spacing such that if there are n items scanned they need to share n^2 padding between them.
In theory this will mean over time the list will be perfectly padded, and given that my IDs are ints and my sort orders are longs, there will never be a scenario where you will not be able to achieve n^2 padding. I can't speak to the upper bounds on the number of operations, but my guts tell me that by doing polynomial work at 1/polynomial frequency, that I'll be doing just fine.

data structure for finding the substring from large number of strings

My problem statement is that I am given millions of strings, and I have to find one sub-string which can be present in any of those strings.
e.g. given is "xyzoverflowasxs, werstackweq" etc. and I have to find a given sub string named as "stack", which should return "werstackweq". What kind of data structure we can use for solving this problem ?
I think we can use suffix tree for this , but wanted some more suggestions for this problem.
I think the way to go is with a dictionary holding the actual words, and another data structure pointing to entries within this dictionary. One way to go would be with suffix trees and their variants, as mentioned in the question and the comments. I think the following is a far simpler (heuristic) alternative.
Say you choose some integer k. For each of your strings, finding the k Rabin Fingerprints of length-k within each string should be efficient and easy (any language has an implementation).
So, for a given k, you could hold two data structures:
A dictionary of the words, say a hash table based on collision lists
A dictionary mapping each fingerprint to an array of the linked-list node pointers in the first data structure.
Given a word of length k or greater, you would choose a k subword, calculate its Rabin fingerprint, find the words which contain this fingerprint, and check if they indeed contain this word.
The question is which k to use, and whether to use multiple such k. I would try this experimentally (starting with simultaneously a few small k values for, say, 1, 2, and 3, and also a couple of larger ones). The performance of this heuristic anyway depends on the distribution of your dictionary and queries.

Data structure for non overlapping ranges of integers?

I remember learning a data structure that stored a set of integers as ranges in a tree, but it's been 10 years and I can't remember the name of the data structure, and I'm a bit fuzzy on the details. If it helps, it's a functional data structure that was taught at CMU, I believe in 15-212 (Principles of Programming) in 2002.
Basically, I want to store a set of integers, most of which are consecutive. I want to be able to query for set membership efficiently, add a range of integers efficiently, and remove a range of integers efficiently. In particular, I don't care to preserve what the original ranges are. It's better if adjacent ranges are coalesced into a single larger range.
A naive implementation would be to simply use a generic set data structure such as a HashSet or TreeSet, and add all integers in a range when adding a range, or remove all integers in a range when removing a range. But of course, that would waste a lot of memory in addition to making add and remove slow.
I'm thinking of a purely functional data structure, but for my current use I don't need it to be. IIRC, lookup, insertion, and deletion were all O(log N), where N was the number of ranges in the set.
So, can you tell me the name of the data structure I'm trying to remember, or a suitable alternative?
I found the old homework and the data structure I had in mind were Discrete Interval Encoding Trees or diets for short. They are described in detail in Diets for Fat Sets, Martin Erwig. Journal of Functional Programming, Vol. 8, No. 6, 627-632, 1998. It is basically a tree of intervals with the invariant that all of the intervals are non-overlapping and non-touching. There is a Haskell implementation in Hackage. I was hoping there would be an existing implementation for Scala, but I'm not seeing any.
The homework also included another data structure they called a Recursive Interval-Occluding Tree (RIOT), which rather than keeping only an interval at each node keeps an interval and another (possibly empty) RIOT of things removed from the interval. The assignment included benchmarks showing it did better than diets for random insertions and deletions. AFAICT it is simply something the TAs made up and never published as it no longer seems to exist anywhere on the Internets, at least not under that name.
You probably are looking for segment trees. This might be helpful: http://www.topcoder.com/tc?d1=tutorials&d2=lowestCommonAncestor&module=Static
You can also use binary search trees for the same, for which each node will have two data fields: min_val and max_val.
During insertion algorithm, you just need to call another merging operation to check if the left-child,parent,right-child create a sequence, so as to club them into a single node. This will take O(log n) time.
Other operations like deletion and look-up will take O(log n) time as usual, but special measures need to be taken while deletion.

Optimized Algorithm: Fastest Way to Derive Sets

I'm writing a program for a competition and I need to be faster than all the other competitors. For this I need a little algorithm help; ideally I'd be using the fastest algorithm.
For this problem I am given 2 things. The first is a list of tuples, each of which contains exactly two elements (strings), each of which represents an item. The second is an integer, which indicates how many unique items there are in total. For example:
# of items = 3
[("ball","chair"),("ball","box"),("box","chair"),("chair","box")]
The same tuples can be repeated/ they are not necessarily unique.) My program is supposed to figure out the maximum number of tuples that can "agree" when the items are sorted into two groups. This means that if all the items are broken into two ideal groups, group 1 and group 2, what are the maximum number of tuples that can have their first item in group 1 and their second item in group 2.
For example, the answer to my earlier example would be 2, with "ball" in group 1 and "chair" and "box" in group 2, satisfying the first two tuples. I do not necessarily need know what items go in which group, I just need to know what the maximum number of satisfied tuples could be.
At the moment I'm trying a recursive approach, but its running on (n^2), far too inefficient in my opinion. Does anyone have a method that could produce a faster algorithm?
Thanks!!!!!!!!!!
Speed up approaches for your task:
1. Use integers
Convert the strings to integers (store the strings in an array and use the position for the tupples.
String[] words = {"ball", "chair", "box"};
In tuppls ball now has number 0 (pos 0 in array) , chair 1, box 2.
comparing ints is faster than Strings.
2. Avoid recursion
Recursion is slow, due the recursion overhead.
For example look at binarys search algorithm in a recursive implementatiion, then look how java implements binSearch() (with a while loop and iteration)
Recursion is helpfull if problems are so complex that a non recursive implementation is to complex for a human brain.
An iterataion is faster, but not in the case when you mimick recursive calls by implementing your own stack.
However you can start implementing using a recursiove algorithm, once it works and it is a suited algo, then try to convert to a non recursive implementation
3. if possible avoid objects
if you want the fastest, the now it becomes ugly!
A tuppel array can either be stored in as array of class Point(x,y) or probably faster,
as array of int:
Example:
(1,2), (2,3), (3,4) can be stored as array: (1,2,2,3,3,4)
This needs much less memory because an object needs at least 12 bytes (in java).
Less memory becomes faster, when the array are really big, then your structure will hopefully fits in the processor cache, while the objects array does not.
4. Programming language
In C it will be faster than in Java.
Maximum cut is a special case of your problem, so I doubt you have a quadratic algorithm for it. (Maximum cut is NP-complete and it corresponds to the case where every tuple (A,B) also appears in reverse as (B,A) the same number of times.)
The best strategy for you to try here is "branch and bound." It's a variant of the straightforward recursive search you've probably already coded up. You keep track of the value of the best solution you've found so far. In each recursive call, you check whether it's even possible to beat the best known solution with the choices you've fixed so far.
One thing that may help (or may hurt) is to "probe": for each as-yet-unfixed item, see if putting that item on one of the two sides leads only to suboptimal solutions; if so, you know that item needs to be on the other side.
Another useful trick is to recurse on items that appear frequently both as the first element and as the second element of your tuples.
You should pay particular attention to the "bound" step --- finding an upper bound on the best possible solution given the choices you've fixed.

Algorithm for checking if set A is a subset of set B in faster than linear time

Is there an algorithm (preferably constant time) to check if set A is a subset of set B?
Creating the data structures to facilitate this problem does not count against the runtime.
Well, you're going to have to look at each element of A, so it must be at least linear time in the size of A.
An O(A+B) algorithm is easy using hashtables (store elements of B in a hashtable, then look up each element of A). I don't think you can do any better unless you know some advance structure for B. For instance, if B is stored in sorted order, you can do O(A log B) using binary search.
You might go for bloom filter (http://en.wikipedia.org/wiki/Bloom_filter ). However there might be false positives, which can be addressed by the method mentioned by Keith above (but note that the worst case complexity of hashing is NOT O(n), but you can do O(nlogn).
See if A is a subset of B according to Bloom filter
If yes, then do a thorough check
If you have a list of the least common letters and pairs of letters in your string set, you can store your sets sorted with their least common letters and letter pairs and maximize your chances of tossing out negative matches as quickly as possible.
It's not clear to me how well this would combine with a bloom filter, Probably a hash table will do since there aren't very many digrams and letters.
If you had some information about the maximum size of subsets or even a common size you could similarly preproccess data by putting all of the subsets of a given size into a bloom filter as mentioned.
You could also do a combination of both of these.

Resources