Lets say I have a set of n elements, divided into a number of sets. Each element is in one set exactly.
I want to be able to do the following queries as quickly as possible:
What set s is element e in?
What elements {e1,e2,...,ei} are in set s?
What data structure should I use? The best I could think of is a map pointing to a bunch of sets but I was wondering if there's a better approach?
If it helps, you can assume my set is the integers {0,1,...,n-1}
If your set is the integers {0,1,...,n-1} without gaps then it would be more efficient to use an array of sets; however if the integers are sparse then a map of sets would require less space. Either way operation (1) would run in constant time (worst case for the array, average case for the hash map).
Related
Let's say I have 2 very large sets of strings. I'd like to make a very compressed representation of these sets so that comparing the two it can be determined if there are any set members in both the sets. I'd like the representations to be constant space, or perhaps log N vs. the size of the set.
I don't care what the members are. Or even the count. Just a true/false for if there's an intersection or not.
My first thought was to have a bit array and toggle bits based on the set contents, like a bloom filter. But how to check for intersection between two bit arrays? I don't think this would work because the arrays would just be random bits.
Perhaps something like a radix tree?
I suspect an algorithm/data structure exists for this already. There's compact probabilistic datastructures for set membership, I don't think this is too far a stretch.
There is in general no structure which takes less than O(n) space and can do set intersection queries. It's easy to see why: If you had that, you could recover each of the entries of the set by checking for intersections with sets, each containing one possible element. By the pigeonhole principle, this means that the representation must consume expected O(n) space.
Theta datasketch looks intriguing. It allows set operations like intersection and union on hashed representation of the sets.
https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html
In this problem, I have a set of elements that are indexed from 1 to n. Each element actually corresponds to a graph node and I am trying to calculate random one-to-one matchings between the nodes. For the sake of simplicity, I neglect further details of the actual problem. I need to write a fast algorithm to randomly consume these elements (nodes) and do this operation multiple times in order to calculate different matchings. The purpose here is to create randomized inputs to another algorithm and each calculated matching at the end of this will be another input to that algorithm.
The most basic algorithm I can think of is to create copies of the elements in the form of an array, generate random integers, and use them as array indices to apply swap operations. This way each random copy can be created in O(n) but in practice, it uses a lot of copy and swap operations. Performance is very important and I am looking for faster ways (algorithms and data structures) of achieving this goal. It just needs to satisfy the two conditions:
It shall be able to consume a random element.
It shall be able to consume an element on the given index.
I tried to write as clear as possible. If you have any questions, feel free to ask and I am happy to clarify. Thanks in advance.
Note: Matching is an operation where you pair the vertices on a graph if there exists an edge between them.
Shuffle index array (for example, with Fisher-Yates shuffling)
ia = [3,1,4,2]
Walk through index array and "consume" set element with current index
for x in ia:
consume(Set[indexed by x])
So for this example you will get order Set[3], Set[1], Set[4], Set[2]
No element swaps, only array of integers is changed
I know I can do a zinterstore with a normal set as an argument (Redis: How to intersect a "normal" set with a sorted set?). Is that going to affect performance? Is it going to be faster/slower than working only with zsets?
According to the sorted-set source code, ZINTERSTORE will treat a set like a sorted-set with score 1, the function name is zunionInterGenericCommand.
Intersecting sets will take more or less time depending on the sorting algorithm used in this step, for example:
/* sort sets from the smallest to largest, this will improve our
* algorithm's performance */
qsort(src,setnum,sizeof(zsetopsrc),zuiCompareByCardinality);
There are also differences in how Sets and Zsets are stored, which will affect how they are read. Redis will decide how to encode a (Sorted) Set depending on how many elements they contain. Therefore iterating through them requires different work.
However for any practical purposes, I'd say that your best bet is to use ZINTERSTORE, and I'll explain why: I hardly see how anything you might write in your source code will beat Redis performance when doing the intersection you want to do.
If your concern is performance, you're getting too much in the details. Your focus should be in the big-O of the operation instead, shown in the command documentation:
Time complexity: O(NK)+O(Mlog(M)) worst case with N being the
smallest input sorted set, K being the number of input sorted sets and
M being the number of elements in the resulting sorted set.
What this tells you is:
1-The size of the smaller set and the amount of sets you plan to intersect determine the first part. Therefore if you know that you'll always intersect 2 sets, one being small and the other one being huge; then you can say that the first part is constant. A good example of this would be intersect a set of all available products in a store (where the score is how many in stock), and a sorted set of products in a user's cart.
In this case you'll have only 2 sets, and you'll know one of them will be very small.
2-The size of the resulting sorted set M can cause a big performance issue. But there's a trick here: big sorted sets are encoded as a skip list when they are too big. A small sorted set will be stored as a zip list, which can cause an important hit in big sorted sets.
However, for the case of intersection, you know that the resulting set can not be bigger than the smaller set you provide. For a union, the resulting set will contain all elements in all sets; so attention needs to be on the size of the bigger sets more than on the smallest.
In summary, the answer to the question of performance with (sorted) sets is: it depends on the sizes of the sets much more than in the actual datatype. Take into consideration that the resulting data structure will be a sorted set regardless of all the inputs being sets. Therefore a big sorted set will be stored (less efficiently) as a skip list.
Knowing beforehand how many sets you plan to intersect (2, 3, depending on user input?) and the size of the smaller set (10? hundreds? thousands?) will give you a much better idea than the internal datatypes. The algorithm for intersecting is the same for both types.
Redis by default assumes the normal set to have some default score for each element, therefore it treats the normal set to be like a sorted set with all elements having an equal default score. I believe performance should be the same as intersecting 2 sorted sets.
Is there an algorithm (preferably constant time) to check if set A is a subset of set B?
Creating the data structures to facilitate this problem does not count against the runtime.
Well, you're going to have to look at each element of A, so it must be at least linear time in the size of A.
An O(A+B) algorithm is easy using hashtables (store elements of B in a hashtable, then look up each element of A). I don't think you can do any better unless you know some advance structure for B. For instance, if B is stored in sorted order, you can do O(A log B) using binary search.
You might go for bloom filter (http://en.wikipedia.org/wiki/Bloom_filter ). However there might be false positives, which can be addressed by the method mentioned by Keith above (but note that the worst case complexity of hashing is NOT O(n), but you can do O(nlogn).
See if A is a subset of B according to Bloom filter
If yes, then do a thorough check
If you have a list of the least common letters and pairs of letters in your string set, you can store your sets sorted with their least common letters and letter pairs and maximize your chances of tossing out negative matches as quickly as possible.
It's not clear to me how well this would combine with a bloom filter, Probably a hash table will do since there aren't very many digrams and letters.
If you had some information about the maximum size of subsets or even a common size you could similarly preproccess data by putting all of the subsets of a given size into a bloom filter as mentioned.
You could also do a combination of both of these.
The problem
I am given N arrays of C booleans. I want to organize these into a datastructure that allows me to do the following operation as fast as possible: Given a new array, return true if this array is a "superset" of any of the stored arrays. With superset I mean this: A is a superset of B if A[i] is true for every i where B[i] is true. If B[i] is false, then A[i] can be anything.
Or, in terms of sets instead of arrays:
Store N sets (each with C possible elements) into a datastructure so you can quickly look up if a given set is a superset of any of the stored sets.
Building the datastructure can take as long as possible, but the lookup should be as efficient as possible, and the datastructure can't take too much space.
Some context
I think this is an interesting problem on its own, but for the thing I'm really trying to solve, you can assume the following:
N = 10000
C = 1000
The stored arrays are sparse
The looked up arrays are random (so not sparse)
What I've come up with so far
For O(NC) lookup: Just iterate all the arrays. This is just too slow though.
For O(C) lookup: I had a long description here, but as Amit pointed out in the comments, it was basically a BDD. While this has great lookup speed, it has an exponential number of nodes. With N and C so large, this takes too much space.
I hope that in between this O(N*C) and O(C) solution, there's maybe a O(log(N)*C) solution that doesn't require an exponential amount of space.
EDIT: A new idea I've come up with
For O(sqrt(N)C) lookup: Store the arrays as a prefix trie. When looking up an array A, go to the appropriate subtree if A[i]=0, but visit both subtrees if A[i]=1.
My intuition tells me that this should make the (average) complexity of the lookup O(sqrt(N)C), if you assume that the stored arrays are random. But: 1. they're not, the arrays are sparse. And 2. it's only intuition, I can't prove it.
I will try out both this new idea and the BDD method, and see which of the 2 work out best.
But in the meantime, doesn't this problem occur more often? Doesn't it have a name? Hasn't there been previous research? It really feels like I'm reinventing the wheel here.
Just to add some background information to the prefix trie solution, recently I found the following paper:
I.Savnik: Index data structure for fast subset and superset queries. CD-ARES, IFIP LNCS, 2013.
The paper proposes the set-trie data structure (container) which provides support for efficient storage and querying of sets of sets using the trie data structure, supporting operations like finding all the supersets/subsets of a given set from a collection of sets.
For any python users interested in an actual implementation, I came up with a python3 package based partly on the above paper. It contains a trie-based container of sets and also a mapping container where the keys are sets. You can find it on github.
I think prefix trie is a great start.
Since yours arrays are sparse, I would additionally test them in bulk. If (B1 ∪ B2) ⊂ A, both are included. So the idea is to OR-pack arrays by pairs, and to reiterate until there is only one "root" array (it would take only twice as much space). It allows to answer 'Yes' to your question earlier, which is mainly useful if you don't need to know with array is actually contained.
Independently, you can apply for each array a hash function preserving ordering.
Ie : B ⊂ A ⇒ h(B) ≺ h(A)
ORing bits together is such a function, but you can also count each 1-bit in adequate partitions of the array. Here, you can eliminate candidates faster (answering 'No' for a particular array).
You can simplify the problem by first reducing your list of sets to "minimal" sets: keep only those sets which are not supersets of any other ones. The problem remains the same because if some input set A is a superset of some set B you removed, then it is also a superset of at least one "minimal" subset C of B which was not removed. The advantage of doing this is that you tend to eliminate large sets, which makes the problem less expensive.
From there I would use some kind of ID3 or C4.5 algorithm.
Building on the trie solution and the paper mentioned by #mmihaltz, it is also possible to implement a method to find subsets by using already existing efficient trie implementations for python. Below I use the package datrie. The only downside is that the keys must be converted to strings, which can be done with "".join(chr(i) for i in myset). This, however, limits the range of elements to about 110000.
from datrie import BaseTrie, BaseState
def existsSubset(trie, setarr, trieState=None):
if trieState is None:
trieState = BaseState(trie)
trieState2 = BaseState(trie)
trieState.copy_to(trieState2)
for i, elem in enumerate(setarr):
if trieState2.walk(elem):
if trieState2.is_terminal() or existsSubset(trie, setarr[i:], trieState2):
return True
trieState.copy_to(trieState2)
return False
The trie can be used like dictionary, but the range of possible elements has to be provided at the beginning:
alphabet = "".join(chr(i) for i in range(100))
trie = BaseTrie(alphabet)
for subset in sets:
trie["".join(chr(i) for i in subset)] = 0 # the assigned value does not matter
Note that the trie implementation above works only with keys larger than (and not equal to) 0. Otherwise, the integer to character mapping does not work properly. This problem can be solved with an index shift.
A cython implementation that also covers the conversion of elements can be found here.