I haven't been able to find any satisfactory coverage of this topic all in one place, so I was wondering:
What are the fastest set intersect, union, and disjoin algorithms?
Are there any interesting ones with limited domains?
Can anyone beat O(Z) where Z is the actual size of intersection?
If your approach relies on sorted sets, please note that, but don't consider it a disqualifying factor. It seems to me that there must be a veritable storehouse of subtle optimizations to be shared, and I don't want to miss any of them.
A few algorithms I know rely on bitwise operations beyond the vanilla, so you may assume the presence of SSE4 and access to intrinsics like popcount. Please note this assumption.
Of interest:
An Implementation of B-Y Intersect
Update
We've got some really good partial answers, but I'm still hoping for some more complete attacks on the problem. I'm particularly interested in seeing a more fully articulated use of bloom filters in attacking the problem.
Update
I've done some preliminary work on combining bloom filters with a cuckoo hash table. It's looking almost obnoxiously promising, because they have very similar demands. I've gone ahead and accepted an answer, but I'm not really satisfied at the moment.
If you're willing to consider set-like structures then bloom filters have trivial union and intersect operations.
For reasonably dense sets, interval lists can beat O(n) for the operations you specify, where n is the number of elements in the set.
An interval list is essentially a strictly increasing list of numbers, [a1, b1, a2, b2, ..., an, bn], where each pair ai, bi denotes the interval [ai, bi). The strictly increasing constraint ensures that every describable set has a unique representation. Representing a set as an ordered collection of intervals allows your set operations to deal with multiple consecutive elements on each iteration.
If set is actually a hashed set and both sets have the same hash function and table size then we can skip all buckets that exist only in one set. That could narrow search a bit.
The following paper presents algorithms for union, intersection and difference on ordered sets that beat O(Z) if the intersection is larger than the difference (Z > n/2):
Confluently Persistent Sets and Maps
there is no optimal solution than O(Z) because if you think of the problem logically each of the intersect, union and disjoin algorithms must at least read all of the input elements once, so Z reads is a must. also since the set is not sorted by default, no further optimizations could beat O(Z)
Abstractly, a set is something that supports an operation, "is X a member?". You can define that operation on the intersection A n B in terms of it on A and B. An implementation would look something like:
interface Set { bool isMember(Object X); };
class Intersection {
Set a, b;
public Intersection(Set A, Set B) { this.a = A; this.b = B; }
public isMember(Object X) {
return a.isMember(X) and b.isMember(Y);
}
}
A and B could be implemented using an explicit set type, like a HashSet. The cost of that operation on each is quite cheap, let's approximate it with O(1); so the cost on the intersection is just 2 O(n). ;-)
Admittedly if you build a large hierarchy of intersections like this, checking for a member can be more expensive, up to O(n) for n sets in the hierarchy. A potential optimisation for this could be to check the depth of the hierarchy against a threshold, and materialise it into a HashSet if it exceeds it. This will reduce the member operation cost, and perhaps amortise the construction cost when many intersections are applied.
Related
Input: An array of elements and a partial order on a subset of those elements, seen as a constraint set.
Output: An array (or any ordered sequence) fulfilling the partial order.
The Problem: How can one achieve the reordering efficiently? The number of introduced inversions (or swaps) compared to the original input sequence should be as small as possible. Note, that the partial order can be defined for any amount of elements (some elements may are not part of it).
The context: It arises from a situation in 2-layer graph crossing reduction: After a crossing reduction phase, I want to reorder some of the nodes (thus, the partial order may contain only a small subset).
In general, I had the idea to weaken this a little bit and solve the problem only for the elements being part of the partial order (though I think, that this could lead to non-optimal results). Thus, if I have a sequence A B C D E and the partial order only contains A, B and E, then C and D will stay at the same place. It somehow reminds me of the Kemeny Score, but I couldn't yet turn that into an algorithm.
Just to be sure: I am not searching for a topological sort. This would probably introduce a lot more inversions than required.
Edit 1:
Changed wording (sequence to array).
The amount of additional space for solving the problem can be arbitrary (well, polynomially bounded) large. Of course, less is better :) So, something like O(ArrayLen*ArrayLen) at most would be fantastic.
Why the min amount of swaps or inversions: As this procedure is part of crossing reduction, the input array's ordering is (hopefully) close to an optimum, in terms of edge crossings with the second node layer. Then, every additional swap or inversion would, probably, introduce edge crossings again. But in the process of computing the output, the number of swaps or movements done is not really important (though, again, something linear or quadratical would be cool), as only the output-quality is important. Right now, I require the constraints to be in a total order and only inspect the nodes of that order, thus it becomes trivial to solve. But the partial order constraints would be more flexible.
I found a paper, which looks promising: "A Fast and Simple Heuristic for Constrained Two-Level Crossing Reduction" by Michael Foster.
Together with the comments below my question, it is answered. Thanks again, #j_random_hacker!
Is there an algorithm (preferably constant time) to check if set A is a subset of set B?
Creating the data structures to facilitate this problem does not count against the runtime.
Well, you're going to have to look at each element of A, so it must be at least linear time in the size of A.
An O(A+B) algorithm is easy using hashtables (store elements of B in a hashtable, then look up each element of A). I don't think you can do any better unless you know some advance structure for B. For instance, if B is stored in sorted order, you can do O(A log B) using binary search.
You might go for bloom filter (http://en.wikipedia.org/wiki/Bloom_filter ). However there might be false positives, which can be addressed by the method mentioned by Keith above (but note that the worst case complexity of hashing is NOT O(n), but you can do O(nlogn).
See if A is a subset of B according to Bloom filter
If yes, then do a thorough check
If you have a list of the least common letters and pairs of letters in your string set, you can store your sets sorted with their least common letters and letter pairs and maximize your chances of tossing out negative matches as quickly as possible.
It's not clear to me how well this would combine with a bloom filter, Probably a hash table will do since there aren't very many digrams and letters.
If you had some information about the maximum size of subsets or even a common size you could similarly preproccess data by putting all of the subsets of a given size into a bloom filter as mentioned.
You could also do a combination of both of these.
The problem
I am given N arrays of C booleans. I want to organize these into a datastructure that allows me to do the following operation as fast as possible: Given a new array, return true if this array is a "superset" of any of the stored arrays. With superset I mean this: A is a superset of B if A[i] is true for every i where B[i] is true. If B[i] is false, then A[i] can be anything.
Or, in terms of sets instead of arrays:
Store N sets (each with C possible elements) into a datastructure so you can quickly look up if a given set is a superset of any of the stored sets.
Building the datastructure can take as long as possible, but the lookup should be as efficient as possible, and the datastructure can't take too much space.
Some context
I think this is an interesting problem on its own, but for the thing I'm really trying to solve, you can assume the following:
N = 10000
C = 1000
The stored arrays are sparse
The looked up arrays are random (so not sparse)
What I've come up with so far
For O(NC) lookup: Just iterate all the arrays. This is just too slow though.
For O(C) lookup: I had a long description here, but as Amit pointed out in the comments, it was basically a BDD. While this has great lookup speed, it has an exponential number of nodes. With N and C so large, this takes too much space.
I hope that in between this O(N*C) and O(C) solution, there's maybe a O(log(N)*C) solution that doesn't require an exponential amount of space.
EDIT: A new idea I've come up with
For O(sqrt(N)C) lookup: Store the arrays as a prefix trie. When looking up an array A, go to the appropriate subtree if A[i]=0, but visit both subtrees if A[i]=1.
My intuition tells me that this should make the (average) complexity of the lookup O(sqrt(N)C), if you assume that the stored arrays are random. But: 1. they're not, the arrays are sparse. And 2. it's only intuition, I can't prove it.
I will try out both this new idea and the BDD method, and see which of the 2 work out best.
But in the meantime, doesn't this problem occur more often? Doesn't it have a name? Hasn't there been previous research? It really feels like I'm reinventing the wheel here.
Just to add some background information to the prefix trie solution, recently I found the following paper:
I.Savnik: Index data structure for fast subset and superset queries. CD-ARES, IFIP LNCS, 2013.
The paper proposes the set-trie data structure (container) which provides support for efficient storage and querying of sets of sets using the trie data structure, supporting operations like finding all the supersets/subsets of a given set from a collection of sets.
For any python users interested in an actual implementation, I came up with a python3 package based partly on the above paper. It contains a trie-based container of sets and also a mapping container where the keys are sets. You can find it on github.
I think prefix trie is a great start.
Since yours arrays are sparse, I would additionally test them in bulk. If (B1 ∪ B2) ⊂ A, both are included. So the idea is to OR-pack arrays by pairs, and to reiterate until there is only one "root" array (it would take only twice as much space). It allows to answer 'Yes' to your question earlier, which is mainly useful if you don't need to know with array is actually contained.
Independently, you can apply for each array a hash function preserving ordering.
Ie : B ⊂ A ⇒ h(B) ≺ h(A)
ORing bits together is such a function, but you can also count each 1-bit in adequate partitions of the array. Here, you can eliminate candidates faster (answering 'No' for a particular array).
You can simplify the problem by first reducing your list of sets to "minimal" sets: keep only those sets which are not supersets of any other ones. The problem remains the same because if some input set A is a superset of some set B you removed, then it is also a superset of at least one "minimal" subset C of B which was not removed. The advantage of doing this is that you tend to eliminate large sets, which makes the problem less expensive.
From there I would use some kind of ID3 or C4.5 algorithm.
Building on the trie solution and the paper mentioned by #mmihaltz, it is also possible to implement a method to find subsets by using already existing efficient trie implementations for python. Below I use the package datrie. The only downside is that the keys must be converted to strings, which can be done with "".join(chr(i) for i in myset). This, however, limits the range of elements to about 110000.
from datrie import BaseTrie, BaseState
def existsSubset(trie, setarr, trieState=None):
if trieState is None:
trieState = BaseState(trie)
trieState2 = BaseState(trie)
trieState.copy_to(trieState2)
for i, elem in enumerate(setarr):
if trieState2.walk(elem):
if trieState2.is_terminal() or existsSubset(trie, setarr[i:], trieState2):
return True
trieState.copy_to(trieState2)
return False
The trie can be used like dictionary, but the range of possible elements has to be provided at the beginning:
alphabet = "".join(chr(i) for i in range(100))
trie = BaseTrie(alphabet)
for subset in sets:
trie["".join(chr(i) for i in subset)] = 0 # the assigned value does not matter
Note that the trie implementation above works only with keys larger than (and not equal to) 0. Otherwise, the integer to character mapping does not work properly. This problem can be solved with an index shift.
A cython implementation that also covers the conversion of elements can be found here.
Consider a class of type doubles
class path_cost {
double length;
double time;
};
If I want to lexicographically order a list of path_costs, I have a problem. Read on :)
If I use exact equal for the equality test like so
bool operator<(const path_cost& rhs) const {
if (length == rhs.length) return time < rhs.time;
return length < rhs.length;
}
the resulting order is likely to be wrong, because a small deviation (e.g. due to numerical inaccuracies in the calculation of the length) may cause the length test to fail, so that e.g.
{ 231.00000000000001, 40 } < { 231.00000000000002, 10 }
erroneously holds.
If I alternatively use a tolerance like so
bool operator<(const path_cost& rhs) const {
if (std::fabs(length-rhs.length)<1-e6)) return time < rhs.time;
return length < rhs.length;
}
then the sorting algorithm may horribly fail since the <-operator is no longer transitive (that is, if a < b and b < c then a < c may not hold)
Any ideas? Solutions? I have thought about partitioning the real line, so that numbers within each partition is considered equal, but that still leaves too many cases where the equality test fails but should not.
(UPDATE by James Curran, hopefully explaining the problem):
Given the numbers:
A = {231.0000001200, 10}
B = {231.0000000500, 40}
C = {231.0000000100, 60}
A.Length & B.Length differ by 7-e7, so we use time, and A < B.
B.Length & C.Length differ by 4-e7, so we use time, and B < C.
A.Length & C.Length differ by 1.1-e6, so we use length, and A > C.
(Update by Esben Mose Hansen)
This is not purely theoretical. The standard sort algorithms tends to crash or worse when given a non-transitive sort operator. And this is exactly what I been contending with (and boy was that fun to debug ;) )
Do you really want just a compare function?
Why don't you sort by length first, then group the pairs into what you think are the same length and then sort within each group by time?
Once sorted by length, you can apply whatever heuristic you need, to determine 'equality' of lengths, to do the grouping.
I don't think you are going to be able to do what you want. Essentially you seem to be saying that in certain cases you want to ignore the fact that a>b and pretend a=b. I'm pretty sure that you can construct a proof that says if a and b are equivalent when the difference is smaller than a certain value then a and b are equivalent for all values of a and b. Something along the lines of:
For a tolerance of C and two numbers A and B where without loss of generality A > B then there exist D(n) = B+n*(C/10) where 0<=n<=(10*(A-B))/(C) such that trivially D(n) is within the tolerance of D(n-1) and D(n+1) and therefore equivalent to them. Also D(0) is B and D((10*(A-B))/(C))=A so A and B can be said to be equivalent.
I think the only way you can solve that problem is using a partitioning method. Something like multiplying by 10^6 and then converting to an int shoudl partition pretty well but will mean that if you have 1.00001*10^-6 and 0.999999*10^-6 then they will come out in different partitions which may not be desired.
The problem then becomes looking at your data to work out how to best partition it which I can't help with since I don't know anything about your data. :)
P.S. Do the algorithms actually crash when given the algorithm or just when they encounter specific unsolvable cases?
I can think of two solutions.
You could carefully choose a sorting algorithm that does not fail when the comparisons are intransitive. For example, quicksort shouldn't fail, at least if you implement it yourself. (If you are worried about the worst case behavior of quicksort, you can first randomize the list, then sort it.)
Or you could extend your tolerance patch so that it becomes an equivalence relation and you restore transitivity. There are standard union-find algorithms to complete any relation to an equivalence relation. After applying union-find, you can replace the length in each equivalence class with a consensus value (such as the average, say) and then do the sort that you wanted to do. It feels a bit strange to doctor floating point numbers to prevent spurious reordering, but it should work.
Actually, Moron makes a good point. Instead of union and find, you can sort by length first, then link together neighbors that are within tolerance, then do a subsort within each group on the second key. That has the same outcome as my second suggestion, but it is a simpler implementation.
I'm not familiar with your application, but I'd be willing to bet that the differences in distance between points in your graph are many orders of magnitude larger than the rounding errors on floating point numbers. Therefore, if two entries differ by only the round-off error, they are essentially the same, and it makes no difference in which order they appear in your list. From a common-sense perspective, I see no reason to worry.
You will never get 100% precision with ordinary doubles. You say that you are afraid that using tolerances will affect the correctness of your program. Have you actually tested this? What level of precision does your program actually need?
In most common applications I find a tolerance of something like 1e-9 suffices. Of course it all depends on your application. You can estimate the level of accuracy you need and just set the tolerance to an acceptable value.
If even that fails, it means that double is simply inadequate for your purposes. This scenario is highly unlikely, but can arise if you need very high precision calculations. In that case you have to use an arbitrary precision package (e.g. BigDecimal in Java or something like GMP for C). Again, only choose this option when there is no other way.
Suppose I have a set of objects, S. There is an algorithm f that, given a set S builds certain data structure D on it: f(S) = D. If S is large and/or contains vastly different objects, D becomes large, to the point of being unusable (i.e. not fitting in allotted memory). To overcome this, I split S into several non-intersecting subsets: S = S1 + S2 + ... + Sn and build Di for each subset. Using n structures is less efficient than using one, but at least this way I can fit into memory constraints. Since size of f(S) grows faster than S itself, combined size of Di is much less than size of D.
However, it is still desirable to reduce n, i.e. the number of subsets; or reduce the combined size of Di. For this, I need to split S in such a way that each Si contains "similar" objects, because then f will produce a smaller output structure if input objects are "similar enough" to each other.
The problems is that while "similarity" of objects in S and size of f(S) do correlate, there is no way to compute the latter other than just evaluating f(S), and f is not quite fast.
Algorithm I have currently is to iteratively add each next object from S into one of Si, so that this results in the least possible (at this stage) increase in combined Di size:
for x in S:
i = such i that
size(f(Si + {x})) - size(f(Si))
is min
Si = Si + {x}
This gives practically useful results, but certainly pretty far from optimum (i.e. the minimal possible combined size). Also, this is slow. To speed up somewhat, I compute size(f(Si + {x})) - size(f(Si)) only for those i where x is "similar enough" to objects already in Si.
Is there any standard approach to such kinds of problems?
I know of branch and bounds algorithm family, but it cannot be applied here because it would be prohibitively slow. My guess is that it is simply not possible to compute optimal distribution of S into Si in reasonable time. But is there some common iteratively improving algorithm?
EDIT:
As comments noted, I never defined "similarity". In fact, all I want is to split in such subsets Si that combined size of Di = f(Si) is minimal or at least small enough. "Similarity" is defined only as this and unfortunately simply cannot be computed easily. I do have a simple approximation, but it is only that — an approximation.
So, what I need is a (likely heuristical) algorithm that minimizes sum f(Si) given that there is no simple way to compute the latter — only approximations I use to throw away cases that are very unlikely to give good results.
About the slowness I found that in similar problems a good-enough solution is to compute the match just by picking a fixed number of random candidates.
True that the result will not be the best one (often worse than the full "greedy" solution you implemented) but it in my experience not too bad and you can decide the speed... it can even be implemented in a prescribed amount of time (that is you keep searching until the allocated time expires).
Another option I use is to keep searching until I see no improvement for a while.
To get past the greedy logic you could keep a queue of N "x" elements and trying to pack them simultaneously in groups of "k" (with k < N).
In this case I found that is important to also keep the "age" of an element in the queue and to use it as a "prize" for the result to avoid keeping "bad" elements forever in the queue because others will always match better (this would make the queue search useless and the results would be basically the same as the greedy approach).