Data structures for fast intersection operations? - algorithm

Randomly select two sets ,both set contains distinct keys (one key may belongs to multiple sets ,one set can never contain duplicate keys ).
Return a integer which represents for the number of keys belongs to both sets .
For example intersect({1,2,3,4},{3,4,5}) returns 2 .
I just need the size of the intersection .I don't need to know exactly which keys are there in the intersection .
Are there any datastructures support this kind of operations in less than O(n) time ?
Edit:
Reading the data out does requires O(n) time, but that would not lead to the conclusion that you can't do the intersection operation in less than O(n) time .
Image this scenario:
I have N sets,each contains 100 keys . I read them up, that's N*100 operations .Now I want to know witch pair of sets have the largest intersection ,that's O(N²) intersection operations .So I want to reduce the complexity of the intersection operation .I don't really care how much time it takes to read and construct the sets ,it's at most N*100,that's nothing compares to O(N²) intersection operations .
Be aware ,there's no way that you can find the pair of sets that have the largest intersection by doing less than O(N²) intersection operations ,I can prove that .You must do all the intersection operations .
(he basic idea is ,let's imagine a complete graph ,with N vertices ,each represent for a set ,and Nx(N-1)/2 edges, each represents for a intersection for the connected pair .Now give each edge an non-negetive weight all you want(represents for the intersection size) ,I can always construct N sets satisfy those Nx(N-1)/2 edge weights .That proves my claim .)

I would advice you to take a look at the two possible alternatives, which work particularly well in practice (especially in case of the large sets).
1. The Bloom Filter data structure
A Bloom filter is a space-efficient (based on the bit array) probabilistic data structure, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not.
There is a trade-off between False Positive rate and the memory footprint of the Bloom Filter. Hence, it is possible to estimate the appropriate size of the Bloom Filter for different use cases.
Each set can be associated with its own Bloom Filter. It is very easy to obtain the Bloom Filter, which corresponds to the intersection of the different sets: all bit arrays, which correspond to the different Bloom Filters, can be combined using the bitwise AND operation.
Having the Bloom Filter, which corresponds to the intersection, it is possible to find quickly the items, which are present in all intersected sets.
Apart from that, it is possible to estimate the cardinality of the intersection without the actual iteration over the entire sets: https://en.wikipedia.org/wiki/Bloom_filter#The_union_and_intersection_of_sets
2. The Skip list data structure
A Skip List is a data structure that allows fast search and intersection within an ordered sequence of elements. Fast search and intersection are made possible by maintaining a linked hierarchy of subsequences, each skipping over fewer elements.
Succinctly saying, the Skip List is very similar to the plain Linked List data structure, however each node of the Skip List has a couple of additional pointers to items, which are located further (pointers, which "skips" over the couple of other nodes of the list).
Hence, in order to obtain the intersection - it is needed to maintain the pointers inside all Skip Lists, which are being intersected. During the intersection of the Skip Lists pointers are jumping over the items, which are not present in all intersected Skip Lists. Hence, usually the runtime complexity of the intersection operation is faster then O(N).
The algorithm of the intersection of Skip Lists is described in the book "Introduction to Information Retrieval" (written by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze):
http://nlp.stanford.edu/IR-book/html/htmledition/faster-postings-list-intersection-via-skip-pointers-1.html
Skip Lists are actively used in a high-performance, full-featured text search engine library: Apache Lucene (Skip Lists are used inside the Inverted Index component).
Here is an additional Stackoverflow question about the usage of Skip Lists in Lucene: how lucene use skip list in inverted index?

Let's assume there is an algorithm which allows checking the intersection length in less then O(n) time. Now let's read part of the input. We have two options:
We've read whole set and part of another or we've read part of first set and part of the other.
Option 1):
counter-example - let's take such input that there exists an element which was read in set 1 and hasn't been read from set 2 but it is in set 2 - we'll receive incorrect result.
Option 2):
counter-example - we can have input such there exists element that is in two sets but hasn't been read in at least one. We receive incorrect result.
OK, we've proven that there is no such algorithm that returns the correct result when we don't read the whole input.
Let's read the whole input - n numbers. Oops, the complexity is O(n).
End of proof.

Related

Randomly Consuming a Set of Elements Indexed From 1 to N

In this problem, I have a set of elements that are indexed from 1 to n. Each element actually corresponds to a graph node and I am trying to calculate random one-to-one matchings between the nodes. For the sake of simplicity, I neglect further details of the actual problem. I need to write a fast algorithm to randomly consume these elements (nodes) and do this operation multiple times in order to calculate different matchings. The purpose here is to create randomized inputs to another algorithm and each calculated matching at the end of this will be another input to that algorithm.
The most basic algorithm I can think of is to create copies of the elements in the form of an array, generate random integers, and use them as array indices to apply swap operations. This way each random copy can be created in O(n) but in practice, it uses a lot of copy and swap operations. Performance is very important and I am looking for faster ways (algorithms and data structures) of achieving this goal. It just needs to satisfy the two conditions:
It shall be able to consume a random element.
It shall be able to consume an element on the given index.
I tried to write as clear as possible. If you have any questions, feel free to ask and I am happy to clarify. Thanks in advance.
Note: Matching is an operation where you pair the vertices on a graph if there exists an edge between them.
Shuffle index array (for example, with Fisher-Yates shuffling)
ia = [3,1,4,2]
Walk through index array and "consume" set element with current index
for x in ia:
consume(Set[indexed by x])
So for this example you will get order Set[3], Set[1], Set[4], Set[2]
No element swaps, only array of integers is changed

Is a zinterstore going to be faster/slower when one of the two input sets is a normal set?

I know I can do a zinterstore with a normal set as an argument (Redis: How to intersect a "normal" set with a sorted set?). Is that going to affect performance? Is it going to be faster/slower than working only with zsets?
According to the sorted-set source code, ZINTERSTORE will treat a set like a sorted-set with score 1, the function name is zunionInterGenericCommand.
Intersecting sets will take more or less time depending on the sorting algorithm used in this step, for example:
/* sort sets from the smallest to largest, this will improve our
* algorithm's performance */
qsort(src,setnum,sizeof(zsetopsrc),zuiCompareByCardinality);
There are also differences in how Sets and Zsets are stored, which will affect how they are read. Redis will decide how to encode a (Sorted) Set depending on how many elements they contain. Therefore iterating through them requires different work.
However for any practical purposes, I'd say that your best bet is to use ZINTERSTORE, and I'll explain why: I hardly see how anything you might write in your source code will beat Redis performance when doing the intersection you want to do.
If your concern is performance, you're getting too much in the details. Your focus should be in the big-O of the operation instead, shown in the command documentation:
Time complexity: O(NK)+O(Mlog(M)) worst case with N being the
smallest input sorted set, K being the number of input sorted sets and
M being the number of elements in the resulting sorted set.
What this tells you is:
1-The size of the smaller set and the amount of sets you plan to intersect determine the first part. Therefore if you know that you'll always intersect 2 sets, one being small and the other one being huge; then you can say that the first part is constant. A good example of this would be intersect a set of all available products in a store (where the score is how many in stock), and a sorted set of products in a user's cart.
In this case you'll have only 2 sets, and you'll know one of them will be very small.
2-The size of the resulting sorted set M can cause a big performance issue. But there's a trick here: big sorted sets are encoded as a skip list when they are too big. A small sorted set will be stored as a zip list, which can cause an important hit in big sorted sets.
However, for the case of intersection, you know that the resulting set can not be bigger than the smaller set you provide. For a union, the resulting set will contain all elements in all sets; so attention needs to be on the size of the bigger sets more than on the smallest.
In summary, the answer to the question of performance with (sorted) sets is: it depends on the sizes of the sets much more than in the actual datatype. Take into consideration that the resulting data structure will be a sorted set regardless of all the inputs being sets. Therefore a big sorted set will be stored (less efficiently) as a skip list.
Knowing beforehand how many sets you plan to intersect (2, 3, depending on user input?) and the size of the smaller set (10? hundreds? thousands?) will give you a much better idea than the internal datatypes. The algorithm for intersecting is the same for both types.
Redis by default assumes the normal set to have some default score for each element, therefore it treats the normal set to be like a sorted set with all elements having an equal default score. I believe performance should be the same as intersecting 2 sorted sets.

Algorithm for checking if set A is a subset of set B in faster than linear time

Is there an algorithm (preferably constant time) to check if set A is a subset of set B?
Creating the data structures to facilitate this problem does not count against the runtime.
Well, you're going to have to look at each element of A, so it must be at least linear time in the size of A.
An O(A+B) algorithm is easy using hashtables (store elements of B in a hashtable, then look up each element of A). I don't think you can do any better unless you know some advance structure for B. For instance, if B is stored in sorted order, you can do O(A log B) using binary search.
You might go for bloom filter (http://en.wikipedia.org/wiki/Bloom_filter ). However there might be false positives, which can be addressed by the method mentioned by Keith above (but note that the worst case complexity of hashing is NOT O(n), but you can do O(nlogn).
See if A is a subset of B according to Bloom filter
If yes, then do a thorough check
If you have a list of the least common letters and pairs of letters in your string set, you can store your sets sorted with their least common letters and letter pairs and maximize your chances of tossing out negative matches as quickly as possible.
It's not clear to me how well this would combine with a bloom filter, Probably a hash table will do since there aren't very many digrams and letters.
If you had some information about the maximum size of subsets or even a common size you could similarly preproccess data by putting all of the subsets of a given size into a bloom filter as mentioned.
You could also do a combination of both of these.

Data structure for range query

I was recently asked a coding question on the below problem.
I have some solution to this problem but I am not very sure if those are most efficient.
Problem:
Write a program to track set of text ranges. Start point and end point will be string.
Text range example : [AbA-Ef]
Aa would fall before this range
AB would fall inside this range
etc.
String comparison would be like 'A' < 'a' < 'B' < 'b' ... 'Z' < 'z'
We need to support following operations on this range
Add range - this should merge the ranges if applicable
Delete range - this deletes range from tracked ranges and recompute the ranges
Query range - Given a character, function should return whether it is part of any of tracked ranges or not.
Note that tracked ranges can be dis-continuous.
My solutions:
I came up with two approaches.
Store ranges as doubly linked list or
Store ranges as some sort of balanced tree with leaf node having actual data and they are inter-connected as linked list.
Do you think that this solution are good enough or you can think of any better way of doing this so that those three API gives your best performance ?
You are probably looking for an interval tree.
Use the data structure with your custom comparator to indicate "What's on range", and you will be able to do the required operations efficiently.
Note, an interval tree is actually an efficient way to implement your 2nd idea (Store ranges as a some sort of balanced tree)
I'm not clear on what the "delete range" operation is supposed to do. Does it;
Delete a previously inserted range, and recompute the merge of the remaining ranges?
Stop tracking the deleted range, regardless of how many times parts of it have been added.
That doesn't make a huge difference algorithmically; it's just bookkeeping. But it's important to clarify. Also, are the ranges closed or half-open? (Another detail which doesn't affect the algorithm but does affect the implementation).
The basic approach to this problem is to merge the tracked set into a sorted list of disjoint (non-overlapping) ranges; either as a vector or a binary search tree, or basically any structure which supports O(log n) searching.
One approach is to put both endpoints of every disjoint range into the datastructure. To find out if a target value is in a range, find the index of the smallest endpoint greater than the target. If the index is odd the target is in some range; even means it's outside.
Alternatively, index all the disjoint ranges by their start points; find the target by searching for the largest start-point not greater than the target, and then compare the target with the associated end-point.
I usually use the first approach with sorted vectors, which are plausible if (a) space utilization is important and (b) insert and merge are relatively rare. With binary search trees, I go for the second approach. But they differ only in details and constants.
Merging and deleting are not difficult, but there are an annoying number of cases. You start by finding the ranges corresponding to the endpoints of the range to be inserted/deleted (using the standard find operation), remove all the ranges in between the two, and fiddle with the endpoints to correct the partially overlapping ranges. While the find operation is always O(log n), the tree/vector manipulation is o(n) (if the inserted/deleted range is large, anyway).
Most languages, including Java and C++, have a some sort of ordered map or ordered set in which you can both look up a value and find the next value after or the first value before a value. You could use this as a building block - If it contains a set of disjoint ranges then it will have a least element of a range followed by a greatest element of a range followed by the least element of a range followed by the greatest element of a range and so on. When you add a range you can check to see if you have preserved this property. If not, you need to merge ranges. Similarly, you want to preserve this when you delete. Then you can query by just looking to see if there is a least element just before your query point and a greatest element just after.
If you want to create your own datastructure from scratch, I would think about some sort of radix trie structure, because this avoids doing lots of repeated string comparisons.
I think you would go for B+ tree it's the same which you have mentioned as your second approach.
Here are some properties of B+ tree:
All data is stored leaf nodes.
Every leaf is at the same level.
All leaf nodes have links to other leaf nodes.
Here are few applications B+ tree:
It reduces the number of I/O operations required to find an element in the tree.
Often used in the implementation of database indexes.
The primary value of a B+ tree is in storing data for efficient retrieval in a block-oriented storage context — in particular, file systems.
NTFS uses B+ trees for directory indexing.
Basically it helps for range queries look ups, minimizes tree traversing.

efficient algorithm to test _which_ sets a particular number belongs to

If I have a large set of continuous ranges ( e.g. [0..5], [10..20], [7..13],[-1..37] ) and can arrange those sets into any data-structure I like, what's the most efficient way to test which sets a particular test_number belongs to?
I've thought about storing the sets in a balanced binary tree based on the low number of a set ( and each node would have all the sets that have the same lowest number of their set). This would allow you to efficiently prune the number of sets based on whether the test_number you're testing against the sets is less than the lowest number of a set, and then prune that node and all the nodes to the right of that node ( which have a low number in their range which is greater than the test_number) . I think that would prune about 25% of the sets on average, but then I would need to linearly look at all the rest of the nodes in the binary tree to determine whether the test_number belonged in those sets. ( I could further optimize by sorting the lists of sets at any one node by the highest number in the set, which would allow me to do binary search within a specific list to determine which set, if any, contain the test_number. Unfortunately, most of the sets I'll be dealing with don't have overlapping set boundaries.)
I think that this problem has been solved in graphics processing since they've figured out ways to efficiently test which polygons in their entire model contribute to a specific pixel, but I don't know the terminology of that type of algorithm.
Your intuition about the relevance of your problem to graphics is correct. Consider building and querying a segment tree. It is particularly-well suited for the counting query you want. See also its description in Computational Geometry.
I think building a tree structure will speed things up considerably (provided you have enough sets and numbers to check that it's worth the initial cost). Instead of a binary tree it should be a ternary tree. Each node should have left, middle, and right nodes, where the left node contains a set that is strictly less than the node set, the right is strictly greater, and the middle has overlap.
Set1
/ | \
/ | \
/ | \
Set2 Set3 Set4
It's quick and easy to tell if there's overlap in the sets, since you only have to compare the min and max values to order them. In the simple case above, Set2[max] < Set1[min], Set4[min] > Set1[max], and Set1 and Set3 have some overlap. This will speed up your search because if the number you're searching for is in Set1, it won't be in Set2 or Set4, and you don't have to check them.
I just want to point out that using a scheme like this only saves time over the naive implementation of checking every set if you have more numbers to check than you have sets.
I think I would organise them in the same way Mediawiki indexes pages - as a bucket sort. I don't know that it's the most efficient algorithm out there, but it should be fast, and is pretty easy to implement (even I've managed it, and in SQL at that!!).
Basically, the algorithm for sorting is
For Each SetOfNumbers
For Each NumberInSet
Put SetOfNumbers into Bin(NumberInSet)
Then to query, you can just count the number of items in Bin(MyNumber)
This approach will work well when your SetOfNumbers rarely changes, although if they change regularly it's generally not too hard to keep the Bins updated either. It's chief disadvantage is that it trades space, and initial sorting time, for very fast queries.
Note that in the algorithm I've expanded the Ranges into SetsOfNumbers - enumerating every number in a given range.

Resources