Efficient data structures for disjoint integer intervals

Efficient data structures for disjoint integer intervals - data-structures

I have a set of disjoint integer intervals and want to check whether a given integer lies in one of these intervals. Of course, this can be achieved by means of a binary search in logarithmic time. However, the vast majority of the queries return false, i.e., only very few integers lie in any interval. To speedup the application, I'm looking for a probabilistic, constant-time algorithm (some sort of hash function) that tells me whether a given integer is definitely not or maybe in an interval. Here is a sketch of the intended algorithm, where magic_data_structure is initialized with the intervals stored in tree:
x = some_integer;
if(!magic_data_structure.find(x))
return false; // definitely not in any interval
return tree.find(x); // binary search on tree
Any ideas or hints for literature? Thank you very much in advance for your help!
P.S.: Does anybody know improvements of interval trees for non-overlapping intervals which (unlike the ones described above) may include other intervals?

This is a naive solution, but constant.
If you are not dealing with extremely large quantities of numbers, you could just use a hash table where the keys are the numbers and the values are a pointer to the set they're in. But of course if there is a lot of data it might take too long (and too much memory) to index it this way.
Looks like there are various disjoint-set data structures and algorithms to store/search them, but I doubt if any of them have constant times.

Related

Is Google stupid or is it me? Hash sets with possible collisions vs classic safe algorithms

So, I was watching one of these Google videos on how they conduct interviews (https://www.youtube.com/watch?v=XKu_SEDAykw) and I found their solution quite weird. Since there are a lot of clever people working at Google, I am wondering now if I got something wrong or if they do. Let me sum up the task and solution in case you don't want to watch it:
The task is to give an efficient algorithm for the following problem:
Given an array A of integers and a separate integer a, find two indices i,j, such that A[i]+A[j] = a.
They start off with the array being sorted and produce a nice linear time algorithm. But then the interviewer asks what happens if the array isn't sorted. And they propose the following linear time algorithm (they say that first sorting the array and then using their linear time algorithm is too slow, although it would run in nlogn time):
They go through the array from first to last and use a hash set to store the numbers they have already seen. Then they only need to check against the hash set for every index of the array (i.e. did i already see the number I need to get the sum) and since that is apparently possible in constant time, the whole algorithm is running in linear time (essentially number of hash sets * Array.length).
Now to my criticism: I think there is a huge flaw in this solution, which essentially lies in the possibility of collisions. Since they assume nlogn to be slow, we can assume that the hash set has less than logn many different entries. Now, given any large input, the probability of getting a collision tends to 1 when hashing n numbers into at most logn many sets. So they trade a very modest speed increase (they assume that ten billion is large for that array, but then the log (base 2) is still less than 30. However, matching this speed with the hash set algorithm would mean that over 300 million numbers would be hashed to the same spot) for an almost definite erroneous algorithm.
I either misunderstand something with hashing or this is an awful solution for the problem. Again the safe nlogn algorithm is not much slower than the one they give, unless the array gets so big that the hash algorithm will get a collision for sure.
I wouldn't be surprised if a constant time algorithm that throws a coin for small arrays and always says yes for large arrays would have the same rate of success on average as their hash set solution.
If I misunderstand something about hashing please point that out, because I find it rather hard to believe that they would make such an error at a top notch computer engineering company.

To be clear, a "hash set" is a hash table where the key is the entire entry; there is no associated value, so the only interesting fact about a key is that it is present. This is a minor detail in the implementation of a hash table.
As has been noted, there is no basis to your statement that the size of the hash set needs to be less than log n in order to reduce lookup time. It's the other way around: the size of the hash set (the number of buckets) should be linear in the size of the dataset, so that the expected length of a hash chain is O(1). (For complexity analysis, it doesn't matter whether the expected length of a hash chain is 1 or 1,000: both are O(1).)
But even if the expected hash table lookup wasn't O(1), there is still a huge advantage to hashing over sorting: hashing is easily parallelizable. And that's something very important to Google, since only parallel algorithms can cope with Google-sized data sets.
In practice, a googly solution to this problem (I think: I haven't watched the video) would be to use two different hashes. The first hash assigns each number to a server, so it has a very large bucket size since each server has a lot of data. Each server then uses its own hash function to map its own data to its own buckets.
Now I can scan the entire dataset in parallel (using other servers), and for each entry, ask the appropriate storage server (which I work out by using the primary hash) whether the additive inverse is in its dataset. Since every entry can only be stored on one server (or set of servers, if the data is replicated for reliability), I don't actually have to disturb the irrelevant servers. (In practice, I would take a bunch of queries, bucket them by server, and then -- in parallel -- send each server a list of queries, because that reduces connection setup time. But the principal is the same.)
That's a very easy and almost infinitely scalable approach to the problem, and I think that an interviewer would be happy to hear about it. Parallel sorting is much more difficult, and in this case the complexity is totally unnecessary.
Of course, you may well have a good argument in favour of your own preferred strategy, and a good interviewer will be happy to hear a good argument as well, even if it's not one they thought of before. Good engineers are always open to discussing good ideas. And that discussion cannot start with the assumption that one of two different ideas must be "stupid".

Since they assume nlogn to be slow, we can assume that the hash set has less than logn many different entries
Wrong. The size of the hash table will be O(len(A)). This will not make the algorithm take more than linear expected time, since there is no multiplicative factor of the hash table size in the algorithm's runtime.
Also, while collisions are likely, hash tables are generally assumed to have some sort of collision resolution strategy. Collisions will not produce incorrect results.

Sorting Algorithm that minimizes the maximum number of comparisons in which individual items are involved

I'm interested in finding a comparison sorting algorithm that minimizes the number of times each single element is compared with others during the run of the algorithm.
For a randomly sorted list, I'm interested in two distributions: the number of comparisons that are needed to sort a list (this is the traditional criterion) and the number of comparisons in which each single element of the list is involved.
Among the algorithms that have a good performance in terms of the number of comparisons, say achieving O(n log(n)) on average, I would like to find out the one for which, on average, the number of times a single element is compared with others is minimized.
I supposed that the theoretical minimum is O(log(n)) which is obtained by dividing the above figure on the total number of comparisons by n.
I'm also interested in the case where data are likely to be already ordered to some extent.
Is perhaps a simulation the best way to go about finding an answer?
(My previous question has been put on hold - This is now a very clear question, if you can't understand it then please explain why)

Yes you definitely should do simulations.
There you will implicitely set the size and pre-ordering constraints in a way that may allow more specific statements than the general question you rose.
There can, however, not be a clear answer to such question in general.
Big-O deals with asymptotic behaviour while your question
seem to target smaller problem sizes. So Big-O could hint on the best candidates for sufficiently large input sets to a sort run. (But, e.g. if you are interested in size<=5 the results may be completely different!)
For getting proper estimate on comparison operations you would need
to analyze each individual algorithm.
At the end, the result (for a given algorithm) will necesarily be specific to the dataset being sorted.
Also, on avarage is not well defined in your context. I'd assume you intend to refer to the number of comparisions on the participating objects for a given sort and not avarage over a (sufficiently large) set of sort runs.
Even within a single algorithm the distribution of comparisions an individual object is taking place in may show a large standard deviation in one case and be (nearly) equally distributed in another case.
As complexity of a sorting algorithm is determined by the total number of comparisons (and position changes thereof), I do not assume there will be much from therotical analysis contributing to an answer.
Maybe you can add some background on what would make an answer to your question "interesting" in a practical sense?

Given a situation, how to decide on a data structure?

I'm preparing to attend technical interviews and have faced mostly questions which are situation based.Often the situation is a big dataset and I'm asked to decide which will be the most optimal data structure to use.
I'm familiar with most data structures,their implementation and performance. But I fall under dilemma when given situations and be decisive on structures.
Looking for steps/algorithm to follow in a given situation which can help me arrive at the optimum data structure within the time period of the interview.

It depends on what operations you need to support efficiently.
Let's start from the simplest example - you have a large list of elements and you have to find the given element. Lets consider various candidates
You can use sorted array to find an element in O(log N) time using Binary search. What if you want to support insertion and deletion along with that? Inserting an element into a sorted array takes O(n) time in the worst case. (Think of adding an element in the beginning. You have to shift all the elements one place to the right). Now here comes binary search trees (BST). They can support insertion, deletion and searching for an element in O(log N) time.
Now you need to support two operations namely finding minimum and maximum. In the first case, it is just returning the first and the last element respectively and hence the complexity is O(1). Assuming the BST is a balanced one like Red-black tree or AVL tree, finding min and max needs O(log N) time. Consider another situation where you need to return the kth order statistic. Again,sorted array wins. As you can see there is a tradeoff and it really depends on the problem you are given.
Let's take another example. You are given a graph of V vertices and E edges and you have to find the number of connected components in the graph. It can be done in O(V+E) time using Depth first search (assuming adjacency list representation). Consider another situation where edges are added incrementally and the number of connected components can be asked at any point of time in the process. In that situation, Disjoint Set Union data structure with rank and path compression heuristics can be used and it is extremely fast for this situation.
One more example - You need to support range update, finding sum of a subarray efficiently and no new elements are inserting into the array. If you have an array of N elements and Q queries are given, then there are two choices. If range sum queries come only after "all" update operations which are Q' in number. Then you can preprocess the array in O(N+Q') time and answer any query in O(1) time (Store prefix sums). What if there is no such order enforced? You can use Segment Tree with lazy propagation for that. It can be built in O(N log N) time and each query can be performed in O(log N) time. So you need O((N+Q)log N) time in total. Again, what if insertion and deletion are supported along with all these operations? You can use a data structure called Treap which is a probabilistic data structure and all these operations can be performed in O(log N) time. (Using implicit treap).
Note: The constant is omitted while using Big Oh notation. Some of them have large constant hidden in their complexities.

Start with common data structures. Can the problem be solved efficiently with arrays, hashtables, lists or trees (or a simple combination of them, e.g. an array of hastables or similar)?
If there are multiple options, just iterate the runtimes for common operations. Typically one data structure is a clear winner in the scenario set up for the interview. If not, just tell the interviewer your findings, e.g. "A takes O(n^2) to build but then queries can be handled in O(1), whereas for B build and query time are both O(n). So for one-time usage, I'd use B, otherwise A". Space consumption might be relevant in some cases, too.
Highly specialized data structures (e.g. prefix trees aka "Trie") are often that: highly specialized for one particular specific case. The interviewer should usually be more interested in your ability to build useful stuff out of an existing general purpose library -- opposed to knowing all kinds of exotic data structures that may not have much real world usage. That said, extra knowledge never hurts, just be prepared to discuss pros and cons of what you mention (the interviewer may probe whether you are just "name dropping").

Data structure to build and lookup set of integer ranges

I have a set of uint32 integers, there may be millions of items in the set. 50-70% of them are consecutive, but in input stream they appear in unpredictable order.
I need to:
Compress this set into ranges to achieve space efficient representation. Already implemented this using trivial algorithm, since ranges computed only once speed is not important here. After this transformation number of resulting ranges is typically within 5 000-10 000, many of them are single-item, of course.
Test membership of some integer, information about specific range in the set is not required. This one must be very fast -- O(1). Was thinking about minimal perfect hash functions, but they do not play well with ranges. Bitsets are very space inefficient. Other structures, like binary trees, has complexity of O(log n), worst thing with them that implementation make many conditional jumps and processor can not predict them well giving poor performance.
Is there any data structure or algorithm specialized in integer ranges to solve this task?

Regarding the second issue:
You could look-up on Bloom Filters. Bloom Filters are specifically designed to answer the membership question in O(1), though the response is either no or maybe (which is not as clear cut as a yes/no :p).
In the maybe case, of course, you need further processing to actually answer the question (unless a probabilistic answer is sufficient in your case), but even so the Bloom Filter may act as a gate keeper, and reject most of the queries outright.
Also, you might want to keep actual ranges and degenerate ranges (single elements) in different structures.
single elements may be best stored in a hash-table
actual ranges can be stored in a sorted array
This diminishes the number of elements stored in the sorted array, and thus the complexity of the binary search performed there. Since you state that many ranges are degenerate, I take it that you only have some 500-1000 ranges (ie, an order of magnitude less), and log(1000) ~ 10
I would therefore suggest the following steps:
Bloom Filter: if no, stop
Sorted Array of real ranges: if yes, stop
Hash Table of single elements
The Sorted Array test is performed first, because from the number you give (millions of number coalesced in a a few thousands of ranges) if a number is contained, chances are it'll be in a range rather than being single :)
One last note: beware of O(1), while it may seem appealing, you are not here in an asymptotic case. Barely 5000-10000 ranges is few, as log(10000) is something like 13. So don't pessimize your implementation by getting a O(1) solution with such a high constant factor that it actually runs slower than a O(log N) solution :)

If you know in advance what the ranges are, then you can check whether a given integer is present in one of the ranges in O(lg n) using the strategy outlined below. It's not O(1), but it's still quite fast in practice.
The idea behind this approach is that if you've merged all of the ranges together, you have a collection of disjoint ranges on the number line. From there, you can define an ordering on those intervals by saying that the interval [a, b] ≤ [c, d] iff b ≤ c. This is a total ordering because all of the ranges are disjoint. You can thus put all of the intervals together into a static array and then sort them by this ordering. This means that the leftmost interval is in the first slot of the array, and the rightmost interval is in the rightmost slot. This construction takes O(n lg n) time.
To check if a some interval contains a given integer, you can do a binary search on this array. Starting at the middle interval, check if the integer is contained in that interval. If so, you're done. Otherwise, if the value is less than the smallest value in the range, continue the search on the left, and if the value is greater than the largest value in the range, continue the search on the right. This is essentially a standard binary search, and it should run in O(lg n) time.
Hope this helps!

AFAIK there is no such algorithm that search over integer list in O(1).
One only can do O(1) search with vast amount of memory.
So it is not very promising to try to find O(1) search algorithm over list of range of integer.
On the other hand, you could try time/memory trade-off approach by carefully examining your data set (eventually building a kind of hash table).

You can use y-fast trees or van Emde Boas trees to achieve O(lg w) time queries, where w is the number of bits in a word, and you can use fusion trees to achieve O(lg_w n) time queries. The optimal tradeoff in terms of n is O(sqrt(lg(n))).
The easiest of these to implement is probably y-fast trees. They are probably faster than doing binary search, though they require roughly O(lg w) = O(lg 32) = O(5) hash table queries, while binary search requires roughly O(lg n) = O(lg 10000) = O(13) comparisons, so binary search may be faster.

Rather than a 'comparison' based storage/retrieval ( which will always be O(log(n)) ),
You need to work on 'radix' based storage/retrieval .
In other words .. extract nibbles from the uint32, and make a trie ..

Keep your ranges into a sorted array and use binary search for lookups.
It's easy to implement, O(log N), and uses less memory and needs less memory accesses than any other tree based approach, so it will probably be also much faster.

From the description of you problem it sounds like the following might be a good compromise. I've described it using an Object oriented language, but is easily convertible to C using a union type or structure with a type member and a pointer.
Use the first 16 bits to index an array of objects (of size 65536). In that array there are 5 possible objects
a NONE object means no elements beginning with those 16bits are in the set
an ALL object means all elements beginning with 16 bits are in the set
a RANGE object means all items with the final 16bits between an upper and lower bound are in the set
a SINGLE object means just one element beginning with the 16bits is in the array
a BITSET object handles all remaining cases with a 65536 bit bitset
Of course, you don't need to split at 16bits, you can adjust to reflect the statistics of your set. In fact you don't need to use consecutive bits, but it speeds up the bit twiddling, and if many of your elements are consecutive as you claim will give good properties.
Hopefully this makes sense, please comment if I need to explain more fully. Effectively you've combined a depth 2 binary tree with a ranges and a bitset for a time/speed tradeoff. If you need to save memory then make the tree deeper with a corresponding slight increase in lookup time.

Best self-balancing BST for quick insertion of a large number of nodes

I've been able to find details on several self-balancing BSTs through several sources, but I haven't found any good descriptions detailing which one is best to use in different situations (or if it really doesn't matter).
I want a BST that is optimal for storing in excess of ten million nodes. The order of insertion of the nodes is basically random, and I will never need to delete nodes, so insertion time is the only thing that would need to be optimized.
I intend to use it to store previously visited game states in a puzzle game, so that I can quickly check if a previous configuration has already been encountered.

Red-black is better than AVL for insertion-heavy applications. If you foresee relatively uniform look-up, then Red-black is the way to go. If you foresee a relatively unbalanced look-up where more recently viewed elements are more likely to be viewed again, you want to use splay trees.

Why use a BST at all? From your description a dictionary will work just as well, if not better.
The only reason for using a BST would be if you wanted to list out the contents of the container in key order. It certainly doesn't sound like you want to do that, in which case go for the hash table. O(1) insertion and search, no worries about deletion, what could be better?

The two self-balancing BSTs I'm most familiar with are red-black and AVL, so I can't say for certain if any other solutions are better, but as I recall, red-black has faster insertion and slower retrieval compared to AVL.
So if insertion is a higher priority than retrieval, red-black may be a better solution.

[hash tables have] O(1) insertion and search
I think this is wrong.
First of all, if you limit the keyspace to be finite, you could store the elements in an array and do an O(1) linear scan. Or you could shufflesort the array and then do a linear scan in O(1) expected time. When stuff is finite, stuff is easily O(1).
So let's say your hash table will store any arbitrary bit string; it doesn't much matter, as long as there's an infinite set of keys, each of which are finite. Then you have to read all the bits of any query and insertion input, else I insert y0 in an empty hash and query on y1, where y0 and y1 differ at a single bit position which you don't look at.
But let's say the key lengths are not a parameter. If your insertion and search take O(1), in particular hashing takes O(1) time, which means that you only look at a finite amount of output from the hash function (from which there's likely to be only a finite output, granted).
This means that with finitely many buckets, there must be an infinite set of strings which all have the same hash value. Suppose I insert a lot, i.e. ω(1), of those, and start querying. This means that your hash table has to fall back on some other O(1) insertion/search mechanism to answer my queries. Which one, and why not just use that directly?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio