Data structure which maps non-overlapping ranges to values? - data-structures

I need a data structure which maps non-overlapping ranges (eg. 8..15, 16..19) to a pointer to a struct.
I will need to be able to look up any element in that range and retrieve that pointer. For example:
structure[8..15] = 0x12345678;
structure[16..19] = 0xdeadbeef;
structure[7]; // => NULL
structure[12]; // => 0x12345678
structure[18]; // => 0xdeadbeef
I'm considering using a binary search tree at the moment. Since the ranges will never overlap, I can search for indexes relatively easily in logarithmic time.
However, I'm wondering if there are any data structures more suitable for this case. I need to be able to efficiently insert, delete and lookup. All of these operations are O(log n) in a BST, but I'm wondering if there's anything that's faster for this.

If you want something faster than O(log n), use Van Emde Boas tree.
It should be used the same way you use a binary search tree: use start of each range as a key, end of the range - as part of value (together with the pointer), mapped to this key. Time complexity is O(log log M), where M is range of keys (INT_MAX, if any integer value is possible for start of range).
In some cases Van Emde Boas tree has large memory overhead. If that is not acceptable, use either a simple trie, as explained by Beni, or Y-fast trie.

I don't think you can do much better.
Non-overlapping ranges are equivallent to a sequence of alternating start/end points. So lookup is just "find largest element ≤ x" followed by O(1) check if it's a start or an end. I.e. an ordered map.
The usual suspects for that - binary trees, B trees, various tries - are all essentially O(log n). Which is best in practice is a matter of tuning, depends on knowing something about the ranges. Are they sparse or dense? Are they of similar size or vary widely? How large is the data (fits in cache / ram / disk)? Do you insert/delete a lot or are lookup dominant? Is access random or with high locality?
One tradeoff applicable to many schemes is splitting ranges, replicating the same pointer in several places. This may speed up lookups at expense of insert/delete and memory usage. An extreme application is just a flat array indexed by point, where lookup is O(1) but insertion is O(size of range); this begs for a multi-level structure: an array of k uniform subranges, pointing to value if entirely covered by one range or to sub-array if not. Hey, I just described a trie! Lookup is log(maxint)/log(k), very fast if k is power of 2 (e.g. 256); insertion and memory are k*log(n).
But remember that wasting memory hurts cache performance, so any such "optimization" may actually be counter-productive, even for lookups.

Related

Interval search algorithm with O(1) lookup speed

I need to design an interval search algorithm that works on 64-bit keys. The match is when key k is between k1 and k2. An important requirement is that the lookup speed is better than O(log n). Researching available literature didn't turn up anything better than interval search trees. I wonder if it's feasible at all.
If your keys have distribution, closed to uniform, you can use
Interpolation search, which has O(log log N) time - this is much better, than O(log n).
UPD: Just an idea:
If you have enough extra memory, you can build trie-like structure. There will be O(1) search time. Idea following: For example, lets we set tree of arrays[256], where each array indexed by some byte of key. Arrays linked to trie. So, root element of trie - is array[265], where index is high byte of the key. But anyway this is not practical, because of in the bottom node, for search borders, need to perform linear search with ~64 iterations.
You can dispatch by leading bytes until the problem is small. That avoids most of the overhead of an interval tree, while maintaining the flexibility of one.
So you have a table of 256 structs that point to 256 structs on down as far as needed until you either encounter a flag saying, "no match", or you are pointed to a small interval tree for the exact matching condition. Processing the top of this tree with straightforward jumps rather than traversing multiple comparisons, possible pipeline stalls, etc, may be a significant performance improvement for you.

Why implement a Hashtable with a Binary Search Tree?

When implementing a Hashtable using an array, we inherit the constant time indexing of the array. What are the reasons for implementing a Hashtable with a Binary Search Tree since it offers search with O(logn)? Why not just use a Binary Search Tree directly?
If the elements don't have a total order (i.e. the "greater than" and "less than" is not be defined for all pairs or it is not consistent between elements), you can't compare all pairs, thus you can't use a BST directly, but nothing's stopping you from indexing the BST by the hash value - since this is an integral value, it obviously has a total order (although you'd still need to resolve collision, that is have a way to handle elements with the same hash value).
However, one of the biggest advantages of a BST over a hash table is the fact that the elements are in order - if we order it by hash value, the elements will have an arbitrary order instead, and this advantage would no longer be applicable.
As for why one might consider implementing a hash table using a BST instead of an array, it would:
Not have the disadvantage of needing to resize the array - with an array, you typically mod the hash value with the array size and resize the array if it gets full, reinserting all elements, but with a BST, you can just directly insert the unchanging hash value into the BST.
This might be relevant if we want any individual operation to never take more than a certain amount of time (which could very well happen if we need to resize the array), with the overall performance being secondary, but there might be better ways to solve this problem.
Have a reduced risk of hash collisions since you don't mod with the array size and thus the number of possible hashes could be significantly bigger. This would reduce the risk of getting the worst-case performance of a hash table (which is when a significant portion of the elements hash to the same value).
What the actual worst-case performance is would depend on how you're resolving collisions. This is typically done with linked-lists for O(n) worst case performance. But we can also achieve O(log n) performance with BST's (as is done in Java's hash table implementation if the number of elements with some hash are above a threshold) - that is, have your hash table array where each element points to a BST where all elements have the same hash value.
Possibly use less memory - with an array you'd inevitably have some empty indices, but with a BST, these simply won't need to exist. Although this is not a clear-cut advantage, if it's an advantage at all.
If we assume we use the less common array-based BST implementation, this array will also have some empty indices and this would also require the occasional resizing, but this is a simply memory copy as opposed to needing to reinsert all elements with updated hashes.
If we use the typical pointer-based BST implementation, the added cost for the pointers would seemingly outweigh the cost of having a few empty indices in an array (unless the array is particularly sparse, which tends to be a bad sign for a hash table anyway).
But, since I haven't personally ever heard of this ever being done, presumably the benefits are not worth the increased cost of operations from expected O(1) to O(log n).
Typically the choice is indeed between using a BST directly (without hash values) and using a hash table (with an array).
Pros:
Potentially use less space b/c we don't allocate a large array
Can iterate through the keys in order, sometimes useful
Cons:
You'd have O(log N) lookup time, which is worse than the guaranteed O(1) for a chained hash table.
Since the requirements of a Hash Table are O(1) lookup, it's not a Hash Table if it has logarithmic lookup times. Granted, since collision is an issue with the array implementation (well, not likely an issue), using a BST could offer benefits in that regard. Generally, though, it's not worth the tradeoff - I can't think of a situation where you wouldn't want guaranteed O(1) lookup time when using a Hash Table.
Alternatively, there is the possibility of an underlying structure to guarantee logarithmic insertion and deletion via a BST variant, where each index in the array has a reference to the corresponding node in the BST. A structure like that could get sort of complex, but would guarantee O(1) lookup and O(logn) insertion/deletion.
I found this looking to see if anyone had done it. I guess maybe not.
I came up with an idea this morning of implementing a binary tree as an array consisting of rows stored by index. Row 1 has 1, row 2 has 2, row 3 has 4 (yes, powers of two). The advantage of this structure is a bit shift and addition or subtraction can be used to walk the tree instead of using extra memory to store bi- or uni-directional references.
This would allow you to rapidly search for a hash value based on some sort of hashable input, to discover if the value exists in some other store. Or for a hash collision (or partial collision) search. I can't think of many other uses for it but for these it would be phenomenally fast. Very likely a lot of the rotation operations would happen entirely in cpu cache and be written out in nice linear blobs to main memory.
Its main utility would be with sorting input values of a random nature. If the blobs in the array were two parts, like a hash, and an identifier for another store, you could do the comparisons very fast and insert very fast to discover where an item bearing a hash value is kept in another location (like the UUID of a filesystem node or maybe even the filename, or other short identifiable string).
I'll leave it to others to dream of other ways to use it but I'm using it for a graph theoretic proof of work search table for identifying partial collisions for a variant of Cuckoo Cycle.
I am just now working on the walk formula, and here it is:
i = index of array element
Walk Up (go to parent):
i>>1-(i+1)%2
(Obviously you probably need to test if i is zero)
Walk Left (down and left):
i<<1+2
(this and the next would also need to test against 2^depth of the structure, so it doesn't walk off the edge and fall back to the root)
Walk Right (down and right):
i<<1+1
As you can see, each walk is a short formula based on the index. A bit shift and addition for going left and right, and a bit shift, addition and modulus for ascending. Two instructions to move down, 4 to move up (in assembler, or as above in C and other HLL operator notation)
edit:
I can see from further commentary that the benefit of slashing the insert time definitely would be of benefit. But I don't think that a conventional vector based binary tree would provide nearly as much benefit as a dense version. A dense version, where all the nodes are in a contiguous array, when it is searched, naturally will travel in a linear fashion through the memory, which should help reduce cache misses and thus reduce the latency of the searches significantly, as well as the fact that there is a latency hit with memory in accessing randomly compared to streaming through blocks sequentially.
https://github.com/calibrae-project/bast/blob/master/pkg/bast/bast.go
This is my current state of a WiP to implement what I am calling a Bifurcation Array Search Tree. For the purpose of a fast insert/delete and not horribly slow search through a sorted collection of hashes, I think that this would be of quite large benefit for cases where there is a lot of data coming and going through the structure, or more to the point, beneficial for more realtime applications.

Which search data structure works best for sorted integer data?

I have a sorted integers of over a billion, which data structure do you think can exploited the sorted behavior? Main goal is to search items faster...
Options I can think of --
1) regular Binary Search trees with recursively splitting in the middle approach.
2) Any other balanced Binary search trees should work well, but does not exploit the sorted heuristics..
Thanks in advance..
[Edit]
Insertions and deletions are very rare...
Also, apart from integers I have to store some other information in the nodes, I think plain arrays cant do that unless it is a list right?
This really depends on what operations you want to do on the data.
If you are just searching the data and never inserting or deleting anything, just storing the data in a giant sorted array may be perfectly fine. You could then use binary search to look up elements efficiently in O(log n) time. However, insertions and deletions can be expensive since with a billion integers O(n) will hurt. You could store auxiliary information inside the array itself, if you'd like, by just placing it next to each of the integers.
However, with a billion integers, this may be too memory-intensive and you may want to switch to using a bit vector. You could then do a binary search over the bitvector in time O(log U), where U is the number of bits. With a billion integers, I assume that U and n would be close, so this isn't that much of a penalty. Depending on the machine word size, this could save you anywhere from 32x to 128x memory without causing too much of a performance hit. Plus, this will increase the locality of the binary searches and can improve performance as well. this does make it much slower to actually iterate over the numbers in the list, but it makes insertions and deletions take O(1) time. In order to do this, you'd need to store some secondary structure (perhaps a hash table?) containing the data associated with each of the integers. This isn't too bad, since you can use this sorted bit vector for sorted queries and the unsorted hash table once you've found what you're looking for.
If you also need to add and remove values from the list, a balanced BST can be a good option. However, because you specifically know that you're storing integers, you may want to look at the more complex van Emde Boas tree structure, which supports insertion, deletion, predecessor, successor, find-max, and find-min all in O(log log n) time, which is exponentially faster than binary search trees. The implementation cost of this approach is high, though, since the data structure is notoriously tricky to get right.
Another data structure you might want to explore is a bitwise trie, which has the same time bounds as the sorted bit vector but allows you to store auxiliary data along with each integer. Plus, it's super easy to implement!
Hope this helps!
The best data structure for searching sorted integers is an array.
You can search it with log(N) operations, and it is more compact (less memory overhead) than a tree.
And you don't even have to write any code (so less chance of a bug) -- just use bsearch from your standard library.
With a sorted array the best you can archieve is with an interpolation search, that gives you log(log(n)) average time. It is essentially a binary search but don't divide the array in 2 sub arrays of the same size.
It's really fast and extraordinary easy to implement.
http://en.wikipedia.org/wiki/Interpolation_search
Don't let the worst case O(n) bound scares you, because with 1 billion integers it's pratically impossible to obtain.
O(1) solutions:
Assuming 32-bit integers and a lot of ram:
A lookup table with size 2³² roughly (4 billion elements), where each index corresponds to the number of integers with that value.
Assuming larger integers:
A really big hash table. The usual modulus hash function would be appropriate if you have a decent distribution of the values, if not, you might want to combine the 32-bit strategy with a hash lookup.

Data structure to build and lookup set of integer ranges

I have a set of uint32 integers, there may be millions of items in the set. 50-70% of them are consecutive, but in input stream they appear in unpredictable order.
I need to:
Compress this set into ranges to achieve space efficient representation. Already implemented this using trivial algorithm, since ranges computed only once speed is not important here. After this transformation number of resulting ranges is typically within 5 000-10 000, many of them are single-item, of course.
Test membership of some integer, information about specific range in the set is not required. This one must be very fast -- O(1). Was thinking about minimal perfect hash functions, but they do not play well with ranges. Bitsets are very space inefficient. Other structures, like binary trees, has complexity of O(log n), worst thing with them that implementation make many conditional jumps and processor can not predict them well giving poor performance.
Is there any data structure or algorithm specialized in integer ranges to solve this task?
Regarding the second issue:
You could look-up on Bloom Filters. Bloom Filters are specifically designed to answer the membership question in O(1), though the response is either no or maybe (which is not as clear cut as a yes/no :p).
In the maybe case, of course, you need further processing to actually answer the question (unless a probabilistic answer is sufficient in your case), but even so the Bloom Filter may act as a gate keeper, and reject most of the queries outright.
Also, you might want to keep actual ranges and degenerate ranges (single elements) in different structures.
single elements may be best stored in a hash-table
actual ranges can be stored in a sorted array
This diminishes the number of elements stored in the sorted array, and thus the complexity of the binary search performed there. Since you state that many ranges are degenerate, I take it that you only have some 500-1000 ranges (ie, an order of magnitude less), and log(1000) ~ 10
I would therefore suggest the following steps:
Bloom Filter: if no, stop
Sorted Array of real ranges: if yes, stop
Hash Table of single elements
The Sorted Array test is performed first, because from the number you give (millions of number coalesced in a a few thousands of ranges) if a number is contained, chances are it'll be in a range rather than being single :)
One last note: beware of O(1), while it may seem appealing, you are not here in an asymptotic case. Barely 5000-10000 ranges is few, as log(10000) is something like 13. So don't pessimize your implementation by getting a O(1) solution with such a high constant factor that it actually runs slower than a O(log N) solution :)
If you know in advance what the ranges are, then you can check whether a given integer is present in one of the ranges in O(lg n) using the strategy outlined below. It's not O(1), but it's still quite fast in practice.
The idea behind this approach is that if you've merged all of the ranges together, you have a collection of disjoint ranges on the number line. From there, you can define an ordering on those intervals by saying that the interval [a, b] ≤ [c, d] iff b ≤ c. This is a total ordering because all of the ranges are disjoint. You can thus put all of the intervals together into a static array and then sort them by this ordering. This means that the leftmost interval is in the first slot of the array, and the rightmost interval is in the rightmost slot. This construction takes O(n lg n) time.
To check if a some interval contains a given integer, you can do a binary search on this array. Starting at the middle interval, check if the integer is contained in that interval. If so, you're done. Otherwise, if the value is less than the smallest value in the range, continue the search on the left, and if the value is greater than the largest value in the range, continue the search on the right. This is essentially a standard binary search, and it should run in O(lg n) time.
Hope this helps!
AFAIK there is no such algorithm that search over integer list in O(1).
One only can do O(1) search with vast amount of memory.
So it is not very promising to try to find O(1) search algorithm over list of range of integer.
On the other hand, you could try time/memory trade-off approach by carefully examining your data set (eventually building a kind of hash table).
You can use y-fast trees or van Emde Boas trees to achieve O(lg w) time queries, where w is the number of bits in a word, and you can use fusion trees to achieve O(lg_w n) time queries. The optimal tradeoff in terms of n is O(sqrt(lg(n))).
The easiest of these to implement is probably y-fast trees. They are probably faster than doing binary search, though they require roughly O(lg w) = O(lg 32) = O(5) hash table queries, while binary search requires roughly O(lg n) = O(lg 10000) = O(13) comparisons, so binary search may be faster.
Rather than a 'comparison' based storage/retrieval ( which will always be O(log(n)) ),
You need to work on 'radix' based storage/retrieval .
In other words .. extract nibbles from the uint32, and make a trie ..
Keep your ranges into a sorted array and use binary search for lookups.
It's easy to implement, O(log N), and uses less memory and needs less memory accesses than any other tree based approach, so it will probably be also much faster.
From the description of you problem it sounds like the following might be a good compromise. I've described it using an Object oriented language, but is easily convertible to C using a union type or structure with a type member and a pointer.
Use the first 16 bits to index an array of objects (of size 65536). In that array there are 5 possible objects
a NONE object means no elements beginning with those 16bits are in the set
an ALL object means all elements beginning with 16 bits are in the set
a RANGE object means all items with the final 16bits between an upper and lower bound are in the set
a SINGLE object means just one element beginning with the 16bits is in the array
a BITSET object handles all remaining cases with a 65536 bit bitset
Of course, you don't need to split at 16bits, you can adjust to reflect the statistics of your set. In fact you don't need to use consecutive bits, but it speeds up the bit twiddling, and if many of your elements are consecutive as you claim will give good properties.
Hopefully this makes sense, please comment if I need to explain more fully. Effectively you've combined a depth 2 binary tree with a ranges and a bitset for a time/speed tradeoff. If you need to save memory then make the tree deeper with a corresponding slight increase in lookup time.

Best continuously sorting algorithm?

I have a set of double-precision data and I need their list to be always sorted. What is the best algorithm to sort the data as it is being added?
As best I mean least Big-O in data count, Small-O in data count (worst case scenario), and least Small-O in the space needed, in that order if possible.
The set size is really variable, from a small number (30) to lots of data (+10M).
Building a self-balancing binary tree like a red-black tree or AVL tree will allow for Θ(lg n) insertion and removal, and Θ(n) retrieval of all elements in sorted order (by doing a depth-first traversal), with Θ(n) memory usage. The implementation is somewhat complex, but they're efficient, and most languages will have library implementations, so they're a good first choice in most cases.
Additionally, retreiving the i-th element can be done by annotating each edge (or, equivalently, node) in the tree with the total number of nodes below it. Then one can find the i-th element in Θ(lg n) time and Θ(1) space with something like:
node *find_index(node *root, int i) {
while (node) {
if (i == root->left_count)
return root;
else if (i < root->left_count)
root = root->left;
else {
i -= root->left_count + 1;
root = root->right;
}
}
return NULL; // i > number of nodes
}
An implementation that supports this can be found in debian's libavl; unfortunately, the maintainer's site seems down, but it can be retrieved from debian's servers.
The structure that is used for indexes of database programs is a B+ Tree. It is a balanced bucketed n-ary tree.
From Wikipedia:
For a b-order B+ tree with h levels of index:
The maximum number of records stored is n = b^h
The minimum number of keys is 2(b/2)^(h−1)
The space required to store the tree is O(n)
Inserting a record requires O(log-b(n)) operations in the worst case
Finding a record requires O(log-b(n)) operations in the worst case
Removing a (previously located) record requires O(log-b(n)) operations in the worst case
Performing a range query with k elements occurring within the range requires O(log-b(n+k)) operations in the worst case.
I use this in my program. You can add your data to the structure as it comes and you can always traverse it in order, front to back or back to front, or search quickly for any value. If you don't find the value, you will have the insertion point where you can add the value.
You can optimize the structure for your program by playing around with b, the size of the buckets.
An interesting presentation about B+ trees: Tree-Structured Indexes
You can get the entire code in C++.
Edit: Now I see your comment that your requirement to know the "i-th sorted element in the set" is an important one. All of a sudden, that makes many data structures less than optimal.
You are probably best off with a SortedList or even better, a SortedDictionary. See the article: Squeezing more performance from SortedList. Both structures have a GetKey function that will return the i-th element.
Likely a heap sort. Heaps are only O(log N) to add new data, and you can pop off the net results at any time in O(N log N) time.
If you always need the whole list sorted every time, then there's not many other options than an insertion sort. It will likely be O(N^2) though with HUGE hassle of linked skip lists you can make it O(N log N).
I would use a heap/priority queue. Worst case is same as average case for runtime. Next element can be found in O(log n) time.
Here is a templatized C# implementation that I derived from this code.
If you just need to know the ith smallest element as it says in the comments, use the BFPRT algorithm which is named after the last names of the authors: Blum, Floyd, Pratt, Rivest, and Tarjan and is generally agreed to be the biggest concentration of big computer science brains in the same paper. O(n) worst-case.
Ok, you want you data sorted, but you need to extract it via an index number.
Start with a basic Tree such as the afforementioned Red-Black trees.
Modify the tree algo such that as you insert elements into the tree all nodes encountered during insertion and deletion keep a count of the number of elements under each branch.
Then when you are extracting data from the tree you can calculate the index as you go, and know which branch to take based on whether is greater or less than the index you are trying to extract.
One other consideration. 10M elements+ in a tree that uses dynamic memory allocation will suck up alot of memory overhead. i.e. The pointers may take up more space than your actual data, plus whatever other member is used to implement the data structure. This will lead to serious memory fragmentation, and in the worst cases, degrade the system's overall performance. (Churning data back and forth from virtual memory.) You might want to consider implementing a combination of block and dynamic memory allocation. Something where in you sort the tree into blocks of data, thus reducing the memory overhead.
Check out the comparison of sorting algorithms in Wikipedia.
Randomized Jumplists are interesting as well.
They require less space as BST and skiplists.
Insertion and deletion is O(log n)
By a "set of double data," do you mean a set of real-valued numbers? One of the more commonly used algorithms for that is a heap sort, I'd check that out. Most of its operations are O( n * log(n) ), which is pretty good but doesn't meet all of your criteria. The advantages of heapsort is that it's reasonably simple to code on your own, and many languages provide libraries to manage a sorted heap.

Resources