Data structure to build and lookup set of integer ranges - algorithm

I have a set of uint32 integers, there may be millions of items in the set. 50-70% of them are consecutive, but in input stream they appear in unpredictable order.
I need to:
Compress this set into ranges to achieve space efficient representation. Already implemented this using trivial algorithm, since ranges computed only once speed is not important here. After this transformation number of resulting ranges is typically within 5 000-10 000, many of them are single-item, of course.
Test membership of some integer, information about specific range in the set is not required. This one must be very fast -- O(1). Was thinking about minimal perfect hash functions, but they do not play well with ranges. Bitsets are very space inefficient. Other structures, like binary trees, has complexity of O(log n), worst thing with them that implementation make many conditional jumps and processor can not predict them well giving poor performance.
Is there any data structure or algorithm specialized in integer ranges to solve this task?

Regarding the second issue:
You could look-up on Bloom Filters. Bloom Filters are specifically designed to answer the membership question in O(1), though the response is either no or maybe (which is not as clear cut as a yes/no :p).
In the maybe case, of course, you need further processing to actually answer the question (unless a probabilistic answer is sufficient in your case), but even so the Bloom Filter may act as a gate keeper, and reject most of the queries outright.
Also, you might want to keep actual ranges and degenerate ranges (single elements) in different structures.
single elements may be best stored in a hash-table
actual ranges can be stored in a sorted array
This diminishes the number of elements stored in the sorted array, and thus the complexity of the binary search performed there. Since you state that many ranges are degenerate, I take it that you only have some 500-1000 ranges (ie, an order of magnitude less), and log(1000) ~ 10
I would therefore suggest the following steps:
Bloom Filter: if no, stop
Sorted Array of real ranges: if yes, stop
Hash Table of single elements
The Sorted Array test is performed first, because from the number you give (millions of number coalesced in a a few thousands of ranges) if a number is contained, chances are it'll be in a range rather than being single :)
One last note: beware of O(1), while it may seem appealing, you are not here in an asymptotic case. Barely 5000-10000 ranges is few, as log(10000) is something like 13. So don't pessimize your implementation by getting a O(1) solution with such a high constant factor that it actually runs slower than a O(log N) solution :)

If you know in advance what the ranges are, then you can check whether a given integer is present in one of the ranges in O(lg n) using the strategy outlined below. It's not O(1), but it's still quite fast in practice.
The idea behind this approach is that if you've merged all of the ranges together, you have a collection of disjoint ranges on the number line. From there, you can define an ordering on those intervals by saying that the interval [a, b] ≤ [c, d] iff b ≤ c. This is a total ordering because all of the ranges are disjoint. You can thus put all of the intervals together into a static array and then sort them by this ordering. This means that the leftmost interval is in the first slot of the array, and the rightmost interval is in the rightmost slot. This construction takes O(n lg n) time.
To check if a some interval contains a given integer, you can do a binary search on this array. Starting at the middle interval, check if the integer is contained in that interval. If so, you're done. Otherwise, if the value is less than the smallest value in the range, continue the search on the left, and if the value is greater than the largest value in the range, continue the search on the right. This is essentially a standard binary search, and it should run in O(lg n) time.
Hope this helps!

AFAIK there is no such algorithm that search over integer list in O(1).
One only can do O(1) search with vast amount of memory.
So it is not very promising to try to find O(1) search algorithm over list of range of integer.
On the other hand, you could try time/memory trade-off approach by carefully examining your data set (eventually building a kind of hash table).

You can use y-fast trees or van Emde Boas trees to achieve O(lg w) time queries, where w is the number of bits in a word, and you can use fusion trees to achieve O(lg_w n) time queries. The optimal tradeoff in terms of n is O(sqrt(lg(n))).
The easiest of these to implement is probably y-fast trees. They are probably faster than doing binary search, though they require roughly O(lg w) = O(lg 32) = O(5) hash table queries, while binary search requires roughly O(lg n) = O(lg 10000) = O(13) comparisons, so binary search may be faster.

Rather than a 'comparison' based storage/retrieval ( which will always be O(log(n)) ),
You need to work on 'radix' based storage/retrieval .
In other words .. extract nibbles from the uint32, and make a trie ..

Keep your ranges into a sorted array and use binary search for lookups.
It's easy to implement, O(log N), and uses less memory and needs less memory accesses than any other tree based approach, so it will probably be also much faster.

From the description of you problem it sounds like the following might be a good compromise. I've described it using an Object oriented language, but is easily convertible to C using a union type or structure with a type member and a pointer.
Use the first 16 bits to index an array of objects (of size 65536). In that array there are 5 possible objects
a NONE object means no elements beginning with those 16bits are in the set
an ALL object means all elements beginning with 16 bits are in the set
a RANGE object means all items with the final 16bits between an upper and lower bound are in the set
a SINGLE object means just one element beginning with the 16bits is in the array
a BITSET object handles all remaining cases with a 65536 bit bitset
Of course, you don't need to split at 16bits, you can adjust to reflect the statistics of your set. In fact you don't need to use consecutive bits, but it speeds up the bit twiddling, and if many of your elements are consecutive as you claim will give good properties.
Hopefully this makes sense, please comment if I need to explain more fully. Effectively you've combined a depth 2 binary tree with a ranges and a bitset for a time/speed tradeoff. If you need to save memory then make the tree deeper with a corresponding slight increase in lookup time.

Related

how to calculate different groups of one million binary sequences?

I have one million binary sequences ,they are in the same length, such as (1000010011,1100110000....) and so on. And I want to know how many different groups they have(same sequences belong to same group ).what is the fastest way?
No stoi please.
Depending on the length L of a sequence:
L < ~20: bucket sort
This is short enough in comparison to the input size. A bucketsort with L buckets is all you need. - preallocate an array of size 2L, since you have ~million sequences and 220 is ~million, you will only need O(n) of additional memory.
Go through your sequence, sort to the buckets
Go through the buckets, count the results. Return them.
And we're done.
The time complexity will be O(n) with O(n) memory cost. This is optimal complexity-wise since you have to visit every element at least once to check its value anyway.
L reasonably large: hash table
If you pick a reasonable hashing function and a good size of the hash table(or a dictionary if we need to store the counts)1 you will have small number of collisions while inserting. The amortized time will be O(n) since if the hash is good, then the insert is amortized O(1).
As a side note, the bucket sort is technically a perfect hash since the hash function in this case is an one-to-one function.
L unreasonably large: binary tree
if for some reason the construction of hash is not feasible or you wish for consistency then building a binary tree to hold the values is a way to go.
This will take O(nlog(n)) as binary trees usually do.
1 ~2M should be enough and it is still O(n). Maybe you could go even lower to around 1,5M size.

Data structure which maps non-overlapping ranges to values?

I need a data structure which maps non-overlapping ranges (eg. 8..15, 16..19) to a pointer to a struct.
I will need to be able to look up any element in that range and retrieve that pointer. For example:
structure[8..15] = 0x12345678;
structure[16..19] = 0xdeadbeef;
structure[7]; // => NULL
structure[12]; // => 0x12345678
structure[18]; // => 0xdeadbeef
I'm considering using a binary search tree at the moment. Since the ranges will never overlap, I can search for indexes relatively easily in logarithmic time.
However, I'm wondering if there are any data structures more suitable for this case. I need to be able to efficiently insert, delete and lookup. All of these operations are O(log n) in a BST, but I'm wondering if there's anything that's faster for this.
If you want something faster than O(log n), use Van Emde Boas tree.
It should be used the same way you use a binary search tree: use start of each range as a key, end of the range - as part of value (together with the pointer), mapped to this key. Time complexity is O(log log M), where M is range of keys (INT_MAX, if any integer value is possible for start of range).
In some cases Van Emde Boas tree has large memory overhead. If that is not acceptable, use either a simple trie, as explained by Beni, or Y-fast trie.
I don't think you can do much better.
Non-overlapping ranges are equivallent to a sequence of alternating start/end points. So lookup is just "find largest element ≤ x" followed by O(1) check if it's a start or an end. I.e. an ordered map.
The usual suspects for that - binary trees, B trees, various tries - are all essentially O(log n). Which is best in practice is a matter of tuning, depends on knowing something about the ranges. Are they sparse or dense? Are they of similar size or vary widely? How large is the data (fits in cache / ram / disk)? Do you insert/delete a lot or are lookup dominant? Is access random or with high locality?
One tradeoff applicable to many schemes is splitting ranges, replicating the same pointer in several places. This may speed up lookups at expense of insert/delete and memory usage. An extreme application is just a flat array indexed by point, where lookup is O(1) but insertion is O(size of range); this begs for a multi-level structure: an array of k uniform subranges, pointing to value if entirely covered by one range or to sub-array if not. Hey, I just described a trie! Lookup is log(maxint)/log(k), very fast if k is power of 2 (e.g. 256); insertion and memory are k*log(n).
But remember that wasting memory hurts cache performance, so any such "optimization" may actually be counter-productive, even for lookups.

What Big-O equation describes my search?

I have a sorted array of doubles (latitudes actually) that relatively uniformally spread out over a range of -10 to -43. Now, if I did a binary search over that list I get O(log N).
But, I can further optimise by search by having a lookup table where I have 34 keys (-10 to -43) that can then jump straight to the starting point of that number.
Eg: -23.123424 first look up key 23 and know the start-end range of all -23 values. I can then binary search from the middle of that.
What would my Big-O look like?
It's still O(log n). Consider: it takes constant time to look up the starting indices in your integer lookup table, so that part doesn't add anything. Then it's O(log n) to do the binary search. Actually it will take roughly log n/34 because you expect to search through an array 34 times smaller on average (the values are distributed in 34 different intervals with boundaries from -43 to -10), but constant multipliers aren't considered in big-O notation.
It would still be O(log N), but for a reduced dataset (think smaller value for N).
Since the lookup table provides ca. 1/34, which is close to 1/32 or 5 steps in the binary search, you might want to benchmark, if this really helps things: The additional code paths with lots of cache misses and one or the other wrong branch prediction/pipeline clearing might make this slower than the direct binary search.
Additionally, if lookup time for an in-memory table is the bottleneck, you might want to consider representing your lats as Int32 values - definitly precise enough, but much faster to search through.
It sounds like your optimization would help, but I'm thinking it's still considered O(log N) because you still have to search the exact value. If it took you directly to the value it would be O(1)
This is a limitation of the Big-Oh analysis. It doesn't take in account that you reduced the amount of values you have to search.
Your concept is close to that of interpolation search, except instead of only "interpolating" once on the integral part of the key, it recursively uses interpolation to intelligently drive a binary search. Since your domain is relatively uniform, the expected runtime is O(log log n).

Data structure with O(1) insertion time and O(log m) lookup?

Backstory (skip to second-to-last paragraph for data structure part): I'm working on a compression algorithm (of the LZ77 variety). The algorithm boils down to finding the longest match between a given string and all strings that have already been seen.
To do this quickly, I've used a hash table (with separate chaining) as recommended in the DEFLATE spec: I insert every string seen so far one at a time (one per input byte) with m slots in the chain for each hash code. Insertions are fast (constant-time with no conditional logic), but searches are slow because I have to look at O(m) strings to find the longest match. Because I do hundreds of thousands of insertions and tens of thousands of lookups in a typical example, I need a highly efficient data structure if I want my algorithm to run quickly (currently it's too slow for m > 4; I'd like an m closer to 128).
I've implemented a special case where m is 1, which runs very fast buts offers only so-so compression. Now I'm working on an algorithm for those who'd prefer improved compression ratio over speed, where the larger m is, the better the compression gets (to a point, obviously). Unfortunately, my attempts so far are too slow for the modest gains in compression ratio as m increases.
So, I'm looking for a data structure that allows very fast insertion (since I do more insertions than searches), but still fairly fast searches (better than O(m)). Does an O(1) insertion and O(log m) search data structure exist? Failing that, what would be the best data structure to use? I'm willing to sacrifice memory for speed. I should add that on my target platform, jumps (ifs, loops, and function calls) are very slow, as are heap allocations (I have to implement everything myself using a raw byte array in order to get acceptable performance).
So far, I've thought of storing the m strings in sorted order, which would allow O(log m) searches using a binary search, but then the insertions also become O(log m).
Thanks!
You might be interested in this match-finding structure :
http://encode.ru/threads/1393-A-proposed-new-fast-match-searching-structure
It's O(1) insertion time and O(m) lookup. But (m) is many times lower than a standard Hash Table for an equivalent match finding result. As en example, with m=4, this structure gets equivalent results than an 80-probes hash table.
You might want to consider using a trie (aka prefix tree) instead of a hash table.
For your particular application, you might be able to additionally optimize insertion. If you know that after inserting ABC you're likely to insert ABCD, then keep a reference to the entry created for ABC and just extend it with D---no need to repeat the lookup of the prefix.
One common optimization in hash tables is to move the item you just found to the head of the list (with the idea that it's likely to be used again soon for that bucket). Perhaps you can use a variation of this idea.
If you do all of your insertions before you do your lookups, you can add a bit to each bucket that says whether the chain for that bucket is sorted. On each lookup, you can check the bit to see if the bucket is sorted. If not, you would sort the bucket and set the bit. Once the bucket is sorted, each lookup is O(lg m).
If you interleave insertions and lookups, you could have 2 lists for each bucket: one that's sorted and on that isn't. Inserts would always go to the non-sorted list. A lookup would first check the sorted list, and only if it's not there would it look in the non-sorted list. When it's found in the non-sorted list you would remove it and put it in the sorted list. This way you only pay to sort items that you lookup.

Which search data structure works best for sorted integer data?

I have a sorted integers of over a billion, which data structure do you think can exploited the sorted behavior? Main goal is to search items faster...
Options I can think of --
1) regular Binary Search trees with recursively splitting in the middle approach.
2) Any other balanced Binary search trees should work well, but does not exploit the sorted heuristics..
Thanks in advance..
[Edit]
Insertions and deletions are very rare...
Also, apart from integers I have to store some other information in the nodes, I think plain arrays cant do that unless it is a list right?
This really depends on what operations you want to do on the data.
If you are just searching the data and never inserting or deleting anything, just storing the data in a giant sorted array may be perfectly fine. You could then use binary search to look up elements efficiently in O(log n) time. However, insertions and deletions can be expensive since with a billion integers O(n) will hurt. You could store auxiliary information inside the array itself, if you'd like, by just placing it next to each of the integers.
However, with a billion integers, this may be too memory-intensive and you may want to switch to using a bit vector. You could then do a binary search over the bitvector in time O(log U), where U is the number of bits. With a billion integers, I assume that U and n would be close, so this isn't that much of a penalty. Depending on the machine word size, this could save you anywhere from 32x to 128x memory without causing too much of a performance hit. Plus, this will increase the locality of the binary searches and can improve performance as well. this does make it much slower to actually iterate over the numbers in the list, but it makes insertions and deletions take O(1) time. In order to do this, you'd need to store some secondary structure (perhaps a hash table?) containing the data associated with each of the integers. This isn't too bad, since you can use this sorted bit vector for sorted queries and the unsorted hash table once you've found what you're looking for.
If you also need to add and remove values from the list, a balanced BST can be a good option. However, because you specifically know that you're storing integers, you may want to look at the more complex van Emde Boas tree structure, which supports insertion, deletion, predecessor, successor, find-max, and find-min all in O(log log n) time, which is exponentially faster than binary search trees. The implementation cost of this approach is high, though, since the data structure is notoriously tricky to get right.
Another data structure you might want to explore is a bitwise trie, which has the same time bounds as the sorted bit vector but allows you to store auxiliary data along with each integer. Plus, it's super easy to implement!
Hope this helps!
The best data structure for searching sorted integers is an array.
You can search it with log(N) operations, and it is more compact (less memory overhead) than a tree.
And you don't even have to write any code (so less chance of a bug) -- just use bsearch from your standard library.
With a sorted array the best you can archieve is with an interpolation search, that gives you log(log(n)) average time. It is essentially a binary search but don't divide the array in 2 sub arrays of the same size.
It's really fast and extraordinary easy to implement.
http://en.wikipedia.org/wiki/Interpolation_search
Don't let the worst case O(n) bound scares you, because with 1 billion integers it's pratically impossible to obtain.
O(1) solutions:
Assuming 32-bit integers and a lot of ram:
A lookup table with size 2³² roughly (4 billion elements), where each index corresponds to the number of integers with that value.
Assuming larger integers:
A really big hash table. The usual modulus hash function would be appropriate if you have a decent distribution of the values, if not, you might want to combine the 32-bit strategy with a hash lookup.

Resources